文档的词频-反向文档频率(TF-IDF)计算-阿里云开发者社区

文档的词频-反向文档频率(TF-IDF)计算

2024-06-07 53

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 文档的词频-反向文档频率(TF-IDF)计算

TF-IDF计算：

TF-IDF反映了在文档集合中一个单词对一个文档的重要性，经常在文本数据挖据与信息

提取中用来作为权重因子。在一份给定的文件里，词频(termfrequency-TF)指的是某一

个给定的词语在该文件中出现的频率。逆向文件频率（inversedocument frequency，

IDF）是一个词语普遍重要性的度量。某一特定词语的IDF，可以由总文件数目除以包含

该词语之文件的数目，再将得到的商取对数得到。

相关代码：

  private static Pattern r = Pattern.compile("([ \\t{}()\",:;. \n])"); 
  private static List<String> documentCollection;
 
    //Calculates TF-IDF weight for each term t in document d
    private static float findTFIDF(String document, String term)
    {
        float tf = findTermFrequency(document, term);
        float idf = findInverseDocumentFrequency(term);
        return tf * idf;
    }
 
    private static float findTermFrequency(String document, String term)
    {
      int count = getFrequencyInOneDoc(document, term);
 
        return (float)((float)count / (float)(r.split(document).length));
    }
    
    private static int getFrequencyInOneDoc(String document, String term)
    {
      int count = 0;
        for(String s : r.split(document))
        {
          if(s.toUpperCase().equals(term.toUpperCase())) {
            count++;
          }
        }
        return count;
    }
 
 
    private static float findInverseDocumentFrequency(String term)
    {
        //find the  no. of document that contains the term in whole document collection
        int count = 0;
        for(String doc : documentCollection)
        {
          count += getFrequencyInOneDoc(doc, term);
        }
        /*
         * log of the ratio of  total no of document in the collection to the no. of document containing the term
         * we can also use Math.Log(count/(1+documentCollection.Count)) to deal with divide by zero case; 
         */
        return (float)Math.log((float)documentCollection.size() / (float)count);
 
    }

建立文档的向量空间模型Vector Space Model并计算余弦相似度。

相关代码：

public static float findCosineSimilarity(float[] vecA, float[] vecB)
{
    float dotProduct = dotProduct(vecA, vecB);
    float magnitudeOfA = magnitude(vecA);
    float magnitudeOfB = magnitude(vecB);
    float result = dotProduct / (magnitudeOfA * magnitudeOfB);
    //when 0 is divided by 0 it shows result NaN so return 0 in such case.
    if (Float.isNaN(result))
        return 0;
    else
        return (float)result;
}
 
public static float dotProduct(float[] vecA, float[] vecB)
{
 
    float dotProduct = 0;
    for (int i = 0; i < vecA.length; i++)
    {
        dotProduct += (vecA[i] * vecB[i]);
    }
 
    return dotProduct;
}
 
// Magnitude of the vector is the square root of the dot product of the vector with itself.
public static float magnitude(float[] vector)
{
    return (float)Math.sqrt(dotProduct(vector, vector));
}