Finding Similar Items 文本相似度计算的算法——机器学习、词向量空间cosine、NLTK、diff、Levenshtein距离

简介:

http://infolab.stanford.edu/~ullman/mmds/ch3.pdf 汇总于此 还有这本书 http://www-nlp.stanford.edu/IR-book/ 里面有词向量空间 SVM 等介绍

http://pages.cs.wisc.edu/~dbbook/openAccess/thirdEdition/slides/slides3ed-english/Ch27b_ir2-vectorspace-95.pdf 专门介绍向量空间

https://courses.cs.washington.edu/courses/cse573/12sp/lectures/17-ir.pdf 也提到了其他思路 貌似类似语音识别的统计模型

使用深度学习来做文档相似度计算 https://cs224d.stanford.edu/reports/PoulosJackson.pdf 还有这里 http://www.cms.waikato.ac.nz/~ml/publications/2012/JASIST2012.pdf

网页里直接比较文本相似度的 http://www.scurtu.it/documentSimilarity.html

这里汇总了一些回答 http://stackoverflow.com/questions/8897593/similarity-between-two-text-documents  包括利用NLP NLTK库来做,或者是diff,skylearn词向量空间+cos

http://stackoverflow.com/questions/1844194/get-cosine-similarity-between-two-documents-in-lucene 也有cosine相似度计算方法

lucene 3 里的cosine相似度计算方法 https://darakpanand.wordpress.com/2013/06/01/document-comparison-by-cosine-methodology-using-lucene/#more-53 注意:4和3的计算方法不一样

向量空间模型(http://stackoverflow.com/questions/10649898/better-way-of-calculating-document-similarity-using-lucene):

Once you've got your data components properly standardized, then you can worry about what's better: fuzzy match, Levenshtein distance, or cosine similarity (etc.)

As I told you in my comment, I think you made a mistake somewhere. The vectors actually contain the <word,frequency> pairs, not words only. Therefore, when you delete the sentence, only the frequency of the corresponding words are subtracted by 1 (the words after are not shifted). Consider the following example:

Document a:

A B C A A B C. D D E A B. D A B C B A.

Document b:

A B C A A B C. D A B C B A.

Vector a:

A:6, B:5, C:3, D:3, E:1

Vector b:

A:5, B:4, C:3, D:1, E:0

Which result in the following similarity measure:

(6*5+5*4+3*3+3*1+1*0)/(Sqrt(6^2+5^2+3^2+3^2+1^2) Sqrt(5^2+4^2+3^2+1^2+0^2))=
62/(8.94427*7.14143)=
0.970648

 

lucene里 more like this:

you may want to check the MoreLikeThis feature of lucene.

MoreLikeThis constructs a lucene query based on terms within a document to find other similar documents in the index.

http://lucene.apache.org/java/3_0_1/api/contrib-queries/org/apache/lucene/search/similar/MoreLikeThis.html

Sample code example (java reference) -

MoreLikeThis mlt = new MoreLikeThis(reader); // Pass the index reader
mlt.setFieldNames(new String[] {"title", "author"}); // specify the fields for similiarity

Query query = mlt.like(docID); // Pass the doc id 
TopDocs similarDocs = searcher.search(query, 10); // Use the searcher
if (similarDocs.totalHits == 0)
    // Do handling
}

 

http://stackoverflow.com/questions/1844194/get-cosine-similarity-between-two-documents-in-lucene 提到: 

i have built an index in Lucene. I want without specifying a query, just to get a score (cosine similarity or another distance?) between two documents in the index.

For example i am getting from previously opened IndexReader ir the documents with ids 2 and 4. Document d1 = ir.document(2); Document d2 = ir.document(4);

How can i get the cosine similarity between these two documents?

Thank you

 

When indexing, there's an option to store term frequency vectors.

During runtime, look up the term frequency vectors for both documents using IndexReader.getTermFreqVector(), and look up document frequency data for each term using IndexReader.docFreq(). That will give you all the components necessary to calculate the cosine similarity between the two docs.

An easier way might be to submit doc A as a query (adding all words to the query as OR terms, boosting each by term frequency) and look for doc B in the result set.

 

 

As Julia points out Sujit Pal's example is very useful but the Lucene 4 API has substantial changes. Here is a version rewritten for Lucene 4.

import java.io.IOException;
import java.util.*;

import org.apache.commons.math3.linear.*;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.core.SimpleAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.*;
import org.apache.lucene.store.*;
import org.apache.lucene.util.*;

public class CosineDocumentSimilarity {

    public static final String CONTENT = "Content";

    private final Set<String> terms = new HashSet<>();
    private final RealVector v1;
    private final RealVector v2;

    CosineDocumentSimilarity(String s1, String s2) throws IOException {
        Directory directory = createIndex(s1, s2);
        IndexReader reader = DirectoryReader.open(directory);
        Map<String, Integer> f1 = getTermFrequencies(reader, 0);
        Map<String, Integer> f2 = getTermFrequencies(reader, 1);
        reader.close();
        v1 = toRealVector(f1);
        v2 = toRealVector(f2);
    }

    Directory createIndex(String s1, String s2) throws IOException {
        Directory directory = new RAMDirectory();
        Analyzer analyzer = new SimpleAnalyzer(Version.LUCENE_CURRENT);
        IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_CURRENT,
                analyzer);
        IndexWriter writer = new IndexWriter(directory, iwc);
        addDocument(writer, s1);
        addDocument(writer, s2);
        writer.close();
        return directory;
    }

    /* Indexed, tokenized, stored. */
    public static final FieldType TYPE_STORED = new FieldType();

    static {
        TYPE_STORED.setIndexed(true);
        TYPE_STORED.setTokenized(true);
        TYPE_STORED.setStored(true);
        TYPE_STORED.setStoreTermVectors(true);
        TYPE_STORED.setStoreTermVectorPositions(true);
        TYPE_STORED.freeze();
    }

    void addDocument(IndexWriter writer, String content) throws IOException {
        Document doc = new Document();
        Field field = new Field(CONTENT, content, TYPE_STORED);
        doc.add(field);
        writer.addDocument(doc);
    }

    double getCosineSimilarity() {
        return (v1.dotProduct(v2)) / (v1.getNorm() * v2.getNorm());
    }

    public static double getCosineSimilarity(String s1, String s2)
            throws IOException {
        return new CosineDocumentSimilarity(s1, s2).getCosineSimilarity();
    }

    Map<String, Integer> getTermFrequencies(IndexReader reader, int docId)
            throws IOException {
        Terms vector = reader.getTermVector(docId, CONTENT);
        TermsEnum termsEnum = null;
        termsEnum = vector.iterator(termsEnum);
        Map<String, Integer> frequencies = new HashMap<>();
        BytesRef text = null;
        while ((text = termsEnum.next()) != null) {
            String term = text.utf8ToString();
            int freq = (int) termsEnum.totalTermFreq();
            frequencies.put(term, freq);
            terms.add(term);
        }
        return frequencies;
    }

    RealVector toRealVector(Map<String, Integer> map) {
        RealVector vector = new ArrayRealVector(terms.size());
        int i = 0;
        for (String term : terms) {
            int value = map.containsKey(term) ? map.get(term) : 0;
            vector.setEntry(i++, value);
        }
        return (RealVector) vector.mapDivide(vector.getL1Norm());
    }
}


















本文转自张昺华-sky博客园博客,原文链接:http://www.cnblogs.com/bonelee/p/6423490.html ,如需转载请自行联系原作者

相关文章
|
6天前
|
机器学习/深度学习 数据采集 自然语言处理
理解并应用机器学习算法:神经网络深度解析
【5月更文挑战第15天】本文深入解析了神经网络的基本原理和关键组成,包括神经元、层、权重、偏置及损失函数。介绍了神经网络在图像识别、NLP等领域的应用,并涵盖了从数据预处理、选择网络结构到训练与评估的实践流程。理解并掌握这些知识,有助于更好地运用神经网络解决实际问题。随着技术发展,神经网络未来潜力无限。
|
4天前
|
存储 算法 物联网
R-Tree算法:空间索引的高效解决方案
【5月更文挑战第17天】R-Tree是用于多维空间索引的数据结构,常用于地理信息系统、数据库和计算机图形学。它通过分层矩形区域组织数据,支持快速查询。文章介绍了R-Tree的工作原理、应用场景,如地理信息存储和查询,以及Python的`rtree`库实现示例。此外,还讨论了R-Tree的优势(如空间效率和查询性能)与挑战(如实现复杂和内存消耗),以及优化和变种,如R* Tree和STR。R-Tree在机器学习、实时数据分析等领域有广泛应用,并与其他数据结构(如kd-trees和quad-trees)进行比较。未来趋势将聚焦于优化算法、动态适应性和分布式并行计算。
19 1
|
3天前
|
机器学习/深度学习 算法 数据处理
探索机器学习中的决策树算法
【5月更文挑战第18天】探索机器学习中的决策树算法,一种基于树形结构的监督学习,常用于分类和回归。算法通过递归划分数据,选择最优特征以提高子集纯净度。优点包括直观、高效、健壮和可解释,但易过拟合、对连续数据处理不佳且不稳定。广泛应用于信贷风险评估、医疗诊断和商品推荐等领域。优化方法包括集成学习、特征工程、剪枝策略和参数调优。
|
4天前
|
机器学习/深度学习 算法 数据挖掘
【机器学习】K-means算法与PCA算法之间有什么联系?
【5月更文挑战第15天】【机器学习】K-means算法与PCA算法之间有什么联系?
|
4天前
|
机器学习/深度学习 算法 数据挖掘
【机器学习】维度灾难问题会如何影响K-means算法?
【5月更文挑战第15天】【机器学习】维度灾难问题会如何影响K-means算法?
|
5天前
|
机器学习/深度学习 算法 数据挖掘
【机器学习】聚类算法中,如何判断数据是否被“充分”地聚类,以便算法产生有意义的结果?
【5月更文挑战第14天】【机器学习】聚类算法中,如何判断数据是否被“充分”地聚类,以便算法产生有意义的结果?
|
5天前
|
机器学习/深度学习 运维 算法
【机器学习】可以利用K-means算法找到数据中的离群值吗?
【5月更文挑战第14天】【机器学习】可以利用K-means算法找到数据中的离群值吗?
|
5天前
|
算法 调度 UED
作业调度算法(含详细计算过程)和进程调度算法浅析
作业调度算法(含详细计算过程)和进程调度算法浅析
37 1
作业调度算法(含详细计算过程)和进程调度算法浅析
|
6天前
|
算法 数据安全/隐私保护 计算机视觉
基于二维CS-SCHT变换和LABS方法的水印嵌入和提取算法matlab仿真
该内容包括一个算法的运行展示和详细步骤,使用了MATLAB2022a。算法涉及水印嵌入和提取,利用LAB色彩空间可能用于隐藏水印。水印通过二维CS-SCHT变换、低频系数处理和特定解码策略来提取。代码段展示了水印置乱、图像处理(如噪声、旋转、剪切等攻击)以及水印的逆置乱和提取过程。最后,计算并保存了比特率,用于评估水印的稳健性。
|
2天前
|
算法
m基于BP译码算法的LDPC编译码matlab误码率仿真,对比不同的码长
MATLAB 2022a仿真实现了LDPC码的性能分析,展示了不同码长对纠错能力的影响。短码长LDPC码收敛快但纠错能力有限,长码长则提供更强纠错能力但易陷入局部最优。核心代码通过循环进行误码率仿真,根据EsN0计算误比特率,并保存不同码长(12-768)的结果数据。
21 9
m基于BP译码算法的LDPC编译码matlab误码率仿真,对比不同的码长

热门文章

最新文章