Finding Similar Items 文本相似度计算的算法——机器学习、词向量空间cosine、NLTK、diff、Levenshtein距离

简介:

http://infolab.stanford.edu/~ullman/mmds/ch3.pdf 汇总于此 还有这本书 http://www-nlp.stanford.edu/IR-book/ 里面有词向量空间 SVM 等介绍

http://pages.cs.wisc.edu/~dbbook/openAccess/thirdEdition/slides/slides3ed-english/Ch27b_ir2-vectorspace-95.pdf 专门介绍向量空间

https://courses.cs.washington.edu/courses/cse573/12sp/lectures/17-ir.pdf 也提到了其他思路 貌似类似语音识别的统计模型

使用深度学习来做文档相似度计算 https://cs224d.stanford.edu/reports/PoulosJackson.pdf 还有这里 http://www.cms.waikato.ac.nz/~ml/publications/2012/JASIST2012.pdf

网页里直接比较文本相似度的 http://www.scurtu.it/documentSimilarity.html

这里汇总了一些回答 http://stackoverflow.com/questions/8897593/similarity-between-two-text-documents  包括利用NLP NLTK库来做,或者是diff,skylearn词向量空间+cos

http://stackoverflow.com/questions/1844194/get-cosine-similarity-between-two-documents-in-lucene 也有cosine相似度计算方法

lucene 3 里的cosine相似度计算方法 https://darakpanand.wordpress.com/2013/06/01/document-comparison-by-cosine-methodology-using-lucene/#more-53 注意:4和3的计算方法不一样

向量空间模型(http://stackoverflow.com/questions/10649898/better-way-of-calculating-document-similarity-using-lucene):

Once you've got your data components properly standardized, then you can worry about what's better: fuzzy match, Levenshtein distance, or cosine similarity (etc.)

As I told you in my comment, I think you made a mistake somewhere. The vectors actually contain the <word,frequency> pairs, not words only. Therefore, when you delete the sentence, only the frequency of the corresponding words are subtracted by 1 (the words after are not shifted). Consider the following example:

Document a:

A B C A A B C. D D E A B. D A B C B A.

Document b:

A B C A A B C. D A B C B A.

Vector a:

A:6, B:5, C:3, D:3, E:1

Vector b:

A:5, B:4, C:3, D:1, E:0

Which result in the following similarity measure:

(6*5+5*4+3*3+3*1+1*0)/(Sqrt(6^2+5^2+3^2+3^2+1^2) Sqrt(5^2+4^2+3^2+1^2+0^2))=
62/(8.94427*7.14143)=
0.970648

 

lucene里 more like this:

you may want to check the MoreLikeThis feature of lucene.

MoreLikeThis constructs a lucene query based on terms within a document to find other similar documents in the index.

http://lucene.apache.org/java/3_0_1/api/contrib-queries/org/apache/lucene/search/similar/MoreLikeThis.html

Sample code example (java reference) -

MoreLikeThis mlt = new MoreLikeThis(reader); // Pass the index reader
mlt.setFieldNames(new String[] {"title", "author"}); // specify the fields for similiarity

Query query = mlt.like(docID); // Pass the doc id 
TopDocs similarDocs = searcher.search(query, 10); // Use the searcher
if (similarDocs.totalHits == 0)
    // Do handling
}

 

http://stackoverflow.com/questions/1844194/get-cosine-similarity-between-two-documents-in-lucene 提到: 

i have built an index in Lucene. I want without specifying a query, just to get a score (cosine similarity or another distance?) between two documents in the index.

For example i am getting from previously opened IndexReader ir the documents with ids 2 and 4. Document d1 = ir.document(2); Document d2 = ir.document(4);

How can i get the cosine similarity between these two documents?

Thank you

 

When indexing, there's an option to store term frequency vectors.

During runtime, look up the term frequency vectors for both documents using IndexReader.getTermFreqVector(), and look up document frequency data for each term using IndexReader.docFreq(). That will give you all the components necessary to calculate the cosine similarity between the two docs.

An easier way might be to submit doc A as a query (adding all words to the query as OR terms, boosting each by term frequency) and look for doc B in the result set.

 

 

As Julia points out Sujit Pal's example is very useful but the Lucene 4 API has substantial changes. Here is a version rewritten for Lucene 4.

import java.io.IOException;
import java.util.*;

import org.apache.commons.math3.linear.*;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.core.SimpleAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.*;
import org.apache.lucene.store.*;
import org.apache.lucene.util.*;

public class CosineDocumentSimilarity {

    public static final String CONTENT = "Content";

    private final Set<String> terms = new HashSet<>();
    private final RealVector v1;
    private final RealVector v2;

    CosineDocumentSimilarity(String s1, String s2) throws IOException {
        Directory directory = createIndex(s1, s2);
        IndexReader reader = DirectoryReader.open(directory);
        Map<String, Integer> f1 = getTermFrequencies(reader, 0);
        Map<String, Integer> f2 = getTermFrequencies(reader, 1);
        reader.close();
        v1 = toRealVector(f1);
        v2 = toRealVector(f2);
    }

    Directory createIndex(String s1, String s2) throws IOException {
        Directory directory = new RAMDirectory();
        Analyzer analyzer = new SimpleAnalyzer(Version.LUCENE_CURRENT);
        IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_CURRENT,
                analyzer);
        IndexWriter writer = new IndexWriter(directory, iwc);
        addDocument(writer, s1);
        addDocument(writer, s2);
        writer.close();
        return directory;
    }

    /* Indexed, tokenized, stored. */
    public static final FieldType TYPE_STORED = new FieldType();

    static {
        TYPE_STORED.setIndexed(true);
        TYPE_STORED.setTokenized(true);
        TYPE_STORED.setStored(true);
        TYPE_STORED.setStoreTermVectors(true);
        TYPE_STORED.setStoreTermVectorPositions(true);
        TYPE_STORED.freeze();
    }

    void addDocument(IndexWriter writer, String content) throws IOException {
        Document doc = new Document();
        Field field = new Field(CONTENT, content, TYPE_STORED);
        doc.add(field);
        writer.addDocument(doc);
    }

    double getCosineSimilarity() {
        return (v1.dotProduct(v2)) / (v1.getNorm() * v2.getNorm());
    }

    public static double getCosineSimilarity(String s1, String s2)
            throws IOException {
        return new CosineDocumentSimilarity(s1, s2).getCosineSimilarity();
    }

    Map<String, Integer> getTermFrequencies(IndexReader reader, int docId)
            throws IOException {
        Terms vector = reader.getTermVector(docId, CONTENT);
        TermsEnum termsEnum = null;
        termsEnum = vector.iterator(termsEnum);
        Map<String, Integer> frequencies = new HashMap<>();
        BytesRef text = null;
        while ((text = termsEnum.next()) != null) {
            String term = text.utf8ToString();
            int freq = (int) termsEnum.totalTermFreq();
            frequencies.put(term, freq);
            terms.add(term);
        }
        return frequencies;
    }

    RealVector toRealVector(Map<String, Integer> map) {
        RealVector vector = new ArrayRealVector(terms.size());
        int i = 0;
        for (String term : terms) {
            int value = map.containsKey(term) ? map.get(term) : 0;
            vector.setEntry(i++, value);
        }
        return (RealVector) vector.mapDivide(vector.getL1Norm());
    }
}


















本文转自张昺华-sky博客园博客,原文链接:http://www.cnblogs.com/bonelee/p/6423490.html ,如需转载请自行联系原作者

相关文章
|
4月前
|
机器学习/深度学习 数据采集 人工智能
【机器学习算法篇】K-近邻算法
K近邻(KNN)是一种基于“物以类聚”思想的监督学习算法,通过计算样本间距离,选取最近K个邻居投票决定类别。支持多种距离度量,如欧式、曼哈顿、余弦相似度等,适用于分类与回归任务。结合Scikit-learn可高效实现,需合理选择K值并进行数据预处理,常用于鸢尾花分类等经典案例。(238字)
|
9月前
|
机器学习/深度学习 数据采集 人工智能
20分钟掌握机器学习算法指南
在短短20分钟内,从零开始理解主流机器学习算法的工作原理,掌握算法选择策略,并建立对神经网络的直观认识。本文用通俗易懂的语言和生动的比喻,帮助你告别算法选择的困惑,轻松踏入AI的大门。
631 8
|
10月前
|
机器学习/深度学习 存储 Kubernetes
【重磅发布】AllData数据中台核心功能:机器学习算法平台
杭州奥零数据科技有限公司成立于2023年,专注于数据中台业务,维护开源项目AllData并提供商业版解决方案。AllData提供数据集成、存储、开发、治理及BI展示等一站式服务,支持AI大模型应用,助力企业高效利用数据价值。
|
11月前
|
机器学习/深度学习 人工智能 自然语言处理
AI训练师入行指南(三):机器学习算法和模型架构选择
从淘金到雕琢,将原始数据炼成智能珠宝!本文带您走进数字珠宝工坊,用算法工具打磨数据金砂。从基础的经典算法到精密的深度学习模型,结合电商、医疗、金融等场景实战,手把手教您选择合适工具,打造价值连城的智能应用。掌握AutoML改装套件与模型蒸馏术,让复杂问题迎刃而解。握紧算法刻刀,为数字世界雕刻文明!
407 6
|
机器学习/深度学习 算法 数据安全/隐私保护
基于机器学习的人脸识别算法matlab仿真,对比GRNN,PNN,DNN以及BP四种网络
本项目展示了人脸识别算法的运行效果(无水印),基于MATLAB2022A开发。核心程序包含详细中文注释及操作视频。理论部分介绍了广义回归神经网络(GRNN)、概率神经网络(PNN)、深度神经网络(DNN)和反向传播(BP)神经网络在人脸识别中的应用,涵盖各算法的结构特点与性能比较。
|
机器学习/深度学习 人工智能 算法
机器学习算法的优化与改进:提升模型性能的策略与方法
机器学习算法的优化与改进:提升模型性能的策略与方法
2326 13
机器学习算法的优化与改进:提升模型性能的策略与方法
|
机器学习/深度学习 算法 网络安全
CCS 2024:如何严格衡量机器学习算法的隐私泄露? ETH有了新发现
在2024年CCS会议上,苏黎世联邦理工学院的研究人员提出,当前对机器学习隐私保护措施的评估可能存在严重误导。研究通过LiRA攻击评估了五种经验性隐私保护措施(HAMP、RelaxLoss、SELENA、DFKD和SSL),发现现有方法忽视最脆弱数据点、使用较弱攻击且未与实际差分隐私基线比较。结果表明这些措施在更强攻击下表现不佳,而强大的差分隐私基线则提供了更好的隐私-效用权衡。
323 14
|
12月前
|
人工智能 编解码 算法
使用 PAI-DSW x Free Prompt Editing图像编辑算法,开发个人AIGC绘图小助理
使用 PAI-DSW x Free Prompt Editing图像编辑算法,开发个人AIGC绘图小助理
414 0
|
机器学习/深度学习 人工智能 自然语言处理
解锁机器学习的新维度:元学习的算法与应用探秘
元学习作为一个重要的研究领域,正逐渐在多个应用领域展现其潜力。通过理解和应用元学习的基本算法,研究者可以更好地解决在样本不足或任务快速变化的情况下的学习问题。随着研究的深入,元学习有望在人工智能的未来发展中发挥更大的作用。
|
算法
PAI下面的gbdt、xgboost、ps-smart 算法如何优化?
设置gbdt 、xgboost等算法的样本和特征的采样率
461 2

热门文章

最新文章