Finding Similar Items 文本相似度计算的算法——机器学习、词向量空间cosine、NLTK、diff、Levenshtein距离-阿里云开发者社区

开发者社区> 桃子红了呐> 正文

Finding Similar Items 文本相似度计算的算法——机器学习、词向量空间cosine、NLTK、diff、Levenshtein距离

简介:
+关注继续查看

http://infolab.stanford.edu/~ullman/mmds/ch3.pdf 汇总于此 还有这本书 http://www-nlp.stanford.edu/IR-book/ 里面有词向量空间 SVM 等介绍

http://pages.cs.wisc.edu/~dbbook/openAccess/thirdEdition/slides/slides3ed-english/Ch27b_ir2-vectorspace-95.pdf 专门介绍向量空间

https://courses.cs.washington.edu/courses/cse573/12sp/lectures/17-ir.pdf 也提到了其他思路 貌似类似语音识别的统计模型

使用深度学习来做文档相似度计算 https://cs224d.stanford.edu/reports/PoulosJackson.pdf 还有这里 http://www.cms.waikato.ac.nz/~ml/publications/2012/JASIST2012.pdf

网页里直接比较文本相似度的 http://www.scurtu.it/documentSimilarity.html

这里汇总了一些回答 http://stackoverflow.com/questions/8897593/similarity-between-two-text-documents  包括利用NLP NLTK库来做,或者是diff,skylearn词向量空间+cos

http://stackoverflow.com/questions/1844194/get-cosine-similarity-between-two-documents-in-lucene 也有cosine相似度计算方法

lucene 3 里的cosine相似度计算方法 https://darakpanand.wordpress.com/2013/06/01/document-comparison-by-cosine-methodology-using-lucene/#more-53 注意:4和3的计算方法不一样

向量空间模型(http://stackoverflow.com/questions/10649898/better-way-of-calculating-document-similarity-using-lucene):

Once you've got your data components properly standardized, then you can worry about what's better: fuzzy match, Levenshtein distance, or cosine similarity (etc.)

As I told you in my comment, I think you made a mistake somewhere. The vectors actually contain the <word,frequency> pairs, not words only. Therefore, when you delete the sentence, only the frequency of the corresponding words are subtracted by 1 (the words after are not shifted). Consider the following example:

Document a:

A B C A A B C. D D E A B. D A B C B A.

Document b:

A B C A A B C. D A B C B A.

Vector a:

A:6, B:5, C:3, D:3, E:1

Vector b:

A:5, B:4, C:3, D:1, E:0

Which result in the following similarity measure:

(6*5+5*4+3*3+3*1+1*0)/(Sqrt(6^2+5^2+3^2+3^2+1^2) Sqrt(5^2+4^2+3^2+1^2+0^2))=
62/(8.94427*7.14143)=
0.970648

 

lucene里 more like this:

you may want to check the MoreLikeThis feature of lucene.

MoreLikeThis constructs a lucene query based on terms within a document to find other similar documents in the index.

http://lucene.apache.org/java/3_0_1/api/contrib-queries/org/apache/lucene/search/similar/MoreLikeThis.html

Sample code example (java reference) -

MoreLikeThis mlt = new MoreLikeThis(reader); // Pass the index reader
mlt.setFieldNames(new String[] {"title", "author"}); // specify the fields for similiarity

Query query = mlt.like(docID); // Pass the doc id 
TopDocs similarDocs = searcher.search(query, 10); // Use the searcher
if (similarDocs.totalHits == 0)
    // Do handling
}

 

http://stackoverflow.com/questions/1844194/get-cosine-similarity-between-two-documents-in-lucene 提到: 

i have built an index in Lucene. I want without specifying a query, just to get a score (cosine similarity or another distance?) between two documents in the index.

For example i am getting from previously opened IndexReader ir the documents with ids 2 and 4. Document d1 = ir.document(2); Document d2 = ir.document(4);

How can i get the cosine similarity between these two documents?

Thank you

 

When indexing, there's an option to store term frequency vectors.

During runtime, look up the term frequency vectors for both documents using IndexReader.getTermFreqVector(), and look up document frequency data for each term using IndexReader.docFreq(). That will give you all the components necessary to calculate the cosine similarity between the two docs.

An easier way might be to submit doc A as a query (adding all words to the query as OR terms, boosting each by term frequency) and look for doc B in the result set.

 

 

As Julia points out Sujit Pal's example is very useful but the Lucene 4 API has substantial changes. Here is a version rewritten for Lucene 4.

import java.io.IOException;
import java.util.*;

import org.apache.commons.math3.linear.*;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.core.SimpleAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.*;
import org.apache.lucene.store.*;
import org.apache.lucene.util.*;

public class CosineDocumentSimilarity {

    public static final String CONTENT = "Content";

    private final Set<String> terms = new HashSet<>();
    private final RealVector v1;
    private final RealVector v2;

    CosineDocumentSimilarity(String s1, String s2) throws IOException {
        Directory directory = createIndex(s1, s2);
        IndexReader reader = DirectoryReader.open(directory);
        Map<String, Integer> f1 = getTermFrequencies(reader, 0);
        Map<String, Integer> f2 = getTermFrequencies(reader, 1);
        reader.close();
        v1 = toRealVector(f1);
        v2 = toRealVector(f2);
    }

    Directory createIndex(String s1, String s2) throws IOException {
        Directory directory = new RAMDirectory();
        Analyzer analyzer = new SimpleAnalyzer(Version.LUCENE_CURRENT);
        IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_CURRENT,
                analyzer);
        IndexWriter writer = new IndexWriter(directory, iwc);
        addDocument(writer, s1);
        addDocument(writer, s2);
        writer.close();
        return directory;
    }

    /* Indexed, tokenized, stored. */
    public static final FieldType TYPE_STORED = new FieldType();

    static {
        TYPE_STORED.setIndexed(true);
        TYPE_STORED.setTokenized(true);
        TYPE_STORED.setStored(true);
        TYPE_STORED.setStoreTermVectors(true);
        TYPE_STORED.setStoreTermVectorPositions(true);
        TYPE_STORED.freeze();
    }

    void addDocument(IndexWriter writer, String content) throws IOException {
        Document doc = new Document();
        Field field = new Field(CONTENT, content, TYPE_STORED);
        doc.add(field);
        writer.addDocument(doc);
    }

    double getCosineSimilarity() {
        return (v1.dotProduct(v2)) / (v1.getNorm() * v2.getNorm());
    }

    public static double getCosineSimilarity(String s1, String s2)
            throws IOException {
        return new CosineDocumentSimilarity(s1, s2).getCosineSimilarity();
    }

    Map<String, Integer> getTermFrequencies(IndexReader reader, int docId)
            throws IOException {
        Terms vector = reader.getTermVector(docId, CONTENT);
        TermsEnum termsEnum = null;
        termsEnum = vector.iterator(termsEnum);
        Map<String, Integer> frequencies = new HashMap<>();
        BytesRef text = null;
        while ((text = termsEnum.next()) != null) {
            String term = text.utf8ToString();
            int freq = (int) termsEnum.totalTermFreq();
            frequencies.put(term, freq);
            terms.add(term);
        }
        return frequencies;
    }

    RealVector toRealVector(Map<String, Integer> map) {
        RealVector vector = new ArrayRealVector(terms.size());
        int i = 0;
        for (String term : terms) {
            int value = map.containsKey(term) ? map.get(term) : 0;
            vector.setEntry(i++, value);
        }
        return (RealVector) vector.mapDivide(vector.getL1Norm());
    }
}


















本文转自张昺华-sky博客园博客,原文链接:http://www.cnblogs.com/bonelee/p/6423490.html,如需转载请自行联系原作者

版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。

相关文章
DevExpress学习03——label控件的背景色问题
今天使用了DevExpress的labelControl,发现拖放上去,其背景色和主窗体的背景一样,非常不谐调,把BackColor设置为透明也不行(Windows中的Label可以)。 没有办法,我用颜色拾取器,拾取到上方面板颜色是RGB(80,80,80),于是将labelControl背景色改为RGB(80,80,80)颜色就统一了。
975 0
DevExpress学习02——DevExpress 14.1的汉化
汉化资源: 汉化补丁:dxKB_A421_DXperience_v14.1_(2014-06-09):http://www.t00y.com/file/86576990 汉化工具:DXperienceUniversal-14.
1214 0
计算字符串相似度算法——Levenshtein
0.这个算法实现起来很简单 1.百度百科介绍: Levenshtein 距离,又称编辑距离,指的是两个字符串之间,由一个转换成另一个所需的最少编辑操作次数。 许可的编辑操作包括将一个字符替换成另一个字符,插入一个字符,删除一个字符。
926 0
文本相似度——自己实现文本相似度算法(余弦定理)
最近由于工作项目,需要判断两个txt文本是否相似,于是开始在网上找资料研究,因为在程序中会把文本转换成String再做比较,所以最开始找到了这篇关于 距离编辑算法 Blog写的非常好,受益匪浅。        于是我决定把它用到项目中,来判断两个文本的相似度。
1345 0
【HEVC学习与研究】45、HEVC的自适应采样点补偿SAO
1、SAO的意义 在图像中像素值剧烈变化的边界区,经过编码-解码重建之后通常会出现波纹状的失真现象,这种失真称作振铃效应。振铃效应产生的根本原因在于边界区高频分量在编码过程中丢失。
1081 0
冰与火之歌:「时间」与「空间」复杂度 | 算法必看系列三十六
对于一个算法,其时间复杂度和空间复杂度往往是相互影响的。当追求一个较好的时间复杂度时,可能会使空间复杂度的性能变差,即可能导致占用较多的存储空间; 反之,求一个较好的空间复杂度时,可能会使时间复杂度的性能变差,即可能导致占用较长的运行时间。另外,算法的所有性能之间都存在着或多或少的相互影响。因此,当设计一个算法(特别是大型算法)时,要综合考虑算法的各项性能,算法的使用频率,算法处理的数据量的大小,算法描述语言的特性,算法运行的机器系统环境等各方面因素,才能够设计出比较好的算法。
2254 0
【HEVC学习与研究】46、HEVC参考代码中SAO的实现
1、HM10 encoder关于SAO的初始化操作 在编码开始之前,编码器将进行与SAO有关的一些初始化设置。如在TAppEncTop::encode()函数中调用TAppEncTop::xInitLibCfg()进行与config相关的设置: m_cTEncTop.
1576 0
【HEVC学习与研究】43、HEVC变换编码的实现
【变换和量化是整个视频编码技术系列中理论性和研究性较强的一部分,本文暂时不去研究变换的原理、推导过程等,只是调试一下在参考代码中对预测残差进行变换的实现过程。
1283 0
4269
文章
0
问答
文章排行榜
最热
最新
相关电子书
更多
《2021云上架构与运维峰会演讲合集》
立即下载
《零基础CSS入门教程》
立即下载
《零基础HTML入门教程》
立即下载