安装 !pip install gensim==4.0.0b0 -i https://pypi.tuna.tsinghua.edu.cn/simple/
import logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
首先,我们需要创建一个要使用的语料库。这个步骤与上一个教程中的步骤相同; 如果您完成了这个步骤,请随意跳到下一个部分。|
from collections import defaultdict from gensim import corpora documents = [ "Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "System and human system engineering testing of EPS", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", "Graph minors A survey", ] # 去除停用词并进行分词 stoplist = set('for a of the and to in'.split()) texts = [ [word for word in document.lower().split() if word not in stoplist] for document in documents ] # 去除低频词 frequency = defaultdict(int) for text in texts: for token in text: frequency[token] += 1 texts = [ [token for token in text if frequency[token] > 1] for text in texts ] dictionary = corpora.Dictionary(texts) corpus = [dictionary.doc2bow(text) for text in texts]
2021-01-28 10:06:04,335 : INFO : adding document #0 to Dictionary(0 unique tokens: []) 2021-01-28 10:06:04,336 : INFO : built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)
from gensim import models lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
2021-01-28 10:20:02,307 : INFO : using serial LSI version on this node 2021-01-28 10:20:02,308 : INFO : updating model with new documents 2021-01-28 10:20:02,309 : INFO : preparing a new chunk of documents 2021-01-28 10:20:02,309 : INFO : using 100 extra samples and 2 power iterations 2021-01-28 10:20:02,310 : INFO : 1st phase: constructing (12, 102) action matrix 2021-01-28 10:20:02,311 : INFO : orthonormalizing (12, 102) action matrix 2021-01-28 10:20:02,312 : INFO : 2nd phase: running dense svd on (12, 9) matrix 2021-01-28 10:20:02,313 : INFO : computing the final decomposition 2021-01-28 10:20:02,313 : INFO : keeping 2 factors (discarding 43.156% of energy spectrum) 2021-01-28 10:20:02,314 : INFO : processed documents up to #9 2021-01-28 10:20:02,315 : INFO : topic #0(3.341): 0.644*"system" + 0.404*"user" + 0.301*"eps" + 0.265*"time" + 0.265*"response" + 0.240*"computer" + 0.221*"human" + 0.206*"survey" + 0.198*"interface" + 0.036*"graph" 2021-01-28 10:20:02,315 : INFO : topic #1(2.542): 0.623*"graph" + 0.490*"trees" + 0.451*"minors" + 0.274*"survey" + -0.167*"system" + -0.141*"eps" + -0.113*"human" + 0.107*"response" + 0.107*"time" + -0.072*"interface"
- 首先,这只是另一种转换:将向量从一个空间转换到另一个空间。
- 其次,LSI的好处是可以识别术语(在我们的情况下是文档中的单词)与主题之间的模式和关系。
我们的LSI空间是二维的(num_topics = 2
现在假设用户键入查询“人机交互”。 我们会
相似性-关于它们的文本(单词)的明显语义相关性。 没有超链接,
doc = "Human computer interaction" vec_bow = dictionary.doc2bow(doc.lower().split()) vec_lsi = lsi[vec_bow] # 查询文档的LSI向量 print(vec_lsi)
[(0, 0.46182100453271596), (1, -0.07002766527899937)]
- 不同的相似性匹配方法
与后续查询进行比较。 在我们的情况下,它们是相同的九个文档
用于训练LSI,转换为2-D LSA空间。 但这只是偶然的,我们
from gensim import similarities index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it index
2021-01-28 10:37:02,431 : WARNING : scanning corpus to determine the number of features (consider setting `num_features` explicitly) 2021-01-28 10:37:02,433 : INFO : creating matrix with 9 documents and 2 features <gensim.similarities.docsim.MatrixSimilarity at 0x7f5e58cc4070>
# 保存索引 index.save('/tmp/deerwester.index') index = similarities.MatrixSimilarity.load('/tmp/deerwester.index')
sims = index[vec_lsi] # perform a similarity query against the corpus print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples
[(0, 0.998093), (1, 0.93748635), (2, 0.9984453), (3, 0.98658866), (4, 0.90755945), (5, -0.12416792), (6, -0.1063926), (7, -0.09879464), (8, 0.05004177)]
sims = sorted(enumerate(sims), key=lambda item: -item[1]) for doc_position, doc_score in sims: print(doc_score, documents[doc_position])
0.9984453 The EPS user interface management system 0.998093 Human machine interface for lab abc computer applications 0.98658866 System and human system engineering testing of EPS 0.93748635 A survey of user opinion of computer system response time 0.90755945 Relation of user perceived response time to error measurement 0.05004177 Graph minors A survey -0.09879464 Graph minors IV Widths of trees and well quasi ordering -0.1063926 The intersection graph of paths in trees -0.12416792 The generation of random binary unordered trees