本节代码地址:https://www.kesci.com/mw/project/600ade02e455800015b7e609
Gensim简介
Image Name
- 专门训练词向量的Python接口。 Gensim中的核心算法使用了核心算力,高度优化和并行化的C例程。
- Gensim可以使用数据流算法处理任意大的语料库。没有“数据集必须适合RAM”的限制。
- Gensim可在Linux,Windows和OS X以及任何其他支持Python和NumPy的平台上运行。
- 每天都有成千上万的公司使用Gensim,每周有2600多个学术引用和100万次下载,Gensim是最成熟的ML库之一。
- Gensim的所有源代码均由GNU LGPL许可证托管在Github上,并由其开源社区维护。有关商业安排,请参阅业务支持。
- Gensim社区还通过Gensim-data项目发布了针对特定领域(例如法律或健康)的预训练模型。
Gensim安装
安装非常简单;一种是pip另外可以通过conda安装:
pip install --upgrade gensim
conda install -c conda-forge gensim
In [8]:
!pip install gensim==4.0.0b0 -i https://pypi.tuna.tsinghua.edu.cn/simple
安装成功
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Collecting gensim==4.0.0b0 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/d0/86/2d03cb418e9fae6dbc17bafd7524d407be0691a703829cecca23e2bc31a9/gensim-4.0.0b0-cp38-cp38-manylinux1_x86_64.whl (24.0 MB) |████████████████████████████████| 24.0 MB 8.7 MB/s eta 0:00:01 Requirement already satisfied: numpy>=1.11.3 in /opt/conda/lib/python3.8/site-packages (from gensim==4.0.0b0) (1.19.1) Requirement already satisfied: smart-open>=1.8.1 in /opt/conda/lib/python3.8/site-packages (from gensim==4.0.0b0) (4.1.2) Requirement already satisfied: scipy>=0.18.1 in /opt/conda/lib/python3.8/site-packages (from gensim==4.0.0b0) (1.5.2) Installing collected packages: gensim Attempting uninstall: gensim Found existing installation: gensim 3.8.3 Uninstalling gensim-3.8.3: Successfully uninstalled gensim-3.8.3 Successfully installed gensim-4.0.0b0
快速上手
Gensim相关的概念:
- 文档:一些文本。
- 语料库:文档的集合。
- 向量:一种数学上方便的文档表示。
- 模型:一种将向量从一种表示形式转换为另一种表示形式的算法
import pprint
文档 Document
document = "Human machine interface for lab abc computer applications"
语料 Corpus
# Create a set of frequent words stoplist = set('for a of the and to in'.split(' ')) # Lowercase each document, split it by white space and filter out stopwords texts = [[word for word in document.lower().split() if word not in stoplist] for document in text_corpus] # Count word frequencies from collections import defaultdict frequency = defaultdict(int) for text in texts: for token in text: frequency[token] += 1 # Only keep words that appear more than once processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts] pprint.pprint(processed_corpus)
from gensim import corpora dictionary = corpora.Dictionary(processed_corpus) print(dictionary)
Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)
向量 Vector
pprint.pprint(dictionary.token2id)
{'computer': 0, 'eps': 8, 'graph': 10, 'human': 1, 'interface': 2, 'minors': 11, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'trees': 9, 'user': 7}
doc2bow将token转换为id表示,(第一代表单词的索引,第二个代表出现的次数)
new_doc = "Human computer interaction" new_vec = dictionary.doc2bow(new_doc.lower().split()) print(new_vec)
[(0, 1), (1, 1)]
接下来我们表示所有的文档
bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus] pprint.pprint(bow_corpus)
[[(0, 1), (1, 1), (2, 1)], [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(2, 1), (5, 1), (7, 1), (8, 1)], [(1, 1), (5, 2), (8, 1)], [(3, 1), (6, 1), (7, 1)], [(9, 1)], [(9, 1), (10, 1)], [(9, 1), (10, 1), (11, 1)], [(4, 1), (10, 1), (11, 1)]]
模型 Model
导入Tfidf模型
from gensim import models # train the model tfidf = models.TfidfModel(bow_corpus) # transform the "system minors" string words = "system minors".lower().split() print(tfidf[dictionary.doc2bow(words)])
我们可以得到对应词语索引的tfidf值
[(5, 0.5898341626740045), (11, 0.8075244024440723)]
使用tfidf表示所有语料
from gensim import similarities index = similarities.SparseMatrixSimilarity(tfidf[bow_corpus], num_features=12)
计算查询文档中与所有语料中文档的相似性
query_document = 'system engineering'.split() query_bow = dictionary.doc2bow(query_document) sims = index[tfidf[query_bow]] print(list(enumerate(sims)))
得到所有的相似性
[(0, 0.0), (1, 0.32448703), (2, 0.41707572), (3, 0.7184812), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0)]
for document_number, score in sorted(enumerate(sims), key=lambda x: x[1], reverse=True): print(document_number, score)
3 0.7184812 2 0.41707572 1 0.32448703 0 0.0 4 0.0 5 0.0 6 0.0 7 0.0 8 0.0