
简介: 本文主要内容主要介绍gensim的基础使用方法,文章内容主要来自https://radimrehurek.com/gensim/tut1.html#from-strings-to-vectors

安装: pip install gensim



from gensim import corpora
documents = ["Human machine interface for lab abc computer applications",
          "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]


分词(tokenize the documents)、去除停用词和在语料中只出现一次的词。处理语料的方式有很多,这里只是简单地通过空格(whitespace)去分词,然后把每个词变为小写,最后去除一些常用的词和只出现一次的词。

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
# remove words that appear only once,collection是python的一个工具库
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1]
               for text in texts]

from pprint import pprint  # pprint可以使输出更易观看。
[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]


如何从文档中提取特征有很多方法。这里简单使用词袋模型(bag-of- words)来提取文档特征,该模型通过计算每个词在文档中出现的频率,然后将这些频率组成一个向量,从而将文档向量化。首先我们需要用语料库训练一个词典,词典包含所有在语料库中出现的单词。

dictionary = corpora.Dictionary(texts)
dictionary.save('./gensim_out/deerwester.dict')  # 因为实际运用中该词典非常大,所以将训练的词典保存起来,方便将来使用。
print(dictionary) # 输出:Dictionary(35 unique tokens: ['abc', 'applications', 'computer', 'human', 'interface']...)
# dictionary有35个不重复的词,给每个词赋予一个id
print(dictionary.token2id)#输出:{'abc': 0, 'applications': 1, 'computer': 2, 'human': 3, 'interface': 4, 'lab': 5, 'machine': 6, 'opinion': 7, 'response': 8, 'survey': 9, 'system': 10, 'time': 11, 'user': 12, 'eps': 13, 'management': 14, 'engineering': 15, 'testing': 16, 'error': 17, 'measurement': 18, 'perceived': 19, 'relation': 20, 'binary': 21, 'generation': 22, 'random': 23, 'trees': 24, 'unordered': 25, 'graph': 26, 'intersection': 27, 'paths': 28, 'iv': 29, 'minors': 30, 'ordering': 31, 'quasi': 32, 'well': 33, 'widths': 34}

上面已经构建了单词词典,我们可以通过该词典用词袋模型将其他的文本向量化.假设新文本是“Human computer interaction“,则输出向量为[(2, 1), (3, 1)],(2,1)中的“2”表示computer在词典中的id为2,“1”表示Human在该文档中出现了1次,同理,(3,1)表示Human在词典中的id为3,出现次数为1,输出向量中元组的顺序应该是按照id大小排序。interaction不在词典中,所以直接被忽略了。

new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
corpora.MmCorpus.serialize('./gensim_out/deerwester.mm',new_vec)  # 讲训练结果存储到硬盘中,方便将来使用。
print(new_vec)#输出[(2, 1), (3, 1)]



class MyCorpus(object):
    def __iter__(self):
        for line in open('mycorpus.txt'):
            yield dictionary.doc2bow(line.lower().split())
corpus_memory_friendly = MyCorpus()# 没有将corpus加载到内存中
print(corpus_memory_friendly)#输出:<__main__.MyCorpus object at 0x10d5690>

for vector in corpus_memory_friendly:  # load one vector into memory at a time
[(0, 1), (1, 1), (2, 1)]
[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(2, 1), (5, 1), (7, 1), (8, 1)]
[(1, 1), (5, 2), (8, 1)]
[(3, 1), (6, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(4, 1), (10, 1), (11, 1)]


# iteritems用来遍历对象中的每个item
from six import iteritems
dictionary = corpora.Dictionary(line.lower().split() for line in open('mycorpus.txt') )
stop_ids = [dictionary.token2id[stopword] for stopword in stoplist if stopword in dictionary.token2id]
once_ids = [tokenid for tokenid, docfreq in iteritem(dictionary.dfs) if docfreq ==1]
dictionary.filter_token(stop_ids + once_ids)
# 去除清洗后的空位
print(dictionary)#输出:Dictionary(12 unique tokens)


  1. corpora.MmCorpus.serialize(path, result)
  2. corpora.SvmLightCorpus.serialize(path, result)
  3. corpora.BleiCorpus.serialize(path, result)
  4. corpora.LowCorpus.serialize(path, result)
    值得注意的时第一种,Market Matrix format,用法举例
corpus = [[(1, 0.5)], []]  # make one document empty, for the heck of it
corpora.MmCorpus.serialize('./gensim_out/corpus.mm', corpus)
corpus = corpora.MmCorpus('./gensim_out/corpus.mm')
print(corpus)#输出:MmCorpus(2 documents, 2 features, 1 non-zero entries)
print(list(corpus))  # calling list() will convert any sequence to a plain Python list
#输出:[[(1, 0.5)], []]

for doc in corpus:
#输出:[(1, 0.5)][]
