我尝试使用Latent Dirichlet分配LDA来提取一些主题。本教程以端到端的自然语言处理流程为特色,从原始数据开始,贯穿准备,建模,可视化论文。
我们将涉及以下几点
- 使用LDA进行主题建模
- 使用pyLDAvis可视化主题模型
- 使用t-SNE和散景可视化LDA结果
In [1]:from scipy import sparse as sp Populating the interactive namespace from numpy and matplotlib In [2]:docs = array(p_df['PaperText'])
预处理和矢量化文档
In [3]:
from nltk.stem.wordnet import WordNetLemmatizer from nltk.tokenize import RegexpTokenizer def docs_preprocessor(docs): tokenizer = RegexpTokenizer(r'\w+') for idx in range(len(docs)): docs[idx] = docs[idx].lower() # Convert to lowercase. docs[idx] = tokenizer.tokenize(docs[idx]) # Split into words. # Remove numbers, but not words that contain numbers. docs = [[token for token in doc if not token.isdigit()] for doc in docs] # Remove words that are only one character. docs = [[token for token in doc if len(token) > 3] for doc in docs] # Lemmatize all words in documents. lemmatizer = WordNetLemmatizer() docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs] return docs
In [4]:
docs = docs_preprocessor(docs)
计算双字母组/三元组:
正弦主题非常相似,可以区分它们是短语而不是单个/单个单词。
In [5]:
from gensim.models import Phrases # Add bigrams and trigrams to docs (only ones that appear 10 times or more). bigram = Phrases(docs, min_count=10) trigram = Phrases(bigram[docs]) for idx in range(len(docs)): for token in bigram[docs[idx]]: if '_' in token: # Token is a bigram, add to document. docs[idx].append(token) for token in trigram[docs[idx]]: if '_' in token: # Token is a bigram, add to document. docs[idx].append(token) Using TensorFlow backend. /opt/conda/lib/python3.6/site-packages/gensim/models/phrases.py:316: UserWarning: For a faster implementation, use the gensim.models.phrases.Phraser class warnings.warn("For a faster implementation, use the gensim.models.phrases.Phraser class")
删除
In [6]:
from gensim.corpora import Dictionary # Create a dictionary representation of the documents. dictionary = Dictionary(docs) print('Number of unique words in initital documents:', len(dictionary)) # Filter out words that occur less than 10 documents, or more than 20% of the documents. dictionary.filter_extremes(no_below=10, no_above=0.2) print('Number of unique words after removing rare and common words:', len(dictionary)) Number of unique words in initital documents: 39534 Number of unique words after removing rare and common words: 6001
修剪常见和罕见的单词,我们最终只有大约6%的单词。
矢量化数据:
第一步是获得每个文档的单词表示。
In [7]:
corpus = [dictionary.doc2bow(doc) for doc in docs]
In [8]:
print('Number of unique tokens: %d' % len(dictionary)) print('Number of documents: %d' % len(corpus)) Number of unique tokens: 6001 Number of documents: 403
通过词袋语料库,我们可以继续从文档中学习我们的主题模型。
训练LDA模型
In [9]:
from gensim.models import LdaModel
In [10]:
%time model = LdaModel(corpus=corpus, id2word=id2word, chunksize=chunksize, \ alpha='auto', eta='auto', \ iterations=iterations, num_topics=num_topics, \ passes=passes, eval_every=eval_every) CPU times: user 3min 58s, sys: 348 ms, total: 3min 58s Wall time: 3min 59s
如何选择主题数量?
LDA是一种无监督的技术,这意味着我们在运行模型之前不知道在我们的语料库中有多少主题存在。主题连贯性是用于确定主题数量的主要技术之一。
但是,我使用了LDA可视化工具pyLDAvis,尝试了几个主题并比较了结果。四个似乎是最能分离主题的最佳主题数量。
In [11]:
import pyLDAvis.gensim pyLDAvis.enable_notebook() import warnings warnings.filterwarnings("ignore", category=DeprecationWarning)
In [12]:
pyLDAvis.gensim.prepare(model, corpus, dictionary)
我们在这看到什么?
左侧面板,标记为Intertopic Distance Map,圆圈表示不同的主题以及它们之间的距离。类似的主题看起来更近,而不同的主题更远。图中主题圆的相对大小对应于语料库中主题的相对频率。
如何评估我们的模型?
将每个文档分成两部分,看看分配给它们的主题是否类似。=>越相似越好
将随机选择的文档相互比较。=>越不相似越好
In [13]:
from sklearn.metrics.pairwise import cosine_similarity p_df['tokenz'] = docs docs1 = p_df['tokenz'].apply(lambda l: l[:int0(len(l)/2)]) docs2 = p_df['tokenz'].apply(lambda l: l[int0(len(l)/2):])
Transform the data
In [14]
corpus1 = [dictionary.doc2bow(doc) for doc in docs1] corpus2 = [dictionary.doc2bow(doc) for doc in docs2] # Using the corpus LDA model tranformation lda_corpus1 = model[corpus1] lda_corpus2 = model[corpus2]
In [15]:
from collections import OrderedDict def get_doc_topic_dist(model, corpus, kwords=False): ''' LDA transformation, for each doc only returns topics with non-zero weight This function makes a matrix transformation of docs in the topic space. ''' top_dist =[] keys = [] for d in corpus: tmp = {i:0 for i in range(num_topics)} tmp.update(dict(model[d])) vals = list(OrderedDict(tmp).values()) top_dist += [array(vals)] if kwords: keys += [array(vals).argmax()] return array(top_dist), keys Intra similarity: cosine similarity for corresponding parts of a doc(higher is better): 0.906086532099 Inter similarity: cosine similarity between random parts (lower is better): 0.846485334252
让我们看一下每个主题中出现的术语。
In [17]:
def explore_topic(lda_model, topic_number, topn, output=True): """ accept a ldamodel, atopic number and topn vocabs of interest prints a formatted list of the topn terms """ terms = [] for term, frequency in lda_model.show_topic(topic_number, topn=topn): terms += [term] if output: print(u'{:20} {:.3f}'.format(term, round(frequency, 3))) return terms
In [18]:
term frequency Topic 0 |--------------------- data_set 0.006 embedding 0.004 query 0.004 document 0.003 tensor 0.003 multi_label 0.003 graphical_model 0.003 singular_value 0.003 topic_model 0.003 margin 0.003 Topic 1 |--------------------- policy 0.007 regret 0.007 bandit 0.006 reward 0.006 active_learning 0.005 agent 0.005 vertex 0.005 item 0.005 reward_function 0.005 submodular 0.004 Topic 2 |--------------------- convolutional 0.005 generative_model 0.005 variational_inference 0.005 recurrent 0.004 gaussian_process 0.004 fully_connected 0.004 recurrent_neural 0.004 hidden_unit 0.004 deep_learning 0.004 hidden_layer 0.004 Topic 3 |--------------------- convergence_rate 0.007 step_size 0.006 matrix_completion 0.006 rank_matrix 0.005 gradient_descent 0.005 regret 0.004 sample_complexity 0.004 strongly_convex 0.004 line_search 0.003 sample_size 0.003
从上面可以检查每个主题并为其分配一个人类可解释的标签。在这里我将它们标记如下:
In [19]:
top_labels = {0: 'Statistics', 1:'Numerical Analysis', 2:'Online Learning', 3:'Deep Learning'}
In [20]:
''' # 1. Remove non-letters paper_text = re.sub("[^a-zA-Z]"," ", paper) # 2. Convert words to lower case and split them words = paper_text.lower().split() # 3. Remove stop words words = [w for w in words if not w in stops] # 4. Remove short words words = [t for t in words if len(t) > 2] # 5. lemmatizing words = [nltk.stem.WordNetLemmatizer().lemmatize(t) for t in words] In [21]:from sklearn.feature_extraction.text import TfidfVectorizer tvectorizer = TfidfVectorizer(input='content', analyzer = 'word', lowercase=True, stop_words='english',\ tokenizer=paper_to_wordlist, ngram_range=(1, 3), min_df=40, max_df=0.20,\ norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=True) dtm = tvectorizer.fit_transform(p_df['PaperText']).toarray()
In [22]:
top_dist =[] for d in corpus: tmp = {i:0 for i in range(num_topics)} tmp.update(dict(model[d])) vals = list(OrderedDict(tmp).values()) top_dist += [array(vals)]
In [23]:
top_dist, lda_keys= get_doc_topic_dist(model, corpus, True) features = tvectorizer.get_feature_names() In [24]:top_ws = [] for n in range(len(dtm)): inds = int0(argsort(dtm[n])[::-1][:4]) tmp = [features[i] for i in inds] top_ws += [' '.join(tmp)] cluster_colors = {0: 'blue', 1: 'green', 2: 'yellow', 3: 'red', 4: 'skyblue', 5:'salmon', 6:'orange', 7:'maroon', 8:'crimson', 9:'black', 10:'gray'} p_df['colors'] = p_df['clusters'].apply(lambda l: cluster_colors[l]) In [25]:from sklearn.manifold import TSNE tsne = TSNE(n_components=2) X_tsne = tsne.fit_transform(top_dist) In [26]:p_df['X_tsne'] =X_tsne[:, 0] p_df['Y_tsne'] =X_tsne[:, 1] In [27]:from bokeh.plotting import figure, show, output_notebook, save#, output_file from bokeh.models import HoverTool, value, LabelSet, Legend, ColumnDataSource output_notebook() BokehJS 0.12.5成功加载。 In [28]:source = ColumnDataSource(dict( x=p_df['X_tsne'], y=p_df['Y_tsne'], color=p_df['colors'], label=p_df['clusters'].apply(lambda l: top_labels[l]), # msize= p_df['marker_size'], topic_key= p_df['clusters'], title= p_df[u'Title'], content = p_df['Text_Rep'] )) In [29]:title = 'T-SNE visualization of topics' plot_lda.scatter(x='x', y='y', legend='label', source=source, color='color', alpha=0.8, size=10)#'msize', ) show(plot_lda)