SK-Learn使用NMF(非负矩阵分解)和LDA(隐含狄利克雷分布)进行话题抽取

简介:

这是一个使用NMF和LDA对一个语料集进行话题抽取的例子。

输入分别是是tf-idf矩阵(NMF)和tf矩阵(LDA)。

输出是一系列的话题,每个话题由一系列的词组成。

默认的参数(n_samples/n_features/n_topics)会使这个例子运行数十秒。

你可以尝试修改问题的规模,但是要注意,NMF的时间复杂度是多项式级别的,LDA的时间复杂度与(n_samples*iterations)成正比。

几点注意事项:

(1)其中line 61的代码需要注释掉,才能看到输出结果。

(2)第一次运行代码,程序会从网上下载新闻数据,然后保存在一个缓存目录中,之后再运行代码,就不会重复下载了。

(3)关于NMF和LDA的参数设置,可以到sklearn的官网上查看【NMF官方文档】【LDA官方文档】。

(4)该代码对应的sk-learn版本为 scikit-learn 0.17.1

代码:

复制代码
 1 # Author: Olivier Grisel <olivier.grisel@ensta.org>
 2 #         Lars Buitinck <L.J.Buitinck@uva.nl>
 3 #         Chyi-Kwei Yau <chyikwei.yau@gmail.com>
 4 # License: BSD 3 clause
 5 
 6 from __future__ import print_function
 7 from time import time
 8 
 9 from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
10 from sklearn.decomposition import NMF, LatentDirichletAllocation
11 from sklearn.datasets import fetch_20newsgroups
12 
13 n_samples = 2000
14 n_features = 1000
15 n_topics = 10
16 n_top_words = 20
17 
18 
19 def print_top_words(model, feature_names, n_top_words):
20     for topic_idx, topic in enumerate(model.components_):
21         print("Topic #%d:" % topic_idx)
22         print(" ".join([feature_names[i]
23                         for i in topic.argsort()[:-n_top_words - 1:-1]]))
24     print()
25 
26 
27 # Load the 20 newsgroups dataset and vectorize it. We use a few heuristics
28 # to filter out useless terms early on: the posts are stripped of headers,
29 # footers and quoted replies, and common English words, words occurring in
30 # only one document or in at least 95% of the documents are removed.
31 
32 print("Loading dataset...")
33 t0 = time()
34 dataset = fetch_20newsgroups(shuffle=True, random_state=1,
35                              remove=('headers', 'footers', 'quotes'))
36 data_samples = dataset.data
37 print("done in %0.3fs." % (time() - t0))
38 
39 # Use tf-idf features for NMF.
40 print("Extracting tf-idf features for NMF...")
41 tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, #max_features=n_features,
42                                    stop_words='english')
43 t0 = time()
44 tfidf = tfidf_vectorizer.fit_transform(data_samples)
45 print("done in %0.3fs." % (time() - t0))
46 
47 # Use tf (raw term count) features for LDA.
48 print("Extracting tf features for LDA...")
49 tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features,
50                                 stop_words='english')
51 t0 = time()
52 tf = tf_vectorizer.fit_transform(data_samples)
53 print("done in %0.3fs." % (time() - t0))
54 
55 # Fit the NMF model
56 print("Fitting the NMF model with tf-idf features,"
57       "n_samples=%d and n_features=%d..."
58       % (n_samples, n_features))
59 t0 = time()
60 nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)
61 exit()
62 print("done in %0.3fs." % (time() - t0))
63 
64 print("\nTopics in NMF model:")
65 tfidf_feature_names = tfidf_vectorizer.get_feature_names()
66 print_top_words(nmf, tfidf_feature_names, n_top_words)
67 
68 print("Fitting LDA models with tf features, n_samples=%d and n_features=%d..."
69       % (n_samples, n_features))
70 lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
71                                 learning_method='online', learning_offset=50.,
72                                 random_state=0)
73 t0 = time()
74 lda.fit(tf)
75 print("done in %0.3fs." % (time() - t0))
76 
77 print("\nTopics in LDA model:")
78 tf_feature_names = tf_vectorizer.get_feature_names()
79 print_top_words(lda, tf_feature_names, n_top_words)
复制代码

结果:

复制代码
Loading dataset...
done in 2.222s.
Extracting tf-idf features for NMF...
done in 2.730s.
Extracting tf features for LDA...
done in 2.702s.
Fitting the NMF model with tf-idf features,n_samples=2000 and n_features=1000...
done in 1.904s.

Topics in NMF model:
Topic #0:
don just people think like know good time right ve say did make really way want going new year ll
Topic #1:
windows thanks file card does dos mail files know program use advance hi window help software looking ftp video pc
Topic #2:
drive scsi ide drives disk controller hard floppy bus hd cd boot mac cable card isa rom motherboard mb internal
Topic #3:
key chip encryption clipper keys escrow government algorithm security secure encrypted public nsa des enforcement law privacy bit use secret
Topic #4:
00 sale 50 shipping 20 10 price 15 new 25 30 dos offer condition 40 cover asking 75 01 interested
Topic #5:
armenian armenians turkish genocide armenia turks turkey soviet people muslim azerbaijan russian greek argic government serdar kurds population ottoman million
Topic #6:
god jesus bible christ faith believe christians christian heaven sin life hell church truth lord does say belief people existence
Topic #7:
mouse driver keyboard serial com1 port bus com3 irq button com sys microsoft ball problem modem adb drivers card com2
Topic #8:
space nasa shuttle launch station sci gov orbit moon earth lunar satellite program mission center cost research data solar mars
Topic #9:
msg food chinese flavor eat glutamate restaurant foods reaction taste restaurants salt effects carl brain people ingredients natural causes olney

Fitting LDA models with tf features, n_samples=2000 and n_features=1000...
done in 22.548s.

Topics in LDA model:
Topic #0:
government people mr law gun state president states public use right rights national new control american security encryption health united
Topic #1:
drive card disk bit scsi use mac memory thanks pc does video hard speed apple problem used data monitor software
Topic #2:
said people armenian armenians turkish did saw went came women killed children turkey told dead didn left started greek war
Topic #3:
year good just time game car team years like think don got new play games ago did season better ll
Topic #4:
10 00 15 25 12 11 20 14 17 16 db 13 18 24 30 19 27 50 21 40
Topic #5:
windows window program version file dos use files available display server using application set edu motif package code ms software
Topic #6:
edu file space com information mail data send available program ftp email entry info list output nasa address anonymous internet
Topic #7:
ax max b8f g9v a86 pl 145 1d9 0t 34u 1t 3t giz bhj wm 2di 75u 2tm bxn 7ey
Topic #8:
god people jesus believe does say think israel christian true life jews did bible don just know world way church
Topic #9:
don know like just think ve want does use good people key time way make problem really work say need
复制代码

 


本文转自ZH奶酪博客园博客,原文链接:http://www.cnblogs.com/CheeseZH/p/5254082.html,如需转载请自行联系原作者

相关文章
|
机器学习/深度学习 自然语言处理 算法
Jieba分词的准确率提升:使用paddle模式进行分词(使用百度飞桨深度学习模型进行分词)
jieba中的paddle模式是指使用飞桨(PaddlePaddle)深度学习框架加速分词的一种模式。相对于传统的分词算法,paddle模式采用了深度学习模型,可以获得更高的分词准确度和更快的分词速度。
|
JavaScript 小程序 前端开发
【手把手教教学物联网项目】01 视频大纲
《手把手教教学物联网项目》是一系列视频教程,旨在引导初学者掌握物联网技术。视频涵盖物联网基础,如物联网概述、架构和技术;STM32微控制器的介绍、编程及外设使用;网关开发,涉及ESP8266和ESP32;物联网通信协议如TCP、MQTT、Modbus等;物联网总线协议如单总线、CAN、IIC和SPI;OLED显示原理与驱动;MQTT服务器搭建;物联网云平台介绍,包括阿里云平台的使用;微信小程序开发入门及前端VUE项目实践。此外,教程还涉及UniAPP和SpringBoot后台开发,最后通过“智能取餐柜”项目将理论知识付诸实践。视频可在B站找到,适合学生、爱好者和开发人员学习物联网技术。
1123 12
【手把手教教学物联网项目】01 视频大纲
|
4月前
|
存储 机器学习/深度学习 算法
​​LLM推理效率的范式转移:FlashAttention与PagedAttention正在重塑AI部署的未来​
本文深度解析FlashAttention与PagedAttention两大LLM推理优化技术:前者通过分块计算提升注意力效率,后者借助分页管理降低KV Cache内存开销。二者分别从计算与内存维度突破性能瓶颈,显著提升大模型推理速度与吞吐量,是当前高效LLM系统的核心基石。建议收藏细读。
915 125
|
5月前
|
人工智能 搜索推荐 数据挖掘
抖音电商API直播间弹幕互动,用户参与度翻倍!
在数字化电商时代,抖音电商API助力商家提升直播互动。通过实时弹幕处理、智能回复与数据分析,实现用户参与度翻倍,增强粘性、提升转化。本文详解API集成步骤与实战应用,助您打造高效直播间。
694 0
|
11月前
|
人工智能 关系型数据库 分布式数据库
2025阿里云PolarDB开发者大会来了!
在数字化浪潮中,AI与数据库的融合正重塑行业格局。2025年2月26日(周三),诚邀您在北京朝阳区嘉瑞文化中心参会,探讨数据技术发展与AI时代的无限可能。线上直播同步进行,欢迎参与!
2025阿里云PolarDB开发者大会来了!
|
6月前
|
传感器 算法 数据挖掘
Python时间序列平滑技术完全指南:6种主流方法原理与实战应用
时间序列数据分析中,噪声干扰普遍存在,影响趋势提取。本文系统解析六种常用平滑技术——移动平均、EMA、Savitzky-Golay滤波器、LOESS回归、高斯滤波与卡尔曼滤波,从原理、参数配置、适用场景及优缺点多角度对比,并引入RPR指标量化平滑效果,助力方法选择与优化。
1412 0
|
消息中间件 人工智能 运维
1月更文特别场——寻找用云高手,分享云&AI实践
我们寻找你,用云高手,欢迎分享你的真知灼见!
3269 68
1月更文特别场——寻找用云高手,分享云&AI实践
|
数据可视化 项目管理 数据库
提高工作效率:5个实用的SOP模板与技巧
SOP(标准操作程序)是将工作流程标准化,明确每一步骤、责任人及时间要求,以提高效率、减少错误并增强团队协作。初入职场者掌握SOP,能更快适应环境,提升个人与团队的工作表现。
2862 1
提高工作效率:5个实用的SOP模板与技巧
|
机器学习/深度学习 文字识别 自然语言处理
医疗行业化验单智能识别技术探讨:OCR与表格识别的应用
本文探讨了OCR与表格识别技术在医疗化验单处理中的应用,通过自动化数据提取和录入,显著提高了效率和准确性,降低了人工劳动强度和错误率。技术实现包括图像预处理、文字识别和表格解析等核心算法的优化,支持与医院信息管理系统集成,未来将向跨模态数据融合、多语言适配及数据安全方向发展。
1350 9
|
存储 SQL 缓存
使用索引注意合理的数量
【6月更文挑战第9天】本文介绍数据库索引是提升数据检索速度的数据结构,通过减少磁盘访问提高性能。建议根据表的大小和使用频率谨慎创建索引,如核心表不超过7个索引,普通表不超过5个,小型表不超过3个。
453 3
使用索引注意合理的数量