前言
本文主要介绍通过python实现数据聚类、脚本开发、办公自动化。读取voc数据,聚类voc数据。
一、业务逻辑
- 读取voc数据采集的数据
- 批处理,使用jieba进行分词,去除停用词
- LDA模型计算词汇和每个词的频率
- 将可视化结果保存到HTML文件中
二、具体产出
三、执行脚本
python lda.py
四、关键代码
# LDA主题分析模型
import pandas as pd
import jieba
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import pyLDAvis
fileName = "100005785591" # 文件名
# 加载停用词
with open('stopwordsfull', 'r', encoding='utf-8') as f:
stopwords = set([line.strip() for line in f])
# 加载业务域名词
with open('luyouqi.txt', 'r', encoding='utf-8') as f:
business_terms = set([line.strip() for line in f])
# 为jieba分词库增加业务名词
for term in business_terms:
jieba.add_word(term)
# 对评论进行分词
def tokenize(text):
words = jieba.cut(text)
filtered_words = [word for word in words if word not in stopwords]
return ' '.join(filtered_words)
# 从xlsx文件加载评论数据
data = pd.read_excel('clean/cleaned_voc'+fileName+'.xlsx')
comments = data['content'].tolist()
# 对每个评论进行分词并且形成新的评论列表
tokenized_comments = [tokenize(comment) for comment in comments]
# 使用CountVectorizer来获取词频
vectorizer = CountVectorizer(max_df=0.85, min_df=2, max_features=1000)
X = vectorizer.fit_transform(tokenized_comments)
# LDA模型
lda = LatentDirichletAllocation(n_components=5, random_state=42)
lda.fit(X)
# 计算词汇和每个词的频率
vocab = vectorizer.get_feature_names_out()
term_frequency = X.sum(axis=0).tolist()[0]
# 获取文档-主题分布和文档长度
doc_topic_dists = lda.transform(X)
doc_lengths = [len(doc.split()) for doc in comments]
# 使用pyLDAvis.prepare方法进行可视化
lda_display = pyLDAvis.prepare(
topic_term_dists=lda.components_,
doc_topic_dists=doc_topic_dists,
doc_lengths=doc_lengths,
vocab=vocab,
term_frequency=term_frequency
)
# 将可视化结果保存到HTML文件中
output_file_path = 'lda/'+fileName+'.html'
pyLDAvis.save_html(lda_display, output_file_path)
# 读取生成的HTML文件并替换CDN链接为本地路径
with open(output_file_path, 'r', encoding='utf-8') as file:
file_contents = file.read()
file_contents = file_contents.replace(
'https://cdn.jsdelivr.net/gh/bmabey/pyLDAvis@3.4.0/pyLDAvis/js/ldavis.v1.0.0.js',
'ldavis.v1.0.0.js'
)
file_contents = file_contents.replace(
'https://cdn.jsdelivr.net/gh/bmabey/pyLDAvis@3.4.0/pyLDAvis/js/ldavis.v1.0.0.css',
'ldavis.v1.0.0.css'
)
# 保存修改后的HTML文件
with open(output_file_path, 'w', encoding='utf-8') as file:
file.write(file_contents)
五、关键文件
luyouqi.text 分词字典(片段)
2.4G
2.5G口
软路由
2.5G
WiFi
WiFi5
WiFi6
WiFi4
stopwordsfull 停用词(片段)
客户
层面
菜鸟
滑丝
换货
三思
固记
厂商
吸引力
体会
六、LDA话题权重优先级参考
https://www.bilibili.com/video/BV1Sr4y1C7Xc/?spm_id_from=333.337.search-card.all.click