BERTopic(一)基本用法

简介: bertopic基本用法

文档 https://maartengr.github.io/BERTopic/algorithm/algorithm.html 提供了基本流程。

Document embeddings
克服了bag-of-words "disregard semantic relationships among words"的缺点

from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("DMetaSoul/Dmeta-embedding")
# 计算嵌入比较慢,可以pre-calculate
embeddings = embedding_model.encode(docs, show_progress_bar=True)

Document clustering
使用umap降维,使用hdbscan聚类
使用hdbscan的原因是"a cluster will not always lie within a sphere around a cluster centroid"

from umap import UMAP
from hdbscan import HDBSCAN

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

参数n_neighbors和min_cluster_size相当于调整一个cluster的大小,n_components相当于输出的维度

Topic Representation
这部分是核心,分为vectorize、c-TF-IDF和fine-tune三部分
对于中文,需要分词

from sklearn.feature_extraction.text import CountVectorizer
import jieba

def tokenize_zh(text):
    words = jieba.lcut(text)
    return words

# by increasing the n-gram range we will consider topic representations that are made up of one or two words.
vectorizer = CountVectorizer(tokenizer=tokenize_zh, ngram_range=(1, 3))

The classic TF-IDF procedure combines two statistics, term frequency, and inverse document frequency. We generalize this procedure to clusters of documents. First, we treat all documents in a cluster as a single document by simply concatenating the documents. Then, TF-IDF is adjusted to account for this representation by translating documents to clusters.

from bertopic.vectorizers import ClassTfidfTransformer

ctfidf_model = ClassTfidfTransformer()
from bertopic.representation import KeyBERTInspired

# Create your representation model
# KeyBERTInspired可以减少stop words
representation_model = KeyBERTInspired()

然后把模型组装起来,训练模型

topic_model = BERTopic(
  embedding_model=embedding_model,          # Step 1 - Extract embeddings
  umap_model=umap_model,                    # Step 2 - Reduce dimensionality
  hdbscan_model=hdbscan_model,              # Step 3 - Cluster reduced embeddings
  vectorizer_model=vectorizer_model,        # Step 4 - Tokenize topics
  ctfidf_model=ctfidf_model,                # Step 5 - Extract topic words
  representation_model=representation_model, # Step 6 - (Optional) Fine-tune topic represenations
  verbose=True
)

topics, probs = topic_model.fit_transform(docs,
  embeddings # pre-calculate
)

看看生成了哪些主题

topic_model.get_topic_info()

topic_model.get_topic(0)

最后可视化一下吧

topic_model.visualize_topics()

topic_model.visualize_heatmap()

还可以保存和加载模型

topic_model.save("path/to/my/model_dir", serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)

loaded_model = BERTopic.load("path/to/my/model_dir")
相关文章
|
3月前
|
Kubernetes 网络协议 网络安全
nftables用法介绍
nftables用法介绍
86 2
|
6月前
|
数据安全/隐私保护
关于DotNetZip的用法
关于DotNetZip的用法
71 0
|
Java Spring
@ConditionalOnProperty的用法
@ConditionalOnProperty的用法
360 0
$.each()的用法
$.each()的用法
237 0
${}用法
[el表达式],它会从page,request,session,application中取值。比如:{name}它的意思就从以上4个对象中去名为name的值。
1360 0
|
Java
Systrace的用法小结
通过SysTrace可以帮助我们分析性能问题,包含方法的耗时时长、CPU的使用情况、ANR、布局情况等;相比性能工具TraceView,主要用来分析每个方法的执行时间,对于冷启动而言,想抓trace只能通过代码的方式,这样会导致整个应用比较卡顿,测试出的方法时间,不是真正的执行时间,只能看下时间长短的相对占比。
3495 0