BERTopic(三)update topics

简介: BERTopic更新主题

BERTopic生成的主题数量较多,并且有一些重复。

hdbscan的min_cluster_size是官方推荐的用于控制主题数量的参数。

Manual Topic Reduction
合并主题

topics_to_merge = [[1, 2],
                   [3, 4]]
topic_model.merge_topics(docs, topics_to_merge)

Automatic Topic Reduction

from bertopic import BERTopic
topic_model = BERTopic(nr_topics="auto")

Topic Reduction after Training

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

# Create topics -> Typically over 50 topics
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

# Further reduce topics
topic_model.reduce_topics(docs, nr_topics=30)

# Access updated topics
topics = topic_model.topics_

Update Topic Representation after Training

# 换一个n_gram_range
topic_model.update_topics(docs, n_gram_range=(1, 3))

# 或者自定义CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 5))
topic_model.update_topics(docs, vectorizer_model=vectorizer_model)

Outlier reduction
把离群点放入非离群的主题里面

new_topics = topic_model.reduce_outliers(docs, topics)

参数threshold选择离群点和主题的最小距离

默认方法是计算离群文档的c-TF-IDF表示,然后放入最匹配的非离群文档中。其它可以考虑的策略还有:

用topic-document probabilities分配主题

This strategy uses the soft-clustering as performed by HDBSCAN to find the best matching topic for each outlier document.

from bertopic import BERTopic

# Train your BERTopic model and calculate the document-topic probabilities
topic_model = BERTopic(calculate_probabilities=True)
topics, probs = topic_model.fit_transform(docs)

# Reduce outliers using the `probabilities` strategy
new_topics = topic_model.reduce_outliers(docs, topics, probabilities=probs, strategy="probabilities")

用topic-document distributions分配主题

from bertopic import BERTopic

# Train your BERTopic model
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

# Reduce outliers using the `distributions` strategy
new_topics = topic_model.reduce_outliers(docs, topics, strategy="distributions")

用c-TF-IDF representations分配主题
Calculate the c-TF-IDF representation for each outlier document and find the best matching c-TF-IDF topic representation using cosine similarity.

from bertopic import BERTopic

# Train your BERTopic model
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

# Reduce outliers using the `c-tf-idf` strategy
new_topics = topic_model.reduce_outliers(docs, topics, strategy="c-tf-idf")

用document and topic embeddings分配主题
Using the embeddings of each outlier documents, find the best matching topic embedding using cosine similarity.

from bertopic import BERTopic

# Train your BERTopic model
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

# Reduce outliers using the `embeddings` strategy
new_topics = topic_model.reduce_outliers(docs, topics, strategy="embeddings")

多种策略结合

# Use the "c-TF-IDF" strategy with a threshold
new_topics = topic_model.reduce_outliers(docs, new_topics , strategy="c-tf-idf", threshold=0.1)

# Reduce all outliers that are left with the "distributions" strategy
new_topics = topic_model.reduce_outliers(docs, topics, strategy="distributions")

Update Topics

topic_model.update_topics(docs, topics=new_topics)
相关文章
|
2月前
|
消息中间件 Kafka Apache
kafka: invalid configuration (That topic/partition is already being consumed)
kafka: invalid configuration (That topic/partition is already being consumed)
|
6月前
|
消息中间件 Oracle 关系型数据库
实时计算 Flink版操作报错合集之报错io.debezium.DebeziumException: The db history topic or its content is fully or partially missing. Please check database history topic configuration and re-execute the snapshot. 是什么原因
在使用实时计算Flink版过程中,可能会遇到各种错误,了解这些错误的原因及解决方法对于高效排错至关重要。针对具体问题,查看Flink的日志是关键,它们通常会提供更详细的错误信息和堆栈跟踪,有助于定位问题。此外,Flink社区文档和官方论坛也是寻求帮助的好去处。以下是一些常见的操作报错及其可能的原因与解决策略。
349 0
|
消息中间件 存储 Kafka
Kafka - Primie Number of Partitions Issue & Consumer Group Rebalance
Kafka - Primie Number of Partitions Issue & Consumer Group Rebalance
50 0
|
存储 机器学习/深度学习 人工智能
Data topic details 1 | Data
数据结构结构教程 李春葆(第五版)习题 第一章
465 0
|
存储 机器学习/深度学习 算法
Data topic details 2 | Data
数据结构结构教程 李春葆(第五版)习题 第二章
206 0
|
存储 算法
Data topic details 9 | Data
数据结构结构教程 李春葆(第五版)习题 第九章
130 0
Data topic details 9 | Data
|
存储 机器学习/深度学习 人工智能
Data topic details 8 | Data
数据结构结构教程 李春葆(第五版)习题 第八章
95 0
Data topic details 8 | Data
|
存储 机器学习/深度学习 人工智能
Data topic details | Data
数据结构结构教程 李春葆(第五版)习题
591 0
Data topic details | Data
|
存储 算法 前端开发
Data topic details 3 | Data
数据结构结构教程 李春葆(第五版)习题 第三章
389 0
Data topic details 3 | Data
|
存储 人工智能 移动开发
Data topic details 7 | Data
数据结构结构教程 李春葆(第五版)习题 第七章
108 0
Data topic details 7 | Data