BERTopic生成的主题数量较多,并且有一些重复。
hdbscan的min_cluster_size是官方推荐的用于控制主题数量的参数。
Manual Topic Reduction
合并主题
topics_to_merge = [[1, 2],
[3, 4]]
topic_model.merge_topics(docs, topics_to_merge)
Automatic Topic Reduction
from bertopic import BERTopic
topic_model = BERTopic(nr_topics="auto")
Topic Reduction after Training
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
# Create topics -> Typically over 50 topics
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
# Further reduce topics
topic_model.reduce_topics(docs, nr_topics=30)
# Access updated topics
topics = topic_model.topics_
Update Topic Representation after Training
# 换一个n_gram_range
topic_model.update_topics(docs, n_gram_range=(1, 3))
# 或者自定义CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 5))
topic_model.update_topics(docs, vectorizer_model=vectorizer_model)
Outlier reduction
把离群点放入非离群的主题里面
new_topics = topic_model.reduce_outliers(docs, topics)
参数threshold选择离群点和主题的最小距离
默认方法是计算离群文档的c-TF-IDF表示,然后放入最匹配的非离群文档中。其它可以考虑的策略还有:
用topic-document probabilities分配主题
This strategy uses the soft-clustering as performed by HDBSCAN to find the best matching topic for each outlier document.
from bertopic import BERTopic
# Train your BERTopic model and calculate the document-topic probabilities
topic_model = BERTopic(calculate_probabilities=True)
topics, probs = topic_model.fit_transform(docs)
# Reduce outliers using the `probabilities` strategy
new_topics = topic_model.reduce_outliers(docs, topics, probabilities=probs, strategy="probabilities")
用topic-document distributions分配主题
from bertopic import BERTopic
# Train your BERTopic model
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
# Reduce outliers using the `distributions` strategy
new_topics = topic_model.reduce_outliers(docs, topics, strategy="distributions")
用c-TF-IDF representations分配主题
Calculate the c-TF-IDF representation for each outlier document and find the best matching c-TF-IDF topic representation using cosine similarity.
from bertopic import BERTopic
# Train your BERTopic model
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
# Reduce outliers using the `c-tf-idf` strategy
new_topics = topic_model.reduce_outliers(docs, topics, strategy="c-tf-idf")
用document and topic embeddings分配主题
Using the embeddings of each outlier documents, find the best matching topic embedding using cosine similarity.
from bertopic import BERTopic
# Train your BERTopic model
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
# Reduce outliers using the `embeddings` strategy
new_topics = topic_model.reduce_outliers(docs, topics, strategy="embeddings")
多种策略结合
# Use the "c-TF-IDF" strategy with a threshold
new_topics = topic_model.reduce_outliers(docs, new_topics , strategy="c-tf-idf", threshold=0.1)
# Reduce all outliers that are left with the "distributions" strategy
new_topics = topic_model.reduce_outliers(docs, topics, strategy="distributions")
Update Topics
topic_model.update_topics(docs, topics=new_topics)