BERTopic(二) Fine-tune representation by llm-阿里云开发者社区

BERTopic(二) Fine-tune representation by llm

2024-05-30 216 发布于北京

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： bertopic

在overview中，使用了KeyBERT来fine-tune representation。BERTopic还支持使用大语言模型来fine-tune。BERTopic支持openai、llama.cpp和langchain。本文使用openai和ollama进行本地部署。

ollama参考https://ollama.com/

import openai
import bertopic.representation
client = openai.OpenAI(
    base_url = 'http://localhost:11434/v1',
    api_key='ollama', # required, but unused
)

representation_model = bertopic.representation.OpenAI(client, model="yi", chat=True)

        
          
        
        
        
          
          AI 代码解读

默认prompt

DEFAULT_CHAT_PROMPT = """
I have a topic that contains the following documents: 
[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]

Based on the information above, extract a short topic label in the following format:
topic: <topic label>
"""

        
          
        
        
        
          
          AI 代码解读

简单看看源码的流程，从fit_transform开始，433行self._extract_topics(documents, embeddings=embeddings, verbose=self.verbose)。来到_openai.py，方法extract_topics调用了_extract_representative_docs。对每个有代表性的doc，调用self.client.chat.completions.create，大模型生成一个response，然后用response.choices[0].message.content.strip().replace("topic: ", "")获得label。

下面介绍bertopic.representation.OpenAI的主要参数。默认情况下，四个最有代表性的文档传给[DOCUMENTS]。
可以用nr_docs改变传入的文档数。用参数diversity改善文档过于相似的问题，这个参数在0到1之间，推荐设为0.1。
可以用doc_length截断文档。
tokenizer决定了doc_length的计算方式，例如是按char还是whitespace切分文档。

BERTopic(二) Fine-tune representation by llm

热门文章

最新文章

相关电子书

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

BERTopic(二) Fine-tune representation by llm

热门文章

最新文章

相关电子书