使用 Qwen 进行自指令数据生成

本cookbook在线运行链接：
https://colab.research.google.com/drive/1tNRrC3u6TjdHz_vG3VicYz6Q7DV_E1cq?usp=sharing

⭐ 为代码库点赞

如果你觉得 CAMEL 很有用或有趣，请考虑在我们的 CAMEL 仓库为我们点赞！你的点赞能帮助更多人发现这个项目，并激励我们持续改进。

Self-instruct是一种为大型语言模型（LLMs）自动生成指令的技术。手动创建这些数据集可能既耗时又昂贵。Self-instruct提供了一种自动化这个过程的方法，可以快速高效地生成大量指令。

安装和设置

首先，安装 CAMEL 包及其所有依赖

!pip install "git+https://github.com/camel-ai/camel.git@master#egg=camel-ai[all]"

如果您尚未拥有 Qwen API 密钥，可以按照以下步骤获取：

访问阿里云模型工作室控制台，并按照屏幕上的说明激活模型服务。
在控制台右上角，点击您的账户名称，然后选择 API-KEY。
在 API 密钥管理页面，点击创建 API 密钥按钮以生成新的密钥。

import os
from getpass import getpass

qwen_api_key = getpass('输入你的 Qwen API 密钥: ')
os.environ["QWEN_API_KEY"] = qwen_api_key

输入你的 Qwen API 密钥: ··········

from camel.configs import QwenConfig
from camel.models import ModelFactory
from camel.types import ModelPlatformType, ModelType
from camel.agents import ChatAgent
from camel.messages import BaseMessage

qwen_model = ModelFactory.create(
    model_platform=ModelPlatformType.QWEN,
    model_type=ModelType.QWEN_TURBO,
    model_config_dict=QwenConfig(temperature=0.2).as_dict(),
)

基础Agent设置

from camel.agents import ChatAgent
from camel.datagen.self_instruct import SelfInstructPipeline

agent = ChatAgent(
    model=qwen_model,
)

基础Pipeline设置

该Pipeline通过从一小组种子（人工编写）指令开始工作，然后使用 LLM 基于这些种子生成新的指令。

种子指令通常存储在 JSON Lines (JSONL) 文件中。文件中的每一行都代表 JSON 格式的单个指令。
与种子文件一样，输出也以 JSONL 格式存储，便于解析和用于进一步的任务，如训练或微调语言模型。

请将 seed_path 替换为你的种子文件路径，将 data_output_path 替换为你想要的输出位置。

import os
import requests

# 为本地数据创建目录
os.makedirs('local_data', exist_ok=True)

# 更新为原始文件内容的 URL
url = "https://raw.githubusercontent.com/camel-ai/camel/master/examples/synthetic_datagen/self_instruct/seed_tasks.jsonl"

# 获取原始文件
response = requests.get(url)

with open('local_data/seed_tasks.jsonl', 'wb') as file:
    file.write(response.content)

seed_path = 'local_data/seed_tasks.jsonl'
data_output_path = 'data_output.json'

Self-instruct以迭代方式工作。在每一轮中：

从 seed_path 中选择一定数量的人工编写指令（num_human_sample）。
从之前的轮次中选择一定数量的机器生成指令（num_machine_sample）。
使用这些选定的指令来指导语言模型生成新的指令。
这些新指令被添加到机器生成指令池中，然后重复这个过程，直到生成所需数量的指令。

human_to_machine_ratio 帮助控制整个过程中人工指导和模型创造力之间的平衡。通过调整这个比率，你可以影响生成指令的质量和多样性。

你可以自由调整 num_human_sample 和 num_machine_sample，这两个值稍后都将传入 human_to_machine_ratio

num_human_sample = 6
num_machine_sample = 2

请将 target_num_instructions 替换为你想要生成的机器指令数量

target_num_instructions = 10

将所有内容传递给我们的流水线。

pipeline = SelfInstructPipeline(
    agent=agent,
    seed=seed_path,
    num_machine_instructions=target_num_instructions,
    data_output_path=data_output_path,
    human_to_machine_ratio=(num_human_sample, num_machine_sample),
)

尝试生成！你将看到生成的数据文件被创建在你指定的位置！

pipeline.generate()

过滤函数

新生成的指令在被添加到结果中之前会经过过滤和评估。只有符合预定标准的指令才会被包含。CAMEL 提供了一些可以在Self-instruct中使用的过滤函数。此外，我们还支持自定义过滤器以进行定制评估！过滤函数在指令有效时返回 True，否则返回 False。

长度过滤器

LengthFilter 过滤掉所有长度小于 min_len 或大于 max_len 的指令。

from camel.datagen.self_instruct import LengthFilter

length_filter = LengthFilter(min_len=5, max_len=50)

instructions = [
    "按升序排列数字。",
    "计算总和。",
    "创建一个详细说明每月支出和储蓄的电子表格报告。"
]

filtered_instructions = [instr for instr in instructions if length_filter.apply(instr)]
print(filtered_instructions)

['按升序排列数字。', '创建一个详细说明每月支出和储蓄的电子表格报告。']

关键词过滤器

KeywordFilter 过滤包含特定不需要关键词的指令。

from camel.datagen.self_instruct import KeywordFilter

keyword_filter = KeywordFilter(keywords=["禁止", "禁用", "禁令"])

instructions = [
    "禁止使用塑料袋。",
    "鼓励回收计划。",
    "禁止在公共场所吸烟。"
]

filtered_instructions = [instr for instr in instructions if keyword_filter.apply(instr)]
print(filtered_instructions)

['鼓励回收计划。']

标点符号过滤器

PunctuationFilter 过滤以非字母数字字符开头的指令。

from camel.datagen.self_instruct import PunctuationFilter

punctuation_filter = PunctuationFilter()

instructions = [
    "按类别对数据进行排序。",
    "#分析随时间变化的趋势。",
    "*创建结果摘要。"
]

filtered_instructions = [instr for instr in instructions if punctuation_filter.apply(instr)]
print(filtered_instructions)

['按类别对数据进行排序。']

非英语过滤器

NonEnglishFilter 过滤不以英文字母开头的指令。

from camel.datagen.self_instruct import NonEnglishFilter

non_english_filter = NonEnglishFilter()

instructions = [
    "Analyze the performance metrics.",
    "计算结果的统计数据.",
    "Test the new algorithm."
]

filtered_instructions = [instr for instr in instructions if non_english_filter.apply(instr)]
print(filtered_instructions)

['Analyze the performance metrics.', 'Test the new algorithm.']

ROUGE 相似度过滤器

RougeSimilarityFilter 基于 ROUGE 分数过滤与现有指令过于相似的指令。

from camel.datagen.self_instruct import RougeSimilarityFilter

existing_instructions = [
    "总结这篇文章。",
    "写一个文本的简要概述。"
]

similarity_filter = RougeSimilarityFilter(existing_instructions, threshold=0.5)

instructions = [
    "总结内容。",
    "为文本创建一个摘要。",
    "提供文本分析。"
]

filtered_instructions = [instr for instr in instructions if similarity_filter.apply(instr)]
print(filtered_instructions)

['为文本创建一个摘要。', '提供文本分析。']

自定义过滤函数

此外，你还可以实现自己的过滤函数。

from camel.datagen.self_instruct import FilterFunction

class CustomFilter(FilterFunction):

    def apply(self, instruction: str) -> bool:
        # 在这里应用你的逻辑
        logic = ...
        return logic

指令过滤器

InstructionFilter 管理所有过滤函数。我们可以使用自定义的 InstructionFilter 来初始化Pipeline。

首先添加你想要的过滤函数并配置它们。

filter_config = {
   
  "length": {
   "min_len": 5, "max_len": 100},
  "keyword": {
   "keywords": ["图片", "视频"]},
  "non_english": {
   },
  "rouge_similarity": {
   
      "existing_instructions": ["一些现有的指令"],
      "threshold": 0.6
  }
}

然后，初始化一个 InstructionFilter

from camel.datagen.self_instruct import InstructionFilter
filters = InstructionFilter(filter_config)

我们可以通过运行以下代码轻松应用所有过滤函数：

instructions = [
    "按升序排列数字。",
    "计算总和。",
    "创建一个详细说明每月支出和储蓄的电子表格报告。",
    "*创建结果摘要。",
    "计算结果的统计数据。"
]

filtered_instructions = [instr for instr in instructions if filters.filter(instr)]
print(filtered_instructions)

['按升序排列数字。', '创建一个详细说明每月支出和储蓄的电子表格报告。']

使用自定义 `InstructionFilter` 设置流水线

CAMEL 在流水线中有一些默认的过滤函数，但你也可以选择自己的！

pipeline = SelfInstructPipeline(
    agent=agent,
    seed=seed_path,
    num_machine_instructions=target_num_instructions,
    data_output_path='data_output_path',
    human_to_machine_ratio=(num_human_sample, num_machine_sample),
    instruction_filter=filters,    # 传入你的 InstructionFilter
)

或者如果你想使用默认的函数过滤器，但使用不同的配置，你也可以只传入过滤器配置

最后，开始生成！

pipeline.generate()

这就是全部内容：对 🐫 CAMEL-AI 有疑问？加入我们的 Discord！无论你是想分享反馈、探索多智能体系统的最新进展、获取支持，还是与其他人在激动人心的项目上建立联系，我们都很欢迎你加入社区！🤝

查看我们的其他工作：

🐫 创建你的第一个 CAMEL 代理免费 Colab
Graph RAG 教程免费 Colab
🧑‍⚖️ 使用 Workforce 创建黑客马拉松评委委员会免费 Colab
🔥 使用 Firecrawl 和 CAMEL 从网站获取数据的 3 种方法免费 Colab
🦥 使用 CAMEL 和 Mistral 模型进行代理 SFT 数据生成，使用 Unsloth 进行微调免费 Colab

来自 🐫 CAMEL-AI 团队的感谢

⭐ 在 Github 上给我们点星，加入我们的 Discord 或关注我们的 X ⭐

使用 Qwen 进行Self-instruct数据生成

使用 Qwen 进行自指令数据生成

安装和设置

基础Agent设置

基础Pipeline设置

过滤函数

长度过滤器

关键词过滤器

标点符号过滤器

非英语过滤器

ROUGE 相似度过滤器

自定义过滤函数

指令过滤器

使用自定义 `InstructionFilter` 设置流水线

通义大模型

热门文章

最新文章

相关电子书

相关实验场景

使用 Qwen 进行Self-instruct数据生成

使用 Qwen 进行自指令数据生成

安装和设置

基础Agent设置

基础Pipeline设置

过滤函数

长度过滤器

关键词过滤器

标点符号过滤器

非英语过滤器

ROUGE 相似度过滤器

自定义过滤函数

指令过滤器

使用自定义 InstructionFilter 设置流水线

通义大模型

热门文章

最新文章

相关电子书

相关实验场景

使用自定义 `InstructionFilter` 设置流水线