使用CAMEL和Unsloth进行数据生成与Qwen模型微调

2024-12-31 59

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

NLP 自学习平台，3个模型定制额度 1个月

NLP自然语言处理_基础版，每接口每天50万次

视觉智能开放平台，分割抠图1万点

简介： 本项目结合CAMEL和Unsloth，生成高质量训练数据并对Qwen 7B模型进行微调，提升其在特定内容上的理解和生成能力。我们使用CAMEL生成指令-输入-输出三元组数据，并通过Unsloth的LoRA技术加速微调过程。详细步骤包括环境准备、API密钥设置、模型加载与配置、数据生成与保存、模型训练及推理。最终，微调后的Qwen 7B模型能更好地处理CAMEL社区相关文本。更多详情请参考：- [CAMEL GitHub](https://github.com/camel-ai/camel)

使用CAMEL和Unsloth进行数据生成与Qwen模型微调

在本项目中，我们将结合CAMEL和Unsloth，生成高质量的训练数据，并对Qwen 7B模型进行微调，使其能够更好地理解和生成与特定内容相关的文本。本文将详细介绍从数据生成到模型训练的完整流程，并分享一些实用的技巧和注意事项。

1. 项目背景与工具介绍

1.1 CAMEL-AI

CAMEL-AI
是一个开源社区，致力于寻找智能体的扩展规律。我们相信，大规模研究这些智能体可以为了解它们的行为、能力和潜在风险提供有价值的见解。为了促进该领域的研究，我们实现并支持各种类型的智能体、任务、提示、模型和模拟环境。

CAMEL
是一个强大的多智能体系统，本文主要利用它的数据合成等能力，特别适合用于生成指令-输入-输出的三元组数据（如Alpaca格式）。CAMEL支持多种模型，包括QWEN系列等，能够高效地生成多样化的训练数据。

本cookbook在线运行链接：
https://colab.research.google.com/drive/1sMnWOvdmASEMhsRIOUSAeYuEywby6FRV?usp=sharing
本cookbook官方文档：
https://docs.camel-ai.org/cookbooks/sft_data_generation_and_unsloth_finetuning_Qwen2_5_7B.html

1.2 Unsloth

Unsloth是一个高效的模型微调工具，专注于加速大语言模型的训练过程。它支持LoRA（Low-Rank Adaptation）等微调技术，能够在保持模型性能的同时大幅减少训练时间和资源消耗。

1.3 Qwen 7B

Qwen 7B是一个开源的大语言模型，具有70亿参数，适合在各种自然语言处理任务中进行微调。我们将使用Unsloth对Qwen 7B进行微调，使其能够更好地理解和生成与CAMEL社区相关的内容。

2. 环境准备

首先，我们需要在Google Colab中设置环境，并安装所需的库。以下是安装步骤：

# 安装必要的库
!pip install unsloth
!pip install camel-ai==0.2.14
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
!pip install firecrawl

3. 设置API密钥

为了使用CAMEL和Firecrawl，我们需要设置OpenAI和Firecrawl的API密钥：

from getpass import getpass
import os

openai_api_key = getpass('Enter your OpenAI API key: ')
os.environ["OPENAI_API_KEY"] = openai_api_key

firecrawl_api_key = getpass('Enter your Firecrawl API key: ')
os.environ["FIRECRAWL_API_KEY"] = firecrawl_api_key

4. 加载与配置模型

接下来，我们使用Unsloth加载Qwen 7B模型，并进行LoRA微调的配置：

from unsloth import FastLanguageModel
import torch

max_seq_length = 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen2.5-7B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

5. 数据生成

我们使用CAMEL生成Alpaca格式的训练数据。首先，我们定义一个生成函数，该函数基于给定的内容生成指令-输入-输出的三元组：

from typing import List
from camel.loaders import Firecrawl
from camel.models import ModelFactory
from camel.types import ModelPlatformType, ModelType
from camel.configs import ChatGPTConfig
from camel.agents import ChatAgent
import json

def generate_alpaca_items(content: str, n_items: int, start_num: int = 1, examples: List[AlpacaItem] = None) -> List[AlpacaItem]:
    system_msg = """
You are an AI assistant generating detailed, accurate responses based on the provided content.
You will be given a reference content, and you must generate a specific number of AlpacaItems.
These are instruction-input-response triplets, where the input is the context or examples.

Add a number to the items to keep track of the order. Generate exactly that many.

For each instruction, imagine but do not include a real world scenario and real user in that scenario to inform realistic and varied instructions. Avoid common sense questions and answers.

Include multiple lines in the output as appropriate to provide sufficient detail. Cite the most relevant context verbatim in output fields, do not omit anything important.

Leave the input field blank.

Ensure all of the most significant parts of the context are covered.

Start with open ended instructions, then move to more specific ones. Consider the starting number for an impression of what has already been generated.
    """

    examples_str = ""
    if examples:
        examples_str = "\n\nHere are some example items for reference:\n" + \
            "\n".join(ex.model_dump_json() for ex in examples)

    model = ModelFactory.create(
        model_platform=ModelPlatformType.OPENAI,
        model_type=ModelType.GPT_4O_MINI,
        model_config_dict=ChatGPTConfig(
            temperature=0.6, response_format=AlpacaItemResponse
        ).as_dict(),
    )

    agent = ChatAgent(
        system_message=system_msg,
        model=model,
    )

    prompt = f"Content reference:\n{content}{examples_str}\n\n Generate {n_items} AlpacaItems. The first should start numbering at {start_num}."
    response = agent.step(prompt)

    alpaca_items = [n_item.item for n_item in
                    AlpacaItemResponse.
                    model_validate_json(response.msgs[0].content).items]

    return alpaca_items

6. 数据生成与保存

我们使用Firecrawl从指定的URL抓取内容，并生成训练数据：

import random
firecrawl = Firecrawl()
response = firecrawl.scrape(
    url="https://github.com/camel-ai/camel/blob/master/CONTRIBUTING.md"
)

alpaca_entries = []
for start in range(1, 301, 50):
    current_examples = examples + (random.sample(alpaca_entries,
                                                 min(5, len(alpaca_entries)))
                                                  if alpaca_entries else [])

    batch = generate_alpaca_items(
        content=response["markdown"],
        n_items=50,
        start_num=start,
        examples=current_examples
    )
    print(f"Generated {len(batch)} items")
    alpaca_entries.extend(batch)

save_json(alpaca_entries, 'alpaca_format_data.json')

7. 模型训练

我们使用生成的Alpaca格式数据对Qwen 7B模型进行微调：

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

model = FastLanguageModel.for_training(model)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 30,
        learning_rate = 0.001,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",
    ),
)

trainer_stats = trainer.train()

8. 模型推理

训练完成后，我们可以使用微调后的模型进行推理：

FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
    AlpacaItem(
        instruction="Explain how can I stay up to date with the CAMEL community.",
        input="",
        output="",
    ).to_string()
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 512, use_cache = True)
tokenizer.batch_decode(outputs)

9. 总结

通过本项目，我们成功使用CAMEL生成了高质量的训练数据，并使用Unsloth对Qwen 7B模型进行了微调。微调后的模型能够更好地理解和生成与CAMEL社区相关的内容。希望本文能帮助你更好地理解如何使用CAMEL和Unsloth进行数据生成与模型微调。

如果你有任何问题或想了解更多关于CAMEL的内容，欢迎访问CAMEL
如果觉得有帮助的话欢迎star我们的开源项目感谢！

CAMEL
https://github.com/camel-ai/camel

本cookbook在线运行链接：
https://colab.research.google.com/drive/1sMnWOvdmASEMhsRIOUSAeYuEywby6FRV?usp=sharing

本cookbook官方文档：
https://docs.camel-ai.org/cookbooks/sft_data_generation_and_unsloth_finetuning_Qwen2_5_7B.html

使用CAMEL和Unsloth进行数据生成与Qwen模型微调

使用CAMEL和Unsloth进行数据生成与Qwen模型微调

1. 项目背景与工具介绍

1.1 CAMEL-AI

1.2 Unsloth

1.3 Qwen 7B

2. 环境准备

3. 设置API密钥

4. 加载与配置模型

5. 数据生成

6. 数据生成与保存

7. 模型训练

8. 模型推理

9. 总结

通义大模型

热门文章

最新文章

相关课程

相关电子书

相关实验场景