幻方开源第二代MoE模型 DeepSeek-V2，魔搭社区推理、微调最佳实践教程

2024-05-08 899

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

模型在线服务 PAI-EAS，A10/V100等 500元 1个月

交互式建模 PAI-DSW，每月250计算时 3个月

模型训练 PAI-DLC，100CU*H 3个月

简介： 5月6日，幻方继1月份推出首个国产MoE模型，历时4个月，带来第二代MoE模型DeepSeek-V2，并开源了技术报告和模型权重，魔搭社区可下载体验。

导读

5月6日，幻方继1月份推出首个国产MoE模型，历时4个月，带来第二代MoE模型DeepSeek-V2，并开源了技术报告和模型权重，魔搭社区可下载体验。

技术报告：

https://github.com/deepseek-ai/DeepSeek-V2/blob/main/deepseek-v2-tech-report.pdf

DeepSeek-V2未遵循业界普遍采用的“类LLaMA的Dense结构”和“类Mistral的Sparse结构”，而采取了对模型框架的全面创新。该模型引入了MLA（Multi-head Latent Attention）架构，这是一种与MHA（Multi-Head Attention）相媲美的技术，能显著降低计算量和推理时的内存使用。同时，自研Sparse结构DeepSeekMoE极大降低了计算量，二者的结合使模型性能得到了大幅提升。（详情可查看技术报告和开源代码）

官方同步，DeepSeek-V2以236B总参数、21B激活，大致达到70B~110B Dense的模型能力，同时消耗的显存（KV Cache）只有同级别Dense模型的1/5~1/100，每token成本大幅降低。实际部署在8卡H800机器上，输入吞吐量超过每秒10万tokens，输出超过每秒5万tokens。

性能方面，在目前大模型主流榜单中，DeepSeek-V2均表现出色:

中文综合能力（AlignBench）开源模型中最强，与GPT-4-Turbo，文心4.0等闭源模型在评测中处于同一梯队
英文综合能力（MT-Bench）与最强的开源模型LLaMA3-70B同处第一梯队，超过最强MoE开源模型Mixtral 8x22B
知识、数学、推理、编程等榜单结果也位居前列
支持128K上下文窗口

和DeepSeek 67B相比，DeepSeek-V2节约了42.5%训练成本，推理的KV Cache节约了93.3%，最大吞吐是之前的576%。

模型链接和下载

DeepSeek-V2系列模型现已在魔搭ModelScope社区开源，包括：

DeepSeek-V2-Chat：

https://modelscope.cn/models/deepseek-ai/DeepSeek-V2-Chat

DeepSeek-V2：

https://modelscope.cn/models/deepseek-ai/DeepSeek-V2

社区支持直接下载模型的repo：

#模型下载
from modelscope import snapshot_download
model_dir = snapshot_download('deepseek-ai/DeepSeek-V2-Chat')

模型推理

推理代码：

import torch
from modelscope import AutoTokenizer, AutoModelForCausalLM, GenerationConfig, AutoConfig
model_name = "deepseek-ai/DeepSeek-V2-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# `max_memory` should be set based on your devices
max_memory = {i: "75GB" for i in range(8)}
model_config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
model_config._attn_implementation='eager'
model = AutoModelForCausalLM.from_pretrained(model_name, config=model_config, trust_remote_code=True, device_map="auto", torch_dtype=torch.bfloat16, max_memory=max_memory)
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id
messages = [
    {"role": "user", "content": "Write a piece of quicksort code in C++"}
]
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100)
result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
print(result)

推理生成样例：

Sure, here is a simple implementation of the quicksort algorithm in C++:
```cpp
#include <iostream>
#include <vector>
using namespace std;
int partition(vector<int>& arr, int low, int high) {
    int pivot = arr[high];
    int i = (low - 1);
    for (int j = low; j <= high - 1; j++) {
        if (arr[j] < pivot) {
            i++;
            swap(arr[i], arr[j]);
        }
    }
    swap(arr[i + 1], arr[high]);
    return (i + 1);
}
void quickSort(vector<int>& arr, int low, int high) {
    if (low < high) {
        int pi = partition(arr, low, high);
        quickSort(arr, low, pi - 1);
        quickSort(arr, pi + 1, high);
    }
}
int main() {
    vector<int> arr = {10, 7, 8, 9, 1, 5};
    int n = arr.size();
    quickSort(arr, 0, n - 1);
    cout << "Sorted array: \n";
    for(int i = 0; i < n; i++) {
        cout << arr[i] << " ";
    }
    return 0;
}
```
This code sorts an array of integers using the quicksort algorithm. The `quickSort` function is the main function that recursively sorts the array. The `partition` function is used to partition the array around a pivot element and return the index of the pivot element. The elements smaller than the pivot are moved to its left and the elements larger are moved to its right.

推理占用：

模型微调和微调后推理

我们使用swift来对模型进行微调，swift是魔搭社区官方提供的LLM微调推理框架。

微调代码开源地址：https://github.com/modelscope/swift

我们使用数据集 self-cognition进行微调，该数据集的任务是：改变模型的自我认知。

环境准备：

git clone https://github.com/modelscope/swift.git
cd swift
pip install -e .[llm]

微调脚本: (LoRA)

默认只对LLM部分的qkv进行lora微调，如果你想对LLM部分的所有linear进行微调，可以指定`--lora_target_modules ALL`。

# Experimental environment: 8*A100
# 8*80GB GPU memory
nproc_per_node=8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NPROC_PER_NODE=$nproc_per_node \
swift sft \
    --model_type deepseek-v2-chat \
    --sft_type lora \
    --tuner_backend peft \
    --dtype bf16 \
    --output_dir output \
    --ddp_backend nccl \
    --self_cognition_sample 2000 \
    --model_name 小白 'Xiao Bai' \
    --model_author 魔搭 'Modelscope' \
    --train_dataset_sample -1 \
    --num_train_epochs 1 \
    --max_length 512 \
    --check_dataset_strategy warning \
    --lora_rank 8 \
    --lora_alpha 32 \
    --lora_dropout_p 0.05 \
    --lora_dtype AUTO \
    --lora_target_modules DEFAULT \
    --gradient_checkpointing false \
    --batch_size 2 \
    --weight_decay 0.1 \
    --learning_rate 1e-4 \
    --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
    --max_grad_norm 0.5 \
    --warmup_ratio 0.03 \
    --eval_steps 100 \
    --save_steps 100 \
    --save_total_limit 10 \
    --logging_steps 10 \
    --deepspeed default-zero3 \

微调后推理脚本: （这里的ckpt_dir需要修改为训练生成的checkpoint文件夹）

# Experimental environment: A10, 3090, V100
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift infer \
    --ckpt_dir output/deepseek-v2-chat/vx-xxx/checkpoint-xxx \
    --load_dataset_config true \
    --eval_human true \
    --max_length 512

微调的可视化结果:

训练准确率

训练loss

微调后样例:

<<< 你是谁
我是一个由魔搭开发的人工智能程序，被称为小白。我的主要目的是通过文本交流为人们提供帮助、信息和娱乐。如果你有任何疑问或需要帮助，请随时提出。

资源占用

微调

点击直达链接：DeepSeek-V2-Chat · 模型库 (modelscope.cn)

幻方开源第二代MoE模型 DeepSeek-V2，魔搭社区推理、微调最佳实践教程

导读

模型链接和下载

模型推理

推理代码：

模型微调和微调后推理

ModelScope模型即服务

热门文章

最新文章

相关课程

相关电子书

相关实验场景