以小博大，微软开源27亿参数模型Phi-2，魔搭最佳实践来啦！

2023-12-20 2119

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

模型在线服务 PAI-EAS，A10/V100等 500元 1个月

交互式建模 PAI-DSW，每月250计算时 3个月

模型训练 PAI-DLC，5000CU*H 3个月

简介： 近日，微软公布了在 Microsoft Ignite 2023大会上宣布开源的 Phi-2 模型的更多细节，“打破传统语言模型缩放定律，可PK比自己大25倍的模型”、“以小博大”等评价，让Phi-2一时间在开源社区中引发关注。

近日，微软公布了在 Microsoft Ignite 2023大会上宣布开源的 Phi-2 模型的更多细节，“打破传统语言模型缩放定律，可PK比自己大25倍的模型”、“以小博大”等评价，让Phi-2一时间在开源社区中引发关注。

Phi-2是一个具有27亿参数的模型，在常识推理、语言理解、数学和代码任务的基准测试评估中， Phi-2和参数不到130亿热门开源模型对比，表现出了优异的性能。

官方总结了Phi-2背后的关键洞见：

教科书级的数据集：秉持Phi系列“Textbooks Are All You Need”的宗旨，Phi-2的训练数据混合了专门用于教授模型进行常识推理和掌握一般知识的合成数据集，包括科学、日常活动和心智理论等，并进一步筛选了具有教育价值和内容质量的网络数据来增强训练语料库。
知识迁移：以1.3B的Phi-1.5模型为基础，将其知识嵌入到27亿参数的Phi-2中。这种规模化的知识转移不仅加速了训练的收敛，还明显提升了Phi-2的基准得分。

值得一提的是，Phi-2没有经过基于人类反馈的增强学习（RLHF）的校准，也没有进行指导性的微调，但得益于其高质量的训练数据，与经过对齐的现有开源模型相比，Phi-2在毒性和偏见方面表现更好。

应用方面，官方强调 Phi-2 当前仅用于研究目的，旨在为AI开发者和研究者提供一个探索可解释性、安全性改进及各种任务微调实验的工具。模型生成的文本/代码应被视为潜在用例的起点，而不是最终解决方案，Phi-2目前未对生产任务进行过测试，暂无法保证支持生产级别的应用程序的性能。

接下来，为大家带来Phi-2在魔搭社区的推理、微调最佳实践教程，希望对其感兴趣的小伙伴有所帮助

环境配置与安装

python 3.8及以上版本
pytorch 1.12及以上版本，推荐2.0及以上版本
建议使用CUDA 11.4及以上

如果你使用ModelScope进行推理:

pip install modelscope transformers -U

如果使用SWIFT进行流式输出, 推理加速和微调:

# 安装ms-swift
git clone https://github.com/modelscope/swift.git
cd swift
pip install -e .[llm]
# 如果你要进行推理加速
# vllm与cuda版本有对应关系，请按照`https://docs.vllm.ai/en/latest/getting_started/installation.html`选择版本
pip install vllm -U

本文主要演示的 Phi-2 模型的模型推理可在 ModelScope 的 Notebook 的环境（这里以PAI-DSW为例）的配置下运行.

模型链接和下载

Phi-2 模型现可在ModelScope社区下载体验:

Phi-2链接:

https://modelscope.cn/models/AI-ModelScope/phi-2/summary

from modelscope import snapshot_download
model_dir = snapshot_download("AI-ModelScope/phi-2", revision = "master")

模型推理

使用ModelScope

推理代码:

import torch
from modelscope import AutoModelForCausalLM, AutoTokenizer
torch.set_default_device("cuda")
model = AutoModelForCausalLM.from_pretrained("AI-ModelScope/phi-2", torch_dtype="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("AI-ModelScope/phi-2", trust_remote_code=True)
inputs = tokenizer('''def print_prime(n):
   """
   Print all primes between 1 and n
   """''', return_tensors="pt", return_attention_mask=False)
outputs = model.generate(**inputs, max_length=200)
text = tokenizer.batch_decode(outputs)[0]
print(text)

资源消耗: 11GB 显存

使用SWIFT进行流式输出

推理代码:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
    get_model_tokenizer, get_template, inference_stream, 
    ModelType, get_default_template_type,
)
from swift.utils import seed_everything
model_type = ModelType.phi2_3b
template_type = get_default_template_type(model_type)
model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'})
# 修改max_new_tokens
model.generation_config.max_new_tokens = 512
template = get_template(template_type, tokenizer)
seed_everything(42)
query = """\
# Print all primes between 1 and n
```python
"""
gen = inference_stream(model, template, query, stop_words=['```\n'])
print_idx = 0
print(query, end='')
for response, history in gen:
    print(response[print_idx:], end='')
    print_idx = len(response)
print()

资源消耗: 7GB 显存

使用vllm推理加速

SWIFT集成了Phi-2和vllm. 我们可以使用SWIFT对模型进行推理加速:

文档:

https://github.com/modelscope/swift/blob/main/docs/source/LLM/VLLM推理加速与部署.md

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
    ModelType, get_vllm_engine, get_default_template_type,
    get_template, inference_vllm
)
model_type = ModelType.phi2_3b
llm_engine = get_vllm_engine(model_type)
template_type = get_default_template_type(model_type)
template = get_template(template_type, llm_engine.tokenizer)
# 与`transformers.GenerationConfig`类似的接口
llm_engine.generation_config.max_new_tokens = 512
query1 = """\
# Print all primes between 1 and n
```python
"""
query2 = """\
# quick_sort
```python
"""
request_list = [{'query': query1}, {'query': query2}]
llm_engine.generation_config.stop = ['```\n']
resp_list = inference_vllm(llm_engine, template, request_list)
for request, resp in zip(request_list, resp_list):
    print(f"{request['query']}{resp['response']}")
"""Out[0]
# Print all primes between 1 and n
```python
def is_prime(n):
    # Assume n is a positive integer
    # Check if n is 1 or less, which are not prime
    if n <= 1:
        return False
    # Check if n is divisible by any integer from 2 to its square root
    for i in range(2, int(n**0.5) + 1):
        if n % i == 0:
            return False
    # If no divisor is found, n is prime
    return True
for n in range(1, 101):
    if is_prime(n):
        print(n)
```
# quick_sort
```python
import random
def quick_sort(arr) -> None:
    if len(arr) <= 1:
        return
    pivot = arr[random.randint(0, len(arr)-1)]
    less = [x for x in arr if x < pivot]
    middle = [x for x in arr if x == pivot]
    greater = [x for x in arr if x > pivot]
    quick_sort(less)
    quick_sort(greater)
    arr[:] = less + middle + greater
a = [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5]
quick_sort(a)
print("Sorted array is:", a)
```
"""

Phi-2模型微调

使用SWIFT对Phi-2进行模型微调, LoRA微调的脚本可以查看:

https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/phi2_3b

微调脚本:

# Experimental environment: A100
# 60GB GPU memory
CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model_type phi2-3b \
    --sft_type lora \
    --template_type default \
    --train_dataset_sample 20000 \
    --eval_steps 100 \
    --output_dir output \
    --num_train_epochs 1 \
    --max_length 2048 \
    --learning_rate 1e-4 \
    --use_flash_attn true \
    --only_save_model true \
    --lora_target_modules ALL \
    --dataset codefuse-python-en \
    --gradient_checkpointing false \

训练过程也支持本地数据集，需要指定如下参数：

--custom_train_dataset_path xxx.jsonl \
--custom_val_dataset_path yyy.jsonl \

自定义数据集的格式可以参考：

https://github.com/modelscope/swift/blob/main/docs/source/LLM/自定义与拓展.md#注册数据集的方式

微调后的推理脚本:

这里的ckpt_dir需要修改为训练生成的checkpoint文件夹：

# Experimental environment: A10
# 8GB GPU memory
CUDA_VISIBLE_DEVICES=0 \
swift infer \
    --ckpt_dir "output/phi2-3b/vx_xxx/checkpoint-xxx" \
    --load_dataset_config true \
    --max_length 2048 \
    --use_flash_attn false \
    --max_new_tokens 2048 \
    --temperature 0.1 \
    --top_p 0.7 \
    --repetition_penalty 1.05 \
    --merge_lora_and_save false \

微调后生成样例

<<< quick sort, python
Sure! Here's an example of a quick sort implementation in Python with a detailed docstring explaining the code methods:
```python
def quick_sort(arr):
    """
    Sorts a given array using the Quick Sort algorithm.
    Args:
        arr (list): The input array to be sorted.
    Returns:
        list: The sorted array.
    Algorithm:
    1. Choose a pivot element from the array.
    2. Partition the array into two sub-arrays: one with elements smaller than the pivot and another with elements larger than the pivot.
    3. Recursively apply quick sort to the sub-arrays until they are fully sorted.
    4. Combine the sorted sub-arrays with the pivot element to obtain the final sorted array.
    Time Complexity:
    - Best Case: O(n log n)
    - Average Case: O(n log n)
    - Worst Case: O(n^2)
    Space Complexity: O(log n) for the recursive calls.
    """
    if len(arr) <= 1:
        return arr  # Base case: if the array has 0 or 1 element, it is already sorted
    pivot = arr[0]  # Choose the first element as the pivot
    left = []  # List to store elements smaller than the pivot
    right = []  # List to store elements larger than the pivot
    for i in range(1, len(arr)):
        if arr[i] < pivot:
            left.append(arr[i])
        else:
            right.append(arr[i])
    # Recursively sort the left and right sub-arrays
    sorted_left = quick_sort(left)
    sorted_right = quick_sort(right)
    # Combine the sorted sub-arrays with the pivot element
    return sorted_left + [pivot] + sorted_right
```
In this implementation, we use the partitioning technique to divide the array into two sub-arrays based on the pivot element. We then recursively apply quick sort to each sub-array until they are fully sorted. Finally, we combine the sorted sub-arrays with the pivot element to obtain the final sorted array.
The time complexity of quick sort is generally O(n log n), but in the worst case scenario where the pivot is always the smallest or largest element, it can become O(n^2). However, this is rare in practice.
The space complexity of quick sort is O(log n) due to the recursive calls.

跑生成的代码

print(quick_sort([10, 7, 8, 4, 3, 6]))
# [3, 4, 6, 7, 8, 10]

点击直达Phi-2模型卡片：phi-2 · 模型库 (modelscope.cn)

以小博大，微软开源27亿参数模型Phi-2，魔搭最佳实践来啦！

环境配置与安装

模型链接和下载

模型推理

使用ModelScope

使用SWIFT进行流式输出

使用vllm推理加速

Phi-2模型微调

ModelScope模型即服务

热门文章

最新文章

相关课程

相关电子书

相关实验场景