导读
继2023-09-11 CodeFuse-CodeLlama-34B发布,HumanEval pass@1指标达到74.4% (贪婪解码), 为当前开源SOTA。最近,CodeFuse-CodeLlama-34B 4bits量化版本发布,CodeFuse-CodeLlama-34B-4bits是CodeFuse-CodeLlama-34B模型的4bits量化版本,后者是通过QLoRA对基座模型CodeLlama-34b-Python进行多代码任务微调而得到的代码大模型,模型输入长度为4K。
经4bits量化后,CodeFuse-CodeLlama-34B-4bits可用单张A10 (24GB显存)或者RTX 4090 (24GB显存)加载,同时,量化后的模型在Humaneval pass@1指标上仍取得了73.8%的表现。
评测表现(代码):
模型 |
HumanEval(pass@1) |
日期 |
CodeFuse-CodeLlama-34B |
74.4% |
2023.9 |
CodeFuse-CodeLlama-34B-4bits |
73.8% |
2023.9 |
WizardCoder-Python-34B-V1.0 |
73.2% |
2023.8 |
GPT-4(zero-shot) |
67.0% |
2023.3 |
PanGu-Coder2 15B |
61.6% |
2023.8 |
CodeLlama-34b-Python |
53.7% |
2023.8 |
CodeLlama-34b |
48.8% |
2023.8 |
GPT-3.5(zero-shot) |
48.1% |
2022.11 |
OctoCoder |
46.2% |
2023.8 |
StarCoder-15B |
33.6% |
2023.5 |
LLaMA 2 70B(zero-shot) |
29.9% |
2023.7 |
环境配置与安装
- python 3.8及以上版本
- pytorch 1.12及以上版本,推荐2.0及以上版本
- 建议使用CUDA 11.4及以上(GPU用户需考虑此选项)
使用步骤
本文在PAI-DSW运行 (可单卡运行)
模型链接和下载
CodeFuse量化模型现已在ModelScope社区开源,
CodeFuse-CodeLlama-34B 4bits:
https://modelscope.cn/models/codefuse-ai/CodeFuse-CodeLlama-34B-4bits/summary
社区支持直接下载模型的repo:
from modelscope.hub.snapshot_download import snapshot_download model_dir = snapshot_download('codefuse-ai/CodeFuse-CodeLlama-34B-4bits', 'v1.0.0')
模型推理
依赖项:
依赖项:
pip install "modelscope>=1.9.1" pip install auto_gptq
推理代码:
import os import torch import time from modelscope import AutoTokenizer, snapshot_download from auto_gptq import AutoGPTQForCausalLM os.environ["TOKENIZERS_PARALLELISM"] = "false" def load_model_tokenizer(model_path): """ Load model and tokenizer based on the given model name or local path of downloaded model. """ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False, lagecy=False) tokenizer.padding_side = "left" tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids("<unk>") tokenizer.eos_token_id = tokenizer.convert_tokens_to_ids("</s>") model = AutoGPTQForCausalLM.from_quantized(model_path, inject_fused_attention=False, inject_fused_mlp=False, use_cuda_fp16=True, disable_exllama=False, device_map='auto' # Support multi-gpus ) return model, tokenizer def inference(model, tokenizer, prompt): """ Uset the given model and tokenizer to generate an answer for the speicifed prompt. """ st = time.time() prompt = prompt if prompt.endswith('\n') else f'{prompt}\n' inputs = f"<|role_start|>human<|role_end|>{prompt}<|role_start|>bot<|role_end|>" input_ids = tokenizer.encode(inputs, return_tensors="pt", padding=True, add_special_tokens=False).to("cuda") with torch.no_grad(): generated_ids = model.generate( input_ids=input_ids, top_p=0.95, temperature=0.1, do_sample=True, max_new_tokens=512, eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.pad_token_id ) print(f'generated tokens num is {len(generated_ids[0][input_ids.size(1):])}') outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True) print(f'generate text is {outputs[0][len(inputs): ]}') latency = time.time() - st print('latency is {} seconds'.format(latency)) if __name__ == "__main__": model_dir = snapshot_download('codefuse-ai/CodeFuse-CodeLlama-34B-4bits', revision='v1.0.0') prompt = 'Please write a QuickSort program in Python' model, tokenizer = load_model_tokenizer(model_dir) inference(model, tokenizer, prompt)
资源消耗:
我们测量了模型加载后占用的显存占用情况,以及输入2048/1024 tokens并输出1024/2048 tokens时的显存使用情况,如下表所示
精度 |
模型空载 |
输入 2048 tokens + 输出1024 tokens |
输入 1024 tokens + 输出2048 tokens |
bfloat16 |
64.89GB |
69.31GB |
66.41GB |
int4 |
19.09GB |
22.19GB |
20.78GB |
int4示例代码显存占用:
魔搭投稿地址:https://survey.alibaba.com/apps/zhiliao/--lb2a9di