04 性能对比
此时,我们的 OSS 上已经有两份 bloom-7b1 checkpoint,一份是 bloom-7b 文件夹存储了 huggingface 原生的 checkpoint,另一份是 bloom-7b-ft-fp16 文件夹存储了转换后的 FasterTransformer 的 checkpoint。我们将使用这两份 checkpoint 进行性能对比,看一下来 FasterTransformer 是否能够带来性能的提升。
性能对比使用 Fastertransformer 提供的 examples/pytorch/gpt/bloom_lambada.py,我们也已经集成到了 AI 套件中。这里我们分别提交两个性能评测命令。
对 Huggingface Bloom-7b1 评测的命令:
arena submit pytorchjob\ --gpus=2\ --image ai-studio-registry.cn-beijing.cr.aliyuncs.com/kube-ai/fastertransformer:torch-0.0.1\ --name perf-hf-bloom \ --workers 1\ --namespace default-group\ --data bloom7b1-pvc:/mnt\ 'python /FasterTransformer/examples/pytorch/gpt/bloom_lambada.py \ --tokenizer-path /mnt/model/bloom-7b1 \ --dataset-path /mnt/data/lambada/lambada_test.jsonl \ --batch-size 16 \ --test-hf \ --show-progress'
查看 HuggingFace 的结果:
$arena -n default-group logs -t 5 perf-hf-bloom Accuracy: 57.5587% (2966/5153) (elapsed time: 173.2149 sec)
对 Fastertransformer Blooom-7b 评测的命令:
arena submit pytorchjob\ --gpus=2\ --image ai-studio-registry.cn-beijing.cr.aliyuncs.com/kube-ai/fastertransformer:torch-0.0.1\ --name perf-ft-bloom \ --workers 1\ --namespace default-group\ --data bloom7b1-pvc:/mnt\ 'mpirun --allow-run-as-root -n 2 python /FasterTransformer/examples/pytorch/gpt/bloom_lambada.py \ --lib-path /FasterTransformer/build/lib/libth_transformer.so \ --checkpoint-path /mnt/model/2-gpu \ --batch-size 16 \ --tokenizer-path /mnt/model/bloom-7b1 \ --dataset-path /mnt/data/lambada/lambada_test.jsonl \ --show-progress'
查看 FasterTransformer 的结果,可以看见带来了 2.5 倍的性能提升。
$arena -n default-group logs -t 5 perf-ft-bloom Accuracy: 57.6363% (2970/5153) (elapsed time: 68.7818 sec)
通过结果对比可以看见,Fastertransformer 与原生的 Huggingface 相比有比较明显的性能提升。
05 模型部署
在这一小节,我们使用 Triton Server 对 FasterTransformer 进行部署,Triton Server 中原生并不支持 FasterTransformer 的 backend,需要我们配合 Nvidia 提供的 Fastertransformer backend 来使用。通过使用 FasterTransformer backend,Triton Server 不再进行 GPU 资源的分配,FasterTransformer backend 会根据 CUDA_VISIBLE_DEVICES 判断当前可用 GPU 资源,并分配给对应的 RANK 来执行分布式的推理。
FasterTransformer 对应的模型 Repo 目录如下所示:
├── model_repo │ └── fastertransformer │ ├── 1 │ │ └── config.ini │ └── config.pbtxt
使用功能 Arena 的如下命令来启动 FasterTransformer:
arena serve triton \ --namespace=default-group \ --version=1 \ --data=bloom7b1-pvc:/mnt \ --name=ft-triton-bloom \ --allow-metrics \ --gpus=2 \ --replicas=1 \ --image=ai-studio-registry.cn-beijing.cr.aliyuncs.com/kube-ai/triton_with_ft:22.03-main-2edb257e-transformers \ --model-repository=/mnt/triton_repo
通过 kubectl logs,我们可以看到 triton server 的部署日志,通过日志可以看到,triton server 启动了两个 gpu 来进行分布式推理。
I0721 08:57:28.116291 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fd264000000' with size 268435456 I0721 08:57:28.118393 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864 I0721 08:57:28.118403 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 1 with size 67108864 I0721 08:57:28.443529 1 model_lifecycle.cc:459] loading: fastertransformer:1 I0721 08:57:28.625253 1 libfastertransformer.cc:1828] TRITONBACKEND_Initialize: fastertransformer I0721 08:57:28.625307 1 libfastertransformer.cc:1838] Triton TRITONBACKEND API version: 1.10 I0721 08:57:28.625315 1 libfastertransformer.cc:1844] 'fastertransformer' TRITONBACKEND API version: 1.10 I0721 08:57:28.627137 1 libfastertransformer.cc:1876] TRITONBACKEND_ModelInitialize: fastertransformer (version 1) I0721 08:57:28.628304 1 libfastertransformer.cc:372] Instance group type: KIND_CPU count: 1 I0721 08:57:28.628326 1 libfastertransformer.cc:402] Sequence Batching: disabled I0721 08:57:28.628334 1 libfastertransformer.cc:412] Dynamic Batching: disabled I0721 08:57:28.661657 1 libfastertransformer.cc:438] Before Loading Weights: +-------------------+-----------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Backend | Path | Config | +-------------------+-----------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+ | fastertransformer | /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so | {"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}} | +-------------------+-----------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+ I0721 09:01:19.653743 1 server.cc:633] +-------------------+---------+--------+ | Model | Version | Status | after allocation : free: 7.47 GB, total: 15.78 GB, used: 8.31 GB +-------------------+---------+--------+ | fastertransformer | 1 | READY | +-------------------+---------+--------+ I0721 09:01:19.668137 1 metrics.cc:864] Collecting metrics for GPU 0: Tesla V100-SXM2-16GB I0721 09:01:19.668167 1 metrics.cc:864] Collecting metrics for GPU 1: Tesla V100-SXM2-16GB I0721 09:01:19.669954 1 metrics.cc:757] Collecting CPU metrics I0721 09:01:19.670150 1 tritonserver.cc:2264] +----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.29.0 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace logging | | model_repository_path[0] | /mnt/triton_repo | | model_control_mode | MODE_NONE | | strict_model_config | 0 | | rate_limit | OFF | | pinned_memory_pool_byte_size | 268435456 | | cuda_memory_pool_byte_size{0} | 67108864 | | cuda_memory_pool_byte_size{1} | 67108864 | | response_cache_byte_size | 0 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | +----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ I0721 09:01:19.672326 1 grpc_server.cc:4819] Started GRPCInferenceService at 0.0.0.0:8001 I0721 09:01:19.672597 1 http_server.cc:3477] Started HTTPService at 0.0.0.0:8000 I0721 09:01:19.714356 1 http_server.cc:184] Started Metrics Service at 0.0.0.0:8002
06 服务请求
启动 forward 进行验证:
# 使用 kubectl 启动port-forward kubectl -n default-group port-forward svc/ft-triton-bloom-1-tritoninferenceserver 8001:8001
这里我们使用 Triton Server 提供的 python SDK 所编写的脚本来向 Triton Server 发起请求。脚本中主要完成三件事情:
import os, sys #from tkinter import _Padding import numpy as np import json import torch #import tritongrpcclient import argparse import time from transformers import AutoTokenizer import tritonclient.grpc as grpcclient # create tokenizer tokenizer = AutoTokenizer.from_pretrained('/mnt/model/bloom-7b1', padding_side='right') tokenizer.pad_token_id = tokenizer.eos_token_id def load_image(img_path: str): """ Loads an encoded image as an array of bytes. """ return np.fromfile(img_path, dtype='uint8') def tokeninze(query): # encode encoded_inputs = tokenizer(query, padding=True, return_tensors='pt') input_token_ids = encoded_inputs['input_ids'].int() input_lengths = encoded_inputs['attention_mask'].sum( dim=-1, dtype=torch.int32).view(-1, 1) return input_token_ids.numpy().astype('uint32'), input_lengths.numpy().astype('uint32') if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--model_name", type=str, required=False, default="fastertransformer", help="Model name") parser.add_argument("--url", type=str, required=False, default="localhost:8001", help="Inference server URL. Default is localhost:8001.") parser.add_argument('-v', "--verbose", action="store_true", required=False, default=False, help='Enable verbose output') args = parser.parse_args() # 1.创建client try: triton_client = grpcclient.InferenceServerClient( url=args.url, verbose=args.verbose) except Exception as e: print("channel creation failed: " + str(e)) sys.exit(1) output_name = "OUTPUT" # 2) 设置input inputs = [] ## 2.1) input_ids query="deepspeed is" input_ids, input_lengths = tokeninze(query) inputs.append(grpcclient.InferInput("input_ids", input_ids.shape, "UINT32")) inputs[0].set_data_from_numpy(input_ids) ## 2.2) input_length inputs.append(grpcclient.InferInput("input_lengths", input_lengths.shape, "UINT32")) inputs[1].set_data_from_numpy(input_lengths) ## 2.3) output length output_len=32 output_len_np = np.array([[output_len]], dtype=np.uintc) inputs.append(grpcclient.InferInput("request_output_len", output_len_np.shape, "UINT32")) inputs[2].set_data_from_numpy(output_len_np) # 3) 设置output outputs = [] outputs.append(grpcclient.InferRequestedOutput("output_ids")) # 4) 发起请求 start_time = time.time() results = triton_client.infer(model_name=args.model_name, inputs=inputs, outputs=outputs) latency = time.time() - start_time # 5) 结果处理:转化为numpy 类型,计算max,转化label output0_data = results.as_numpy("output_ids") print(output0_data.shape) result = tokenizer.batch_decode(output0_data[0]) print(result)
发起 client 请求命令如下:
$python3 bloom_7b_client.py (1, 1, 36) ['deepspeed is the speed of the ship at the time of the collision, and the\ndeepspeed of the other ship is the speed of the other ship at the time']
07 总结
本文我们通过 Bloom-7b1 模型展示了如何在云原生 AI 套件中使用 FasterTransformer 对大语言模型进行加速,通过与 HuggingFace 的版本对比可以带来 2.5 倍的性能提升。后续我们会逐步推出更多大模型相关的推理加速方案,以满足不同的业务需求,大家敬请期待。