【阅读原文】戳:容器服务 ACK 大模型推理最佳实践系列一:TensorRT-LLM
在ACK中使用KServe部署Triton+TensorRT-LLM。本教程以Llama-2-7b-hf模型为例,演示如何在ACK中使用KServe部署Triton框架。Triton采用TensorRT-LLM后端。
背景介绍
1. KServe
KServe[1]是一个开源的云原生模型服务平台,旨在简化在Kubernetes上部署和运行机器学习模型的过程,支持多种机器学习框架、具备弹性扩容能力。KServe通过定义简单的YAML文件,提供声明式的API来部署模型,使得配置和管理模型服务变得更加容易。
更多关于KServe框架的信息,请参见KServe社区文档[2]。
2. Triton(Triton Inference Server)
Triton Inference Server[3]是一个NVIDIA开源的推理服务框架,用于帮助用户快速搭建AI推理应用。Triton支持多种不同的机器学习框架作为它的运行时后端,包括TensorRT,TensorFlow,PyTorch,ONNX,vLLM等。Triton面向实时推理、批量推理以及音视频流式推理场景进行了许多优化,以在推理时获得更好的性能。
更多关于Triton推理服务框架的信息,请参考Triton Inference Server Github代码库[4]。
3. TensorRT-LLM
TensorRT-LLM[5]是NVIDIA开源的LLM模型优化引擎。该框架用于定义LLM模型并将模型构建为TensorRT Engine,以提升在NVIDIA GPU上的推理效率。TensorRT-LLM还可以与Triton框架结合,作为Triton推理框架的一种后端tensorrtllm_backend[6]。TensorRT-LLM构建的模型可以在单个或多个GPU上运行,支持Tensor Parallelism及Pipeline Parallelism。
更多关于TensorRT-LLM的信息,请参考TensorRT-LLM Github代码库[7]。
前提条件
•已创建包含GPU的Kubernetes集群。具体操作,请参见创建包含GPU的Kubernetes集群[8]。
•GPU节点显存需要>=24GB。
•已安装KServe。具体操作,请参见安装ack-kserve组件[9]。
1. 准备模型数据及模型编译脚本
1.1. 从HuggingFace/ModelScope上下载Llama-2-7b-hf模型
其他TensorRT-LLM框架支持的模型请参考文档TensorRT-LLM支持矩阵[10]。
1.2. 准备模型编译脚本
新建trtllm-llama-2-7b.sh文件,文件内容如下:
#!/bin/sh set -e # 脚本适用于 nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3 镜像 MODEL_MOUNT_PATH=/mnt/models OUTPUT_DIR=/root/trt-llm TRT_BACKEND_DIR=/root/tensorrtllm_backend # clone tensorrtllm_backend echo "clone tensorrtllm_backend..." if [ -d "$TRT_BACKEND_DIR" ]; then echo "directory $TRT_BACKEND_DIR exists, skip clone tensorrtllm_backend" else cd /root git clone -b v0.9.0 https://github.com/triton-inference-server/tensorrtllm_backend.git cd $TRT_BACKEND_DIR git submodule update --init --recursive git lfs install git lfs pull fi # covert checkpoint if [ -d "$OUTPUT_DIR/llama-2-7b-ckpt" ]; then echo "directory $OUTPUT_DIR/llama-2-7b-ckpt exists, skip convert checkpoint" else echo "covert checkpoint..." python3 $TRT_BACKEND_DIR/tensorrt_llm/examples/llama/convert_checkpoint.py \ --model_dir $MODEL_MOUNT_PATH/Llama-2-7b-hf \ --output_dir $OUTPUT_DIR/llama-2-7b-ckpt \ --dtype float16 fi # build trtllm engine if [ -d "$OUTPUT_DIR/llama-2-7b-engine" ]; then echo "directory $OUTPUT_DIR/llama-2-7b-engine exists, skip convert checkpoint" else echo "build trtllm engine..." trtllm-build --checkpoint_dir $OUTPUT_DIR/llama-2-7b-ckpt \ --remove_input_padding enable \ --gpt_attention_plugin float16 \ --context_fmha enable \ --gemm_plugin float16 \ --output_dir $OUTPUT_DIR/llama-2-7b-engine \ --paged_kv_cache enable \ --max_batch_size 8 fi # config model echo "config model..." cd $TRT_BACKEND_DIR cp all_models/inflight_batcher_llm/ llama_ifb -r export HF_LLAMA_MODEL=$MODEL_MOUNT_PATH/Llama-2-7b-hf export ENGINE_PATH=$OUTPUT_DIR/llama-2-7b-engine python3 tools/fill_template.py -i llama_ifb/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:8,preprocessing_instance_count:1 python3 tools/fill_template.py -i llama_ifb/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:8,postprocessing_instance_count:1 python3 tools/fill_template.py -i llama_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:8,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False python3 tools/fill_template.py -i llama_ifb/ensemble/config.pbtxt triton_max_batch_size:8 python3 tools/fill_template.py -i llama_ifb/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:8,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:1280,max_attention_window_size:1280,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0 # run server echo "run server..." pip install SentencePiece tritonserver --model-repository=$TRT_BACKEND_DIR/llama_ifb --http-port=8080 --grpc-port=9000 --metrics-port=8002 --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_
1.3. 上传OSS,并在集群内创建PV/PVC
# 创建目录 ossutil mkdir oss://<your-bucket-name>/Llama-2-7b-hf # 上传模型文件 ossutil cp -r ./Llama-2-7b-hf oss://<your-bucket-name>/Llama-2-7b-hf # 上传脚本文件 chmod +x trtllm-llama-2-7b.sh ossutil cp -r ./trtllm-llama-2-7b.sh oss://<your-bucket-name>/trtllm-llama-2-7b.sh
预期oss中文件路径如下:
tree -L 1 . ├── Llama-2-7b-hf └── trtllm-llama-2-7b.sh
1.4. 创建PV,PVC
替换文件中${your-accesskey-id}、${your-accesskey-secert}、${your-bucket-name}、${your-bucket-endpoint} 变量。
kubectl apply -f- << EOF apiVersion: v1 kind: Secret metadata: name: oss-secret stringData: akId: ${your-accesskey-id} # 用于访问oss的AK akSecret: ${your-accesskey-secert} # 用于访问oss的SK --- apiVersion: v1 kind: PersistentVolume metadata: name: llm-model labels: alicloud-pvname: llm-model spec: capacity: storage: 30Gi accessModes: - ReadOnlyMany persistentVolumeReclaimPolicy: Retain csi: driver: ossplugin.csi.alibabacloud.com volumeHandle: model-oss nodePublishSecretRef: name: oss-secret namespace: default volumeAttributes: bucket: ${your-bucket-name} url: ${your-bucket-endpoint} # e.g. oss-cn-hangzhou.aliyuncs.com otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other" path: "/" --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: llm-model spec: accessModes: - ReadOnlyMany resources: requests: storage: 30Gi selector: matchLabels: alicloud-pvname: llm-model EOF
2. 创建ClusterServerRuntime
kubectl apply -f- <<EOF apiVersion: serving.kserve.io/v1alpha1 kind: ClusterServingRuntime metadata: name: triton-trtllm spec: annotations: prometheus.kserve.io/path: /metrics prometheus.kserve.io/port: "8002" containers: - args: - tritonserver - --model-store=/mnt/models - --grpc-port=9000 - --http-port=8080 - --allow-grpc=true - --allow-http=true image: nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3 name: kserve-container resources: requests: cpu: "4" memory: 12Gi protocolVersions: - v2 - grpc-v2 supportedModelFormats: - name: triton version: "2" << EOF
3. 部署应用
3.1. 部署KServe应用
kubectl apply -f- << EOF apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: llama-2-7b spec: predictor: model: modelFormat: name: triton version: "2" runtime: triton-trtllm storageUri: pvc://llm-model/ name: kserve-container resources: limits: nvidia.com/gpu: "1" requests: cpu: "4" memory: 12Gi nvidia.com/gpu: "1" command: - sh - -c - /mnt/models/trtllm-llama-2-7b.sh EOF
执行以下命令查看应用是否ready
kubectl get isvc llama-2-7b
预期输出:
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE llama-2-7b http://llama-2-7b-default.example.com True 29m
3.2. 访问应用
3.2.1. 容器内访问
curl -X POST localhost:8080/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'
3.2.2. 集群内节点访问
NGINX_INGRESS_IP=`kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.spec.clusterIP}'` # 如果不是部署在default命名空间下,需要修改下ns名称 SERVICE_HOSTNAME=$(kubectl get inferenceservice llama-2-7b -n default -o jsonpath='{.status.url}' | cut -d "/" -f 3) curl -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" \ http://$NGINX_INGRESS_IP:80/v2/models/ensemble/generate \ -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'
3.2.3. 集群外访问
NGINX_INGRESS_IP=`kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'` # 如果不是部署在default命名空间下,需要修改下ns名称 SERVICE_HOSTNAME=$(kubectl get inferenceservice llama-2-7b -n default -o jsonpath='{.status.url}' | cut -d "/" -f 3) curl -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" \ http://$NGINX_INGRESS_IP:80/v2/models/ensemble/generate \ -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'
预期输出:
{"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"\nMachine learning is a type of artificial intelligence (AI) that allows software applications to become more accurate"}
4. Q&A
Failed to pull image "nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3": failed to pull and unpack image "nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3": failed to copy: httpReadSeeker: failed open: failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://authn.nvidia.com/token?scope=repository%3Anvidia%2Ftritonserver%3Apull&service=registry: 401
报错原因:NVIDIA镜像仓库鉴权失败
解决方案:在本地机器上手动拉取nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3镜像并推送到自己的仓库中,然后修改ClusterServeRuntime中的镜像地址为自己的仓库地址。
5. Reference
https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md
https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/README.md
相关链接:
[1] KServe
https://github.com/kserve/kserve
[2] KServe社区文档
https://kserve.github.io/website/latest/
[3] Triton Inference Server
https://github.com/triton-inference-server/server
[4] Triton Inference Server Github代码库
https://github.com/triton-inference-server/server
[5] TensorRT-LLM
https://github.com/NVIDIA/TensorRT-LLM
[6] tensorrtllm_backend
https://github.com/triton-inference-server/tensorrtllm_backend
[7] TensorRT-LLM Github代码库
https://github.com/triton-inference-server/server
[8] 创建包含GPU的Kubernetes集群
[9] 安装ack-kserve组件
[10] TensorRT-LLM支持矩阵
https://nvidia.github.io/TensorRT-LLM/reference/support-matrix.html
我们是阿里巴巴云计算和大数据技术幕后的核心技术输出者。
获取关于我们的更多信息~