容器服务 ACK 大模型推理最佳实践系列一：TensorRT-LLM-阿里云开发者社区

容器服务 ACK 大模型推理最佳实践系列一：TensorRT-LLM

2024-06-12 60313

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

对象存储 OSS，20GB 3个月

对象存储 OSS，恶意文件检测 1000次 1年

对象存储 OSS，内容安全 1000次 1年

简介： 在 ACK 中使用 KServe 部署 Triton+TensorRT-LLM

【阅读原文】戳：容器服务 ACK 大模型推理最佳实践系列一：TensorRT-LLM

在ACK中使用KServe部署Triton+TensorRT-LLM。本教程以Llama-2-7b-hf模型为例，演示如何在ACK中使用KServe部署Triton框架。Triton采用TensorRT-LLM后端。

背景介绍

1. KServe

KServe[1]是一个开源的云原生模型服务平台，旨在简化在Kubernetes上部署和运行机器学习模型的过程，支持多种机器学习框架、具备弹性扩容能力。KServe通过定义简单的YAML文件，提供声明式的API来部署模型，使得配置和管理模型服务变得更加容易。

更多关于KServe框架的信息，请参见KServe社区文档[2]。

2. Triton（Triton Inference Server）

Triton Inference Server[3]是一个NVIDIA开源的推理服务框架，用于帮助用户快速搭建AI推理应用。Triton支持多种不同的机器学习框架作为它的运行时后端，包括TensorRT，TensorFlow，PyTorch，ONNX，vLLM等。Triton面向实时推理、批量推理以及音视频流式推理场景进行了许多优化，以在推理时获得更好的性能。

更多关于Triton推理服务框架的信息，请参考Triton Inference Server Github代码库[4]。

3. TensorRT-LLM

TensorRT-LLM[5]是NVIDIA开源的LLM模型优化引擎。该框架用于定义LLM模型并将模型构建为TensorRT Engine，以提升在NVIDIA GPU上的推理效率。TensorRT-LLM还可以与Triton框架结合，作为Triton推理框架的一种后端tensorrtllm_backend[6]。TensorRT-LLM构建的模型可以在单个或多个GPU上运行，支持Tensor Parallelism及Pipeline Parallelism。

更多关于TensorRT-LLM的信息，请参考TensorRT-LLM Github代码库[7]。

前提条件

•已创建包含GPU的Kubernetes集群。具体操作，请参见创建包含GPU的Kubernetes集群[8]。

•GPU节点显存需要>=24GB。

•已安装KServe。具体操作，请参见安装ack-kserve组件[9]。

1. 准备模型数据及模型编译脚本

1.1. 从HuggingFace/ModelScope上下载Llama-2-7b-hf模型

其他TensorRT-LLM框架支持的模型请参考文档TensorRT-LLM支持矩阵[10]。

1.2. 准备模型编译脚本

新建trtllm-llama-2-7b.sh文件，文件内容如下：

#!/bin/sh
set -e
# 脚本适用于 nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3 镜像
MODEL_MOUNT_PATH=/mnt/models
OUTPUT_DIR=/root/trt-llm
TRT_BACKEND_DIR=/root/tensorrtllm_backend
# clone tensorrtllm_backend
echo "clone tensorrtllm_backend..."
if [ -d "$TRT_BACKEND_DIR" ]; then
    echo "directory $TRT_BACKEND_DIR exists, skip clone tensorrtllm_backend"
else
  cd /root
  git clone -b v0.9.0 https://github.com/triton-inference-server/tensorrtllm_backend.git
  cd $TRT_BACKEND_DIR
  git submodule update --init --recursive
  git lfs install
  git lfs pull
fi
# covert checkpoint
if [ -d "$OUTPUT_DIR/llama-2-7b-ckpt" ]; then
    echo "directory $OUTPUT_DIR/llama-2-7b-ckpt exists, skip convert checkpoint"
else
  echo "covert checkpoint..."
  python3 $TRT_BACKEND_DIR/tensorrt_llm/examples/llama/convert_checkpoint.py \
  --model_dir $MODEL_MOUNT_PATH/Llama-2-7b-hf \
  --output_dir $OUTPUT_DIR/llama-2-7b-ckpt \
  --dtype float16
fi
# build trtllm engine
if [ -d "$OUTPUT_DIR/llama-2-7b-engine" ]; then
    echo "directory $OUTPUT_DIR/llama-2-7b-engine exists, skip convert checkpoint"
else
  echo "build trtllm engine..."
  trtllm-build --checkpoint_dir $OUTPUT_DIR/llama-2-7b-ckpt \
               --remove_input_padding enable \
               --gpt_attention_plugin float16 \
               --context_fmha enable \
               --gemm_plugin float16 \
               --output_dir $OUTPUT_DIR/llama-2-7b-engine \
               --paged_kv_cache enable \
               --max_batch_size 8
fi
# config model
echo "config model..."
cd $TRT_BACKEND_DIR
cp all_models/inflight_batcher_llm/ llama_ifb -r
export HF_LLAMA_MODEL=$MODEL_MOUNT_PATH/Llama-2-7b-hf
export ENGINE_PATH=$OUTPUT_DIR/llama-2-7b-engine
python3 tools/fill_template.py -i llama_ifb/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:8,preprocessing_instance_count:1
python3 tools/fill_template.py -i llama_ifb/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:8,postprocessing_instance_count:1
python3 tools/fill_template.py -i llama_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:8,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i llama_ifb/ensemble/config.pbtxt triton_max_batch_size:8
python3 tools/fill_template.py -i llama_ifb/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:8,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:1280,max_attention_window_size:1280,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0
# run server
echo "run server..."
pip install SentencePiece
tritonserver --model-repository=$TRT_BACKEND_DIR/llama_ifb --http-port=8080 --grpc-port=9000 --metrics-port=8002 --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_

1.3. 上传OSS，并在集群内创建PV/PVC

# 创建目录
ossutil mkdir oss://<your-bucket-name>/Llama-2-7b-hf
# 上传模型文件
ossutil cp -r ./Llama-2-7b-hf oss://<your-bucket-name>/Llama-2-7b-hf
# 上传脚本文件
chmod +x trtllm-llama-2-7b.sh
ossutil cp -r ./trtllm-llama-2-7b.sh oss://<your-bucket-name>/trtllm-llama-2-7b.sh

预期oss中文件路径如下：

tree -L 1
.
├── Llama-2-7b-hf
└── trtllm-llama-2-7b.sh

1.4. 创建PV，PVC

替换文件中${your-accesskey-id}、${your-accesskey-secert}、${your-bucket-name}、${your-bucket-endpoint} 变量。

kubectl apply -f- << EOF
apiVersion: v1
kind: Secret
metadata:
  name: oss-secret
stringData:
  akId: ${your-accesskey-id} # 用于访问oss的AK
  akSecret: ${your-accesskey-secert} # 用于访问oss的SK
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: llm-model
  labels:
    alicloud-pvname: llm-model
spec:
  capacity:
    storage: 30Gi 
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: ossplugin.csi.alibabacloud.com
    volumeHandle: model-oss
    nodePublishSecretRef:
      name: oss-secret
      namespace: default
    volumeAttributes:
      bucket: ${your-bucket-name}
      url: ${your-bucket-endpoint} # e.g. oss-cn-hangzhou.aliyuncs.com
      otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
      path: "/"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llm-model
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 30Gi
  selector:
    matchLabels:
      alicloud-pvname: llm-model
EOF

2. 创建ClusterServerRuntime

kubectl apply -f- <<EOF
apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  name: triton-trtllm
spec:
  annotations:
    prometheus.kserve.io/path: /metrics
    prometheus.kserve.io/port: "8002"
  containers:
  - args:
    - tritonserver
    - --model-store=/mnt/models
    - --grpc-port=9000
    - --http-port=8080
    - --allow-grpc=true
    - --allow-http=true
    image: nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
    name: kserve-container
    resources:
      requests:
        cpu: "4"
        memory: 12Gi
  protocolVersions:
  - v2
  - grpc-v2
  supportedModelFormats:
  - name: triton
    version: "2"
<< EOF

3. 部署应用

3.1. 部署KServe应用

kubectl apply -f- << EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-2-7b
spec:
  predictor:
    model:
      modelFormat:
        name: triton
        version: "2"
      runtime: triton-trtllm
      storageUri: pvc://llm-model/
      name: kserve-container
      resources:
        limits:
          nvidia.com/gpu: "1"
        requests:
          cpu: "4"
          memory: 12Gi
          nvidia.com/gpu: "1"
      command:
      - sh
      - -c
      - /mnt/models/trtllm-llama-2-7b.sh
EOF

执行以下命令查看应用是否ready

kubectl get isvc llama-2-7b

预期输出：

NAME         URL                                     READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION   AGE
llama-2-7b   http://llama-2-7b-default.example.com   True                                                                  29m

3.2. 访问应用

3.2.1. 容器内访问

curl -X POST localhost:8080/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'

3.2.2. 集群内节点访问

NGINX_INGRESS_IP=`kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.spec.clusterIP}'`
# 如果不是部署在default命名空间下，需要修改下ns名称
SERVICE_HOSTNAME=$(kubectl get inferenceservice llama-2-7b -n default  -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" \
http://$NGINX_INGRESS_IP:80/v2/models/ensemble/generate \
-d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'

3.2.3. 集群外访问

NGINX_INGRESS_IP=`kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'`
# 如果不是部署在default命名空间下，需要修改下ns名称
SERVICE_HOSTNAME=$(kubectl get inferenceservice llama-2-7b -n default  -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" \
http://$NGINX_INGRESS_IP:80/v2/models/ensemble/generate \
-d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'

预期输出：

{"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"\nMachine learning is a type of artificial intelligence (AI) that allows software applications to become more accurate"}

4. Q&A

Failed to pull image "nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3": failed to pull and unpack image "nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3": failed to copy: httpReadSeeker: failed open: failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://authn.nvidia.com/token?scope=repository%3Anvidia%2Ftritonserver%3Apull&service=registry: 401

报错原因：NVIDIA镜像仓库鉴权失败

解决方案：在本地机器上手动拉取nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3镜像并推送到自己的仓库中，然后修改ClusterServeRuntime中的镜像地址为自己的仓库地址。

5. Reference

https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md

https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/README.md

相关链接：

[1] KServe

https://github.com/kserve/kserve

[2] KServe社区文档

https://kserve.github.io/website/latest/

[3] Triton Inference Server

https://github.com/triton-inference-server/server