TensorRT-LLM 推理服务实战指南-阿里云开发者社区

TensorRT-LLM 推理服务实战指南

2025-10-31 705

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： `trtllm-serve` 是 TensorRT-LLM 官方推理服务工具，支持一键部署兼容 OpenAI API 的生产级服务，提供模型查询、文本与对话补全等接口，并兼容多模态及分布式部署，助力高效推理。

简介

trtllm-serve 是 TensorRT-LLM 官方提供的推理服务启动工具，让你能快速将优化后的模型部署成生产级服务。无需复杂的框架搭建，一条命令就能启动与 OpenAI API 完全兼容的服务接口，是从本地开发到线上部署的最直接通道。

它支持以下核心接口：

/v1/models - 模型查询
/v1/completions - 文本补全
/v1/chat/completions - 对话补全

更多接口详情请参考 OpenAI API 参考文档。

此外还提供以下实用接口：

/health - 服务健康状态
/metrics - 运行时性能指标
/version - 版本信息

其中 /metrics 接口提供 GPU 显存占用、KV 缓存统计、动态批处理等运行时数据。

快速开始

基本启动命令：

trtllm-serve <model> [--tp_size <tp> --pp_size <pp> --ep_size <ep> --host <host> --port <port>]

完整的命令参数说明请参考命令行参考。

推理请求示例

服务启动后可通过推理接口发送请求。以下示例采用 TinyLlama-1.1B-Chat-v1.0 模型。

Chat 补全接口

使用 OpenAI Python 客户端调用：

from openai import OpenAI

client = OpenAI(api_key="任意字符串", base_url="http://localhost:8000/v1")

completion = client.chat.completions.create(
    model="TinyLlama-1.1B-Chat-v1.0",
    messages=[
        {
   "role": "system", "content": "You are a helpful assistant."},
        {
   "role": "user", "content": "Hello!"}
    ]
)
print(completion.choices[0].message.content)

或使用 curl：

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TinyLlama-1.1B-Chat-v1.0",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "temperature": 0.7
  }'

文本补全接口

使用 Python 客户端：

from openai import OpenAI

client = OpenAI(api_key="任意字符串", base_url="http://localhost:8000/v1")

completion = client.completions.create(
    model="TinyLlama-1.1B-Chat-v1.0",
    prompt="从前有座山，"
)
print(completion.choices[0].text)

或使用 curl：

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TinyLlama-1.1B-Chat-v1.0",
    "prompt": "从前有座山，",
    "temperature": 0.7
  }'

多模态服务

多模态模型部署有以下限制：

TRT-LLM 多模态暂不支持 kv_cache_reuse
多模态模型需要 chat_template，仅支持 Chat API

配置步骤

创建配置文件：

cat > ./extra-llm-api-config.yml << EOF
kv_cache_config:
    enable_block_reuse: false
EOF

启动服务：

trtllm-serve Qwen/Qwen2-VL-7B-Instruct \
    --extra_llm_api_options ./extra-llm-api-config.yml

多模态请求示例

多模态模型支持文本、图片、视频和音频混合输入。

使用 Python 客户端：

from openai import OpenAI

client = OpenAI(api_key="任意字符串", base_url="http://localhost:8000/v1")

completion = client.chat.completions.create(
    model="Qwen/Qwen2-VL-7B-Instruct",
    messages=[
        {
   
            "role": "user",
            "content": [
                {
   "type": "text", "text": "这张图片里有什么？"},
                {
   "type": "image_url", "image_url": {
   "url": "https://example.com/image.png"}}
            ]
        }
    ]
)
print(completion.choices[0].message.content)

或使用 curl：

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2-VL-7B-Instruct",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "这张图片里有什么？"},
          {"type": "image_url", "image_url": {"url": "https://example.com/image.png"}}
        ]
      }
    ]
  }'

支持的模态类型

TRT-LLM 多模态支持的数据格式如下（具体支持情况取决于模型）：

文本

基础格式：

{
   "role": "user", "content": "韩国的首都是哪儿？"}

显式类型：

{
   "role": "user", "content": [{
   "type": "text", "text": "韩国的首都是哪儿？"}]}

图片

URL 引入：

{
   
  "role": "user",
  "content": [
    {
   "type": "text", "text": "图片里有什么？"},
    {
   "type": "image_url", "image_url": {
   "url": "https://example.com/image.png"}}
  ]
}

Base64 编码：

{
   
  "role": "user",
  "content": [
    {
   "type": "text", "text": "图片里有什么？"},
    {
   "type": "image_url", "image_url": {
   "url": "data:image/jpeg;base64,{image_base64}"}}
  ]
}

提示： TensorRT-LLM 提供 load_base64_image 工具函数用于图片编码，详见源码。

视频

{
   
  "role": "user",
  "content": [
    {
   "type": "text", "text": "视频里有什么？"},
    {
   "type": "video_url", "video_url": {
   "url": "https://example.com/video.mp4"}}
  ]
}

音频

{
   
  "role": "user",
  "content": [
    {
   "type": "text", "text": "这段音频说的是什么？"},
    {
   "type": "audio_url", "audio_url": {
   "url": "https://example.com/audio.mp3"}}
  ]
}

多节点分布式部署

TRT-LLM 支持通过 Slurm 在多节点上部署大规模模型。以 DeepSeek-V3 为例：

# 创建配置文件，启用注意力数据并行和调度优化
echo -e "enable_attention_dp: true\npytorch_backend_config:\n  enable_overlap_scheduler: true" > extra-llm-api-config.yml

# 通过 Slurm 启动多节点服务
srun -N 2 -w [NODES] \
    --output=benchmark_2node.log \
    --ntasks 16 --ntasks-per-node=8 \
    --mpi=pmix --gres=gpu:8 \
    --container-image=<CONTAINER_IMG> \
    --container-mounts=/workspace:/workspace \
    --container-workdir /workspace \
    bash -c "trtllm-llmapi-launch trtllm-serve deepseek-ai/DeepSeek-V3 \
      --max_batch_size 161 \
      --max_num_tokens 1160 \
      --tp_size 16 \
      --ep_size 4 \
      --kv_cache_free_gpu_memory_fraction 0.95 \
      --extra_llm_api_options ./extra-llm-api-config.yml"

详见 trtllm-llmapi-launch 源码。

性能指标接口

注意：

PyTorch 后端的性能指标功能仍在完善中，数据完整度不如 TensorRT 后端

CPU 显存占用等字段暂不支持 PyTorch 后端

启用 enable_iter_perf_stats 会引入轻微的性能开销

/metrics 接口提供 GPU 显存、KV 缓存等实时性能数据。

启用指标收集

PyTorch 后端通过配置文件启用性能统计：

# extra_llm_config.yaml
enable_iter_perf_stats: true

启动服务时指定配置文件：

trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
    --extra_llm_api_options extra_llm_config.yaml

查询指标

发送推理请求后可查询性能指标。由于数据在队列中且被取出后即删除，建议在请求后立即轮询：

curl -X GET http://localhost:8000/metrics

典型响应示例：

[
    {
   
        "gpuMemUsage": 76665782272,
        "iter": 154,
        "iterLatencyMS": 7.00688362121582,
        "kvCacheStats": {
   
            "allocNewBlocks": 3126,
            "allocTotalBlocks": 3126,
            "cacheHitRate": 0.00128,
            "freeNumBlocks": 101253,
            "maxNumBlocks": 101256,
            "missedBlocks": 3121,
            "reusedBlocks": 4,
            "tokensPerBlock": 32,
            "usedNumBlocks": 3
        },
        "numActiveRequests": 1
    }
]

返回数据包含 GPU 显存占用、KV 缓存命中率、活跃请求数等关键指标，用于性能分析和优化。

命令行参考

完整参数说明：

trtllm-serve --help

TensorRT-LLM 推理服务实战指南

简介

快速开始

推理请求示例

Chat 补全接口

文本补全接口

多模态服务

配置步骤

多模态请求示例

支持的模态类型

多节点分布式部署

性能指标接口

启用指标收集

查询指标

命令行参考

千问大模型

热门文章

最新文章

相关电子书