使用vllm在阿里云容器服务ACK上部署Qwen3推理服务-开发者社区-阿里云

01. 背景介绍

Qwen3

通义千问 Qwen3 是 Qwen 系列最新推出的首个混合推理模型。旗舰模型 Qwen3-235B-A22B 在代码、数学、通用能力等基准测试中，与 DeepSeek-R1、o1、o3-mini、Grok-3 和 Gemini-2.5-Pro 等顶级模型相比，表现出极具竞争力的结果。此外，小型 MoE 模型 Qwen3-30B-A3B 的激活参数数量是 QwQ-32B 的 10%，表现更胜一筹，甚至像 Qwen3-4B 这样的小模型也能匹敌 Qwen2.5-72B-Instruct 的性能。Qwen3 支持多种思考模式，用户可以根据具体任务控制模型进行思考的程度。Qwen3 模型支持 119 种语言和方言, 同时也加强了对 MCP 的支持。更多信息请参考《Qwen3：思深，行速》。

ACK

容器服务 Kubernetes 版 ACK（Container Service for Kubernetes）是全球首批通过 Kubernetes 一致性认证的服务平台，提供高性能的容器应用管理服务。它整合了阿里云虚拟化、存储、网络和安全能力，简化集群的搭建和扩容等工作，让您专注于容器化的应用的开发与管理。

ACS

容器计算服务 ACS（Container Compute Service）是以 Kubernetes 为用户界面的容器服务产品，提供符合容器规范的算力资源。

通过虚拟节点（Virtual Node）的形式接入到 ACK 集群中，使得集群可以轻松获得极大的弹性能力，而不必受限于集群的节点计算容量。ACS 在接管 Pod 容器底层基础设施的管理工作后，Kubernetes 不再需要直接负责单个 Pod 的放置、启动等工作，也不再需要关心底层虚拟机的资源情况，通过 ACS 即可确保 Pod 需要的资源随时可用。

02. 前提条件

已创建包含 GPU 的 ACK 集群。具体操作，请参见为集群添加 GPU 节点池 [1]。

已通过 kubectl 连接到集群。具体操作，请参见通过 kubectl 连接集群 [2]。

03. 模型部署

步骤一：准备 Qwen3-8B 模型文件

1. 执行以下命令从 ModelScope 下载 Qwen3-8B 模型。

请确认是否已安装 git-lfs 插件，如未安装可执行 yum install git-lfs 或者 apt-get install git-lfs 安装。更多的安装方式，请参见安装 git-lfs [3]。

git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Qwen/Qwen3-8B
cd Qwen3-8B/
git lfs pull

2. 在 OSS 中创建目录，将模型上传至 OSS。

关于 ossutil 工具的安装和使用方法，请参见安装 ossutil [4]。

ossutil mkdir oss://<your-bucket-name>/models/Qwen3-8B
ossutil cp -r ./Qwen3-8B oss://<your-bucket-name>/models/Qwen3-8B

3. 创建 PV 和 PVC。为目标集群配置名为 llm-model 的存储卷 PV 和存储声明 PVC。

apiVersion: v1
kind: Secret
metadata:
  name: oss-secret
stringData:
  akId: <your-oss-ak> # 配置用于访问OSS的AccessKey ID
  akSecret: <your-oss-sk> # 配置用于访问OSS的AccessKey Secret
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: llm-model
  labels:
    alicloud-pvname: llm-model
spec:
  capacity:
    storage: 30Gi 
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: ossplugin.csi.alibabacloud.com
    volumeHandle: llm-model
    nodePublishSecretRef:
      name: oss-secret
      namespace: default
    volumeAttributes:
      bucket: <your-bucket-name> # bucket名称
      url: <your-bucket-endpoint> # Endpoint信息，如oss-cn-hangzhou-internal.aliyuncs.com
      otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
      path: <your-model-path> # 本示例中为/models/Qwen3-8B/
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llm-model
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 30Gi
  selector:
    matchLabels:
      alicloud-pvname: llm-model

步骤二：部署推理服务

执行下列命令，启动名称为 qwen3 的推理服务。

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: qwen3
  name: qwen3
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: qwen3
  template:
    metadata:
      labels:
        app: qwen3
        # for ACS Cluster
        # alibabacloud.com/compute-class: gpu
        # 指定GPU型号为example-model，请按实际情况填写，如T4
        # alibabacloud.com/gpu-model-series: "example-model"
    spec:
      volumes:
        - name: model
          persistentVolumeClaim:
            claimName: llm-model
      containers:
      - command:
        - sh
        - -c
        - vllm serve /models/Qwen3-8B/ --port 8000 --trust-remote-code --max-model-len 2048 --gpu-memory-utilization 0.98 
        image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.8.4
        imagePullPolicy: IfNotPresent
        name: vllm
        ports:
        - containerPort: 8000
          name: restful
          protocol: TCP
        readinessProbe:
          tcpSocket:
            port: 8000
          initialDelaySeconds: 30
        resources:
          limits:
            nvidia.com/gpu: "1"
            cpu: 8
            memory: 16Gi
          requests:
            nvidia.com/gpu: "1"
            cpu: 8
            memory: 16Gi
        volumeMounts:
          - mountPath: /models/Qwen3-8B/
            name: model
---
apiVersion: v1
kind: Service
metadata:
  name: qwen3
spec:
  ports:
    - name: http
      port: 8000
      protocol: TCP
      targetPort: 8000
  selector:
    app: qwen3
  type: ClusterIP

步骤三：验证推理服务

1. 执行以下命令，在推理服务与本地环境之间建立端口转发。

kubectl port-forward svc/qwen3 8000:8000

预期输出：

Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000

2. 执行以下命令，向模型推理服务发送一条模型推理请求。

curl -H "Content-Type: application/json" http://localhost:8000/v1/chat/completions -d '{"model": "/models/Qwen3-8B/", "messages": [{"role": "user", "content": "Say this is a test!"}], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9, "seed": 10}'

预期输出：

{"id":"chatcmpl-3e472d9f449648718a483279062f4987","object":"chat.completion","created":1745980464,"model":"/models/Qwen3-8B/","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"<think>\nOkay, the user said \"Say this is a test!\" and I need to respond. Let me think about how to approach this. First, I should acknowledge their message. Maybe start with a friendly greeting. Then, since they mentioned a test, perhaps they're testing my response capabilities. I should confirm that I'm here to help and offer assistance with anything they need. Keep it open-ended so they feel comfortable asking more. Also, make sure the tone is positive and encouraging. Let me put that together in a natural way.\n</think>\n\nHello! It's great to meet you. If you have any questions or need help with something, feel free to let me know. I'm here to assist! 😊","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":14,"total_tokens":161,"completion_tokens":147,"prompt_tokens_details":null},"prompt_logprobs":null}

04. ACK Pro 集群弹 ACS 算力

ACK 同时还支持 Serverless Pod 方式 ACS GPU 算力。ACS容器算力可以通过虚拟节点（Virtual Node）的形式接入到 Kubernetes 集群中，使得集群可以轻松获得极大的弹性能力，而不必受限于集群的节点计算容量。

前提条件

开通容器服务 Kubernetes 版，并授权默认角色和开通相关云产品。具体操作，请参见快速创建 ACK 托管集群 [5]。

登录容器计算服务控制台 [6]，根据提示开通 ACS 服务。

组件中心安装虚拟节点组件（ACK Virtual Node）。

模型部署

ACS 与 ACK Pro 的部署方式基本一致，只需要再额外在 Pod 打上 ACS 算力标签即可 alibabacloud.com/compute-class: gpu，如下：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: qwen3
spec:
  template:
    metadata:
      labels:
        app: qwen3
        # for ACS 算力
        alibabacloud.com/compute-class: gpu
        # 指定GPU型号为example-model，请按实际情况填写，如T4
        alibabacloud.com/gpu-model-series: "example-model"
    spec:
      containers:
      ...