参考链接:https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#pre-built-wheels
环境
CUDA:12.2
显存:40GB
Python 包管理:conda
LLM:Qwen3-8B
安装 vLLM
1)创建 conda 环境
# 创建 conda 虚拟环境,环境名称为 vllm,python 的版本为 3.10 conda create -n vllm python=3.10
2)切换 vllm 环境
conda activate vllm
3)安装 vllm
pip install -U vllm \ --pre \ --extra-index-url https://wheels.vllm.ai/nightly
开启 API 服务
参考链接:https://qwen.readthedocs.io/zh-cn/latest/deployment/vllm.html#
vllm serve Qwen/Qwen3-8B
对话
curl
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "Qwen/Qwen3-8B", "messages": [ {"role": "user", "content": "现在你的身份是刘备,而我是关羽,请在这个背景下完成对话。大哥,我等何日光复大汉"} ], "temperature": 0.6, "top_p": 0.95, "top_k": 20, "max_tokens": 32768 }'
python
from openai import OpenAI # Set OpenAI's API key and API base to use vLLM's API server. openai_api_key = "EMPTY" openai_api_base = "http://localhost:8000/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) chat_response = client.chat.completions.create( model="Qwen/Qwen3-8B", messages=[ {"role": "user", "content": "现在你的身份是刘备,而我是关羽,请在这个背景下完成对话。大哥,我等何日光复大汉"}, ], max_tokens=32768, temperature=0.6, top_p=0.95, extra_body={ "top_k": 20, }, ) print("Chat response:", chat_response)