vllm运行模型的要点:
先下载,下载的时候干别的事情
1、Hugging Face CLI安装
pip install "huggingface_hub[hf_transfer]"
2、模型下载
huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
软件安装
1、显卡驱动安装 cudatool安装,cudnn安装(注意安装指定版本与驱动匹配)
锁定驱动版本的命令:
apt-mark hold nvidia-dkms-525
下载cudatool并安装
wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
./cuda_11.8.0_520.61.05_linux.run
wget https://developer.download.nvidia.com/compute/cudnn/9.7.1/local_installers/cudnn-local-repo-ubuntu2004-9.7.1_1.0-1_amd64.deb
sudo dpkg -i cudnn-local-repo-ubuntu2004-9.7.1_1.0-1_amd64.deb
sudo cp /var/cudnn-local-repo-ubuntu2004-9.7.1/cudnn-*-keyring.gpg /usr/share/keyrings/sudo apt-get update
sudo apt-get -y install cudnn
conda env create -n vllm=python3.11
conda activate vllm
3、torch安装(可免,会在vllm一起安装)
pip install torch==2.1.1+cu121 torchvision==0.20.1+cu121 torchaudio==2.1.1+cu121 --index-url https://download.pytorch.org/whl/cu121
4、xformers安装(可免,会在vllm一起安装)
5、vllm安装(安装指定cuda和torch版本的vllm)
wget https://github.com/vllm-project/vllm/releases/download/v0.6.1.post1/vllm-0.6.1.post1+cu118-cp311-cp311-manylinux1_x86_64.whl
pip install 。。。
6、flash-attn安装
pip install flash-attn(安装指定cuda版本的)
7、启动
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-14B \
--trust-remote-code --served-model-name ds14 \
--gpu-memory-utilization 0.98 --tensor-parallel-size 1 \
--port 8000 --max-model-len=65536 --token=
运行报错,做链接
ln -s /root/miniconda3/envs/vllm/lib/python3.11/site-packages/nvidia/nvjitlink/lib/libnvJitLink.so.12 /root/miniconda3/envs/vllm/lib/python3.11/site-packages/nvidia/cusparse/lib/libnvJitLink.so.12
export LD_LIBRARY_PATH=/root/miniconda3/envs/vllm/lib/python3.11/site-packages/nvidia/cusparse/lib:$LD_LIBRARY_PATH
open webui安装:
conda env create -n vllm=open-webui
conda activate open-webui
yum install open-webui