1. omni go
1.1. 参考文档
https://modelscope.cn/models/Qwen/Qwen2.5-Omni-7B/files
https://github.com/QwenLM/Qwen2.5-Omni
1.2. 基础环境信息
1.2.1. uname -a
root@gpu-h20-69f8f8d484-7cd5n:/vllm-workspace# uname -a Linux gpu-h20-69f8f8d484-7cd5n 5.10.134-008.15.kangaroo.al8.x86_64 #1 SMP Sun Mar 2 10:55:41 CST 2025 x86_64 x86_64 x86_64 GNU/Linux
1.2.2. nvcc
root@gpu-h20-69f8f8d484-7cd5n:~# nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Mon_Apr__3_17:16:06_PDT_2023 Cuda compilation tools, release 12.1, V12.1.105 Build cuda_12.1.r12.1/compiler.32688072_0
1.2.3. nvdia-smi
root@gpu-h20-69f8f8d484-7cd5n:~# nvidia-smi Tue Apr 8 04:31:36 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA H20 Off | 00000000:00:01.0 Off | 0 | | N/A 32C P0 114W / 500W | 23700MiB / 97871MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA H20 Off | 00000000:00:02.0 Off | 0 | | N/A 37C P0 118W / 500W | 26758MiB / 97871MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| +-----------------------------------------------------------------------------------------+
1.2.4. python -V
root@gpu-h20-69f8f8d484-7cd5n:~# python3 -V Python 3.12.9
1.2.5. pip list
root@gpu-h20-69f8f8d484-7cd5n:~# pip freeze accelerate==1.3.0 aiofiles==23.2.1 aiohappyeyeballs==2.4.4 aiohttp==3.11.12 aiohttp-cors==0.7.0 aiosignal==1.3.2 airportsdata==20241001 annotated-types==0.7.0 anyio==4.8.0 astor==0.8.1 attrs==25.1.0 audioread==3.0.1 av==14.3.0 bitsandbytes==0.45.1 blake3==1.0.4 blinker==1.4 boto3==1.36.14 botocore==1.36.14 cachetools==5.5.1 certifi==2025.1.31 cffi==1.17.1 charset-normalizer==3.4.1 click==8.1.8 cloudpickle==3.1.1 cmake==3.31.4 colorful==0.5.6 compressed-tensors==0.9.1 cryptography==3.4.8 dbus-python==1.2.18 decorator==5.2.1 decord==0.6.0 depyf==0.18.0 dill==0.3.9 diskcache==5.6.3 distlib==0.3.9 distro==1.7.0 distro-info==1.1+ubuntu0.2 einops==0.8.0 fastapi==0.115.8 ffmpeg==1.4 ffmpy==0.5.0 filelock==3.17.0 flash-attn @ file:///oss/sunyf/whl/flash_attn-2.7.3%2Bcu12torch2.5cxx11abiFALSE-cp312-cp312-linux_x86_64.whl#sha256=cbb9f1af63fb1ebe3b6a16b52c0653ce60e29b42f95b1a16a95825eddb74f01d flashinfer-python @ https://wheels.vllm.ai/flashinfer/524304395bd1d8cd7d07db083859523fcaa246a4/flashinfer_python-0.2.0.post1-cp312-cp312-linux_x86_64.whl#sha256=52d821a3972da8a6a874c1420fb6434cff641e0342e3fe192b2899b296b8b116 frozenlist==1.5.0 fsspec==2025.2.0 gguf==0.10.0 google-api-core==2.24.1 google-auth==2.38.0 googleapis-common-protos==1.67.0rc1 gradio==5.23.3 gradio_client==1.8.0 groovy==0.1.2 grpcio==1.70.0 h11==0.14.0 hf_transfer==0.1.9 httpcore==1.0.7 httplib2==0.20.2 httptools==0.6.4 httpx==0.28.1 huggingface-hub==0.28.1 humanize==4.11.0 idna==3.10 importlib-metadata==4.6.4 iniconfig==2.0.0 interegular==0.3.3 jeepney==0.7.1 Jinja2==3.1.5 jiter==0.8.2 jmespath==1.0.1 joblib==1.4.2 jsonschema==4.23.0 jsonschema-specifications==2024.10.1 keyring==23.5.0 lark==1.2.2 launchpadlib==1.10.16 lazr.restfulclient==0.14.4 lazr.uri==1.0.6 lazy_loader==0.4 librosa==0.11.0 llvmlite==0.44.0 lm-format-enforcer==0.10.9 markdown-it-py==3.0.0 MarkupSafe==3.0.2 mdurl==0.1.2 mistral_common==1.5.2 modelscope==1.24.1 modelscope_studio==1.2.2 more-itertools==8.10.0 mpmath==1.3.0 msgpack==1.1.0 msgspec==0.19.0 multidict==6.1.0 nest-asyncio==1.6.0 networkx==3.4.2 ninja==1.11.1.3 numba==0.61.0 numpy==1.26.4 nvidia-cublas-cu12==12.4.5.8 nvidia-cuda-cupti-cu12==12.4.127 nvidia-cuda-nvrtc-cu12==12.4.127 nvidia-cuda-runtime-cu12==12.4.127 nvidia-cudnn-cu12==9.1.0.70 nvidia-cufft-cu12==11.2.1.3 nvidia-curand-cu12==10.3.5.147 nvidia-cusolver-cu12==11.6.1.9 nvidia-cusparse-cu12==12.3.1.170 nvidia-ml-py==12.570.86 nvidia-nccl-cu12==2.21.5 nvidia-nvjitlink-cu12==12.4.127 nvidia-nvtx-cu12==12.4.127 oauthlib==3.2.0 openai==1.61.1 opencensus==0.11.4 opencensus-context==0.1.3 opencv-python-headless==4.11.0.86 orjson==3.10.16 outlines==0.1.11 outlines_core==0.1.26 packaging==24.2 pandas==2.2.3 partial-json-parser==0.2.1.1.post5 pillow==10.4.0 platformdirs==4.3.6 pluggy==1.5.0 pooch==1.8.2 prometheus-fastapi-instrumentator==7.0.2 prometheus_client==0.21.1 propcache==0.2.1 proto-plus==1.26.0 protobuf==5.29.3 psutil==6.1.1 py-cpuinfo==9.0.0 py-spy==0.4.0 pyasn1==0.6.1 pyasn1_modules==0.4.1 pybind11==2.13.6 pycountry==24.6.1 pycparser==2.22 pydantic==2.10.6 pydantic_core==2.27.2 pydub==0.25.1 Pygments==2.19.1 PyGObject==3.42.1 PyJWT==2.3.0 pyparsing==2.4.7 pytest==8.3.4 python-apt==2.4.0+ubuntu4 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 python-multipart==0.0.20 pytz==2025.2 PyYAML==6.0.2 pyzmq==26.2.1 qwen-omni-utils==0.0.3 ray==2.42.0 referencing==0.36.2 regex==2024.11.6 requests==2.32.3 rich==14.0.0 rpds-py==0.22.3 rsa==4.9 ruff==0.11.4 runai-model-streamer==0.12.0 runai-model-streamer-s3==0.12.0 s3transfer==0.11.2 safehttpx==0.1.6 safetensors==0.5.2 scikit-learn==1.6.1 scipy==1.15.2 SecretStorage==3.3.1 semantic-version==2.10.0 sentencepiece==0.2.0 setuptools==75.8.0 setuptools-scm==8.1.0 shellingham==1.5.4 six==1.16.0 smart-open==7.1.0 sniffio==1.3.1 soundfile==0.13.1 soxr==0.5.0.post1 starlette==0.45.3 sympy==1.13.1 threadpoolctl==3.6.0 tiktoken==0.7.0 timm==0.9.10 tokenizers==0.21.0 tomlkit==0.13.2 torch==2.5.1 torchaudio==2.5.1 torchvision==0.20.1 tqdm==4.67.1 transformers @ file:///root/transformers-4.50.0.dev0-py3-none-any.whl#sha256=3dffab149ebdfc8e9c938a4fa1c5e7cc4784ee1ffd9d14618931b6fe5f541654 triton==3.1.0 typer==0.15.2 typing_extensions==4.12.2 tzdata==2025.2 unattended-upgrades==0.1 urllib3==2.3.0 uvicorn==0.34.0 uvloop==0.21.0 virtualenv==20.29.1 vllm @ file:///vllm-workspace/dist/vllm-0.7.2-cp38-abi3-linux_x86_64.whl#sha256=d7f8438c3524442f45a6f1d33fdd0d548cf0bc7f5ce78b2ac5fca346143c6ddb wadllib==1.3.6 watchfiles==1.0.4 websockets==14.2 wheel==0.37.1 wrapt==1.17.2 xformers==0.0.28.post3 xgrammar==0.1.11 yarl==1.18.3 zipp==1.0.0
1.2.6. 镜像
qwenllm/qwen-omni:2.5-cu121
1.3. 准备工作
1.3.1. 模型下载
modelscope下载即可
https://modelscope.cn/models/Qwen/Qwen2.5-Omni-7B/files
modelscope download --model Qwen/Qwen2.5-Omni-7B --local_dir ./xxx
modelscope download --model Qwen/Qwen2.5-Omni-7B·
1.3.2. 环境准备
# web_demo的脚本在这个里面,详见启动命令 git clone https://github.com/QwenLM/Qwen2.5-Omni pip uninstall transformers # 打包好的transformers branch whl:transformers-4.50.0.dev0-py3-none-any.whl pip install /path/to/transformers-4.50.0.dev0-py3-none-any.whl pip install accelerate # 直接下载好的对应版本的flash attention的whl,这个也是要拉外网 pip install /path/to/flash_attn-2.7.3+cu12torch2.5cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
1.3.3. 测试代码
import soundfile as sf from modelscope import Qwen2_5OmniModel, Qwen2_5OmniProcessor from qwen_omni_utils import process_mm_info model_path = "/oss/model/Qwen2.5-Omni-7B" # default: Load the model on the available device(s) model = Qwen2_5OmniModel.from_pretrained(model_path, torch_dtype="auto", device_map="auto", attn_implementation="flash_attention_2", ) processor = Qwen2_5OmniProcessor.from_pretrained(model_path) conversation = [ { "role": "system", "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", }, { "role": "user", "content": [ {"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4"}, ], }, ] # set use audio in video USE_AUDIO_IN_VIDEO = True # Preparation for inference text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO) inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO) inputs = inputs.to(model.device).to(model.dtype) # Inference: Generation of the output text and audio text_ids, audio = model.generate(**inputs, use_audio_in_video=USE_AUDIO_IN_VIDEO) text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False) print(text) sf.write( "output.wav", audio.reshape(-1).detach().cpu().numpy(), samplerate=24000, )
1.3.4. 文件\显存占用
文件大小21G
model = Qwen2_5OmniModel.from_pretrained("/oss/model/Qwen2.5-Omni-7B", torch_dtype="auto", device_map="auto")
两张卡差不多都是21G,
单卡model\多加载processor显存占用
1.4. 启动命令
python3 web_demo.py --checkpoint-path /oss/model/Qwen2.5-Omni-7B --ui-language zh --server-name 0.0.0.0 --flash-attn2
1.5. 前端测试
1.6. fastapi
from fastapi import FastAPI from typing import List app = FastAPI() import torch import soundfile as sf from transformers import Qwen2_5OmniModel, Qwen2_5OmniProcessor # Qwen2_5OmniModel, Qwen2_5OmniProcessor from qwen_omni_utils import process_mm_info model_path = "/oss/model/Qwen2.5-Omni-7B" # default: Load the model on the available device(s) model = Qwen2_5OmniModel.from_pretrained(model_path , torch_dtype=torch.bfloat16 , device_map="cuda:0" ,attn_implementation="flash_attention_2" ) processor = Qwen2_5OmniProcessor.from_pretrained(model_path) # conversation = [ # { # "role": "system", # "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", # }, # { # "role": "user", # "content": [ # {"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4"} # ] # } # ] # set use audio in video USE_AUDIO_IN_VIDEO = True @app.post("/sunyf_post") def test_post(data: List[dict]): # Preparation for inference text = processor.apply_chat_template(data, add_generation_prompt=True, tokenize=False) audios, images, videos = process_mm_info(data, use_audio_in_video=USE_AUDIO_IN_VIDEO) inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO) inputs = inputs.to(model.device).to(model.dtype) # Inference: Generation of the output text and audio text_ids, audio = model.generate(**inputs, use_audio_in_video=USE_AUDIO_IN_VIDEO) text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False) print(text) sf.write( "output.wav", audio.reshape(-1).detach().cpu().numpy(), samplerate=24000, )
python3 -m hypercorn qwen2_5_omni_7b_fastapi:app --bind 0.0.0.0:8000
curl -X 'POST' \ 'localhost:8000/sunyf_post' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '[ { "role": "system", "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech." }, { "role": "user", "content": [ {"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4"} ] } ]'
1.7. vllm
20250430 更新:
omni 在vllm 085支持 已经release了thinker的部分(即目前仅支持文本输出),n卡可以用了,pai上有临时编译的和085两个版本的image,都是可以用的,ppu需要研发适配发版,参考github:
https://github.com/vllm-project/vllm/blob/v0.8.5/vllm/model_executor/models/registry.py
https://docs.vllm.ai/en/v0.8.5.post1/models/supported_models.html
1.7.1. 部署
环境安装参考:https://github.com/QwenLM/Qwen2.5-Omni#deployment-with-vllm
pip install vllm==0.8.5 # 这个不安装服务也能正常启动,但是推理请求会报错 # ModuleNotFoundError: No module named 'librosa' pip install vllm[audio] # 这里transformers需要错源码编译后安装,不然会报找不到模型 arch pip install transformers-4.52.0.dev0-py3-none-any.whl
VLLM_USE_V1=0 vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-Omni-3B
1.7.2. 请求测试
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": [ {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}}, {"type": "audio_url", "audio_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/cough.wav"}}, {"type": "text", "text": "告诉我图片中的文字以及语音中的声音"} ]} ] }'
{ "id": "chatcmpl-df07174d01a840468c6fc01b834285aa", "object": "chat.completion", "created": 1746261800, "model": "/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-Omni-3B", "choices": [ { "index": 0, "message": { "role": "assistant", "reasoning_content": null, "content": "图片中的文字是“TONGYI Qwen”,语音中的声音是一个人在咳嗽。", "tool_calls": [] }, "logprobs": null, "finish_reason": "stop", "stop_reason": null } ], "usage": { "prompt_tokens": 148, "total_tokens": 168, "completion_tokens": 20, "prompt_tokens_details": null }, "prompt_logprobs": null }
1.7.3. L20压测
这里以L20为例做了个简单的压测,以vlm官网推荐的hf数据集lmarena-ai/VisionArena-Chat为例
python3 benchmark_serving.py \ --backend openai-chat \ --model /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-Omni-3B \ --endpoint /v1/chat/completions \ --dataset-name hf \ --dataset-path lmarena-ai/VisionArena-Chat \ --hf-split train \ --num-prompts 100 \ --hf-output-len 1000 \ --max-concurrency 10
2. 相关问题
2.1. 如何从branch 安装transformer没有merge\release的分支
2.1.1. pip直接安装
限制:外网允许的情况下
pip install git+https://github.com/huggingface/transformers@f742a644ca32e65758c3adb36225aef1731bd2a8
(base) [root@iZt4nh1mo71f8inpq7c9zaZ transformers]# pip install git+https://github.com/huggingface/transformers@f742a644ca32e65758c3adb36225aef1731bd2a8 Looking in indexes: http://mirrors.cloud.aliyuncs.com/pypi/simple/ Collecting git+https://github.com/huggingface/transformers@f742a644ca32e65758c3adb36225aef1731bd2a8 Cloning https://github.com/huggingface/transformers (to revision f742a644ca32e65758c3adb36225aef1731bd2a8) to /tmp/pip-req-build-huloit5n Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-huloit5n Running command git rev-parse -q --verify 'sha^f742a644ca32e65758c3adb36225aef1731bd2a8' Running command git fetch -q https://github.com/huggingface/transformers f742a644ca32e65758c3adb36225aef1731bd2a8 Running command git checkout -q f742a644ca32e65758c3adb36225aef1731bd2a8 Resolved https://github.com/huggingface/transformers to commit f742a644ca32e65758c3adb36225aef1731bd2a8 Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done Collecting filelock (from transformers==4.50.0.dev0) Using cached http://mirrors.cloud.aliyuncs.com/pypi/packages/4d/36/2a115987e2d8c300a974597416d9de88f2444426de9571f4b59b2cca3acc/filelock-3.18.0-py3-none-any.whl (16 kB) Collecting huggingface-hub<1.0,>=0.26.0 (from transformers==4.50.0.dev0) Using cached http://mirrors.cloud.aliyuncs.com/pypi/packages/99/e3/2232d0e726d4d6ea69643b9593d97d0e7e6ea69c2fe9ed5de34d476c1c47/huggingface_hub-0.30.1-py3-none-any.whl (481 kB) Collecting numpy>=1.17 (from transformers==4.50.0.dev0) Downloading http://mirrors.cloud.aliyuncs.com/pypi/packages/02/e2/e2cbb8d634151aab9528ef7b8bab52ee4ab10e076509285602c2a3a686e0/numpy-2.2.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.1 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16.1/16.1 MB 147.9 MB/s eta 0:00:00 Requirement already satisfied: packaging>=20.0 in /root/miniconda3/lib/python3.12/site-packages (from transformers==4.50.0.dev0) (24.2) Collecting pyyaml>=5.1 (from transformers==4.50.0.dev0) Downloading http://mirrors.cloud.aliyuncs.com/pypi/packages/b9/2b/614b4752f2e127db5cc206abc23a8c19678e92b23c3db30fc86ab731d3bd/PyYAML-6.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (767 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 767.5/767.5 kB 83.1 MB/s eta 0:00:00 Collecting regex!=2019.12.17 (from transformers==4.50.0.dev0) Downloading http://mirrors.cloud.aliyuncs.com/pypi/packages/fb/13/e3b075031a738c9598c51cfbc4c7879e26729c53aa9cca59211c44235314/regex-2024.11.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (796 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 796.9/796.9 kB 82.2 MB/s eta 0:00:00 Requirement already satisfied: requests in /root/miniconda3/lib/python3.12/site-packages (from transformers==4.50.0.dev0) (2.32.3) Collecting tokenizers<0.22,>=0.21 (from transformers==4.50.0.dev0) Using cached http://mirrors.cloud.aliyuncs.com/pypi/packages/8a/63/38be071b0c8e06840bc6046991636bcb30c27f6bb1e670f4f4bc87cf49cc/tokenizers-0.21.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB) Collecting safetensors>=0.4.1 (from transformers==4.50.0.dev0) Using cached http://mirrors.cloud.aliyuncs.com/pypi/packages/a6/f8/dae3421624fcc87a89d42e1898a798bc7ff72c61f38973a65d60df8f124c/safetensors-0.5.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (471 kB) Requirement already satisfied: tqdm>=4.27 in /root/miniconda3/lib/python3.12/site-packages (from transformers==4.50.0.dev0) (4.67.1) Collecting fsspec>=2023.5.0 (from huggingface-hub<1.0,>=0.26.0->transformers==4.50.0.dev0) Using cached http://mirrors.cloud.aliyuncs.com/pypi/packages/44/4b/e0cfc1a6f17e990f3e64b7d941ddc4acdc7b19d6edd51abf495f32b1a9e4/fsspec-2025.3.2-py3-none-any.whl (194 kB) Requirement already satisfied: typing-extensions>=3.7.4.3 in /root/miniconda3/lib/python3.12/site-packages (from huggingface-hub<1.0,>=0.26.0->transformers==4.50.0.dev0) (4.12.2) Requirement already satisfied: charset-normalizer<4,>=2 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers==4.50.0.dev0) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers==4.50.0.dev0) (3.7) Requirement already satisfied: urllib3<3,>=1.21.1 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers==4.50.0.dev0) (2.3.0) Requirement already satisfied: certifi>=2017.4.17 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers==4.50.0.dev0) (2025.1.31) Building wheels for collected packages: transformers Building wheel for transformers (pyproject.toml) ... done Created wheel for transformers: filename=transformers-4.50.0.dev0-py3-none-any.whl size=11030162 sha256=4ea444b63511af6f0b4e5ad40034871a05a691dc946a8f58fbc498d5f50f20d4 Stored in directory: /root/.cache/pip/wheels/f2/41/36/989e2608a431821b658c608fd1a84528d94288ca63198c584c Successfully built transformers Installing collected packages: safetensors, regex, pyyaml, numpy, fsspec, filelock, huggingface-hub, tokenizers, transformers Successfully installed filelock-3.18.0 fsspec-2025.3.2 huggingface-hub-0.30.1 numpy-2.2.4 pyyaml-6.0.2 regex-2024.11.6 safetensors-0.5.3 tokenizers-0.21.1 transformers-4.50.0.dev0 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
2.1.2. 打包成whl
这种场景比较常见,一般branch都是release在github中,国内的机器通过pip直接安装会卡在网络问题上
# 建议试用env保证python版本一致 # 创建一个文件夹用来存项目 mkdir -p transformers cd transformers # 拉git repo git clone https://github.com/huggingface/transformers.git # 进到项目中 cd transformers/ # fetch这个分支,不然fetch all的话会非常大,不建议 git fetch -q https://github.com/huggingface/transformers f742a644ca32e65758c3adb36225aef1731bd2a8 # checkout到这个分支 git checkout -q f742a644ca32e65758c3adb36225aef1731bd2a8 # 需要python 有如下两个包 pip install wheel setuptools # 到根目录下打包 python3 setup.py bdist_wheel #查看whl包,这个包可以通过其他方式直接拉到国内打包 ll dist/ # 通过pip install
2.2. flash-atten安装报错
最后看到的报错是:Failed to build flash-attn
File "/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/setup.py", line 486, in run urllib.request.urlretrieve(wheel_url, wheel_filename)
核心报错是上面代码访问超时
urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>
root@gpu-h20-69f8f8d484-7cd5n:/vllm-workspace# pip install flash-attn --no-build-isolation -i http://mirrors.cloud.aliyuncs.com/pypi/simple/ --trusted-host mirrors.cloud.aliyuncs.com Looking in indexes: http://mirrors.cloud.aliyuncs.com/pypi/simple/ Collecting flash-attn Downloading http://mirrors.cloud.aliyuncs.com/pypi/packages/11/34/9bf60e736ed7bbe15055ac2dab48ec67d9dbd088d2b4ae318fd77190ab4e/flash_attn-2.7.4.post1.tar.gz (6.0 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.0/6.0 MB 74.5 MB/s eta 0:00:00 Preparing metadata (setup.py) ... done Requirement already satisfied: torch in /usr/local/lib/python3.12/dist-packages (from flash-attn) (2.5.1) Requirement already satisfied: einops in /usr/local/lib/python3.12/dist-packages (from flash-attn) (0.8.0) Requirement already satisfied: filelock in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (3.17.0) Requirement already satisfied: typing-extensions>=4.8.0 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (4.12.2) Requirement already satisfied: networkx in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (3.4.2) Requirement already satisfied: jinja2 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (3.1.5) Requirement already satisfied: fsspec in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (2025.2.0) Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.4.127 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (12.4.127) Requirement already satisfied: nvidia-cuda-runtime-cu12==12.4.127 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (12.4.127) Requirement already satisfied: nvidia-cuda-cupti-cu12==12.4.127 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (12.4.127) Requirement already satisfied: nvidia-cudnn-cu12==9.1.0.70 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (9.1.0.70) Requirement already satisfied: nvidia-cublas-cu12==12.4.5.8 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (12.4.5.8) Requirement already satisfied: nvidia-cufft-cu12==11.2.1.3 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (11.2.1.3) Requirement already satisfied: nvidia-curand-cu12==10.3.5.147 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (10.3.5.147) Requirement already satisfied: nvidia-cusolver-cu12==11.6.1.9 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (11.6.1.9) Requirement already satisfied: nvidia-cusparse-cu12==12.3.1.170 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (12.3.1.170) Requirement already satisfied: nvidia-nccl-cu12==2.21.5 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (2.21.5) Requirement already satisfied: nvidia-nvtx-cu12==12.4.127 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (12.4.127) Requirement already satisfied: nvidia-nvjitlink-cu12==12.4.127 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (12.4.127) Requirement already satisfied: triton==3.1.0 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (3.1.0) Requirement already satisfied: setuptools in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (75.8.0) Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (1.13.1) Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.12/dist-packages (from sympy==1.13.1->torch->flash-attn) (1.3.0) Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.12/dist-packages (from jinja2->torch->flash-attn) (3.0.2) Building wheels for collected packages: flash-attn Building wheel for flash-attn (setup.py) ... - \ | \ error error: subprocess-exited-with-error × python setup.py bdist_wheel did not run successfully. │ exit code: 1 ╰─> [255 lines of output] torch.__version__ = 2.5.1+cu124 /usr/local/lib/python3.12/dist-packages/setuptools/__init__.py:94: _DeprecatedInstaller: setuptools.installer and fetch_build_eggs are deprecated. !! ******************************************************************************** Requirements should be satisfied by a PEP 517 installer. If you are using pip, you can try `pip install --use-pep517`. ******************************************************************************** !! dist.fetch_build_eggs(dist.setup_requires) running bdist_wheel Guessing wheel URL: https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp312-cp312-linux_x86_64.whl Precompiled wheel not found. Building from source... running build running build_py creating build/lib.linux-x86_64-cpython-312/hopper copying hopper/test_kvcache.py -> build/lib.linux-x86_64-cpython-312/hopper copying hopper/benchmark_split_kv.py -> build/lib.linux-x86_64-cpython-312/hopper copying hopper/generate_kernels.py -> build/lib.linux-x86_64-cpython-312/hopper copying hopper/__init__.py -> build/lib.linux-x86_64-cpython-312/hopper copying hopper/benchmark_flash_attention_fp8.py -> build/lib.linux-x86_64-cpython-312/hopper copying hopper/test_flash_attn.py -> build/lib.linux-x86_64-cpython-312/hopper copying hopper/test_util.py -> build/lib.linux-x86_64-cpython-312/hopper copying hopper/padding.py -> build/lib.linux-x86_64-cpython-312/hopper copying hopper/benchmark_attn.py -> build/lib.linux-x86_64-cpython-312/hopper copying hopper/flash_attn_interface.py -> build/lib.linux-x86_64-cpython-312/hopper copying hopper/test_attn_kvcache.py -> build/lib.linux-x86_64-cpython-312/hopper copying hopper/setup.py -> build/lib.linux-x86_64-cpython-312/hopper creating build/lib.linux-x86_64-cpython-312/flash_attn copying flash_attn/flash_blocksparse_attention.py -> build/lib.linux-x86_64-cpython-312/flash_attn copying flash_attn/flash_attn_triton.py -> build/lib.linux-x86_64-cpython-312/flash_attn copying flash_attn/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn copying flash_attn/flash_blocksparse_attn_interface.py -> build/lib.linux-x86_64-cpython-312/flash_attn copying flash_attn/flash_attn_triton_og.py -> build/lib.linux-x86_64-cpython-312/flash_attn copying flash_attn/fused_softmax.py -> build/lib.linux-x86_64-cpython-312/flash_attn copying flash_attn/bert_padding.py -> build/lib.linux-x86_64-cpython-312/flash_attn copying flash_attn/flash_attn_interface.py -> build/lib.linux-x86_64-cpython-312/flash_attn creating build/lib.linux-x86_64-cpython-312/flash_attn/losses copying flash_attn/losses/cross_entropy.py -> build/lib.linux-x86_64-cpython-312/flash_attn/losses copying flash_attn/losses/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/losses creating build/lib.linux-x86_64-cpython-312/flash_attn/layers copying flash_attn/layers/patch_embed.py -> build/lib.linux-x86_64-cpython-312/flash_attn/layers copying flash_attn/layers/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/layers copying flash_attn/layers/rotary.py -> build/lib.linux-x86_64-cpython-312/flash_attn/layers creating build/lib.linux-x86_64-cpython-312/flash_attn/ops copying flash_attn/ops/fused_dense.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops copying flash_attn/ops/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops copying flash_attn/ops/activations.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops copying flash_attn/ops/layer_norm.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops copying flash_attn/ops/rms_norm.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops creating build/lib.linux-x86_64-cpython-312/flash_attn/utils copying flash_attn/utils/distributed.py -> build/lib.linux-x86_64-cpython-312/flash_attn/utils copying flash_attn/utils/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/utils copying flash_attn/utils/pretrained.py -> build/lib.linux-x86_64-cpython-312/flash_attn/utils copying flash_attn/utils/benchmark.py -> build/lib.linux-x86_64-cpython-312/flash_attn/utils copying flash_attn/utils/generation.py -> build/lib.linux-x86_64-cpython-312/flash_attn/utils creating build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd copying flash_attn/flash_attn_triton_amd/utils.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd copying flash_attn/flash_attn_triton_amd/bench.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd copying flash_attn/flash_attn_triton_amd/bwd_ref.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd copying flash_attn/flash_attn_triton_amd/fwd_decode.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd copying flash_attn/flash_attn_triton_amd/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd copying flash_attn/flash_attn_triton_amd/interface_torch.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd copying flash_attn/flash_attn_triton_amd/bwd_prefill.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd copying flash_attn/flash_attn_triton_amd/interface_fa.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd copying flash_attn/flash_attn_triton_amd/test.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd copying flash_attn/flash_attn_triton_amd/fwd_ref.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd copying flash_attn/flash_attn_triton_amd/fwd_prefill.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd creating build/lib.linux-x86_64-cpython-312/flash_attn/modules copying flash_attn/modules/mha.py -> build/lib.linux-x86_64-cpython-312/flash_attn/modules copying flash_attn/modules/block.py -> build/lib.linux-x86_64-cpython-312/flash_attn/modules copying flash_attn/modules/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/modules copying flash_attn/modules/mlp.py -> build/lib.linux-x86_64-cpython-312/flash_attn/modules copying flash_attn/modules/embedding.py -> build/lib.linux-x86_64-cpython-312/flash_attn/modules creating build/lib.linux-x86_64-cpython-312/flash_attn/models copying flash_attn/models/gptj.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models copying flash_attn/models/baichuan.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models copying flash_attn/models/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models copying flash_attn/models/opt.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models copying flash_attn/models/bert.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models copying flash_attn/models/falcon.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models copying flash_attn/models/llama.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models copying flash_attn/models/btlm.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models copying flash_attn/models/bigcode.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models copying flash_attn/models/gpt_neox.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models copying flash_attn/models/vit.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models copying flash_attn/models/gpt.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models creating build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton copying flash_attn/ops/triton/k_activations.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton copying flash_attn/ops/triton/cross_entropy.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton copying flash_attn/ops/triton/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton copying flash_attn/ops/triton/linear.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton copying flash_attn/ops/triton/mlp.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton copying flash_attn/ops/triton/layer_norm.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton copying flash_attn/ops/triton/rotary.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton running build_ext /usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:416: UserWarning: The detected CUDA version (12.1) has a minor version mismatch with the version that was used to compile PyTorch (12.4). Most likely this shouldn't be a problem. warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda)) /usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:426: UserWarning: There are no x86_64-linux-gnu-g++ version bounds defined for CUDA version 12.1 warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}') building 'flash_attn_2_cuda' extension creating /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn creating /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src Emitting ninja build file /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/build.ninja... Compiling objects... Using envvar MAX_JOBS (13) as the number of workers... [1/85] c++ -MMD -MF /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/flash_api.o.d -fno-strict-overflow -Wsign-compare -DNDEBUG -g -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/flash_api.cpp -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/flash_api.o -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 [2/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim160_bf16_causal_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim160_bf16_causal_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim160_bf16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 [3/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim160_fp16_causal_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim160_fp16_causal_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim160_fp16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 [4/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim192_fp16_causal_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim192_fp16_causal_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim192_fp16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 [5/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim192_bf16_causal_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim192_bf16_causal_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim192_bf16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 [6/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_bf16_causal_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim256_bf16_causal_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_bf16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 FAILED: /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_bf16_causal_sm80.o /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_bf16_causal_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim256_bf16_causal_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_bf16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 Killed Killed [7/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_causal_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim128_fp16_causal_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 [8/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 [9/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim160_bf16_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim160_bf16_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim160_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 [10/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim160_fp16_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim160_fp16_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim160_fp16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 [11/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim192_fp16_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim192_fp16_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim192_fp16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 [12/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim192_bf16_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim192_bf16_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim192_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 [13/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim128_fp16_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 [14/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim32_bf16_causal_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim32_bf16_causal_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim32_bf16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 [15/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 [16/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_fp16_causal_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim256_fp16_causal_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_fp16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 [17/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_bf16_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim256_bf16_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 [18/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_fp16_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim256_fp16_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_fp16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0 ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/usr/lib/python3.12/urllib/request.py", line 1344, in do_open h.request(req.get_method(), req.selector, req.data, headers, File "/usr/lib/python3.12/http/client.py", line 1338, in request self._send_request(method, url, body, headers, encode_chunked) File "/usr/lib/python3.12/http/client.py", line 1384, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/usr/lib/python3.12/http/client.py", line 1333, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/usr/lib/python3.12/http/client.py", line 1093, in _send_output self.send(msg) File "/usr/lib/python3.12/http/client.py", line 1037, in send self.connect() File "/usr/lib/python3.12/http/client.py", line 1472, in connect super().connect() File "/usr/lib/python3.12/http/client.py", line 1003, in connect self.sock = self._create_connection( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/socket.py", line 865, in create_connection raise exceptions[0] File "/usr/lib/python3.12/socket.py", line 850, in create_connection sock.connect(sa) TimeoutError: [Errno 110] Connection timed out During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/setup.py", line 486, in run urllib.request.urlretrieve(wheel_url, wheel_filename) File "/usr/lib/python3.12/urllib/request.py", line 240, in urlretrieve with contextlib.closing(urlopen(url, data)) as fp: ^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/urllib/request.py", line 215, in urlopen return opener.open(url, data, timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/urllib/request.py", line 515, in open response = self._open(req, data) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/urllib/request.py", line 532, in _open result = self._call_chain(self.handle_open, protocol, protocol + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/urllib/request.py", line 492, in _call_chain result = func(*args) ^^^^^^^^^^^ File "/usr/lib/python3.12/urllib/request.py", line 1392, in https_open return self.do_open(http.client.HTTPSConnection, req, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/urllib/request.py", line 1347, in do_open raise URLError(err) urllib.error.URLError: <urlopen error [Errno 110] Connection timed out> During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 2104, in _run_ninja_build subprocess.run( File "/usr/lib/python3.12/subprocess.py", line 573, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v', '-j', '13']' returned non-zero exit status 1. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<string>", line 2, in <module> File "<pip-setuptools-caller>", line 34, in <module> File "/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/setup.py", line 526, in <module> setup( File "/usr/local/lib/python3.12/dist-packages/setuptools/__init__.py", line 117, in setup return distutils.core.setup(**attrs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/core.py", line 186, in setup return run_commands(dist) ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/core.py", line 202, in run_commands dist.run_commands() File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/dist.py", line 983, in run_commands self.run_command(cmd) File "/usr/local/lib/python3.12/dist-packages/setuptools/dist.py", line 999, in run_command super().run_command(command) File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/dist.py", line 1002, in run_command cmd_obj.run() File "/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/setup.py", line 503, in run super().run() File "/usr/lib/python3/dist-packages/wheel/bdist_wheel.py", line 299, in run self.run_command('build') File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/cmd.py", line 339, in run_command self.distribution.run_command(command) File "/usr/local/lib/python3.12/dist-packages/setuptools/dist.py", line 999, in run_command super().run_command(command) File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/dist.py", line 1002, in run_command cmd_obj.run() File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/command/build.py", line 136, in run self.run_command(cmd_name) File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/cmd.py", line 339, in run_command self.distribution.run_command(command) File "/usr/local/lib/python3.12/dist-packages/setuptools/dist.py", line 999, in run_command super().run_command(command) File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/dist.py", line 1002, in run_command cmd_obj.run() File "/usr/local/lib/python3.12/dist-packages/setuptools/command/build_ext.py", line 99, in run _build_ext.run(self) File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/command/build_ext.py", line 365, in run self.build_extensions() File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 868, in build_extensions build_ext.build_extensions(self) File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/command/build_ext.py", line 481, in build_extensions self._build_extensions_serial() File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/command/build_ext.py", line 507, in _build_extensions_serial self.build_extension(ext) File "/usr/local/lib/python3.12/dist-packages/setuptools/command/build_ext.py", line 264, in build_extension _build_ext.build_extension(self, ext) File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/command/build_ext.py", line 562, in build_extension objects = self.compiler.compile( ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 681, in unix_wrap_ninja_compile _write_ninja_file_and_compile_objects( File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 1784, in _write_ninja_file_and_compile_objects _run_ninja_build( File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 2120, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error compiling objects for extension [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for flash-attn Running setup.py clean for flash-attn Failed to build flash-attn [notice] A new release of pip is available: 25.0 -> 25.0.1 [notice] To update, run: python3.12 -m pip install --upgrade pip ERROR: Failed to build installable wheels for some pyproject.toml based projects (flash-attn) root@gpu-h20-69f8f8d484-7cd5n:/vllm-workspace#
通过堆栈分析源码发现是获取到当前机器的各种版本,包含torch\rund\falsh\cpp11去拉flash-attention的release了,这里也是网络的原因
https://github.com/Dao-AILab/flash-attention/releases
2.3. 没包
这个是modelscope的版本比较低 --force 重新下下最新版
ModuleNotFoundError: No module named 'modelscope' ImportError: Cannot import available module of Qwen2_5OmniModel in modelscope, or related packages(['transformers', 'peft', 'diffusers'])
2.4. video给网络url偶发会超时,本地正常
后续可以将share url换成本地的路径
>>> import soundfile as sf >>> >>> from modelscope import Qwen2_5OmniModel, Qwen2_5OmniProcessor >>> from qwen_omni_utils import process_mm_info >>> >>> >>> model = Qwen2_5OmniModel.from_pretrained("/oss/model/Qwen2.5-Omni-7B", torch_dtype="auto", device_map="auto") Qwen2_5OmniToken2WavModel does not support eager attention implementation, fall back to sdpa Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 5/5 [00:30<00:00, 6.07s/it] /usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py:4641: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. for key, value in torch.load(path).items(): >>> processor = Qwen2_5OmniProcessor.from_pretrained("/oss/model/Qwen2.5-Omni-7B") >>> conversation = [ ... { ... "role": "system", ... "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", ... }, ... { ... "role": "user", ... "content": [ ... {"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4"}, ... ], ... }, ... ] >>> USE_AUDIO_IN_VIDEO = True >>> text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) >>> audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO) /usr/local/lib/python3.12/dist-packages/librosa/core/audio.py:172: FutureWarning: librosa.core.audio.__audioread_load Deprecated as of librosa version 0.10.0. It will be removed in librosa version 1.0. y, sr_native = __audioread_load(path, offset, duration, dtype) Traceback (most recent call last): File "/usr/local/lib/python3.12/dist-packages/audioread/ffdec.py", line 188, in read_data data = self.stdout_reader.queue.get(timeout=timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/queue.py", line 179, in get raise Empty _queue.Empty During handling of the above exception, another exception occurred: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.12/dist-packages/qwen_omni_utils/v2_5/__init__.py", line 12, in process_mm_info audios = process_audio_info(conversations, use_audio_in_video) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/qwen_omni_utils/v2_5/audio_process.py", line 46, in process_audio_info audios.append(librosa.load(audioread.ffdec.FFmpegAudioFile(path), sr=16000)[0]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/librosa/core/audio.py", line 172, in load y, sr_native = __audioread_load(path, offset, duration, dtype) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/decorator.py", line 235, in fun return caller(func, *(extras + args), **kw) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/librosa/util/decorators.py", line 63, in __wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/librosa/core/audio.py", line 255, in __audioread_load for frame in input_file: ^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/audioread/ffdec.py", line 201, in read_data raise ReadTimeoutError('ffmpeg output: {}'.format( audioread.ffdec.ReadTimeoutError: ffmpeg output: b' Metadata: creation_time : 2025-03-14T07:52:19.000000Z handler_name : Core Media Audio vendor_id : [0][0][0][0] Stream mapping: Stream #0:1 -> #0:0 (aac (native) -> pcm_s16le (native)) Press [q] to stop, [?] for help Output #0, s16le, to \'pipe:\': Metadata: major_brand : qt minor_version : 0 compatible_brands: qt com.apple.quicktime.artwork: {"data":{"editType":"default","edittime":835,"infoStickerId":"","is_ai_lyric":0,"is_aimusic_mv":0,"is_use_ai_image_generation":0,"is_use_ai_sound":0,"is_use_ai_video_generation":0,"is_use_aimusic_bgm":0,"is_use_aimusic_vocal":0,"is_use_graph_chart":0,"is_ encoder : Lavf58.76.100 Stream #0:0(und): Audio: pcm_s16le, 44100 Hz, stereo, s16, 1411 kb/s (default) Metadata: creation_time : 2025-03-14T07:52:19.000000Z handler_name : Core Media Audio vendor_id : [0][0][0][0] encoder : Lavc58.134.100 pcm_s16le size= 4kB time=00:00:00.00 bitrate=N/A speed= 0x size= 348kB time=00:00:01.99 bitrate=1427.6kbits/s speed=0.333x size= 432kB time=00:00:02.48 bitrate=1424.4kbits/s speed=0.319x size= 520kB time=00:00:02.99 bitrate=1422.1kbits/s speed=0.293x size= 604kB time=00:00:03.48 bitrate=1420.6kbits/s speed=0.247x size= 692kB time=00:00:03.99 bitrate=1419.4kbits/s speed=0.196x size= 776kB time=00:00:04.48 bitrate=1418.5kbits/s speed=0.165x size= 864kB time=00:00:04.99 bitrate=1417.8kbits/s speed=0.149x size= 928kB time=00:00:05.36 bitrate=1417.3kbits/s speed=0.157x size= 948kB time=00:00:05.47 bitrate=1417.2kbits/s speed=0.131x size= 1036kB time=00:00:05.98 bitrate=1416.7kbits/s speed=0.133x size= 1120kB time=00:00:06.47 bitrate=1416.3kbits/s speed=0.121x size= 1208kB time=00:00:06.98 bitrate=1415.9kbits/s speed=0.122x size= 1268kB time=00:00:07.33 bitrate=1415.7kbits/s speed=0.127x size= 1292kB time=00:00:07.47 bitrate=1415.6kbits/s speed=0.117x size= ' >>> inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO) Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'audios' is not defined
2.5. 显存问题
2.5.1. oom
非flash-atten得backend的实现应该是有一个问题,一张acp的图片会把96G显存的卡直接打满,换了attn_implementation="flash_attention_2"后正常,社区有个类似的oom的bug
predict history: [{'role': 'system', 'content': 'You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.'}, {'role': 'user', 'content': 'Meijiao的ACP证书ID是多少?'}, {'role': 'user', 'content': [{'type': 'image', 'image': '/tmp/gradio/d97304ebdcff708153634e166099bef672384228166f6bed37e3cbb03d8e05db/ACP-yibei.png'}]}] Traceback (most recent call last): File "/usr/local/lib/python3.12/dist-packages/gradio/queueing.py", line 715, in process_events response = await route_utils.call_process_api( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/gradio/route_utils.py", line 322, in call_process_api output = await app.get_blocks().process_api( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/gradio/blocks.py", line 2137, in process_api result = await self.call_function( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/gradio/blocks.py", line 1675, in call_function prediction = await utils.async_iteration(iterator) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 735, in async_iteration return await anext(iterator) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 729, in __anext__ return await anyio.to_thread.run_sync( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/anyio/to_thread.py", line 56, in run_sync return await get_async_backend().run_sync_in_worker_thread( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/anyio/_backends/_asyncio.py", line 2461, in run_sync_in_worker_thread return await future ^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/anyio/_backends/_asyncio.py", line 962, in run result = context.run(func, *args) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 712, in run_sync_iterator_async return next(iterator) ^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 873, in gen_wrapper response = next(iterator) ^^^^^^^^^^^^^^ File "/root/github/Qwen2.5-Omni/web_demo.py", line 203, in chat_predict for chunk in predict(formatted_history, voice_choice): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/github/Qwen2.5-Omni/web_demo.py", line 115, in predict text_ids, audio = model.generate(**inputs, spk=voice, use_audio_in_video=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 4796, in generate thinker_result = self.thinker.generate( ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py", line 2315, in generate result = self._sample( ^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py", line 3303, in _sample outputs = self(**model_inputs, return_dict=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 2667, in forward image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1551, in forward hidden_states = blk( ^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1338, in forward hidden_states = hidden_states + self.attn( ^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1212, in forward attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(q.dtype) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/functional.py", line 2142, in softmax ret = input.softmax(dim, dtype=dtype) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 55.04 GiB. GPU 0 has a total capacity of 94.99 GiB of which 46.20 GiB is free. Process 687142 has 48.79 GiB memory in use. Of the allocated memory 40.22 GiB is allocated by PyTorch, and 8.13 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) predict history: [{'role': 'system', 'content': 'You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.'}, {'role': 'user', 'content': 'Meijiao的ACP证书ID是多少?'}, {'role': 'user', 'content': [{'type': 'image', 'image': '/tmp/gradio/d97304ebdcff708153634e166099bef672384228166f6bed37e3cbb03d8e05db/ACP-yibei.png'}]}] Traceback (most recent call last): File "/usr/local/lib/python3.12/dist-packages/gradio/queueing.py", line 715, in process_events response = await route_utils.call_process_api( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/gradio/route_utils.py", line 322, in call_process_api output = await app.get_blocks().process_api( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/gradio/blocks.py", line 2137, in process_api result = await self.call_function( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/gradio/blocks.py", line 1675, in call_function prediction = await utils.async_iteration(iterator) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 735, in async_iteration return await anext(iterator) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 729, in __anext__ return await anyio.to_thread.run_sync( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/anyio/to_thread.py", line 56, in run_sync return await get_async_backend().run_sync_in_worker_thread( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/anyio/_backends/_asyncio.py", line 2461, in run_sync_in_worker_thread return await future ^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/anyio/_backends/_asyncio.py", line 962, in run result = context.run(func, *args) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 712, in run_sync_iterator_async return next(iterator) ^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 873, in gen_wrapper response = next(iterator) ^^^^^^^^^^^^^^ File "/root/github/Qwen2.5-Omni/web_demo.py", line 203, in chat_predict for chunk in predict(formatted_history, voice_choice): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/github/Qwen2.5-Omni/web_demo.py", line 115, in predict text_ids, audio = model.generate(**inputs, spk=voice, use_audio_in_video=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 4796, in generate thinker_result = self.thinker.generate( ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py", line 2315, in generate result = self._sample( ^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py", line 3303, in _sample outputs = self(**model_inputs, return_dict=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 2667, in forward image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1551, in forward hidden_states = blk( ^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1338, in forward hidden_states = hidden_states + self.attn( ^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1210, in forward attn_weights = torch.matmul(q, k.transpose(1, 2)) / math.sqrt(self.head_dim) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~ torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 27.52 GiB. GPU 0 has a total capacity of 94.99 GiB of which 18.60 GiB is free. Process 687142 has 76.38 GiB memory in use. Of the allocated memory 70.03 GiB is allocated by PyTorch, and 5.84 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) predict history: [{'role': 'system', 'content': 'You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.'}, {'role': 'user', 'content': 'what is the acp certificate id'}, {'role': 'user', 'content': [{'type': 'image', 'image': '/tmp/gradio/d97304ebdcff708153634e166099bef672384228166f6bed37e3cbb03d8e05db/ACP-yibei.png'}]}] Traceback (most recent call last): File "/usr/local/lib/python3.12/dist-packages/gradio/queueing.py", line 715, in process_events response = await route_utils.call_process_api( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/gradio/route_utils.py", line 322, in call_process_api output = await app.get_blocks().process_api( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/gradio/blocks.py", line 2137, in process_api result = await self.call_function( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/gradio/blocks.py", line 1675, in call_function prediction = await utils.async_iteration(iterator) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 735, in async_iteration return await anext(iterator) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 729, in __anext__ return await anyio.to_thread.run_sync( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/anyio/to_thread.py", line 56, in run_sync return await get_async_backend().run_sync_in_worker_thread( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/anyio/_backends/_asyncio.py", line 2461, in run_sync_in_worker_thread return await future ^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/anyio/_backends/_asyncio.py", line 962, in run result = context.run(func, *args) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 712, in run_sync_iterator_async return next(iterator) ^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 873, in gen_wrapper response = next(iterator) ^^^^^^^^^^^^^^ File "/root/github/Qwen2.5-Omni/web_demo.py", line 203, in chat_predict for chunk in predict(formatted_history, voice_choice): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/github/Qwen2.5-Omni/web_demo.py", line 115, in predict text_ids, audio = model.generate(**inputs, spk=voice, use_audio_in_video=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 4796, in generate thinker_result = self.thinker.generate( ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py", line 2315, in generate result = self._sample( ^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py", line 3303, in _sample outputs = self(**model_inputs, return_dict=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 2667, in forward image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1551, in forward hidden_states = blk( ^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1338, in forward hidden_states = hidden_states + self.attn( ^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1210, in forward attn_weights = torch.matmul(q, k.transpose(1, 2)) / math.sqrt(self.head_dim) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~ torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 27.52 GiB. GPU 0 has a total capacity of 94.99 GiB of which 18.60 GiB is free. Process 687142 has 76.38 GiB memory in use. Of the allocated memory 70.03 GiB is allocated by PyTorch, and 5.84 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) predict history: [{'role': 'system', 'content': 'You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.'}, {'role': 'user', 'content': 'what is the acp certificate id'}, {'role': 'user', 'content': [{'type': 'image', 'image': '/tmp/gradio/d97304ebdcff708153634e166099bef672384228166f6bed37e3cbb03d8e05db/ACP-yibei.png'}]}] Traceback (most recent call last): File "/usr/local/lib/python3.12/dist-packages/gradio/queueing.py", line 715, in process_events response = await route_utils.call_process_api( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/gradio/route_utils.py", line 322, in call_process_api output = await app.get_blocks().process_api( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/gradio/blocks.py", line 2137, in process_api result = await self.call_function( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/gradio/blocks.py", line 1675, in call_function prediction = await utils.async_iteration(iterator) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 735, in async_iteration return await anext(iterator) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 729, in __anext__ return await anyio.to_thread.run_sync( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/anyio/to_thread.py", line 56, in run_sync return await get_async_backend().run_sync_in_worker_thread( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/anyio/_backends/_asyncio.py", line 2461, in run_sync_in_worker_thread return await future ^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/anyio/_backends/_asyncio.py", line 962, in run result = context.run(func, *args) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 712, in run_sync_iterator_async return next(iterator) ^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 873, in gen_wrapper response = next(iterator) ^^^^^^^^^^^^^^ File "/root/github/Qwen2.5-Omni/web_demo.py", line 203, in chat_predict for chunk in predict(formatted_history, voice_choice): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/github/Qwen2.5-Omni/web_demo.py", line 115, in predict text_ids, audio = model.generate(**inputs, spk=voice, use_audio_in_video=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 4796, in generate thinker_result = self.thinker.generate( ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py", line 2315, in generate result = self._sample( ^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py", line 3303, in _sample outputs = self(**model_inputs, return_dict=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 2667, in forward image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1551, in forward hidden_states = blk( ^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1338, in forward hidden_states = hidden_states + self.attn( ^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1210, in forward attn_weights = torch.matmul(q, k.transpose(1, 2)) / math.sqrt(self.head_dim) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~ torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 27.52 GiB. GPU 0 has a total capacity of 94.99 GiB of which 18.60 GiB is free. Process 687142 has 76.38 GiB memory in use. Of the allocated memory 70.03 GiB is allocated by PyTorch, and 5.84 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2.5.2. device_map逻辑
但是实际上跑起来一个video是这样的,这里有两个问题
- device_map = "auto"的情况下model.device 看着只在卡0,但是实际上两卡都占了,这里后面要看下device_map的逻辑
model = Qwen2_5OmniModel.from_pretrained("/oss/model/Qwen2.5-Omni-7B", torch_dtype=torch.bfloat16, device_map="auto")
重新加载一下是这样的,为什么卡1是40G?
看显存占用确实都是这一个pid在用,可能有缓存
model = Qwen2_5OmniModel.from_pretrained("/oss/model/Qwen2.5-Omni-7B", torch_dtype=torch.bfloat16, device_map="cuda:0")
重新启动python进程,看着指了cuda:0的话 卡1就确实没有用了
TODO:auto的逻辑(默认的dp=2?而不是tp=2)