阿里云PAI-全模态模型Qwen2.5-Omni-7B推理浅试

本文涉及的产品
交互式建模 PAI-DSW,每月250计算时 3个月
模型在线服务 PAI-EAS,A10/V100等 500元 1个月
模型训练 PAI-DLC,100CU*H 3个月
简介: 阿里云PAI-全模态模型Qwen2.5-Omni-7B推理浅试

1. omni go

1.1. 参考文档

https://modelscope.cn/models/Qwen/Qwen2.5-Omni-7B/files

https://github.com/QwenLM/Qwen2.5-Omni

1.2. 基础环境信息

1.2.1. uname -a

root@gpu-h20-69f8f8d484-7cd5n:/vllm-workspace# uname -a
Linux gpu-h20-69f8f8d484-7cd5n 5.10.134-008.15.kangaroo.al8.x86_64 #1 SMP Sun Mar 2 10:55:41 CST 2025 x86_64 x86_64 x86_64 GNU/Linux

1.2.2. nvcc

root@gpu-h20-69f8f8d484-7cd5n:~# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

1.2.3. nvdia-smi

root@gpu-h20-69f8f8d484-7cd5n:~# nvidia-smi 
Tue Apr  8 04:31:36 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H20                     Off |   00000000:00:01.0 Off |                    0 |
| N/A   32C    P0            114W /  500W |   23700MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H20                     Off |   00000000:00:02.0 Off |                    0 |
| N/A   37C    P0            118W /  500W |   26758MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

1.2.4. python -V

root@gpu-h20-69f8f8d484-7cd5n:~# python3 -V
Python 3.12.9

1.2.5. pip list

root@gpu-h20-69f8f8d484-7cd5n:~# pip freeze
accelerate==1.3.0
aiofiles==23.2.1
aiohappyeyeballs==2.4.4
aiohttp==3.11.12
aiohttp-cors==0.7.0
aiosignal==1.3.2
airportsdata==20241001
annotated-types==0.7.0
anyio==4.8.0
astor==0.8.1
attrs==25.1.0
audioread==3.0.1
av==14.3.0
bitsandbytes==0.45.1
blake3==1.0.4
blinker==1.4
boto3==1.36.14
botocore==1.36.14
cachetools==5.5.1
certifi==2025.1.31
cffi==1.17.1
charset-normalizer==3.4.1
click==8.1.8
cloudpickle==3.1.1
cmake==3.31.4
colorful==0.5.6
compressed-tensors==0.9.1
cryptography==3.4.8
dbus-python==1.2.18
decorator==5.2.1
decord==0.6.0
depyf==0.18.0
dill==0.3.9
diskcache==5.6.3
distlib==0.3.9
distro==1.7.0
distro-info==1.1+ubuntu0.2
einops==0.8.0
fastapi==0.115.8
ffmpeg==1.4
ffmpy==0.5.0
filelock==3.17.0
flash-attn @ file:///oss/sunyf/whl/flash_attn-2.7.3%2Bcu12torch2.5cxx11abiFALSE-cp312-cp312-linux_x86_64.whl#sha256=cbb9f1af63fb1ebe3b6a16b52c0653ce60e29b42f95b1a16a95825eddb74f01d
flashinfer-python @ https://wheels.vllm.ai/flashinfer/524304395bd1d8cd7d07db083859523fcaa246a4/flashinfer_python-0.2.0.post1-cp312-cp312-linux_x86_64.whl#sha256=52d821a3972da8a6a874c1420fb6434cff641e0342e3fe192b2899b296b8b116
frozenlist==1.5.0
fsspec==2025.2.0
gguf==0.10.0
google-api-core==2.24.1
google-auth==2.38.0
googleapis-common-protos==1.67.0rc1
gradio==5.23.3
gradio_client==1.8.0
groovy==0.1.2
grpcio==1.70.0
h11==0.14.0
hf_transfer==0.1.9
httpcore==1.0.7
httplib2==0.20.2
httptools==0.6.4
httpx==0.28.1
huggingface-hub==0.28.1
humanize==4.11.0
idna==3.10
importlib-metadata==4.6.4
iniconfig==2.0.0
interegular==0.3.3
jeepney==0.7.1
Jinja2==3.1.5
jiter==0.8.2
jmespath==1.0.1
joblib==1.4.2
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
keyring==23.5.0
lark==1.2.2
launchpadlib==1.10.16
lazr.restfulclient==0.14.4
lazr.uri==1.0.6
lazy_loader==0.4
librosa==0.11.0
llvmlite==0.44.0
lm-format-enforcer==0.10.9
markdown-it-py==3.0.0
MarkupSafe==3.0.2
mdurl==0.1.2
mistral_common==1.5.2
modelscope==1.24.1
modelscope_studio==1.2.2
more-itertools==8.10.0
mpmath==1.3.0
msgpack==1.1.0
msgspec==0.19.0
multidict==6.1.0
nest-asyncio==1.6.0
networkx==3.4.2
ninja==1.11.1.3
numba==0.61.0
numpy==1.26.4
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-ml-py==12.570.86
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
oauthlib==3.2.0
openai==1.61.1
opencensus==0.11.4
opencensus-context==0.1.3
opencv-python-headless==4.11.0.86
orjson==3.10.16
outlines==0.1.11
outlines_core==0.1.26
packaging==24.2
pandas==2.2.3
partial-json-parser==0.2.1.1.post5
pillow==10.4.0
platformdirs==4.3.6
pluggy==1.5.0
pooch==1.8.2
prometheus-fastapi-instrumentator==7.0.2
prometheus_client==0.21.1
propcache==0.2.1
proto-plus==1.26.0
protobuf==5.29.3
psutil==6.1.1
py-cpuinfo==9.0.0
py-spy==0.4.0
pyasn1==0.6.1
pyasn1_modules==0.4.1
pybind11==2.13.6
pycountry==24.6.1
pycparser==2.22
pydantic==2.10.6
pydantic_core==2.27.2
pydub==0.25.1
Pygments==2.19.1
PyGObject==3.42.1
PyJWT==2.3.0
pyparsing==2.4.7
pytest==8.3.4
python-apt==2.4.0+ubuntu4
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-multipart==0.0.20
pytz==2025.2
PyYAML==6.0.2
pyzmq==26.2.1
qwen-omni-utils==0.0.3
ray==2.42.0
referencing==0.36.2
regex==2024.11.6
requests==2.32.3
rich==14.0.0
rpds-py==0.22.3
rsa==4.9
ruff==0.11.4
runai-model-streamer==0.12.0
runai-model-streamer-s3==0.12.0
s3transfer==0.11.2
safehttpx==0.1.6
safetensors==0.5.2
scikit-learn==1.6.1
scipy==1.15.2
SecretStorage==3.3.1
semantic-version==2.10.0
sentencepiece==0.2.0
setuptools==75.8.0
setuptools-scm==8.1.0
shellingham==1.5.4
six==1.16.0
smart-open==7.1.0
sniffio==1.3.1
soundfile==0.13.1
soxr==0.5.0.post1
starlette==0.45.3
sympy==1.13.1
threadpoolctl==3.6.0
tiktoken==0.7.0
timm==0.9.10
tokenizers==0.21.0
tomlkit==0.13.2
torch==2.5.1
torchaudio==2.5.1
torchvision==0.20.1
tqdm==4.67.1
transformers @ file:///root/transformers-4.50.0.dev0-py3-none-any.whl#sha256=3dffab149ebdfc8e9c938a4fa1c5e7cc4784ee1ffd9d14618931b6fe5f541654
triton==3.1.0
typer==0.15.2
typing_extensions==4.12.2
tzdata==2025.2
unattended-upgrades==0.1
urllib3==2.3.0
uvicorn==0.34.0
uvloop==0.21.0
virtualenv==20.29.1
vllm @ file:///vllm-workspace/dist/vllm-0.7.2-cp38-abi3-linux_x86_64.whl#sha256=d7f8438c3524442f45a6f1d33fdd0d548cf0bc7f5ce78b2ac5fca346143c6ddb
wadllib==1.3.6
watchfiles==1.0.4
websockets==14.2
wheel==0.37.1
wrapt==1.17.2
xformers==0.0.28.post3
xgrammar==0.1.11
yarl==1.18.3
zipp==1.0.0

1.2.6. 镜像

@牧原

qwenllm/qwen-omni:2.5-cu121

1.3. 准备工作

1.3.1. 模型下载

modelscope下载即可

https://modelscope.cn/models/Qwen/Qwen2.5-Omni-7B/files

modelscope download --model Qwen/Qwen2.5-Omni-7B --local_dir ./xxx

modelscope download --model Qwen/Qwen2.5-Omni-7B·

1.3.2. 环境准备

# web_demo的脚本在这个里面,详见启动命令
git clone https://github.com/QwenLM/Qwen2.5-Omni
pip uninstall transformers
# 打包好的transformers branch whl:transformers-4.50.0.dev0-py3-none-any.whl
pip install /path/to/transformers-4.50.0.dev0-py3-none-any.whl
pip install accelerate
# 直接下载好的对应版本的flash attention的whl,这个也是要拉外网
pip install /path/to/flash_attn-2.7.3+cu12torch2.5cxx11abiFALSE-cp312-cp312-linux_x86_64.whl

1.3.3. 测试代码

import soundfile as sf
from modelscope import Qwen2_5OmniModel, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info
model_path = "/oss/model/Qwen2.5-Omni-7B"
# default: Load the model on the available device(s)
model = Qwen2_5OmniModel.from_pretrained(model_path, 
                                         torch_dtype="auto", 
                                         device_map="auto",
                                         attn_implementation="flash_attention_2",
                                        )
processor = Qwen2_5OmniProcessor.from_pretrained(model_path)
conversation = [
    {
        "role": "system",
        "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
    },
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4"},
        ],
    },
]
# set use audio in video
USE_AUDIO_IN_VIDEO = True
# Preparation for inference
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = inputs.to(model.device).to(model.dtype)
# Inference: Generation of the output text and audio
text_ids, audio = model.generate(**inputs, use_audio_in_video=USE_AUDIO_IN_VIDEO)
text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(text)
sf.write(
    "output.wav",
    audio.reshape(-1).detach().cpu().numpy(),
    samplerate=24000,
)

image.png

1.3.4. 文件\显存占用

文件大小21G

image.png

model = Qwen2_5OmniModel.from_pretrained("/oss/model/Qwen2.5-Omni-7B", torch_dtype="auto", device_map="auto")

两张卡差不多都是21G,

image.png

单卡model\多加载processor显存占用

image.png

1.4. 启动命令

python3 web_demo.py --checkpoint-path /oss/model/Qwen2.5-Omni-7B --ui-language zh --server-name 0.0.0.0 --flash-attn2

1.5. 前端测试

image.png

1.6. fastapi

from fastapi import FastAPI
from typing import List
app = FastAPI()
import torch
import soundfile as sf
from transformers import Qwen2_5OmniModel, Qwen2_5OmniProcessor  # Qwen2_5OmniModel, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info
model_path = "/oss/model/Qwen2.5-Omni-7B"
# default: Load the model on the available device(s)
model = Qwen2_5OmniModel.from_pretrained(model_path
                                         , torch_dtype=torch.bfloat16
                                         , device_map="cuda:0"
                                         ,attn_implementation="flash_attention_2"
                                         )
processor = Qwen2_5OmniProcessor.from_pretrained(model_path)
# conversation = [
#     {
#         "role": "system",
#         "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
#     },
#     {
#         "role": "user",
#         "content": [
#             {"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4"}
#         ]
#     }
# ]
# set use audio in video
USE_AUDIO_IN_VIDEO = True
@app.post("/sunyf_post")
def test_post(data: List[dict]):
    # Preparation for inference
    text = processor.apply_chat_template(data, add_generation_prompt=True, tokenize=False)
    audios, images, videos = process_mm_info(data, use_audio_in_video=USE_AUDIO_IN_VIDEO)
    inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True,
                       use_audio_in_video=USE_AUDIO_IN_VIDEO)
    inputs = inputs.to(model.device).to(model.dtype)
    # Inference: Generation of the output text and audio
    text_ids, audio = model.generate(**inputs, use_audio_in_video=USE_AUDIO_IN_VIDEO)
    text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
    print(text)
    sf.write(
        "output.wav",
        audio.reshape(-1).detach().cpu().numpy(),
        samplerate=24000,
    )
python3 -m hypercorn qwen2_5_omni_7b_fastapi:app --bind 0.0.0.0:8000
curl -X 'POST' \
  'localhost:8000/sunyf_post' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '[
    {
        "role": "system",
        "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."
    },
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4"}
        ]
    }
]'

1.7. vllm

20250430 更新:

omni 在vllm 085支持 已经release了thinker的部分(即目前仅支持文本输出),n卡可以用了,pai上有临时编译的和085两个版本的image,都是可以用的,ppu需要研发适配发版,参考github:

https://github.com/vllm-project/vllm/blob/v0.8.5/vllm/model_executor/models/registry.py

https://docs.vllm.ai/en/v0.8.5.post1/models/supported_models.html

1.7.1. 部署

环境安装参考:https://github.com/QwenLM/Qwen2.5-Omni#deployment-with-vllm

pip install vllm==0.8.5
# 这个不安装服务也能正常启动,但是推理请求会报错
# ModuleNotFoundError: No module named 'librosa'
pip install vllm[audio]
# 这里transformers需要错源码编译后安装,不然会报找不到模型 arch
pip install transformers-4.52.0.dev0-py3-none-any.whl
VLLM_USE_V1=0 vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-Omni-3B

1.7.2. 请求测试

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
        {"type": "audio_url", "audio_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/cough.wav"}},
        {"type": "text", "text": "告诉我图片中的文字以及语音中的声音"}
    ]}
    ]
    }'
{
    "id": "chatcmpl-df07174d01a840468c6fc01b834285aa",
    "object": "chat.completion",
    "created": 1746261800,
    "model": "/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-Omni-3B",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "reasoning_content": null,
                "content": "图片中的文字是“TONGYI Qwen”,语音中的声音是一个人在咳嗽。",
                "tool_calls": []
            },
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null
        }
    ],
    "usage": {
        "prompt_tokens": 148,
        "total_tokens": 168,
        "completion_tokens": 20,
        "prompt_tokens_details": null
    },
    "prompt_logprobs": null
}

1.7.3. L20压测

这里以L20为例做了个简单的压测,以vlm官网推荐的hf数据集lmarena-ai/VisionArena-Chat为例

python3 benchmark_serving.py \
  --backend openai-chat \
  --model /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-Omni-3B \
  --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path lmarena-ai/VisionArena-Chat \
  --hf-split train \
  --num-prompts 100 \
  --hf-output-len 1000 \
  --max-concurrency 10

image.png

2. 相关问题

2.1. 如何从branch 安装transformer没有merge\release的分支

2.1.1. pip直接安装

限制:外网允许的情况下

pip install git+https://github.com/huggingface/transformers@f742a644ca32e65758c3adb36225aef1731bd2a8
(base) [root@iZt4nh1mo71f8inpq7c9zaZ transformers]# pip install git+https://github.com/huggingface/transformers@f742a644ca32e65758c3adb36225aef1731bd2a8
Looking in indexes: http://mirrors.cloud.aliyuncs.com/pypi/simple/
Collecting git+https://github.com/huggingface/transformers@f742a644ca32e65758c3adb36225aef1731bd2a8
  Cloning https://github.com/huggingface/transformers (to revision f742a644ca32e65758c3adb36225aef1731bd2a8) to /tmp/pip-req-build-huloit5n
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-huloit5n
  Running command git rev-parse -q --verify 'sha^f742a644ca32e65758c3adb36225aef1731bd2a8'
  Running command git fetch -q https://github.com/huggingface/transformers f742a644ca32e65758c3adb36225aef1731bd2a8
  Running command git checkout -q f742a644ca32e65758c3adb36225aef1731bd2a8
  Resolved https://github.com/huggingface/transformers to commit f742a644ca32e65758c3adb36225aef1731bd2a8
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Collecting filelock (from transformers==4.50.0.dev0)
  Using cached http://mirrors.cloud.aliyuncs.com/pypi/packages/4d/36/2a115987e2d8c300a974597416d9de88f2444426de9571f4b59b2cca3acc/filelock-3.18.0-py3-none-any.whl (16 kB)
Collecting huggingface-hub<1.0,>=0.26.0 (from transformers==4.50.0.dev0)
  Using cached http://mirrors.cloud.aliyuncs.com/pypi/packages/99/e3/2232d0e726d4d6ea69643b9593d97d0e7e6ea69c2fe9ed5de34d476c1c47/huggingface_hub-0.30.1-py3-none-any.whl (481 kB)
Collecting numpy>=1.17 (from transformers==4.50.0.dev0)
  Downloading http://mirrors.cloud.aliyuncs.com/pypi/packages/02/e2/e2cbb8d634151aab9528ef7b8bab52ee4ab10e076509285602c2a3a686e0/numpy-2.2.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16.1/16.1 MB 147.9 MB/s eta 0:00:00
Requirement already satisfied: packaging>=20.0 in /root/miniconda3/lib/python3.12/site-packages (from transformers==4.50.0.dev0) (24.2)
Collecting pyyaml>=5.1 (from transformers==4.50.0.dev0)
  Downloading http://mirrors.cloud.aliyuncs.com/pypi/packages/b9/2b/614b4752f2e127db5cc206abc23a8c19678e92b23c3db30fc86ab731d3bd/PyYAML-6.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (767 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 767.5/767.5 kB 83.1 MB/s eta 0:00:00
Collecting regex!=2019.12.17 (from transformers==4.50.0.dev0)
  Downloading http://mirrors.cloud.aliyuncs.com/pypi/packages/fb/13/e3b075031a738c9598c51cfbc4c7879e26729c53aa9cca59211c44235314/regex-2024.11.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (796 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 796.9/796.9 kB 82.2 MB/s eta 0:00:00
Requirement already satisfied: requests in /root/miniconda3/lib/python3.12/site-packages (from transformers==4.50.0.dev0) (2.32.3)
Collecting tokenizers<0.22,>=0.21 (from transformers==4.50.0.dev0)
  Using cached http://mirrors.cloud.aliyuncs.com/pypi/packages/8a/63/38be071b0c8e06840bc6046991636bcb30c27f6bb1e670f4f4bc87cf49cc/tokenizers-0.21.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
Collecting safetensors>=0.4.1 (from transformers==4.50.0.dev0)
  Using cached http://mirrors.cloud.aliyuncs.com/pypi/packages/a6/f8/dae3421624fcc87a89d42e1898a798bc7ff72c61f38973a65d60df8f124c/safetensors-0.5.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (471 kB)
Requirement already satisfied: tqdm>=4.27 in /root/miniconda3/lib/python3.12/site-packages (from transformers==4.50.0.dev0) (4.67.1)
Collecting fsspec>=2023.5.0 (from huggingface-hub<1.0,>=0.26.0->transformers==4.50.0.dev0)
  Using cached http://mirrors.cloud.aliyuncs.com/pypi/packages/44/4b/e0cfc1a6f17e990f3e64b7d941ddc4acdc7b19d6edd51abf495f32b1a9e4/fsspec-2025.3.2-py3-none-any.whl (194 kB)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /root/miniconda3/lib/python3.12/site-packages (from huggingface-hub<1.0,>=0.26.0->transformers==4.50.0.dev0) (4.12.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers==4.50.0.dev0) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers==4.50.0.dev0) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers==4.50.0.dev0) (2.3.0)
Requirement already satisfied: certifi>=2017.4.17 in /root/miniconda3/lib/python3.12/site-packages (from requests->transformers==4.50.0.dev0) (2025.1.31)
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... done
  Created wheel for transformers: filename=transformers-4.50.0.dev0-py3-none-any.whl size=11030162 sha256=4ea444b63511af6f0b4e5ad40034871a05a691dc946a8f58fbc498d5f50f20d4
  Stored in directory: /root/.cache/pip/wheels/f2/41/36/989e2608a431821b658c608fd1a84528d94288ca63198c584c
Successfully built transformers
Installing collected packages: safetensors, regex, pyyaml, numpy, fsspec, filelock, huggingface-hub, tokenizers, transformers
Successfully installed filelock-3.18.0 fsspec-2025.3.2 huggingface-hub-0.30.1 numpy-2.2.4 pyyaml-6.0.2 regex-2024.11.6 safetensors-0.5.3 tokenizers-0.21.1 transformers-4.50.0.dev0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.

2.1.2. 打包成whl

这种场景比较常见,一般branch都是release在github中,国内的机器通过pip直接安装会卡在网络问题上

# 建议试用env保证python版本一致
# 创建一个文件夹用来存项目
mkdir -p transformers
cd transformers
# 拉git repo
git clone https://github.com/huggingface/transformers.git
# 进到项目中
cd transformers/
# fetch这个分支,不然fetch all的话会非常大,不建议
git fetch -q https://github.com/huggingface/transformers f742a644ca32e65758c3adb36225aef1731bd2a8
# checkout到这个分支
git checkout -q f742a644ca32e65758c3adb36225aef1731bd2a8
# 需要python 有如下两个包
pip install wheel setuptools
    
# 到根目录下打包
python3 setup.py bdist_wheel
#查看whl包,这个包可以通过其他方式直接拉到国内打包
ll dist/
# 通过pip install

image.png

2.2. flash-atten安装报错

最后看到的报错是:Failed to build flash-attn

File "/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/setup.py", line 486, in run
    urllib.request.urlretrieve(wheel_url, wheel_filename)

核心报错是上面代码访问超时

urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>

root@gpu-h20-69f8f8d484-7cd5n:/vllm-workspace# pip install flash-attn --no-build-isolation -i http://mirrors.cloud.aliyuncs.com/pypi/simple/ --trusted-host mirrors.cloud.aliyuncs.com
Looking in indexes: http://mirrors.cloud.aliyuncs.com/pypi/simple/
Collecting flash-attn
  Downloading http://mirrors.cloud.aliyuncs.com/pypi/packages/11/34/9bf60e736ed7bbe15055ac2dab48ec67d9dbd088d2b4ae318fd77190ab4e/flash_attn-2.7.4.post1.tar.gz (6.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.0/6.0 MB 74.5 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Requirement already satisfied: torch in /usr/local/lib/python3.12/dist-packages (from flash-attn) (2.5.1)
Requirement already satisfied: einops in /usr/local/lib/python3.12/dist-packages (from flash-attn) (0.8.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (3.17.0)
Requirement already satisfied: typing-extensions>=4.8.0 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (4.12.2)
Requirement already satisfied: networkx in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (3.4.2)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (3.1.5)
Requirement already satisfied: fsspec in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (2025.2.0)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.4.127 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (12.4.127)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.4.127 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (12.4.127)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.4.127 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (12.4.127)
Requirement already satisfied: nvidia-cudnn-cu12==9.1.0.70 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (9.1.0.70)
Requirement already satisfied: nvidia-cublas-cu12==12.4.5.8 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (12.4.5.8)
Requirement already satisfied: nvidia-cufft-cu12==11.2.1.3 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (11.2.1.3)
Requirement already satisfied: nvidia-curand-cu12==10.3.5.147 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (10.3.5.147)
Requirement already satisfied: nvidia-cusolver-cu12==11.6.1.9 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (11.6.1.9)
Requirement already satisfied: nvidia-cusparse-cu12==12.3.1.170 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (12.3.1.170)
Requirement already satisfied: nvidia-nccl-cu12==2.21.5 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (2.21.5)
Requirement already satisfied: nvidia-nvtx-cu12==12.4.127 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (12.4.127)
Requirement already satisfied: nvidia-nvjitlink-cu12==12.4.127 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (12.4.127)
Requirement already satisfied: triton==3.1.0 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (3.1.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (75.8.0)
Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.12/dist-packages (from torch->flash-attn) (1.13.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.12/dist-packages (from sympy==1.13.1->torch->flash-attn) (1.3.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.12/dist-packages (from jinja2->torch->flash-attn) (3.0.2)
Building wheels for collected packages: flash-attn
  Building wheel for flash-attn (setup.py) ... -
\
|
\
error
  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [255 lines of output]
      
      
      torch.__version__  = 2.5.1+cu124
      
      
      /usr/local/lib/python3.12/dist-packages/setuptools/__init__.py:94: _DeprecatedInstaller: setuptools.installer and fetch_build_eggs are deprecated.
      !!
      
              ********************************************************************************
              Requirements should be satisfied by a PEP 517 installer.
              If you are using pip, you can try `pip install --use-pep517`.
              ********************************************************************************
      
      !!
        dist.fetch_build_eggs(dist.setup_requires)
      running bdist_wheel
      Guessing wheel URL:  https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
      Precompiled wheel not found. Building from source...
      running build
      running build_py
      creating build/lib.linux-x86_64-cpython-312/hopper
      copying hopper/test_kvcache.py -> build/lib.linux-x86_64-cpython-312/hopper
      copying hopper/benchmark_split_kv.py -> build/lib.linux-x86_64-cpython-312/hopper
      copying hopper/generate_kernels.py -> build/lib.linux-x86_64-cpython-312/hopper
      copying hopper/__init__.py -> build/lib.linux-x86_64-cpython-312/hopper
      copying hopper/benchmark_flash_attention_fp8.py -> build/lib.linux-x86_64-cpython-312/hopper
      copying hopper/test_flash_attn.py -> build/lib.linux-x86_64-cpython-312/hopper
      copying hopper/test_util.py -> build/lib.linux-x86_64-cpython-312/hopper
      copying hopper/padding.py -> build/lib.linux-x86_64-cpython-312/hopper
      copying hopper/benchmark_attn.py -> build/lib.linux-x86_64-cpython-312/hopper
      copying hopper/flash_attn_interface.py -> build/lib.linux-x86_64-cpython-312/hopper
      copying hopper/test_attn_kvcache.py -> build/lib.linux-x86_64-cpython-312/hopper
      copying hopper/setup.py -> build/lib.linux-x86_64-cpython-312/hopper
      creating build/lib.linux-x86_64-cpython-312/flash_attn
      copying flash_attn/flash_blocksparse_attention.py -> build/lib.linux-x86_64-cpython-312/flash_attn
      copying flash_attn/flash_attn_triton.py -> build/lib.linux-x86_64-cpython-312/flash_attn
      copying flash_attn/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn
      copying flash_attn/flash_blocksparse_attn_interface.py -> build/lib.linux-x86_64-cpython-312/flash_attn
      copying flash_attn/flash_attn_triton_og.py -> build/lib.linux-x86_64-cpython-312/flash_attn
      copying flash_attn/fused_softmax.py -> build/lib.linux-x86_64-cpython-312/flash_attn
      copying flash_attn/bert_padding.py -> build/lib.linux-x86_64-cpython-312/flash_attn
      copying flash_attn/flash_attn_interface.py -> build/lib.linux-x86_64-cpython-312/flash_attn
      creating build/lib.linux-x86_64-cpython-312/flash_attn/losses
      copying flash_attn/losses/cross_entropy.py -> build/lib.linux-x86_64-cpython-312/flash_attn/losses
      copying flash_attn/losses/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/losses
      creating build/lib.linux-x86_64-cpython-312/flash_attn/layers
      copying flash_attn/layers/patch_embed.py -> build/lib.linux-x86_64-cpython-312/flash_attn/layers
      copying flash_attn/layers/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/layers
      copying flash_attn/layers/rotary.py -> build/lib.linux-x86_64-cpython-312/flash_attn/layers
      creating build/lib.linux-x86_64-cpython-312/flash_attn/ops
      copying flash_attn/ops/fused_dense.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops
      copying flash_attn/ops/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops
      copying flash_attn/ops/activations.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops
      copying flash_attn/ops/layer_norm.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops
      copying flash_attn/ops/rms_norm.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops
      creating build/lib.linux-x86_64-cpython-312/flash_attn/utils
      copying flash_attn/utils/distributed.py -> build/lib.linux-x86_64-cpython-312/flash_attn/utils
      copying flash_attn/utils/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/utils
      copying flash_attn/utils/pretrained.py -> build/lib.linux-x86_64-cpython-312/flash_attn/utils
      copying flash_attn/utils/benchmark.py -> build/lib.linux-x86_64-cpython-312/flash_attn/utils
      copying flash_attn/utils/generation.py -> build/lib.linux-x86_64-cpython-312/flash_attn/utils
      creating build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
      copying flash_attn/flash_attn_triton_amd/utils.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
      copying flash_attn/flash_attn_triton_amd/bench.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
      copying flash_attn/flash_attn_triton_amd/bwd_ref.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
      copying flash_attn/flash_attn_triton_amd/fwd_decode.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
      copying flash_attn/flash_attn_triton_amd/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
      copying flash_attn/flash_attn_triton_amd/interface_torch.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
      copying flash_attn/flash_attn_triton_amd/bwd_prefill.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
      copying flash_attn/flash_attn_triton_amd/interface_fa.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
      copying flash_attn/flash_attn_triton_amd/test.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
      copying flash_attn/flash_attn_triton_amd/fwd_ref.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
      copying flash_attn/flash_attn_triton_amd/fwd_prefill.py -> build/lib.linux-x86_64-cpython-312/flash_attn/flash_attn_triton_amd
      creating build/lib.linux-x86_64-cpython-312/flash_attn/modules
      copying flash_attn/modules/mha.py -> build/lib.linux-x86_64-cpython-312/flash_attn/modules
      copying flash_attn/modules/block.py -> build/lib.linux-x86_64-cpython-312/flash_attn/modules
      copying flash_attn/modules/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/modules
      copying flash_attn/modules/mlp.py -> build/lib.linux-x86_64-cpython-312/flash_attn/modules
      copying flash_attn/modules/embedding.py -> build/lib.linux-x86_64-cpython-312/flash_attn/modules
      creating build/lib.linux-x86_64-cpython-312/flash_attn/models
      copying flash_attn/models/gptj.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
      copying flash_attn/models/baichuan.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
      copying flash_attn/models/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
      copying flash_attn/models/opt.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
      copying flash_attn/models/bert.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
      copying flash_attn/models/falcon.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
      copying flash_attn/models/llama.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
      copying flash_attn/models/btlm.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
      copying flash_attn/models/bigcode.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
      copying flash_attn/models/gpt_neox.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
      copying flash_attn/models/vit.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
      copying flash_attn/models/gpt.py -> build/lib.linux-x86_64-cpython-312/flash_attn/models
      creating build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton
      copying flash_attn/ops/triton/k_activations.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton
      copying flash_attn/ops/triton/cross_entropy.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton
      copying flash_attn/ops/triton/__init__.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton
      copying flash_attn/ops/triton/linear.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton
      copying flash_attn/ops/triton/mlp.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton
      copying flash_attn/ops/triton/layer_norm.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton
      copying flash_attn/ops/triton/rotary.py -> build/lib.linux-x86_64-cpython-312/flash_attn/ops/triton
      running build_ext
      /usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:416: UserWarning: The detected CUDA version (12.1) has a minor version mismatch with the version that was used to compile PyTorch (12.4). Most likely this shouldn't be a problem.
        warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
      /usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:426: UserWarning: There are no x86_64-linux-gnu-g++ version bounds defined for CUDA version 12.1
        warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')
      building 'flash_attn_2_cuda' extension
      creating /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn
      creating /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src
      Emitting ninja build file /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/build.ninja...
      Compiling objects...
      Using envvar MAX_JOBS (13) as the number of workers...
      [1/85] c++ -MMD -MF /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/flash_api.o.d -fno-strict-overflow -Wsign-compare -DNDEBUG -g -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/flash_api.cpp -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/flash_api.o -O3 -std=c++17 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [2/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim160_bf16_causal_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim160_bf16_causal_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim160_bf16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [3/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim160_fp16_causal_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim160_fp16_causal_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim160_fp16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [4/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim192_fp16_causal_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim192_fp16_causal_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim192_fp16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [5/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim192_bf16_causal_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim192_bf16_causal_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim192_bf16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [6/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_bf16_causal_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim256_bf16_causal_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_bf16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      FAILED: /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_bf16_causal_sm80.o
      /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_bf16_causal_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim256_bf16_causal_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_bf16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      Killed
      Killed
      [7/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_causal_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim128_fp16_causal_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [8/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [9/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim160_bf16_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim160_bf16_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim160_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [10/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim160_fp16_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim160_fp16_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim160_fp16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [11/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim192_fp16_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim192_fp16_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim192_fp16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [12/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim192_bf16_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim192_bf16_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim192_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [13/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim128_fp16_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_fp16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [14/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim32_bf16_causal_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim32_bf16_causal_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim32_bf16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [15/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim128_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [16/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_fp16_causal_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim256_fp16_causal_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_fp16_causal_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [17/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_bf16_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim256_bf16_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_bf16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      [18/85] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_fp16_sm80.o.d -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src -I/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/cutlass/include -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.12/dist-packages/torch/include/TH -I/usr/local/lib/python3.12/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/csrc/flash_attn/src/flash_bwd_hdim256_fp16_sm80.cu -o /tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/build/temp.linux-x86_64-cpython-312/csrc/flash_attn/src/flash_bwd_hdim256_fp16_sm80.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 --threads 2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=0
      ninja: build stopped: subcommand failed.
      Traceback (most recent call last):
        File "/usr/lib/python3.12/urllib/request.py", line 1344, in do_open
          h.request(req.get_method(), req.selector, req.data, headers,
        File "/usr/lib/python3.12/http/client.py", line 1338, in request
          self._send_request(method, url, body, headers, encode_chunked)
        File "/usr/lib/python3.12/http/client.py", line 1384, in _send_request
          self.endheaders(body, encode_chunked=encode_chunked)
        File "/usr/lib/python3.12/http/client.py", line 1333, in endheaders
          self._send_output(message_body, encode_chunked=encode_chunked)
        File "/usr/lib/python3.12/http/client.py", line 1093, in _send_output
          self.send(msg)
        File "/usr/lib/python3.12/http/client.py", line 1037, in send
          self.connect()
        File "/usr/lib/python3.12/http/client.py", line 1472, in connect
          super().connect()
        File "/usr/lib/python3.12/http/client.py", line 1003, in connect
          self.sock = self._create_connection(
                      ^^^^^^^^^^^^^^^^^^^^^^^^
        File "/usr/lib/python3.12/socket.py", line 865, in create_connection
          raise exceptions[0]
        File "/usr/lib/python3.12/socket.py", line 850, in create_connection
          sock.connect(sa)
      TimeoutError: [Errno 110] Connection timed out
      
      During handling of the above exception, another exception occurred:
      
      Traceback (most recent call last):
        File "/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/setup.py", line 486, in run
          urllib.request.urlretrieve(wheel_url, wheel_filename)
        File "/usr/lib/python3.12/urllib/request.py", line 240, in urlretrieve
          with contextlib.closing(urlopen(url, data)) as fp:
                                  ^^^^^^^^^^^^^^^^^^
        File "/usr/lib/python3.12/urllib/request.py", line 215, in urlopen
          return opener.open(url, data, timeout)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/usr/lib/python3.12/urllib/request.py", line 515, in open
          response = self._open(req, data)
                     ^^^^^^^^^^^^^^^^^^^^^
        File "/usr/lib/python3.12/urllib/request.py", line 532, in _open
          result = self._call_chain(self.handle_open, protocol, protocol +
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/usr/lib/python3.12/urllib/request.py", line 492, in _call_chain
          result = func(*args)
                   ^^^^^^^^^^^
        File "/usr/lib/python3.12/urllib/request.py", line 1392, in https_open
          return self.do_open(http.client.HTTPSConnection, req,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/usr/lib/python3.12/urllib/request.py", line 1347, in do_open
          raise URLError(err)
      urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>
      
      During handling of the above exception, another exception occurred:
      
      Traceback (most recent call last):
        File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 2104, in _run_ninja_build
          subprocess.run(
        File "/usr/lib/python3.12/subprocess.py", line 573, in run
          raise CalledProcessError(retcode, process.args,
      subprocess.CalledProcessError: Command '['ninja', '-v', '-j', '13']' returned non-zero exit status 1.
      
      The above exception was the direct cause of the following exception:
      
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/setup.py", line 526, in <module>
          setup(
        File "/usr/local/lib/python3.12/dist-packages/setuptools/__init__.py", line 117, in setup
          return distutils.core.setup(**attrs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/core.py", line 186, in setup
          return run_commands(dist)
                 ^^^^^^^^^^^^^^^^^^
        File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/core.py", line 202, in run_commands
          dist.run_commands()
        File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/dist.py", line 983, in run_commands
          self.run_command(cmd)
        File "/usr/local/lib/python3.12/dist-packages/setuptools/dist.py", line 999, in run_command
          super().run_command(command)
        File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/dist.py", line 1002, in run_command
          cmd_obj.run()
        File "/tmp/pip-install-c2mvc8ry/flash-attn_9e815b8babcd4740b973fd92f49451e3/setup.py", line 503, in run
          super().run()
        File "/usr/lib/python3/dist-packages/wheel/bdist_wheel.py", line 299, in run
          self.run_command('build')
        File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/cmd.py", line 339, in run_command
          self.distribution.run_command(command)
        File "/usr/local/lib/python3.12/dist-packages/setuptools/dist.py", line 999, in run_command
          super().run_command(command)
        File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/dist.py", line 1002, in run_command
          cmd_obj.run()
        File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/command/build.py", line 136, in run
          self.run_command(cmd_name)
        File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/cmd.py", line 339, in run_command
          self.distribution.run_command(command)
        File "/usr/local/lib/python3.12/dist-packages/setuptools/dist.py", line 999, in run_command
          super().run_command(command)
        File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/dist.py", line 1002, in run_command
          cmd_obj.run()
        File "/usr/local/lib/python3.12/dist-packages/setuptools/command/build_ext.py", line 99, in run
          _build_ext.run(self)
        File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/command/build_ext.py", line 365, in run
          self.build_extensions()
        File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 868, in build_extensions
          build_ext.build_extensions(self)
        File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/command/build_ext.py", line 481, in build_extensions
          self._build_extensions_serial()
        File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/command/build_ext.py", line 507, in _build_extensions_serial
          self.build_extension(ext)
        File "/usr/local/lib/python3.12/dist-packages/setuptools/command/build_ext.py", line 264, in build_extension
          _build_ext.build_extension(self, ext)
        File "/usr/local/lib/python3.12/dist-packages/setuptools/_distutils/command/build_ext.py", line 562, in build_extension
          objects = self.compiler.compile(
                    ^^^^^^^^^^^^^^^^^^^^^^
        File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 681, in unix_wrap_ninja_compile
          _write_ninja_file_and_compile_objects(
        File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 1784, in _write_ninja_file_and_compile_objects
          _run_ninja_build(
        File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 2120, in _run_ninja_build
          raise RuntimeError(message) from e
      RuntimeError: Error compiling objects for extension
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for flash-attn
  Running setup.py clean for flash-attn
Failed to build flash-attn
[notice] A new release of pip is available: 25.0 -> 25.0.1
[notice] To update, run: python3.12 -m pip install --upgrade pip
ERROR: Failed to build installable wheels for some pyproject.toml based projects (flash-attn)
root@gpu-h20-69f8f8d484-7cd5n:/vllm-workspace#

通过堆栈分析源码发现是获取到当前机器的各种版本,包含torch\rund\falsh\cpp11去拉flash-attention的release了,这里也是网络的原因

https://github.com/Dao-AILab/flash-attention/releases

image.png

2.3. 没包

这个是modelscope的版本比较低 --force 重新下下最新版

ModuleNotFoundError: No module named 'modelscope'
ImportError: Cannot import available module of Qwen2_5OmniModel in modelscope, or related packages(['transformers', 'peft', 'diffusers'])

2.4. video给网络url偶发会超时,本地正常

后续可以将share url换成本地的路径

>>> import soundfile as sf
>>> 
>>> from modelscope import Qwen2_5OmniModel, Qwen2_5OmniProcessor
>>> from qwen_omni_utils import process_mm_info
>>> 
>>> 
>>> model = Qwen2_5OmniModel.from_pretrained("/oss/model/Qwen2.5-Omni-7B", torch_dtype="auto", device_map="auto")
Qwen2_5OmniToken2WavModel does not support eager attention implementation, fall back to sdpa
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 5/5 [00:30<00:00,  6.07s/it]
/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py:4641: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  for key, value in torch.load(path).items():
>>> processor = Qwen2_5OmniProcessor.from_pretrained("/oss/model/Qwen2.5-Omni-7B")
>>> conversation = [
...     {
...         "role": "system",
...         "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
...     },
...     {
...         "role": "user",
...         "content": [
...             {"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4"},
...         ],
...     },
... ]
>>> USE_AUDIO_IN_VIDEO = True
>>> text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
>>> audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
/usr/local/lib/python3.12/dist-packages/librosa/core/audio.py:172: FutureWarning: librosa.core.audio.__audioread_load
  Deprecated as of librosa version 0.10.0.
  It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/audioread/ffdec.py", line 188, in read_data
    data = self.stdout_reader.queue.get(timeout=timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/queue.py", line 179, in get
    raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.12/dist-packages/qwen_omni_utils/v2_5/__init__.py", line 12, in process_mm_info
    audios = process_audio_info(conversations, use_audio_in_video)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/qwen_omni_utils/v2_5/audio_process.py", line 46, in process_audio_info
    audios.append(librosa.load(audioread.ffdec.FFmpegAudioFile(path), sr=16000)[0])
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/librosa/core/audio.py", line 172, in load
    y, sr_native = __audioread_load(path, offset, duration, dtype)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/decorator.py", line 235, in fun
    return caller(func, *(extras + args), **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/librosa/util/decorators.py", line 63, in __wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/librosa/core/audio.py", line 255, in __audioread_load
    for frame in input_file:
                 ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/audioread/ffdec.py", line 201, in read_data
    raise ReadTimeoutError('ffmpeg output: {}'.format(
audioread.ffdec.ReadTimeoutError: ffmpeg output: b'    Metadata:
      creation_time   : 2025-03-14T07:52:19.000000Z
      handler_name    : Core Media Audio
      vendor_id       : [0][0][0][0]
Stream mapping:
  Stream #0:1 -> #0:0 (aac (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, s16le, to \'pipe:\':
  Metadata:
    major_brand     : qt  
    minor_version   : 0
    compatible_brands: qt  
    com.apple.quicktime.artwork: {"data":{"editType":"default","edittime":835,"infoStickerId":"","is_ai_lyric":0,"is_aimusic_mv":0,"is_use_ai_image_generation":0,"is_use_ai_sound":0,"is_use_ai_video_generation":0,"is_use_aimusic_bgm":0,"is_use_aimusic_vocal":0,"is_use_graph_chart":0,"is_
    encoder         : Lavf58.76.100
  Stream #0:0(und): Audio: pcm_s16le, 44100 Hz, stereo, s16, 1411 kb/s (default)
    Metadata:
      creation_time   : 2025-03-14T07:52:19.000000Z
      handler_name    : Core Media Audio
      vendor_id       : [0][0][0][0]
      encoder         : Lavc58.134.100 pcm_s16le
size=       4kB time=00:00:00.00 bitrate=N/A speed=   0x    
size=     348kB time=00:00:01.99 bitrate=1427.6kbits/s speed=0.333x    
size=     432kB time=00:00:02.48 bitrate=1424.4kbits/s speed=0.319x    
size=     520kB time=00:00:02.99 bitrate=1422.1kbits/s speed=0.293x    
size=     604kB time=00:00:03.48 bitrate=1420.6kbits/s speed=0.247x    
size=     692kB time=00:00:03.99 bitrate=1419.4kbits/s speed=0.196x    
size=     776kB time=00:00:04.48 bitrate=1418.5kbits/s speed=0.165x    
size=     864kB time=00:00:04.99 bitrate=1417.8kbits/s speed=0.149x    
size=     928kB time=00:00:05.36 bitrate=1417.3kbits/s speed=0.157x    
size=     948kB time=00:00:05.47 bitrate=1417.2kbits/s speed=0.131x    
size=    1036kB time=00:00:05.98 bitrate=1416.7kbits/s speed=0.133x    
size=    1120kB time=00:00:06.47 bitrate=1416.3kbits/s speed=0.121x    
size=    1208kB time=00:00:06.98 bitrate=1415.9kbits/s speed=0.122x    
size=    1268kB time=00:00:07.33 bitrate=1415.7kbits/s speed=0.127x    
size=    1292kB time=00:00:07.47 bitrate=1415.6kbits/s speed=0.117x    
size=   '
>>> inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'audios' is not defined

image.png

2.5. 显存问题

2.5.1. oom

非flash-atten得backend的实现应该是有一个问题,一张acp的图片会把96G显存的卡直接打满,换了attn_implementation="flash_attention_2"后正常,社区有个类似的oom的bug

predict history:  [{'role': 'system', 'content': 'You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.'}, {'role': 'user', 'content': 'Meijiao的ACP证书ID是多少?'}, {'role': 'user', 'content': [{'type': 'image', 'image': '/tmp/gradio/d97304ebdcff708153634e166099bef672384228166f6bed37e3cbb03d8e05db/ACP-yibei.png'}]}]
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/gradio/queueing.py", line 715, in process_events
    response = await route_utils.call_process_api(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/route_utils.py", line 322, in call_process_api
    output = await app.get_blocks().process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/blocks.py", line 2137, in process_api
    result = await self.call_function(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/blocks.py", line 1675, in call_function
    prediction = await utils.async_iteration(iterator)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 735, in async_iteration
    return await anext(iterator)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 729, in __anext__
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/anyio/_backends/_asyncio.py", line 2461, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/anyio/_backends/_asyncio.py", line 962, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 712, in run_sync_iterator_async
    return next(iterator)
           ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 873, in gen_wrapper
    response = next(iterator)
               ^^^^^^^^^^^^^^
  File "/root/github/Qwen2.5-Omni/web_demo.py", line 203, in chat_predict
    for chunk in predict(formatted_history, voice_choice):
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/github/Qwen2.5-Omni/web_demo.py", line 115, in predict
    text_ids, audio = model.generate(**inputs, spk=voice, use_audio_in_video=True)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 4796, in generate
    thinker_result = self.thinker.generate(
                     ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py", line 2315, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py", line 3303, in _sample
    outputs = self(**model_inputs, return_dict=True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 2667, in forward
    image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1551, in forward
    hidden_states = blk(
                    ^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1338, in forward
    hidden_states = hidden_states + self.attn(
                                    ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1212, in forward
    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(q.dtype)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/functional.py", line 2142, in softmax
    ret = input.softmax(dim, dtype=dtype)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 55.04 GiB. GPU 0 has a total capacity of 94.99 GiB of which 46.20 GiB is free. Process 687142 has 48.79 GiB memory in use. Of the allocated memory 40.22 GiB is allocated by PyTorch, and 8.13 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
predict history:  [{'role': 'system', 'content': 'You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.'}, {'role': 'user', 'content': 'Meijiao的ACP证书ID是多少?'}, {'role': 'user', 'content': [{'type': 'image', 'image': '/tmp/gradio/d97304ebdcff708153634e166099bef672384228166f6bed37e3cbb03d8e05db/ACP-yibei.png'}]}]
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/gradio/queueing.py", line 715, in process_events
    response = await route_utils.call_process_api(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/route_utils.py", line 322, in call_process_api
    output = await app.get_blocks().process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/blocks.py", line 2137, in process_api
    result = await self.call_function(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/blocks.py", line 1675, in call_function
    prediction = await utils.async_iteration(iterator)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 735, in async_iteration
    return await anext(iterator)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 729, in __anext__
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/anyio/_backends/_asyncio.py", line 2461, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/anyio/_backends/_asyncio.py", line 962, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 712, in run_sync_iterator_async
    return next(iterator)
           ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 873, in gen_wrapper
    response = next(iterator)
               ^^^^^^^^^^^^^^
  File "/root/github/Qwen2.5-Omni/web_demo.py", line 203, in chat_predict
    for chunk in predict(formatted_history, voice_choice):
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/github/Qwen2.5-Omni/web_demo.py", line 115, in predict
    text_ids, audio = model.generate(**inputs, spk=voice, use_audio_in_video=True)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 4796, in generate
    thinker_result = self.thinker.generate(
                     ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py", line 2315, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py", line 3303, in _sample
    outputs = self(**model_inputs, return_dict=True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 2667, in forward
    image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1551, in forward
    hidden_states = blk(
                    ^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1338, in forward
    hidden_states = hidden_states + self.attn(
                                    ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1210, in forward
    attn_weights = torch.matmul(q, k.transpose(1, 2)) / math.sqrt(self.head_dim)
                   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 27.52 GiB. GPU 0 has a total capacity of 94.99 GiB of which 18.60 GiB is free. Process 687142 has 76.38 GiB memory in use. Of the allocated memory 70.03 GiB is allocated by PyTorch, and 5.84 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
predict history:  [{'role': 'system', 'content': 'You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.'}, {'role': 'user', 'content': 'what is the acp certificate id'}, {'role': 'user', 'content': [{'type': 'image', 'image': '/tmp/gradio/d97304ebdcff708153634e166099bef672384228166f6bed37e3cbb03d8e05db/ACP-yibei.png'}]}]
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/gradio/queueing.py", line 715, in process_events
    response = await route_utils.call_process_api(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/route_utils.py", line 322, in call_process_api
    output = await app.get_blocks().process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/blocks.py", line 2137, in process_api
    result = await self.call_function(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/blocks.py", line 1675, in call_function
    prediction = await utils.async_iteration(iterator)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 735, in async_iteration
    return await anext(iterator)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 729, in __anext__
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/anyio/_backends/_asyncio.py", line 2461, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/anyio/_backends/_asyncio.py", line 962, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 712, in run_sync_iterator_async
    return next(iterator)
           ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 873, in gen_wrapper
    response = next(iterator)
               ^^^^^^^^^^^^^^
  File "/root/github/Qwen2.5-Omni/web_demo.py", line 203, in chat_predict
    for chunk in predict(formatted_history, voice_choice):
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/github/Qwen2.5-Omni/web_demo.py", line 115, in predict
    text_ids, audio = model.generate(**inputs, spk=voice, use_audio_in_video=True)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 4796, in generate
    thinker_result = self.thinker.generate(
                     ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py", line 2315, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py", line 3303, in _sample
    outputs = self(**model_inputs, return_dict=True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 2667, in forward
    image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1551, in forward
    hidden_states = blk(
                    ^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1338, in forward
    hidden_states = hidden_states + self.attn(
                                    ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1210, in forward
    attn_weights = torch.matmul(q, k.transpose(1, 2)) / math.sqrt(self.head_dim)
                   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 27.52 GiB. GPU 0 has a total capacity of 94.99 GiB of which 18.60 GiB is free. Process 687142 has 76.38 GiB memory in use. Of the allocated memory 70.03 GiB is allocated by PyTorch, and 5.84 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
predict history:  [{'role': 'system', 'content': 'You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.'}, {'role': 'user', 'content': 'what is the acp certificate id'}, {'role': 'user', 'content': [{'type': 'image', 'image': '/tmp/gradio/d97304ebdcff708153634e166099bef672384228166f6bed37e3cbb03d8e05db/ACP-yibei.png'}]}]
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/gradio/queueing.py", line 715, in process_events
    response = await route_utils.call_process_api(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/route_utils.py", line 322, in call_process_api
    output = await app.get_blocks().process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/blocks.py", line 2137, in process_api
    result = await self.call_function(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/blocks.py", line 1675, in call_function
    prediction = await utils.async_iteration(iterator)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 735, in async_iteration
    return await anext(iterator)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 729, in __anext__
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/anyio/_backends/_asyncio.py", line 2461, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/anyio/_backends/_asyncio.py", line 962, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 712, in run_sync_iterator_async
    return next(iterator)
           ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/gradio/utils.py", line 873, in gen_wrapper
    response = next(iterator)
               ^^^^^^^^^^^^^^
  File "/root/github/Qwen2.5-Omni/web_demo.py", line 203, in chat_predict
    for chunk in predict(formatted_history, voice_choice):
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/github/Qwen2.5-Omni/web_demo.py", line 115, in predict
    text_ids, audio = model.generate(**inputs, spk=voice, use_audio_in_video=True)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 4796, in generate
    thinker_result = self.thinker.generate(
                     ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py", line 2315, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py", line 3303, in _sample
    outputs = self(**model_inputs, return_dict=True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 2667, in forward
    image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1551, in forward
    hidden_states = blk(
                    ^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1338, in forward
    hidden_states = hidden_states + self.attn(
                                    ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py", line 1210, in forward
    attn_weights = torch.matmul(q, k.transpose(1, 2)) / math.sqrt(self.head_dim)
                   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 27.52 GiB. GPU 0 has a total capacity of 94.99 GiB of which 18.60 GiB is free. Process 687142 has 76.38 GiB memory in use. Of the allocated memory 70.03 GiB is allocated by PyTorch, and 5.84 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

image.png

2.5.2. device_map逻辑

但是实际上跑起来一个video是这样的,这里有两个问题

  1. device_map = "auto"的情况下model.device 看着只在卡0,但是实际上两卡都占了,这里后面要看下device_map的逻辑

image.png

image.png

model = Qwen2_5OmniModel.from_pretrained("/oss/model/Qwen2.5-Omni-7B", torch_dtype=torch.bfloat16, device_map="auto")

重新加载一下是这样的,为什么卡1是40G?

image.png

看显存占用确实都是这一个pid在用,可能有缓存

image.png

model = Qwen2_5OmniModel.from_pretrained("/oss/model/Qwen2.5-Omni-7B", torch_dtype=torch.bfloat16, device_map="cuda:0")

重新启动python进程,看着指了cuda:0的话 卡1就确实没有用了

TODO:auto的逻辑(默认的dp=2?而不是tp=2)

image.png

相关实践学习
使用PAI-EAS一键部署ChatGLM及LangChain应用
本场景中主要介绍如何使用模型在线服务(PAI-EAS)部署ChatGLM的AI-Web应用以及启动WebUI进行模型推理,并通过LangChain集成自己的业务数据。
机器学习概览及常见算法
机器学习(Machine Learning, ML)是人工智能的核心,专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能,它是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。 本课程将带你入门机器学习,掌握机器学习的概念和常用的算法。
相关文章
|
1月前
|
程序员 定位技术 开发者
试了试阿里云的通义灵码 2.5 版
通义灵码 2.5 版是个特别实用的工具,无论是个人开发者还是企业团队,都能从中受益。如果你也在找能提升开发效率的工具,通义灵码绝对值得一试!
101 33
试了试阿里云的通义灵码 2.5 版
|
1月前
|
关系型数据库 OLAP 数据库
拒绝等待!阿里云瑶池数据库 x Qwen3,构建增强式RAG
阿里巴巴发布的通义千问Qwen3在性能上超越多个国际顶尖模型,阿里云瑶池数据库已适配该模型,支持私域部署并与Dify无缝集成。传统RAG方案在处理复杂关系和多跳推理时存在局限,而GraphRAG通过图结构存储知识,结合Qwen3和AnalyticDB PostgreSQL,可有效解决这些问题,提升知识关联检索与分析能力。某新零售客户案例表明,GraphRAG能更好地满足高复杂度业务需求,提供直观的知识图谱可视化服务。阿里云提供Qwen3全系列模型的私域部署解决方案,确保数据安全和服务稳定性。
|
16天前
|
弹性计算 自然语言处理 Ubuntu
从0开始在阿里云上搭建基于通义千问的钉钉智能问答机器人
本文描述在阿里云上从0开始构建一个LLM智能问答钉钉机器人。LLM直接调用了阿里云百炼平台提供的调用服务。
从0开始在阿里云上搭建基于通义千问的钉钉智能问答机器人
|
1月前
|
PyTorch 调度 算法框架/工具
阿里云PAI-DLC任务Pytorch launch_agent Socket Timeout问题源码分析
DLC任务Pytorch launch_agent Socket Timeout问题源码分析与解决方案
88 18
阿里云PAI-DLC任务Pytorch launch_agent Socket Timeout问题源码分析
|
16天前
|
机器学习/深度学习 人工智能 自然语言处理
阿里云人工智能平台 PAI 开源 EasyDistill 框架助力大语言模型轻松瘦身
本文介绍了阿里云人工智能平台 PAI 推出的开源工具包 EasyDistill。随着大语言模型的复杂性和规模增长,它们面临计算需求和训练成本的障碍。知识蒸馏旨在不显著降低性能的前提下,将大模型转化为更小、更高效的版本以降低训练和推理成本。EasyDistill 框架简化了知识蒸馏过程,其具备多种功能模块,包括数据合成、基础和进阶蒸馏训练。通过数据合成,丰富训练集的多样性;基础和进阶蒸馏训练则涵盖黑盒和白盒知识转移策略、强化学习及偏好优化,从而提升小模型的性能。
|
23天前
|
自然语言处理 监控 安全
阿里云发布可观测MCP!支持自然语言查询和分析多模态日志
阿里云可观测官方发布了Observable MCP Server,提供了一系列访问阿里云可观测各产品的工具能力,包含阿里云日志服务SLS、阿里云应用实时监控服务ARMS等,支持用户通过自然语言形式查询
149 0
阿里云发布可观测MCP!支持自然语言查询和分析多模态日志
|
22天前
|
机器学习/深度学习 数据采集 人工智能
20分钟掌握机器学习算法指南
在短短20分钟内,从零开始理解主流机器学习算法的工作原理,掌握算法选择策略,并建立对神经网络的直观认识。本文用通俗易懂的语言和生动的比喻,帮助你告别算法选择的困惑,轻松踏入AI的大门。
86 7
|
17天前
|
机器学习/深度学习 算法 搜索推荐
认识聚类算法【机器学习必学】
处理网https://www.91chuli.com/
|
7月前
|
机器学习/深度学习 算法 数据挖掘
K-means聚类算法是机器学习中常用的一种聚类方法,通过将数据集划分为K个簇来简化数据结构
K-means聚类算法是机器学习中常用的一种聚类方法,通过将数据集划分为K个簇来简化数据结构。本文介绍了K-means算法的基本原理,包括初始化、数据点分配与簇中心更新等步骤,以及如何在Python中实现该算法,最后讨论了其优缺点及应用场景。
413 6
|
2月前
|
机器学习/深度学习 存储 Kubernetes
【重磅发布】AllData数据中台核心功能:机器学习算法平台
杭州奥零数据科技有限公司成立于2023年,专注于数据中台业务,维护开源项目AllData并提供商业版解决方案。AllData提供数据集成、存储、开发、治理及BI展示等一站式服务,支持AI大模型应用,助力企业高效利用数据价值。

相关产品

  • 人工智能平台 PAI