开发者社区 > 大数据与机器学习 > 人工智能平台PAI > 正文

求助,人工智能平台 PAI-DSW运行模型时,找不到 GPU

截屏2024-10-16 03.28.18.png

(envTimeLLM) root@dsw-456910-599d598865-67g8w:/mnt/workspace/Time-LLM# bash ./scripts/TimeLLM_ETTh1.sh
The following values were not passed to accelerate launch and had defaults used instead:
--num_machines was set to a value of 1
--dynamo_backend was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
[2024-10-16 03:14:13,335] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-16 03:14:13,587] [INFO] [comm.py:637:init_distributed] cdb=None
Traceback (most recent call last):
File "/mnt/workspace/Time-LLM/run_main.py", line 105, in
accelerator = Accelerator(kwargs_handlers=[ddp_kwargs], deepspeed_plugin=deepspeed_plugin)
File "/mnt/workspace/Time-LLM/envTimeLLM/lib/python3.10/site-packages/accelerate/accelerator.py", line 371, in init
self.state = AcceleratorState(
File "/mnt/workspace/Time-LLM/envTimeLLM/lib/python3.10/site-packages/accelerate/state.py", line 777, in init
PartialState(cpu, **kwargs)
File "/mnt/workspace/Time-LLM/envTimeLLM/lib/python3.10/site-packages/accelerate/state.py", line 211, in init
torch.cuda.set_device(self.device)
File "/mnt/workspace/Time-LLM/envTimeLLM/lib/python3.10/site-packages/torch/cuda/init.py", line 408, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

截屏2024-10-16 03.31.20.png

只有一个 gpu,已经设置了export CUDA_VISIBLE_DEVICES=0,但还是不行

展开
收起
游客gh2ock4e6m5xu 2024-10-16 08:10:07 63 0
0 条回答
写回答
取消 提交回答

人工智能平台 PAI(Platform for AI,原机器学习平台PAI)是面向开发者和企业的机器学习/深度学习工程平台,提供包含数据标注、模型构建、模型训练、模型部署、推理优化在内的AI开发全链路服务,内置140+种优化算法,具备丰富的行业场景插件,为用户提供低门槛、高性能的云原生AI工程化能力。

相关产品

  • 人工智能平台 PAI
  • 热门讨论

    热门文章

    相关电子书

    更多
    【云栖精选6月刊】当AI来敲门,一刊尽览人工智能 立即下载
    人工智能的商业化落地 立即下载
    人工智能的投资机会 立即下载