求助,人工智能平台 PAI-DSW运行模型时,找不到 GPU

截屏2024-10-16 03.28.18.png

(envTimeLLM) root@dsw-456910-599d598865-67g8w:/mnt/workspace/Time-LLM# bash ./scripts/TimeLLM_ETTh1.sh
The following values were not passed to accelerate launch and had defaults used instead:
--num_machines was set to a value of 1
--dynamo_backend was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
[2024-10-16 03:14:13,335] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-16 03:14:13,587] [INFO] [comm.py:637:init_distributed] cdb=None
Traceback (most recent call last):
File "/mnt/workspace/Time-LLM/run_main.py", line 105, in
accelerator = Accelerator(kwargs_handlers=[ddp_kwargs], deepspeed_plugin=deepspeed_plugin)
File "/mnt/workspace/Time-LLM/envTimeLLM/lib/python3.10/site-packages/accelerate/accelerator.py", line 371, in init
self.state = AcceleratorState(
File "/mnt/workspace/Time-LLM/envTimeLLM/lib/python3.10/site-packages/accelerate/state.py", line 777, in init
PartialState(cpu, **kwargs)
File "/mnt/workspace/Time-LLM/envTimeLLM/lib/python3.10/site-packages/accelerate/state.py", line 211, in init
torch.cuda.set_device(self.device)
File "/mnt/workspace/Time-LLM/envTimeLLM/lib/python3.10/site-packages/torch/cuda/init.py", line 408, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

截屏2024-10-16 03.31.20.png

只有一个 gpu,已经设置了export CUDA_VISIBLE_DEVICES=0,但还是不行

展开
收起
游客gh2ock4e6m5xu 2024-10-16 08:10:07 113 发布于湖南 分享
分享
版权
举报
0 条回答
写回答
取消 提交回答

人工智能平台 PAI(Platform for AI,原机器学习平台PAI)是面向开发者和企业的机器学习/深度学习工程平台,提供包含数据标注、模型构建、模型训练、模型部署、推理优化在内的AI开发全链路服务,内置140+种优化算法,具备丰富的行业场景插件,为用户提供低门槛、高性能的云原生AI工程化能力。

还有其他疑问?
咨询AI助理
AI助理

你好,我是AI助理

可以解答问题、推荐解决方案等