1. 运行PyG时遇到的错误
运行环境,报错信息和查找到的错误内容:
Linux系统
Python 3.8(使用anaconda管理的虚拟环境)
PyTorch 1.11+cudatoolkit 10.2(通过anaconda下载)
torch-scatter 2.0.9
torch-sparse 0.6.14
pyg-nightly 2.1.0.dev20220815
报错形式是重复多行(其中具体数字可能会产生改变):
/opt/conda/conda-bld/pytorch_1646755853042/work/aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [279,0,0], thread: [118,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
最后给出的报错信息:
Traceback (most recent call last): File "try1.py", line 128, in <module> print(model(train_data.x_dict,train_data.edge_index_dict)) File "env_path/lib/python3.8/site-packages/torch/fx/graph_module.py", line 630, in wrapped_call raise e.with_traceback(None) RuntimeError: CUDA error: device-side assert triggered 1
这他娘的谁看得懂!
第一步的解决方案是把数据和模型从GPU上放到CPU上,然后再重新运行代码,就会给出正常的报错信息(而不是莫名其妙的信息)了。比如我的问题就是邻接矩阵(edge_index)中由于之前代码撰写的错误,出现了比节点数-1更大的索引:
Traceback (most recent call last): File "try1.py", line 146, in <module> print(model(train_data.x_dict,train_data.edge_index_dict)) File "env_path/lib/python3.8/site-packages/torch/fx/graph_module.py", line 630, in wrapped_call raise e.with_traceback(None) IndexError: index out of range in self
2. 运行transformers时遇到的错误
运行环境,报错信息和查找到的错误内容:
Linux系统
Python 3.8(使用anaconda管理的虚拟环境)
PyTorch 1.11+cudatoolkit 10.2(通过anaconda下载)
transformers 4.21.1
我是在使用AutoModelForSequenceClassification类,初始化的时候定义num_labels失误,导致数据中标签取值达到了这个数字,因此报错。
解决方式就是把num_labels改成正确的值。
报错信息:
Traceback (most recent call last): File "c1bert.py", line 109, in <module> optimizer.step() File "env_path/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper return wrapped(*args, **kwargs) File "env_path/lib/python3.8/site-packages/torch/optim/optimizer.py", line 88, in wrapper return func(*args, **kwargs) File "env_path/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "env_path/lib/python3.8/site-packages/torch/optim/adamw.py", line 145, in step F.adamw(params_with_grad, File "env_path/lib/python3.8/site-packages/torch/optim/_functional.py", line 155, in adamw param.addcdiv_(exp_avg, denom, value=-step_size) RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. /opt/conda/conda-bld/pytorch_1646755853042/work/aten/src/ATen/native/cuda/Loss.cu:257: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [4,0,0] Assertion `t >= 0 && t < n_classes` failed.