Traceback (most recent call last): File "src/main.py", line 442, in <module> main(args) File "src/main.py", line 404, in main args.clip_max_norm, args) File "/home/wsx/0A_DATA/HFPN/src/engine.py", line 52, in train_one_epoch losses = sum(loss_dict[k] * weight_dict[k] for k in loss_dict.keys() if k in weight_dict) UnboundLocalError: local variable 'loss_dict' referenced before assignment Killing subprocess 21108
原因:分布式同时多任务训练导致显存爆了导致。
解决:改小batchsize,更换的ddp,降一下显存,处理一下数据传入。
另外可在报错语句前面加入以下进行预警。
try: ...... ...... except RuntimeError as e: if "out of memory" in str(e): sys.exit('Out Of Memory') else: raise e