Resource exhausted: OOM when allocating tensor with shape[2304,384] Traceback (most recent call last

简介: Resource exhausted: OOM when allocating tensor with shape[2304,384] Traceback (most recent call last

The reason is simple: I just use too much gpu memory:

1. 2018-09-26 18:50:05.489980: W T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:279] ********__*_____________________*_**_____________________*____*_**********************************xx
2. 2018-09-26 18:50:05.490391: W T:\src\github\tensorflow\tensorflow\core\framework\op_kernel.cc:1275] OP_REQUIRES failed at conv_ops.cc:636 : Resource exhausted: OOM when allocating tensor with shape[32,32,417,417] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
3. Traceback (most recent call last):
4. 
5.     callbacks=[logging, checkpoint])
6.   File "D:\Anaconda3\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
7.     return func(*args, **kwargs)
8.   File "D:\Anaconda3\lib\site-packages\keras\engine\training.py", line 1415, in fit_generator
9.     initial_epoch=initial_epoch)
10.   File "D:\Anaconda3\lib\site-packages\keras\engine\training_generator.py", line 213, in fit_generator
11.     class_weight=class_weight)
12.   File "D:\Anaconda3\lib\site-packages\keras\engine\training.py", line 1215, in train_on_batch
13.     outputs = self.train_function(ins)
14.   File "D:\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py", line 2666, in __call__
15.     return self._call(inputs)
16.   File "D:\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py", line 2636, in _call
17.     fetched = self._callable_fn(*array_vals)
18.   File "D:\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1382, in __call__
19.     run_metadata_ptr)
20.   File "D:\Anaconda3\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 519, in __exit__
21.     c_api.TF_GetCode(self.status.status))
22. tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[32,32,417,417] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
23.    [[Node: conv2d_2/convolution = Conv2D[T=DT_FLOAT, _class=["loc:@batch_normalization_2/cond/FusedBatchNorm/Switch"], data_format="NHWC", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 2, 2, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](zero_padding2d_1/Pad, conv2d_2/kernel/read)]]
24. Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
25. 
26.    [[Node: yolo_loss/while_1/LoopCond/_2963 = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_6607_yolo_loss/while_1/LoopCond", tensor_type=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopyolo_loss/while_1/strided_slice_1/stack_2/_2805)]]
27. Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
1. 2018-09-26 18:50:05.482286: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:674] 1 Chunks of size 16384 totalling 16.0KiB
2. 2018-09-26 18:50:05.482594: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:674] 8 Chunks of size 21504 totalling 168.0KiB
3. 2018-09-26 18:50:05.482884: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:674] 2 Chunks of size 32768 totalling 64.0KiB
4. 2018-09-26 18:50:05.483090: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:674] 8 Chunks of size 43008 totalling 336.0KiB
5. 2018-09-26 18:50:05.483276: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:674] 5 Chunks of size 65024 totalling 317.5KiB
6. 2018-09-26 18:50:05.483457: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:674] 2 Chunks of size 73728 totalling 144.0KiB
7. 2018-09-26 18:50:05.483656: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:674] 8 Chunks of size 86016 totalling 672.0KiB
8. 2018-09-26 18:50:05.483844: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:674] 3 Chunks of size 129792 totalling 380.3KiB
9. 2018-09-26 18:50:05.484411: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:674] 11 Chunks of size 131072 totalling 1.38MiB
10. 2018-09-26 18:50:05.484719: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:674] 1 Chunks of size 196608 totalling 192.0KiB
11. 2018-09-26 18:50:05.484902: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:674] 5 Chunks of size 259584 totalling 1.24MiB
12. 2018-09-26 18:50:05.485216: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:674] 3 Chunks of size 294912 totalling 864.0KiB
13. 2018-09-26 18:50:05.485494: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:674] 1 Chunks of size 454400 totalling 443.8KiB
14. 2018-09-26 18:50:05.485748: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:674] 3 Chunks of size 519168 totalling 1.49MiB
15. 2018-09-26 18:50:05.486063: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:674] 11 Chunks of size 524288 totalling 5.50MiB
16. 2018-09-26 18:50:05.486245: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:674] 1 Chunks of size 786432 totalling 768.0KiB
17. 2018-09-26 18:50:05.486419: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:674] 4 Chunks of size 1038336 totalling 3.96MiB
18. 2018-09-26 18:50:05.486590: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:674] 12 Chunks of size 1179648 totalling 13.50MiB
19. 2018-09-26 18:50:05.486764: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:674] 1 Chunks of size 1817088 totalling 1.73MiB
20. 2018-09-26 18:50:05.486934: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:674] 3 Chunks of size 2076672 totalling 5.94MiB
21. 2018-09-26 18:50:05.487432: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:674] 7 Chunks of size 2097152 totalling 14.00MiB
22. 2018-09-26 18:50:05.487719: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:674] 12 Chunks of size 4718592 totalling 54.00MiB
23. 2018-09-26 18:50:05.487982: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:674] 1 Chunks of size 7268352 totalling 6.93MiB
24. 2018-09-26 18:50:05.488284: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:674] 8 Chunks of size 18874368 totalling 144.00MiB
25. 2018-09-26 18:50:05.488560: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:674] 1 Chunks of size 431485952 totalling 411.50MiB
26. 2018-09-26 18:50:05.488842: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:674] 1 Chunks of size 712249344 totalling 679.25MiB
27. 2018-09-26 18:50:05.489097: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:678] Sum Total of in-use chunks: 1.32GiB
28. 2018-09-26 18:50:05.489374: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:680] Stats: 
29. Limit:                  3211594956
30. InUse:                  1415122432
31. MaxInUse:               2420054016
32. NumAllocs:                    1707
33. MaxAllocSize:            712249344

The most expedient way is probably to reduce the batch size. It'll run slower, but use less memory.

I change the batch_size from 128 to 32 then the problem is resolved!

 

AIEarth是一个由众多领域内专家博主共同打造的学术平台,旨在建设一个拥抱智慧未来的学术殿堂!【平台地址:https://devpress.csdn.net/aiearth】 很高兴认识你!加入我们共同进步!

相关实践学习
部署Stable Diffusion玩转AI绘画(GPU云服务器)
本实验通过在ECS上从零开始部署Stable Diffusion来进行AI绘画创作,开启AIGC盲盒。
目录
相关文章
|
IDE PyTorch 网络安全
|
5月前
|
PyTorch 算法框架/工具 Python
Traceback (most recent call last):WARNING: Dataset not found, nonexistent paths:
这篇文章描述了在使用YOLOv5进行训练时遇到的"Dataset not found"错误,分析了可能的原因,包括网络连接问题和数据集路径配置错误,并提供了相应的解决方法,如检查网络设置和确认数据集文件的存放位置。
Traceback (most recent call last):WARNING: Dataset not found, nonexistent paths:
|
并行计算 Linux PyTorch
RuntimeError: CUDA error: device-side assert triggered
我在运行PyG和transformers代码时两次遇到了这一问题,在此加以记录。
|
7月前
|
并行计算 监控 前端开发
函数计算操作报错合集之如何解决报错:RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0!
在使用函数计算服务(如阿里云函数计算)时,用户可能会遇到多种错误场景。以下是一些常见的操作报错及其可能的原因和解决方法,包括但不限于:1. 函数部署失败、2. 函数执行超时、3. 资源不足错误、4. 权限与访问错误、5. 依赖问题、6. 网络配置错误、7. 触发器配置错误、8. 日志与监控问题。
277 2
|
8月前
|
机器学习/深度学习 并行计算 PyTorch
【已解决】RuntimeError: CUDA error: device-side assert triggeredCUDA kernel errors might be asynchronous
【已解决】RuntimeError: CUDA error: device-side assert triggeredCUDA kernel errors might be asynchronous
成功解决ValueError: Found input variables with inconsistent numbers of samples: [86, 891]
成功解决ValueError: Found input variables with inconsistent numbers of samples: [86, 891]
|
机器学习/深度学习 Windows
raise RuntimeError(‘Error(s) in loading state_dict for {}:\n\t{}‘.format( RuntimeError: Error(s)..报错
即load_state_dict(fsd,strict=False) 属性strict;当strict=True,要求预训练练权重层数的键值与新构建的模型中的权重层数名称完全吻合;
1587 0
|
存储 异构计算 Python
解决numpy.core._exceptions.MemoryError: Unable to allocate 1.04 MiB for an array
但实际上它保存的不是模型文件,而是参数文件文件。在模型文件中,存储完整的模型,而在状态文件中,仅存储参数。因此,collections.OrderedDict只是模型的值。
1977 0
|
PyTorch 算法框架/工具 开发者
RuntimeError: Can‘t call numpy() on Variable that requires grad. Use var.detach().numpy()
出现这个现象的原因是:待转换类型的PyTorch Tensor变量带有梯度,直接将其转换为numpy数据将破坏计算图,因此numpy拒绝进行数据转换,实际上这是对开发者的一种提醒。如果自己在转换数据时不需要保留梯度信息,可以在变量转换之前添加detach()调用。
198 0
|
机器学习/深度学习 数据采集
ValueError: Found input variables with inconsistent numbers of samples: [140, 1120] 怎么解决?
这个错误通常发生在机器学习模型的训练中,它表示输入数据的样本数量不一致。在你的情况下,你的输入数据中有两个变量,一个变量的样本数量为140,另一个变量的样本数量为1120,因此这个错误就出现了。 为了解决这个问题,你需要确保所有输入变量的样本数量是相同的。你可以通过以下几种方式来解决这个问题: 检查数据:检查数据是否正确加载,可能会导致数据样本数量不一致。 数据清洗:检查是否有重复的样本或者缺失的样本,如果有则需要对数据进行清洗。 数据对齐:如果你使用了多个数据源,那么你需要对它们进行对齐以确保它们的样本数量一致。 数据重采样:如果数据中有不均衡的样本数量,你可以考虑使用数据重采样方
1002 0