CUDA存储单元的使用GPU存储单元的分配与释放

简介: CUDA存储单元的使用GPU存储单元的分配与释放

一、问题源起

从以下的异常堆栈可以看到是BLAS程序集初始化失败,可以看到是执行MatMul的时候发生的异常,基本可以断定可能数据集太大导致memory不够用了。

2021-08-10 16:38:04.917501: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-08-10 16:38:04.960048: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-08-10 16:38:04.986898: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-08-10 16:38:04.992366: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-08-10 16:38:04.992389: W tensorflow/stream_executor/stream.cc:1455] attempting to perform BLAS operation using StreamExecutor without BLAS support
Traceback (most recent call last):
  File "/home/mango/PycharmProjects/DeepLearing/minist_conv.py", line 32, in <module>
    model.fit(train_images, train_labels, epochs=5, batch_size=64)
  File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/keras/engine/training.py", line 1183, in fit
    tmp_logs = self.train_function(iterator)
  File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/def_function.py", line 889, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/def_function.py", line 950, in _call
    return self._stateless_fn(*args, **kwds)
  File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/function.py", line 3023, in __call__
    return graph_function._call_flat(
  File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/function.py", line 1960, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/function.py", line 591, in call
    outputs = execute.execute(
  File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError:  Blas xGEMM launch failed : a.shape=[1,64,576], b.shape=[1,576,64], m=64, n=64, k=576
   [[node sequential/dense/MatMul (defined at home/mango/PycharmProjects/DeepLearing/minist_conv.py:32) ]] [Op:__inference_train_function_993]
Function call stack:
train_function

二、开发环境

mango@mango-ubuntu:~$ /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Wed_Jul_14_19:41:19_PDT_2021
Cuda~~ compilation tools, release 11.4, V11.4.100==
Build cuda_11.4.r11.4/compiler.30188945_0
mango@mango-ubuntu:~$ tail -n 10 /usr/include/cudnn_version.h
#ifndef CUDNN_VERSION_H_
#define CUDNN_VERSION_H_
#define CUDNN_MAJOR 8
#define CUDNN_MINOR 2
#define CUDNN_PATCHLEVEL 2
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
#endif /* CUDNN_VERSION_H */
mango@mango-ubuntu:~$ python3 --version
Python 3.9.5
mango@mango-ubuntu:~$ nvidia-smi
Tue Aug 10 19:57:58 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   54C    P0    N/A /  N/A |    329MiB /  2002MiB |      9%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1818      G   /usr/lib/xorg/Xorg                186MiB |
|    0   N/A  N/A      2002      G   /usr/bin/gnome-shell               45MiB |
|    0   N/A  N/A      3435      G   ...AAAAAAAAA= --shared-files       75MiB |
|    0   N/A  N/A      6016      G   python3                            13MiB |
+-----------------------------------------------------------------------------+
mango@mango-ubuntu:~$ python3
Python 3.9.5 (default, May 11 2021, 08:20:37) 
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2021-08-10 18:33:05.917520: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
>>> tf.__version__
'2.5.0'
>>>

三、Tensorflow针对GPU内存的分配策略

By default, TensorFlow maps nearly all of the GPU memory of all GPUs (subject to CUDA_VISIBLE_DEVICES) visible to the process. This is done to more efficiently use the relatively precious GPU memory resources on the devices by reducing memory fragmentation.

默认情况下,为了通过减少内存碎片更有效地利用设备上相对宝贵的GPU内存资源,TensorFlow进程会使用所有可见的GPU。

In some cases it is desirable for the process to only allocate a subset of the available memory, or to only grow the memory usage as is needed by the process. TensorFlow provides two methods to control this.

在某些情况下,进程只分配可用内存的一个子集,或者只根据进程的需要增加内存使用量。TensorFlow提供了两种方法来控制这种情况。

The first option is to turn on memory growth by calling tf.config.experimental.set_memory_growth, which attempts to allocate only as much GPU memory as needed for the runtime allocations: it starts out allocating very little memory, and as the program gets run and more GPU memory is needed, the GPU memory region is extended for the TensorFlow process. Memory is not released since it can lead to memory fragmentation. To turn on memory growth for a specific GPU, use the following code prior to allocating any tensors or executing any ops.

第一种选择是通过调用tf.config.experimental.set_memory_growth来打开内存增长,它尝试只分配运行时所需的GPU内存:它开始分配很少的内存,当程序运行时需要更多的GPU内存时,GPU内存区域会进一步扩展增大。内存不会被释放,因为这会导致内存碎片。为了打开特定GPU的内存增长,在分配任何张量或执行任何操作之前,使用以下代码。

gpus = tf.config.list_physical_devices('GPU')
if gpus:
  try:
    # Currently, memory growth needs to be the same across GPUs
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
    print(e)

Another way to enable this option is to set the environmental variable TF_FORCE_GPU_ALLOW_GROWTH to true. This configuration is platform specific.

启用该选项的另一种方法是将环境变量TF_FORCE_GPU_ALLOW_GROWTH设置为true。此配置是特定于平台的。

The second method is to configure a virtual GPU device with tf.config.experimental.set_virtual_device_configuration and set a hard limit on the total memory to allocate on the GPU.

This is useful if you want to truly bound the amount of GPU memory available to the TensorFlow process. This is common practice for local development when the GPU is shared with other applications such as a workstation GUI.

第二种方法是使用tf.config.experimental.set_virtual_device_configuration配置虚拟GPU设备,并设置GPU上可分配的总内存的硬限制。

如果你想真正将GPU内存的数量绑定到TensorFlow进程中,这是非常有用的。当GPU与其他应用程序(如工作站GUI)共享时,这是本地开发的常见做法。

gpus = tf.config.list_physical_devices('GPU')
if gpus:
  # Restrict TensorFlow to only allocate 1GB of memory on the first GPU
  try:
    tf.config.set_logical_device_configuration(
        gpus[0],
        [tf.config.LogicalDeviceConfiguration(memory_limit=1024)])
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)

四、问题分析验证

通过上边对TensorFlow文档的分析,默认情况下会占用所有的GPU内存,但是TensorFlow提供了两种方式可以灵活的控制内存的分配策略;

我们可以直接设置GPU内存按需动态分配

import tensorflow as tf
physical_gpus = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_gpus[0], True)

通过以下命令可以看到执行过程中GPU内存的占用最高为697M

mango@mango-ubuntu:~$ while true; do nvidia-smi; sleep 0.2; done;
Tue Aug 10 20:30:58 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   58C    P0    N/A /  N/A |   1026MiB /  2002MiB |     72%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1818      G   /usr/lib/xorg/Xorg                186MiB |
|    0   N/A  N/A      2002      G   /usr/bin/gnome-shell               45MiB |
|    0   N/A  N/A      3435      G   ...AAAAAAAAA= --shared-files       73MiB |
|    0   N/A  N/A      6016      G   python3                            13MiB |
|    0   N/A  N/A     13829      C   /usr/bin/python3.9                697MiB |
+-----------------------------------------------------------------------------+

我们也可以限制最多使用1024M的GPU内存

import tensorflow as tf
physical_gpus = tf.config.list_physical_devices('GPU')
tf.config.set_logical_device_configuration(physical_gpus[0], [tf.config.LogicalDeviceConfiguration(memory_limit=1024)])

同样通过命令可以看到执行过程中GPU内存的占用最高为1455M

mango@mango-ubuntu:~$ while true; do nvidia-smi; sleep 0.2; done;
Tue Aug 10 20:31:24 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   58C    P0    N/A /  N/A |   1784MiB /  2002MiB |     74%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1818      G   /usr/lib/xorg/Xorg                186MiB |
|    0   N/A  N/A      2002      G   /usr/bin/gnome-shell               46MiB |
|    0   N/A  N/A      3435      G   ...AAAAAAAAA= --shared-files       72MiB |
|    0   N/A  N/A      6016      G   python3                            13MiB |
|    0   N/A  N/A     13570      C   /usr/bin/python3.9               1455MiB |
+-----------------------------------------------------------------------------+

五、GPU分配策略分析

通过四中的测试结果可得

  1. 默认的分配策略会占用所有的内存,并且执行中不会进行释放,如果训练数据量比较打很容易内存不够用;
  2. 限制最大使用内存,测试占用内存比设置的大,这个可能跟训练中间使用的模型和操作的复杂程度有关系,需要根据具体的业务场景设置合适的值;但是要注意不能设置大了,否则还是会报错,但是设置小了只是执行的慢一些罢了;
  3. 设置内存按需分配可能是一个相对比较中庸的方案,感觉可能是一个更好的方案,不知道TensorFlow为什么没有设置为默认值,留作一个问题,后续有新的认知的话再补充;

六、扩展

单GPU模拟多GPU环境

当我们的本地开发环境只有一个GPU,但却需要编写多GPU的程序在工作站上进行训练任务时,TensorFlow为我们提供了一个方便的功能,可以让我们在本地开发环境中建立多个模拟GPU,从而让多GPU的程序调试变得更加方便。以下代码在实体GPU GPU:0 的基础上建立了两个显存均为2GB的虚拟GPU。

gpus = tf.config.list_physical_devices('GPU')
if gpus:
  # Create 2 virtual GPUs with 1GB memory each
  try:
    tf.config.set_logical_device_configuration(
        gpus[0],
        [tf.config.LogicalDeviceConfiguration(memory_limit=1024),
         tf.config.LogicalDeviceConfiguration(memory_limit=1024)])
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPU,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)

多GPU的数据并行

使用 tf.distribute.Strategy可以将模型拷贝到每个GPU上,然后将训练数据分批在不同的GPU上执行,达到数据并行。

tf.debugging.set_log_device_placement(True)
gpus = tf.config.list_logical_devices('GPU')
strategy = tf.distribute.MirroredStrategy(gpus)
with strategy.scope():
  inputs = tf.keras.layers.Input(shape=(1,))
  predictions = tf.keras.layers.Dense(1)(inputs)
  model = tf.keras.models.Model(inputs=inputs, outputs=predictions)
  model.compile(loss='mse',
                optimizer=tf.keras.optimizers.SGD(learning_rate=0.2))
相关实践学习
部署Stable Diffusion玩转AI绘画(GPU云服务器)
本实验通过在ECS上从零开始部署Stable Diffusion来进行AI绘画创作,开启AIGC盲盒。
目录
相关文章
|
7月前
|
机器学习/深度学习 并行计算 API
【GPU】CUDA是什么?以及学习路线图!
【GPU】CUDA是什么?以及学习路线图!
335 0
|
3月前
|
存储 并行计算 算法
CUDA统一内存:简化GPU编程的内存管理
在GPU编程中,内存管理是关键挑战之一。NVIDIA CUDA 6.0引入了统一内存,简化了CPU与GPU之间的数据传输。统一内存允许在单个地址空间内分配可被两者访问的内存,自动迁移数据,从而简化内存管理、提高性能并增强代码可扩展性。本文将详细介绍统一内存的工作原理、优势及其使用方法,帮助开发者更高效地开发CUDA应用程序。
|
7月前
|
机器学习/深度学习 并行计算 算法框架/工具
Anaconda+Cuda+Cudnn+Pytorch(GPU版)+Pycharm+Win11深度学习环境配置
Anaconda+Cuda+Cudnn+Pytorch(GPU版)+Pycharm+Win11深度学习环境配置
1011 3
|
7月前
|
并行计算 API C++
GPU 硬件与 CUDA 程序开发工具
GPU 硬件与 CUDA 程序开发工具
138 0
|
7月前
|
并行计算 API 开发工具
【GPU】GPU 硬件与 CUDA 程序开发工具
【GPU】GPU 硬件与 CUDA 程序开发工具
108 0
|
7月前
|
机器学习/深度学习 并行计算 流计算
【GPU】GPU CUDA 编程的基本原理是什么?
【GPU】GPU CUDA 编程的基本原理是什么?
186 0
|
7月前
|
弹性计算 并行计算 UED
带你读《弹性计算技术指导及场景应用》——4. 自动安装NVIDIA GPU驱动和CUDA组件
带你读《弹性计算技术指导及场景应用》——4. 自动安装NVIDIA GPU驱动和CUDA组件
153 0
|
1月前
|
弹性计算 人工智能 Serverless
阿里云ACK One:注册集群云上节点池(CPU/GPU)自动弹性伸缩,助力企业业务高效扩展
在当今数字化时代,企业业务的快速增长对IT基础设施提出了更高要求。然而,传统IDC数据中心却在业务存在扩容慢、缩容难等问题。为此,阿里云推出ACK One注册集群架构,通过云上节点池(CPU/GPU)自动弹性伸缩等特性,为企业带来全新突破。
|
4月前
|
机器学习/深度学习 编解码 人工智能
阿里云gpu云服务器租用价格:最新收费标准与活动价格及热门实例解析
随着人工智能、大数据和深度学习等领域的快速发展,GPU服务器的需求日益增长。阿里云的GPU服务器凭借强大的计算能力和灵活的资源配置,成为众多用户的首选。很多用户比较关心gpu云服务器的收费标准与活动价格情况,目前计算型gn6v实例云服务器一周价格为2138.27元/1周起,月付价格为3830.00元/1个月起;计算型gn7i实例云服务器一周价格为1793.30元/1周起,月付价格为3213.99元/1个月起;计算型 gn6i实例云服务器一周价格为942.11元/1周起,月付价格为1694.00元/1个月起。本文为大家整理汇总了gpu云服务器的最新收费标准与活动价格情况,以供参考。
阿里云gpu云服务器租用价格:最新收费标准与活动价格及热门实例解析
|
21天前
|
人工智能 弹性计算 编解码
阿里云GPU云服务器性能、应用场景及收费标准和活动价格参考
GPU云服务器作为阿里云提供的一种高性能计算服务,通过结合GPU与CPU的计算能力,为用户在人工智能、高性能计算等领域提供了强大的支持。其具备覆盖范围广、超强计算能力、网络性能出色等优势,且计费方式灵活多样,能够满足不同用户的需求。目前用户购买阿里云gpu云服务器gn5 规格族(P100-16G)、gn6i 规格族(T4-16G)、gn6v 规格族(V100-16G)有优惠,本文为大家详细介绍阿里云gpu云服务器的相关性能及收费标准与最新活动价格情况,以供参考和选择。