TensorFlow 多 GPU 处理并行数据

简介: Multi-GPU processing with data parallelismIf you write your software in a language like C++ for a single cpu c...

Multi-GPU processing with data parallelism

If you write your software in a language like C++ for a single cpu core, making it run on multiple GPUs in parallel would require rewriting the software from scratch. But this is not the case with TensorFlow. Because of its symbolic nature, tensorflow can hide all that complexity, making it effortless to scale your program across many CPUs and GPUs.

Let’s start with the simple example of adding two vectors on CPU:

 import tensorflow as tf

with tf.device(tf.DeviceSpec(device_type='CPU', device_index=0)):
    a = tf.random_uniform([1000, 100])
    b = tf.random_uniform([1000, 100])
    c = a + b

tf.Session().run(c)

The same thing can as simply be done on GPU:

with tf.device(tf.DeviceSpec(device_type='GPU', device_index=0)):
    a = tf.random_uniform([1000, 100])
    b = tf.random_uniform([1000, 100])
    c = a + b
 ```

But what if we have two GPUs and want to utilize both? To do that, we can split the data and use a separate GPU for processing each half:
```python
split_a = tf.split(a, 2)
split_b = tf.split(b, 2)

split_c = []
for i in range(2):
    with tf.device(tf.DeviceSpec(device_type='GPU', device_index=i)):
        split_c.append(split_a[i] + split_b[i])

c = tf.concat(split_c, axis=0)
 ```

Let's rewrite this in a more general form so that we can replace addition with any other set of operations:




<div class="se-preview-section-delimiter"></div>

```python
def make_parallel(fn, num_gpus, **kwargs):
    in_splits = {}
    for k, v in kwargs.items():
        in_splits[k] = tf.split(v, num_gpus)

    out_split = []
    for i in range(num_gpus):
        with tf.device(tf.DeviceSpec(device_type='GPU', device_index=i)):
            with tf.variable_scope(tf.get_variable_scope(), reuse=i > 0):
                out_split.append(fn(**{k : v[i] for k, v in in_splits.items()}))

    return tf.concat(out_split, axis=0)


def model(a, b):
    return a + b

c = make_parallel(model, 2, a=a, b=b)

You can replace the model with any function that takes a set of tensors as input and returns a tensor as result with the condition that both the input and output are in batch. Note that we also added a variable scope and set the reuse to true. This makes sure that we use the same variables for processing both splits. This is something that will become handy in our next example.

Let’s look at a slightly more practical example. We want to train a neural network on multiple GPUs. During training we not only need to compute the forward pass but also need to compute the backward pass (the gradients). But how can we parallelize the gradient computation? This turns out to be pretty easy.

Recall from the first item that we wanted to fit a second degree polynomial to a set of samples. We reorganized the code a bit to have the bulk of the operations in the model function:

import numpy as np
import tensorflow as tf

def model(x, y):
    w = tf.get_variable("w", shape=[3, 1])

    f = tf.stack([tf.square(x), x, tf.ones_like(x)], 1)
    yhat = tf.squeeze(tf.matmul(f, w), 1)

    loss = tf.square(yhat - y)
    return loss

x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)

loss = model(x, y)

train_op = tf.train.AdamOptimizer(0.1).minimize(
    tf.reduce_mean(loss))

def generate_data():
    x_val = np.random.uniform(-10.0, 10.0, size=100)
    y_val = 5 * np.square(x_val) + 3
    return x_val, y_val

sess = tf.Session()
sess.run(tf.global_variables_initializer())
for _ in range(1000):
    x_val, y_val = generate_data()
    _, loss_val = sess.run([train_op, loss], {x: x_val, y: y_val})

_, loss_val = sess.run([train_op, loss], {x: x_val, y: y_val})
print(sess.run(tf.contrib.framework.get_variables_by_name("w")))

Now let’s use make_parallel that we just wrote to parallelize this. We only need to change two lines of code from the above code:

loss = make_parallel(model, 2, x=x, y=y)

train_op = tf.train.AdamOptimizer(0.1).minimize(
    tf.reduce_mean(loss),
    colocate_gradients_with_ops=True)

The only thing that we need to change to parallelize backpropagation of gradients is to set the colocate_gradients_with_ops flag to true. This ensures that gradient ops run on the same device as the original op.

更多教程:http://www.tensorflownews.com/

相关实践学习
基于阿里云DeepGPU实例,用AI画唯美国风少女
本实验基于阿里云DeepGPU实例,使用aiacctorch加速stable-diffusion-webui,用AI画唯美国风少女,可提升性能至高至原性能的2.6倍。
目录
相关文章
|
4月前
|
并行计算 TensorFlow 算法框架/工具
win10上使用gpu版的tensorflow
win10上使用gpu版的tensorflow
|
5月前
|
并行计算 TensorFlow 算法框架/工具
TensorFlow识别GPU难道就这么难吗?还是我的GPU有问题?
TensorFlow识别GPU难道就这么难吗?还是我的GPU有问题?
|
5月前
|
TensorFlow 算法框架/工具 异构计算
Windows部署TensorFlow后识别GPU失败,原因是啥?
Windows部署TensorFlow后识别GPU失败,原因是啥?
|
26天前
|
机器学习/深度学习 并行计算 TensorFlow
TensorFlow与GPU加速:提升深度学习性能
【4月更文挑战第17天】本文介绍了TensorFlow如何利用GPU加速深度学习, GPU的并行处理能力适合处理深度学习中的矩阵运算,显著提升性能。TensorFlow通过CUDA和cuDNN库支持GPU,启用GPU只需简单代码。GPU加速能减少训练时间,使训练更大、更复杂的模型成为可能,但也需注意成本、内存限制和编程复杂性。随着技术发展,GPU将继续在深度学习中发挥关键作用,而更高效的硬件解决方案也将备受期待。
|
2月前
|
机器学习/深度学习 并行计算 PyTorch
【多GPU炼丹-绝对有用】PyTorch多GPU并行训练:深度解析与实战代码指南
本文介绍了PyTorch中利用多GPU进行深度学习的三种策略:数据并行、模型并行和两者结合。通过`DataParallel`实现数据拆分、模型不拆分,将数据批次在不同GPU上处理;数据不拆分、模型拆分则将模型组件分配到不同GPU,适用于复杂模型;数据和模型都拆分,适合大型模型,使用`DistributedDataParallel`结合`torch.distributed`进行分布式训练。代码示例展示了如何在实践中应用这些策略。
112 2
【多GPU炼丹-绝对有用】PyTorch多GPU并行训练:深度解析与实战代码指南
|
5月前
|
并行计算 TensorFlow 算法框架/工具
Linux Ubuntu配置CPU与GPU版本tensorflow库的方法
Linux Ubuntu配置CPU与GPU版本tensorflow库的方法
|
5月前
|
并行计算 TensorFlow 算法框架/工具
新版本GPU加速的tensorflow库的配置方法
新版本GPU加速的tensorflow库的配置方法
101 1
|
5月前
|
机器学习/深度学习 TensorFlow 算法框架/工具
Anaconda配置Python新版本tensorflow库(CPU、GPU通用)的方法
Anaconda配置Python新版本tensorflow库(CPU、GPU通用)的方法
|
4月前
|
机器学习/深度学习 Dart TensorFlow
TensorFlow Lite,ML Kit 和 Flutter 移动深度学习:6~11(5)
TensorFlow Lite,ML Kit 和 Flutter 移动深度学习:6~11(5)
75 0
|
2天前
|
机器学习/深度学习 人工智能 自然语言处理
使用TensorFlow进行深度学习入门
【5月更文挑战第11天】本文引导读者入门TensorFlow深度学习,介绍TensorFlow——Google的开源机器学习框架,用于处理各种机器学习问题。内容包括TensorFlow安装(使用pip)、核心概念(张量、计算图和会话)以及构建和训练简单线性回归模型的示例。通过这个例子,读者可掌握TensorFlow的基本操作,包括定义模型、损失函数、优化器以及运行会话。