【DSW Gallery】HybridBackend 极简教程: 在 GPU 上加速推荐模型训练

本文涉及的产品
交互式建模 PAI-DSW,5000CU*H 3个月
简介: 本文介绍了如何使用 HybridBackend 在 GPU 上加速一个示例推荐模型的训练。HybridBackend 是阿里巴巴提供的一个工业级稀疏模型训练框架,可以帮助用户轻松提升GPU上的稀疏模型训练的计算吞吐。

直接使用

请打开HybridBackend 极简教程: 在 GPU 上加速推荐模型训练,并点击右上角 “ 在DSW中打开” 。

image.png


HybridBackend Quickstart

In this tutorial, we use HybridBackend to speed up training of a sample ranking model based on stacked DCNv2 on Taobao ad click datasets.

Why HybridBackend

  • Training industrial recommendation models can benefit greatly from GPUs
  • Embedding layer becomes wider, consuming up to thousands of feature fields, which requires larger memory bandwidth;
  • Feature interaction layer is going deeper by leveraging multiple DNN submodules over different subsets of features, which requires higher computing capability;
  • GPUs provide much higher computing capability, larger memory bandwidth, and faster data movement;
  • Industrial recommendation models do not take full advantage of the GPU resources by canonical training frameworks
  • Industrial recommendation models contain up to a thousand of input feature fields, introducing fragmentary and memory-intensive operations;
  • The multiple constituent feature interaction submodules introduce substantial small-sized compute kernels;
  • Training framework of industrial recommendation models must be less-invasive and compatible with existing workflow
  • Training is only a part of production recommendation system, it needs great effort to modify inference pipeline;
  • AI scientists write models in a variety of ways, especially in a big team.

HybridBackend enables speeding up of training industrial recommendation models on GPUs with minimum effort. In this tutorial, you will learn how to use HybridBackend to make training of industrial recommendation models much faster.

See HybridBackend GitHub repo and the paper for more information.

Requirements

  • Hardware
  • Modern GPU and interconnect (e.g. A10 / PCIe Gen4)
  • Fast data storage (e.g. ESSD)
  • Software
  • Ubuntu 20.04 or above
  • Python 3.8 or above
  • CUDA 11.4
  • TensorFlow 1.15
  • TFRecord Format
  • Parquet Format
!pip3 install hybridbackend-tf115-cu114

Sample ranking model

In this tutorial, a sample ranking model based on stacked DCNv2 is used. You can see code in ranking for more details.

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
import tensorflow as tf
from tensorflow.python.util import module_wrapper as deprecation
deprecation._PER_MODULE_WARNING_LIMIT = 0
tf.get_logger().propagate = False
from ranking.data import DataSpec
from ranking.model import stacked_dcn_v2
from ranking.model import wide_and_deep_features
# Global configuration
train_max_steps = 100
train_batch_size = 16000
data_spec = DataSpec.read('ranking/taobao/data/spec.json')
def train(iterator, embedding_weight_device, dnn_device, hooks):
  batch = iterator.get_next()
  batch.pop('ts')
  labels = tf.reshape(tf.to_float(batch.pop('label')), shape=[-1, 1])
  wide_features, deep_features = wide_and_deep_features(
    batch,
    data_spec.defaults,
    data_spec.norms,
    data_spec.logs,
    data_spec.embedding_dims,
    data_spec.embedding_sizes,
    embedding_weight_device)
  with tf.device(dnn_device):
    logits = stacked_dcn_v2(
      wide_features + deep_features,
      [1024, 1024, 512, 256, 1])
    loss = tf.reduce_mean(tf.keras.losses.binary_crossentropy(labels, logits))
    step = tf.train.get_or_create_global_step()
    opt = tf.train.AdagradOptimizer(learning_rate=0.001)
    train_op = opt.minimize(loss, global_step=step)
  hooks.append(tf.train.StepCounterHook(10))
  hooks.append(tf.train.StopAtStepHook(train_max_steps))
  config = tf.ConfigProto(allow_soft_placement=True)
  config.gpu_options.allow_growth = True
  config.gpu_options.force_gpu_compatible = True
  with tf.train.MonitoredTrainingSession(
      '', hooks=hooks, config=config) as sess:
    while not sess.should_stop():
      sess.run(train_op)

Training without HybridBackend

Without HybridBackend, training the sample ranking model underutilizes GPUs.

# Download training data in TFRecord format
!wget http://easyrec.oss-cn-beijing.aliyuncs.com/data/taobao/day_0.tfrecord
with tf.Graph().as_default():
  ds = tf.data.TFRecordDataset('./day_0.tfrecord', compression_type='GZIP')
  ds = ds.batch(train_batch_size, drop_remainder=True)
  ds = ds.map(
    lambda batch: tf.io.parse_example(batch, data_spec.to_example_spec()))
  ds = ds.prefetch(2)
  iterator = tf.data.make_one_shot_iterator(ds)
  with tf.device('/gpu:0'):
    train(iterator, '/cpu:0', '/gpu:0', [])

Training with HybridBackend

By just one-line importing, HybridBackend uses packing and interleaving to speed up embedding layers dramatically and automatically.

# Note: Once HybridBackend is on, you need to restart notebook to turn it off.
import hybridbackend.tensorflow as hb
# Exact same code except HybridBackend is on.
with tf.Graph().as_default():
  ds = tf.data.TFRecordDataset('./day_0.tfrecord', compression_type='GZIP')
  ds = ds.batch(train_batch_size, drop_remainder=True)
  ds = ds.map(
    lambda batch: tf.io.parse_example(batch, data_spec.to_example_spec()))
  ds = ds.prefetch(2)
  iterator = tf.data.make_one_shot_iterator(ds)
  with tf.device('/gpu:0'):
    train(iterator, '/cpu:0', '/gpu:0', [])

Training with HybridBackend (Optimized data pipeline)

Even greater training performance gains can be archived if we use optimized data pipeline provided by HybridBackend.

# Download training data in Parquet format
!wget http://easyrec.oss-cn-beijing.aliyuncs.com/data/taobao/day_0.parquet
# Note: Once HybridBackend is on, you need to restart notebook to turn it off.
import hybridbackend.tensorflow as hb
with tf.Graph().as_default():
  ds = hb.data.ParquetDataset(
    './day_0.parquet',
    batch_size=train_batch_size,
    num_parallel_parser_calls=tf.data.experimental.AUTOTUNE,
    drop_remainder=True)
  ds = ds.apply(hb.data.to_sparse())
  ds = ds.prefetch(2)
  iterator = tf.data.make_one_shot_iterator(ds)
  with tf.device('/gpu:0'):
    iterator = hb.data.Iterator(iterator, 2)
    train(iterator, '/cpu:0', '/gpu:0', [hb.data.Iterator.Hook()])


相关实践学习
基于阿里云DeepGPU实例,用AI画唯美国风少女
本实验基于阿里云DeepGPU实例,使用aiacctorch加速stable-diffusion-webui,用AI画唯美国风少女,可提升性能至高至原性能的2.6倍。
相关文章
|
2月前
|
机器学习/深度学习 存储 PyTorch
【AMP实操】解放你的GPU运行内存!在pytorch中使用自动混合精度训练
【AMP实操】解放你的GPU运行内存!在pytorch中使用自动混合精度训练
68 0
|
3月前
|
人工智能 机器人 Serverless
魔搭大模型一键部署到阿里云函数计算,GPU 闲置计费功能可大幅降低开销
魔搭大模型一键部署到阿里云函数计算,GPU 闲置计费功能可大幅降低开销
582 2
|
4月前
|
机器学习/深度学习 弹性计算 TensorFlow
阿里云GPU加速:大模型训练与推理的全流程指南
随着深度学习和大规模模型的普及,GPU成为训练和推理的关键加速器。本文将详细介绍如何利用阿里云GPU产品完成大模型的训练与推理。我们将使用Elastic GPU、阿里云深度学习镜像、ECS(云服务器)等阿里云产品,通过代码示例和详细说明,带你一步步完成整个流程。
844 0
|
4月前
|
机器学习/深度学习 异构计算 Python
Bert-vits2最终版Bert-vits2-2.3云端训练和推理(Colab免费GPU算力平台)
对于深度学习初学者来说,JupyterNoteBook的脚本运行形式显然更加友好,依托Python语言的跨平台特性,JupyterNoteBook既可以在本地线下环境运行,也可以在线上服务器上运行。GoogleColab作为免费GPU算力平台的执牛耳者,更是让JupyterNoteBook的脚本运行形式如虎添翼。 本次我们利用Bert-vits2的最终版Bert-vits2-v2.3和JupyterNoteBook的脚本来复刻生化危机6的人气角色艾达王(ada wong)。
Bert-vits2最终版Bert-vits2-2.3云端训练和推理(Colab免费GPU算力平台)
|
5月前
|
存储 人工智能 芯片
多GPU训练大型模型:资源分配与优化技巧 | 英伟达将推出面向中国的改良芯片HGX H20、L20 PCIe、L2 PCIe
在人工智能领域,大型模型因其强大的预测能力和泛化性能而备受瞩目。然而,随着模型规模的不断扩大,计算资源和训练时间成为制约其发展的重大挑战。特别是在英伟达禁令之后,中国AI计算行业面临前所未有的困境。为了解决这个问题,英伟达将针对中国市场推出新的AI芯片,以应对美国出口限制。本文将探讨如何在多个GPU上训练大型模型,并分析英伟达禁令对中国AI计算行业的影响。
|
1月前
|
运维 监控 Serverless
一键开启 GPU 闲置模式,基于函数计算低成本部署 Google Gemma 模型服务
本文介绍如何使用函数计算 GPU 实例闲置模式低成本、快速的部署 Google Gemma 模型服务。
164779 57
|
2月前
|
并行计算 TensorFlow 算法框架/工具
|
3月前
|
并行计算 PyTorch 算法框架/工具
NumPy 高级教程——GPU 加速
NumPy 高级教程——GPU 加速【1月更文挑战第4篇】
189 1
|
4月前
|
达摩院 并行计算 异构计算
modelscope调用的模型如何指定在特定gpu上运行?排除使用CUDA_VISIBLE_DEVICES环境变量
由于个人需要,家里有多张卡,但是我只想通过输入device号的方式,在单卡上运行模型。如果设置环境变量的话我的其他服务将会受影响。
|
4月前
|
机器学习/深度学习 IDE TensorFlow
进入DSW后,如何把工作环境切换为GPU状态
进入DSW后,如何把工作环境切换为GPU状态
101 2