【DSW Gallery】HybridBackend 极简教程: 在 GPU 上加速推荐模型训练

本文涉及的产品
交互式建模 PAI-DSW,5000CU*H 3个月
简介: 本文介绍了如何使用 HybridBackend 在 GPU 上加速一个示例推荐模型的训练。HybridBackend 是阿里巴巴提供的一个工业级稀疏模型训练框架,可以帮助用户轻松提升GPU上的稀疏模型训练的计算吞吐。

直接使用

请打开HybridBackend 极简教程: 在 GPU 上加速推荐模型训练,并点击右上角 “ 在DSW中打开” 。

image.png


HybridBackend Quickstart

In this tutorial, we use HybridBackend to speed up training of a sample ranking model based on stacked DCNv2 on Taobao ad click datasets.

Why HybridBackend

  • Training industrial recommendation models can benefit greatly from GPUs
  • Embedding layer becomes wider, consuming up to thousands of feature fields, which requires larger memory bandwidth;
  • Feature interaction layer is going deeper by leveraging multiple DNN submodules over different subsets of features, which requires higher computing capability;
  • GPUs provide much higher computing capability, larger memory bandwidth, and faster data movement;
  • Industrial recommendation models do not take full advantage of the GPU resources by canonical training frameworks
  • Industrial recommendation models contain up to a thousand of input feature fields, introducing fragmentary and memory-intensive operations;
  • The multiple constituent feature interaction submodules introduce substantial small-sized compute kernels;
  • Training framework of industrial recommendation models must be less-invasive and compatible with existing workflow
  • Training is only a part of production recommendation system, it needs great effort to modify inference pipeline;
  • AI scientists write models in a variety of ways, especially in a big team.

HybridBackend enables speeding up of training industrial recommendation models on GPUs with minimum effort. In this tutorial, you will learn how to use HybridBackend to make training of industrial recommendation models much faster.

See HybridBackend GitHub repo and the paper for more information.

Requirements

  • Hardware
  • Modern GPU and interconnect (e.g. A10 / PCIe Gen4)
  • Fast data storage (e.g. ESSD)
  • Software
  • Ubuntu 20.04 or above
  • Python 3.8 or above
  • CUDA 11.4
  • TensorFlow 1.15
  • TFRecord Format
  • Parquet Format
!pip3 install hybridbackend-tf115-cu114

Sample ranking model

In this tutorial, a sample ranking model based on stacked DCNv2 is used. You can see code in ranking for more details.

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
import tensorflow as tf
from tensorflow.python.util import module_wrapper as deprecation
deprecation._PER_MODULE_WARNING_LIMIT = 0
tf.get_logger().propagate = False
from ranking.data import DataSpec
from ranking.model import stacked_dcn_v2
from ranking.model import wide_and_deep_features
# Global configuration
train_max_steps = 100
train_batch_size = 16000
data_spec = DataSpec.read('ranking/taobao/data/spec.json')
def train(iterator, embedding_weight_device, dnn_device, hooks):
  batch = iterator.get_next()
  batch.pop('ts')
  labels = tf.reshape(tf.to_float(batch.pop('label')), shape=[-1, 1])
  wide_features, deep_features = wide_and_deep_features(
    batch,
    data_spec.defaults,
    data_spec.norms,
    data_spec.logs,
    data_spec.embedding_dims,
    data_spec.embedding_sizes,
    embedding_weight_device)
  with tf.device(dnn_device):
    logits = stacked_dcn_v2(
      wide_features + deep_features,
      [1024, 1024, 512, 256, 1])
    loss = tf.reduce_mean(tf.keras.losses.binary_crossentropy(labels, logits))
    step = tf.train.get_or_create_global_step()
    opt = tf.train.AdagradOptimizer(learning_rate=0.001)
    train_op = opt.minimize(loss, global_step=step)
  hooks.append(tf.train.StepCounterHook(10))
  hooks.append(tf.train.StopAtStepHook(train_max_steps))
  config = tf.ConfigProto(allow_soft_placement=True)
  config.gpu_options.allow_growth = True
  config.gpu_options.force_gpu_compatible = True
  with tf.train.MonitoredTrainingSession(
      '', hooks=hooks, config=config) as sess:
    while not sess.should_stop():
      sess.run(train_op)

Training without HybridBackend

Without HybridBackend, training the sample ranking model underutilizes GPUs.

# Download training data in TFRecord format
!wget http://easyrec.oss-cn-beijing.aliyuncs.com/data/taobao/day_0.tfrecord
with tf.Graph().as_default():
  ds = tf.data.TFRecordDataset('./day_0.tfrecord', compression_type='GZIP')
  ds = ds.batch(train_batch_size, drop_remainder=True)
  ds = ds.map(
    lambda batch: tf.io.parse_example(batch, data_spec.to_example_spec()))
  ds = ds.prefetch(2)
  iterator = tf.data.make_one_shot_iterator(ds)
  with tf.device('/gpu:0'):
    train(iterator, '/cpu:0', '/gpu:0', [])

Training with HybridBackend

By just one-line importing, HybridBackend uses packing and interleaving to speed up embedding layers dramatically and automatically.

# Note: Once HybridBackend is on, you need to restart notebook to turn it off.
import hybridbackend.tensorflow as hb
# Exact same code except HybridBackend is on.
with tf.Graph().as_default():
  ds = tf.data.TFRecordDataset('./day_0.tfrecord', compression_type='GZIP')
  ds = ds.batch(train_batch_size, drop_remainder=True)
  ds = ds.map(
    lambda batch: tf.io.parse_example(batch, data_spec.to_example_spec()))
  ds = ds.prefetch(2)
  iterator = tf.data.make_one_shot_iterator(ds)
  with tf.device('/gpu:0'):
    train(iterator, '/cpu:0', '/gpu:0', [])

Training with HybridBackend (Optimized data pipeline)

Even greater training performance gains can be archived if we use optimized data pipeline provided by HybridBackend.

# Download training data in Parquet format
!wget http://easyrec.oss-cn-beijing.aliyuncs.com/data/taobao/day_0.parquet
# Note: Once HybridBackend is on, you need to restart notebook to turn it off.
import hybridbackend.tensorflow as hb
with tf.Graph().as_default():
  ds = hb.data.ParquetDataset(
    './day_0.parquet',
    batch_size=train_batch_size,
    num_parallel_parser_calls=tf.data.experimental.AUTOTUNE,
    drop_remainder=True)
  ds = ds.apply(hb.data.to_sparse())
  ds = ds.prefetch(2)
  iterator = tf.data.make_one_shot_iterator(ds)
  with tf.device('/gpu:0'):
    iterator = hb.data.Iterator(iterator, 2)
    train(iterator, '/cpu:0', '/gpu:0', [hb.data.Iterator.Hook()])


相关实践学习
基于阿里云DeepGPU实例,用AI画唯美国风少女
本实验基于阿里云DeepGPU实例,使用aiacctorch加速stable-diffusion-webui,用AI画唯美国风少女,可提升性能至高至原性能的2.6倍。
相关文章
|
3月前
|
机器学习/深度学习 存储 PyTorch
【AMP实操】解放你的GPU运行内存!在pytorch中使用自动混合精度训练
【AMP实操】解放你的GPU运行内存!在pytorch中使用自动混合精度训练
76 0
|
4月前
|
人工智能 机器人 Serverless
魔搭大模型一键部署到阿里云函数计算,GPU 闲置计费功能可大幅降低开销
魔搭大模型一键部署到阿里云函数计算,GPU 闲置计费功能可大幅降低开销
595 2
|
5月前
|
IDE 开发工具
垃圾分类模型训练部署教程,基于MaixHub和MaixPy-k210(3)
在开发板上运行模型 1、烧录模型文件到板子 使用kflash_gui工具,可以完成这个任务。
182 0
|
5月前
|
机器学习/深度学习 Python
垃圾分类模型训练部署教程,基于MaixHub和MaixPy-k210(2)
至此,我们就已经成功上传了其中一个类别的图片啦!按照上面的方式,我们可以继续上传其余每个类别的图片。 上传完所有类别的图片后,来到总览,可以大致浏览我们刚刚上传的图片。 接下来,就要用这些图片来训练用于垃圾分类的模型了!
136 0
|
5月前
|
IDE 数据处理 开发工具
垃圾分类模型训练部署教程,基于MaixHub和MaixPy-k210(1)
我的准备 Maix duino开发板一块(含摄像头配件) Type-c数据集一根
113 0
|
2月前
|
机器学习/深度学习 人工智能 运维
【人工智能技术专题】「入门到精通系列教程」打好AI基础带你进军人工智能领域的全流程技术体系(机器学习知识导论)(二)
【人工智能技术专题】「入门到精通系列教程」打好AI基础带你进军人工智能领域的全流程技术体系(机器学习知识导论)
60 1
|
2月前
|
机器学习/深度学习 人工智能 自然语言处理
【人工智能技术专题】「入门到精通系列教程」打好AI基础带你进军人工智能领域的全流程技术体系(机器学习知识导论)(一)
【人工智能技术专题】「入门到精通系列教程」打好AI基础带你进军人工智能领域的全流程技术体系(机器学习知识导论)
83 1
|
3天前
|
机器学习/深度学习 数据采集 人工智能
机器学习之sklearn基础教程
【5月更文挑战第9天】Sklearn是Python热门机器学习库,提供丰富算法和预处理工具。本文深入讲解基础概念、核心理论、常见问题及解决策略。内容涵盖模型选择与训练、预处理、交叉验证、分类回归、模型评估、数据集划分、正则化、编码分类变量、特征选择与降维、集成学习、超参数调优、模型评估、保存加载及模型解释。学习Sklearn是迈入机器学习领域的关键。
15 3
|
5天前
|
机器学习/深度学习 算法 Python
深入浅出Python机器学习:从零开始的SVM教程/厾罗
深入浅出Python机器学习:从零开始的SVM教程/厾罗
|
5天前
|
机器学习/深度学习 数据采集 算法
机器学习之sklearn基础教程
机器学习之sklearn基础教程

热门文章

最新文章