模型部署专题 | 01:基于Triton Server部署BERT模型

本文涉及的产品
智能开放搜索 OpenSearch行业算法版,1GB 20LCU 1个月
大数据开发治理平台 DataWorks,不限时长
检索分析服务 Elasticsearch 版,2核4GB开发者规格 1个月
简介: 本文简要介绍如何使用 Triton 部署 BERT模型

背景

本文简要介绍如何使用 Triton 部署 BERT模型,主要参考 NVIDIA/DeepLearningExamples

更多、更及时内容欢迎留意微信公众号小窗幽记机器学习

准备工作

下载数据

进入到/data/DeepLearningExamples-master/PyTorch/LanguageModeling/BERT/data/squad后,下载数据:

bash ./squad_download.sh

下载模型

wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/bert_large_pyt_amp_ckpt_squad_qa1_1/versions/1/zip -O bert_large_pyt_amp_ckpt_squad_qa1_1_1.zip

image.png

由于各个脚本使用的是bert_qa.pt,所以,对上述模型文件进行重命名。

构建容器

bash ./scripts/docker/build.sh

image.png

Processing triggers for libc-bin (2.27-3ubuntu1) ...
Removing intermediate container 89010b0a75b2
 ---> 562bcc14dbfa
Step 15/15 : COPY . .
 ---> 23bac3585a43
Successfully built 23bac3585a43
Successfully tagged bert:latest

模型部署

将 checkpoint 导出为 torchscript

在宿主机(不需要容器内部)下,进入DeepLearningExamples-master/PyTorch/LanguageModeling/BERT执行下述脚本将 checkpoint 转为 torchscript:

bash ./triton/export_model.sh

转换过程状态:

=============
== PyTorch ==
=============

NVIDIA Release 20.06 (build 13419386)
PyTorch Version 1.6.0a0+9907a3e

Container image Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.

Copyright (c) 2014-2020 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

NOTE: Legacy NVIDIA Driver detected.  Compatibility mode ENABLED.

NOTE: MOFED driver for multi-node communication was not detected.
      Multi-node communication performance may be reduced.

deploying model bertQA-ts-script in format pytorch_libtorch
/opt/conda/lib/python3.6/site-packages/torch/jit/_recursive.py:160: UserWarning: 'bias' was found in ScriptModule constants,  but it is a non-constant parameter. Consider removing it.
  " but it is a non-constant {}. Consider removing it.".format(name, hint))

conversion correctness test results
-----------------------------------
maximal absolute error over dataset (L_inf):  0.0322265625

average L_inf error over output tensors:  0.02264404296875
variance of L_inf error over output tensors:  5.4970383644104004e-05
stddev of L_inf error over output tensors:  0.00741420148391612

time of error check of native model:  0.8040032386779785 seconds
time of error check of ts model:  1.7353665828704834 seconds

done

模型格式转换后,待部署的Triton模型将存于BERT/results/triton_models

image.png

./triton/export_model.shEXPORT_FORMAT值为ts-script表示转为torchscript格式。如果想要以ONNX格式部署,则可以将./triton/export_model.sh中的EXPORT_FORMAT值设置为onnx。此外,还要注意相应改动triton_model_name,比如改为bertQA-onnx,以对新转换的模型进行合适命名。

image.png

启动 Triton server

可以通过执行以下命令来启动Triton server:

docker run --rm --gpus device=0 --ipc=host --network=host -p 8000:8000 -p 8001:8001 -p 8002:8002 -v $PWD/results/triton_models:/models nvcr.io/nvidia/tritonserver:20.06-v1-py3 trtserver --model-store=/models --log-verbose=1

由于上述镜像nvcr.io/nvidia/tritonserver:20.06-v1-py3本地尚未拉取,所以执行上述命令后,会优先拉取该镜像。

另外,注意这里指定的模型位置是--model-store=/models映射的是./results/triton_models,且该目录下有2个模型,所以服务启动的时候会将2个模型都加载:

image.png

服务启动后,可以看下显存的占用情况:

image.png

启动自定义的Triton client

./triton/client.py为自定义的client代码。

Step1: 启动一个 client 容器

docker run -it --rm --ipc=host --network=host -v $PWD/vocab:/workspace/bert/vocab bert:latest

image.png

PS:
启动客户端无需指定GPU,且上述的启动方式,当在终端直接退出该容器后,该容器自动销毁。

如此便启动了一个容器,并进入容器当中。

Step2: 启动 client
进入到 client 代码目录:cd /workspace/bert/triton/,再运行如下代码,对 bertQA-ts-script 版模型进行请求:

python client.py --do_lower_case --version_2_with_negative --vocab_file=../vocab/vocab --triton-model-name=bertQA-ts-script

image.png

此时,client 端将向已在运行的 Triton server 发送一个请求,Triton server 接收请求并处理后,将请求返回。如果想输入自定义的文本段落和问题,则只需在运行client.py脚本时搭配--question--context参数并传入对应的内容。此外,可以通过--triton-model-name指定特定的模型。这里服务端加载了2个模型,所以client也可以对 onnx 版模型进行请求:

python client.py --do_lower_case --version_2_with_negative --vocab_file=../vocab/vocab --triton-model-name=bertQA-onnx

image.png

模型部署后的评估:Squad1.1

部署并评估模型,可以在宿主机下执行以下命令

bash ./triton/evaluate.sh

PS:
在部署和评测之前,先将之前启动的 Triton server 关闭,否则端口被冲突。

服务启动和评测运行状态如下:

=============
== PyTorch ==
=============

NVIDIA Release 20.06 (build 13419386)
PyTorch Version 1.6.0a0+9907a3e

Container image Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.

Copyright (c) 2014-2020 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

NOTE: Legacy NVIDIA Driver detected.  Compatibility mode ENABLED.

NOTE: MOFED driver for multi-node communication was not detected.
      Multi-node communication performance may be reduced.

deploying model bert_large_fp32 in format pytorch_libtorch
/opt/conda/lib/python3.6/site-packages/torch/jit/_recursive.py:160: UserWarning: 'bias' was found in ScriptModule constants,  but it is a non-constant parameter. Consider removing it.
  " but it is a non-constant {}. Consider removing it.".format(name, hint))

conversion correctness test results
-----------------------------------
maximal absolute error over dataset (L_inf):  1.4185905456542969e-05

average L_inf error over output tensors:  1.0482966899871826e-05
variance of L_inf error over output tensors:  8.773056355456296e-12
stddev of L_inf error over output tensors:  2.961934562993635e-06

time of error check of native model:  1.596167802810669 seconds
time of error check of ts model:  2.414717435836792 seconds

done
Starting server...
Waiting for TRITON Server to be ready at http://localhost:8000...
000
.......TRITON Server is ready!

=============
== PyTorch ==
=============

NVIDIA Release 20.06 (build 13419386)
PyTorch Version 1.6.0a0+9907a3e

Container image Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.

Copyright (c) 2014-2020 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

NOTE: Legacy NVIDIA Driver detected.  Compatibility mode ENABLED.

NOTE: MOFED driver for multi-node communication was not detected.
      Multi-node communication performance may be reduced.

Sending Requests: 100%|███████████████████████████████████████████████████████████████████████████| 10833/10833 [04:20<00:00, 27.84sentences/s-----------------------------█████████████████████████████████████████████████████████████████████▉| 10832/10833 [14:29<00:00, 12.28sentences/s]
Individual Time Runs
Total Time: 869886.3623142242 ms
-----------------------------
-----------------------------
Total Inference Time = 432310.23 forSentences processed = 10833
Throughput Average (sentences/sec) = 12.45
Throughput Average (batches/sec) = 1.56
-----------------------------
-----------------------------
Summary Statistics
Batch size = 8
Sequence Length = 384
Latency Confidence Level 95 (ms) = 594040.61627388
Latency Confidence Level 99 (ms)  = 615392.275094986
Latency Confidence Level 100 (ms)  = 619993.6480522156
Latency Average (ms)  = 319048.1366518239
-----------------------------
Sending Requests: 100%|███████████████████████████████████████████████████████████████████████████| 10833/10833 [15:16<00:00, 11.82sentences/s]
Processed Requests: 100%|█████████████████████████████████████████████████████████████████████████| 10833/10833 [15:16<00:00, 11.82sentences/s]

=============
== PyTorch ==
=============

NVIDIA Release 20.06 (build 13419386)
PyTorch Version 1.6.0a0+9907a3e

Container image Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.

Copyright (c) 2014-2020 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

NOTE: Legacy NVIDIA Driver detected.  Compatibility mode ENABLED.

NOTE: MOFED driver for multi-node communication was not detected.
      Multi-node communication performance may be reduced.

trt_server_cont
tritonnet

需要注意的是,默认下以torchscript格式部署服务,并以Squad1.1数据集进行评测。如果想对onnx格式模型进行评测,将/triton/evaluate.sh中的EXPORT_FORMAT值从ts-script改为onnx

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

NOTE: Legacy NVIDIA Driver detected.  Compatibility mode ENABLED.

NOTE: MOFED driver for multi-node communication was not detected.
      Multi-node communication performance may be reduced.

deploying model bert_large_fp32 in format onnxruntime_onnx
/opt/conda/lib/python3.6/site-packages/torch/onnx/utils.py:955: UserWarning: No names were found for specified dynamic axes of provided input.Automatically generated names will be applied to each dynamic axes of input input__0
  'Automatically generated names will be applied to each dynamic axes of input {}'.format(key))
/opt/conda/lib/python3.6/site-packages/torch/onnx/utils.py:955: UserWarning: No names were found for specified dynamic axes of provided input.Automatically generated names will be applied to each dynamic axes of input input__1
  'Automatically generated names will be applied to each dynamic axes of input {}'.format(key))
/opt/conda/lib/python3.6/site-packages/torch/onnx/utils.py:955: UserWarning: No names were found for specified dynamic axes of provided input.Automatically generated names will be applied to each dynamic axes of input input__2
  'Automatically generated names will be applied to each dynamic axes of input {}'.format(key))
/opt/conda/lib/python3.6/site-packages/torch/onnx/utils.py:955: UserWarning: No names were found for specified dynamic axes of provided input.Automatically generated names will be applied to each dynamic axes of input output__0
  'Automatically generated names will be applied to each dynamic axes of input {}'.format(key))
/opt/conda/lib/python3.6/site-packages/torch/onnx/utils.py:955: UserWarning: No names were found for specified dynamic axes of provided input.Automatically generated names will be applied to each dynamic axes of input output__1
  'Automatically generated names will be applied to each dynamic axes of input {}'.format(key))
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 1336539136

conversion correctness test results
-----------------------------------
maximal absolute error over dataset (L_inf):  0.00022530555725097656

average L_inf error over output tensors:  0.0001377016305923462
variance of L_inf error over output tensors:  6.448256743378049e-09
stddev of L_inf error over output tensors:  8.030103824595327e-05

time of error check of native model:  1.2507586479187012 seconds
time of error check of onnx model:  76.80649089813232 seconds

done
Starting server...
Waiting for TRITON Server to be ready at http://localhost:8000...
000
.......TRITON Server is ready!

=============
== PyTorch ==
=============

NVIDIA Release 20.06 (build 13419386)
PyTorch Version 1.6.0a0+9907a3e

Container image Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.

Copyright (c) 2014-2020 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

NOTE: Legacy NVIDIA Driver detected.  Compatibility mode ENABLED.

NOTE: MOFED driver for multi-node communication was not detected.
      Multi-node communication performance may be reduced.

Sending Requests: 100%|███████████████████████████████████████████████████████████████████████████| 10833/10833 [04:40<00:00, 15.52sentences/s-----------------------------█████████████████████████████████████████████████████████████████████▉| 10832/10833 [14:23<00:00, 12.42sentences/s]
Individual Time Runs
Total Time: 863938.3265972137 ms
-----------------------------
-----------------------------
Total Inference Time = 418017.89 forSentences processed = 10833
Throughput Average (sentences/sec) = 12.54
Throughput Average (batches/sec) = 1.57
-----------------------------
-----------------------------
Summary Statistics
Batch size = 8
Sequence Length = 384
Latency Confidence Level 95 (ms) = 568533.2419872284
Latency Confidence Level 99 (ms)  = 591532.5634479523
Latency Confidence Level 100 (ms)  = 595446.0487365723
Latency Average (ms)  = 308500.2912194087
-----------------------------
Sending Requests: 100%|███████████████████████████████████████████████████████████████████████████| 10833/10833 [15:10<00:00, 11.90sentences/s]
Processed Requests: 100%|█████████████████████████████████████████████████████████████████████████| 10833/10833 [15:10<00:00, 11.90sentences/s]

=============
== PyTorch ==
=============

NVIDIA Release 20.06 (build 13419386)
PyTorch Version 1.6.0a0+9907a3e

Container image Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.

Copyright (c) 2014-2020 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

NOTE: Legacy NVIDIA Driver detected.  Compatibility mode ENABLED.

NOTE: MOFED driver for multi-node communication was not detected.
      Multi-node communication performance may be reduced.

trt_server_cont
tritonnet

更多、更及时内容欢迎留意微信公众号小窗幽记机器学习

相关文章
|
1月前
|
机器学习/深度学习 人工智能 开发工具
如何快速部署本地训练的 Bert-VITS2 语音模型到 Hugging Face
Hugging Face是一个机器学习(ML)和数据科学平台和社区,帮助用户构建、部署和训练机器学习模型。它提供基础设施,用于在实时应用中演示、运行和部署人工智能(AI)。用户还可以浏览其他用户上传的模型和数据集。Hugging Face通常被称为机器学习界的GitHub,因为它让开发人员公开分享和测试他们所训练的模型。 本次分享如何快速部署本地训练的 Bert-VITS2 语音模型到 Hugging Face。
如何快速部署本地训练的 Bert-VITS2 语音模型到 Hugging Face
|
1月前
|
PyTorch 算法框架/工具
Bert Pytorch 源码分析:五、模型架构简图 REV1
Bert Pytorch 源码分析:五、模型架构简图 REV1
42 0
|
1月前
|
PyTorch 算法框架/工具
Bert Pytorch 源码分析:五、模型架构简图
Bert Pytorch 源码分析:五、模型架构简图
40 0
|
1天前
|
机器学习/深度学习 自然语言处理 TensorFlow
使用Python实现深度学习模型:BERT模型教程
使用Python实现深度学习模型:BERT模型教程
46 0
|
7天前
|
机器学习/深度学习 自然语言处理 PyTorch
【自然语言处理NLP】Bert预训练模型、Bert上搭建CNN、LSTM模型的输入、输出详解
【自然语言处理NLP】Bert预训练模型、Bert上搭建CNN、LSTM模型的输入、输出详解
25 0
|
1月前
|
机器学习/深度学习 数据采集 人工智能
【NLP】Datawhale-AI夏令营Day3打卡:Bert模型
【NLP】Datawhale-AI夏令营Day3打卡:Bert模型
|
1月前
|
机器学习/深度学习 自然语言处理 数据格式
训练你自己的自然语言处理深度学习模型,Bert预训练模型下游任务训练:情感二分类
训练你自己的自然语言处理深度学习模型,Bert预训练模型下游任务训练:情感二分类
|
1月前
|
机器学习/深度学习 自然语言处理 数据挖掘
预训练语言模型中Transfomer模型、自监督学习、BERT模型概述(图文解释)
预训练语言模型中Transfomer模型、自监督学习、BERT模型概述(图文解释)
70 0
|
1月前
|
自然语言处理 Python
BERT模型基本理念、工作原理、配置讲解(图文解释)
BERT模型基本理念、工作原理、配置讲解(图文解释)
519 0
|
1月前
|
机器学习/深度学习 人工智能 自然语言处理
极智AI | 变形金刚大家族Transformer ViT CLIP BLIP BERT模型结构
大家好,我是极智视界,本文整理介绍一下 Transformer ViT CLIP BLIP BERT 模型结构。
210 0