PAI-Blade 通用推理优化：系统优化实践（下）|学习笔记-阿里云开发者社区

开发者学堂课程【PAL 平台学习路线：机器学习入门到应用：PAI-Blade 通用推理优化：系统优化实践（下）】学习笔记，与课程紧密联系，让用户快速学习知识。

课程地址：https://developer.aliyun.com/learning/course/855/detail/14120

PAI-Blade 通用推理优化：系统优化实践（下）

内容介绍：

一．AI 推理加速

二．ResNet50示例

三．RetinaNet 示例1

四．RetinaNet 示例2

blade 的通用推理优化工具（2），用 blade 优化 PyTorch 模型主要特点是通用，自动，适配不同的硬件，以及得到阿里云生产环境的大量验证。

一．AI 推理加速

1. 通用优化

2.自动优化

3.适配不同的硬件

4.生产环境检验

5. blade 技术架构

基础上我们支持多种不同的AI框架前端，如 TensorFlow 和PyTorch 在优化过程当中，会执行计算图的优化和转换编译优化，量化以及使用不同的硬件加速器加速模型中的得到的优化模型格式还是原始框架的模型格式，可以直接部署在原始的模型框架当中做到只有少量的改动就可以引入优化功能。

三个示例，分别是 Resnut 50模型优化示例，Detectron2模型优化示例和 Python 优化扩展示例。首先将介绍的是Resnet 50示例使用 blade 化拍照模型，可以采用优化和偶尔量化功能.

第一个事例中将介绍 Resnet 50使用 blade 优化模型.

二．ResNet50示例

在此实例当中，将介绍如何安装 blade 的优化环境，并加载拍 Python 型及测试数据使用 blade o1无损优化Python模型，使用 blade o2量化优化模型。首先是示例安装，设立环境安装我们安装了拍照是1.8.1扩大10.2环境上面的对应3.1 6.0版本的包使用这两个命令即可安装 blade 的环境。第二步，加载了一个 touch modules 模型，并将此模型放到扩大设备上并采用 touch 技术是 great 导出 PyTorch 模型然后创建了一组模型的输入数据再不无损优化当中将上面创建的模套就是模型以及测试数据作为 Python 的优化的收入参数。In给定了优化的级别优化的是目标设备，是 GPU 在优化过程当中会看到 blade 打印出来的优化进度优化的返回的三个参数分别表示优化完成的模型，还是一个 torch 的模型可以被保存并作为部署的模型 Opt spark 包含了优化的结果所需要的一些信息 report 是要来打印优化的报告的在这里将优化的报告信息打印出来，下面这个列表所示，我们在优化的报表当中会包含了所使用的软件的环境，硬件环境以及输入的测试数据的 shape 信息。

1.示例环境安装

Linux with Python >3.6

·CUDA10.2

·PyTorch>=18.1

·Blade>=3.16.0

执行以下命令安装 PyTorch 及 PAl-Blade 敏捷版的 Wheelf 包，详情请见安装 PAl-Blade 敏捷版.

]:!pip install pai_blade_gpu==3.16.0+cu102-f https://pai-blade.oss-cn-zhangjiakou.aliyuncs.com/release/repo_ext.html

!pip install torch_addons==3.16.0+1.8.1.cu102 -f https://pai-blade.oss-cn-zhangjiakou.aliyuncs.com/release/repo_ext.html

2.加载 PyTorch 模型及测试数据

这里为了简便，从 torchvision,models 的中 ResNet50模型作为展示，下面的示例程序执行了以下步爆

·加数 torchvision.models,resnet50模型到 cuda 设备上

·使用 torch,jit,script 粥 PyTorch 模型导出为 TorchScript 模型

·构造了 ResNets50模型的测试输入数超

li:import torch

import torchvision.models as models

model=models.resnet50(),f1oat().cuda()#准备模型

model torch.jit.script(model).eval()#criptModule.

dummy=torch.ones(32,3,224,224).cuda()。构成测试数据.

outputs model(dumey)

3 .Blade O1无损优化

调用 blade.opt1m1ze 函数进行模型优化.樱型优化的代码示例如下

import blade

config blade.Config()

config.gpu_config,disable_.fp16_accuracy_.check=True

关闭数值优化精度检查

with config:

optimized_model,opt_spec,report blade.optimize(

model,

待优化的模型，

O1,

#优化级别，o1或o2，

device_type='gpu',

#目标设备，gpu/cpu/ed0e

test_data=I(dummy,)】,#PyTorchi 的输入数据是 List of tuple of tensor,

[Progress】5%,

phase:usertest_data validation.

[Progress]10%,phase:test_data_deduction.

[Progress】

15%,phase:CombinedSwitch4.

[Progress]95%,phase:model collecting

[Progress]100%,Finished!

blade.optimize 函数返回的三个对象，分别如下所示：

optimized_model:优化完成的模型，此处为

torch.jit..ScriptModule;可以被保存，并用来作为部署。

·opt_spec:包含复现优化结果需要的配置信息、环境变量及资源文件等，通过 with 语句可以使其生效.

Report:优化报告，可以直接打印。关于报告中的参数解释，详情请参见优化报告

打印优化报告，如下所示：

3]:print("Report:{)"format(report))

Report:{

"software_context":

"software":"pytorch",

"vers1on":"1.8.1+cu182"

"software":"cuda",

"version":"10.2.0"

Saving completed

Mode:Command Ln 1.Col 1 resnet50 examolei

"hardware context":{

"device_type":"gpu"

"microarchitecture":"T4"

"user_config":""

"diagnosis":

"model":"unnamed.pt",

"test data_source":"user provided",

"shape_variation":"undefined",

"message":"Unable to deduce model inputs information (data type,shape,value range,etc.)",

),

"test_data_info":"0 shape:(32,3,224,224)data type:float32"

"optimizations'":【

{

"name":"PtTrtPassFp16",

"status":"effective",

"speedup":"6.39",

"pre_run":"98.77 ms"

"p0st_run":"15.45ms"

"overall":{

"baseline":"99.07 ms"

"optimized":"15.48s",

"speedup":“6.40"

"model info":(

"input_format":"torch_script"

"compatibility_list":

r

"device_type":“gpu",

"microarchitecture":"T4"

"model sdk":{

4 .Blade O2 量化

开启量化功能时，需要提供一组用于离线计算量化参数的校正数据集，PyTorch 模型的校正数据集是一个包含若干组输入数据的列表，示例如下：

:import torch

calib_data list()

for1 in range(10):#构建10组量化参数校正数据集；直实应用场晨中应该使用真实战据

1na9e=(torch.0nes(32,3,224,224).cuda(),)#-始用来推理的根型输入参致，类型tuple

calib_data.append(image)

test_data=(torch.ones(32,3,224,224).cuda(),)#-组缅助优化测试数揭

与上面 O1优化类似，您只需要使用 PA-Blade 优化时指定 optimization_evel='o2即可开启量化功能。

(5]:import blade

optimized_model,opt_spec,report blade.optimize(

model=model,

optimization_level='02',

device_type='gpu',

test data=test_data,

calib_data=calib_data

print("Report:()"format(report))

[Progress]5%,phase:user_test_data_validation.

【Progress】1g%,phase:test_data_deduction,

[Progress]15%,phase:CombinedSwitch_9

[Progress]95%,phase:model_collecting.

(Progress]100%,Finished!

Report:(

"software_context":

"software":"pytorch",

"vers1on":"1.8.1+cu102"

"software":"cuda"

"vers1on":"10.2.0

"hardware context":

"device_type":"gpu",

"microarchitecture":"T4"

"user_config":"",

"diagnosis":{

"model":"unnamed.pt"

"test_data_source":"user provided",

"shape_variation":"undefined",

"message":"Unable to deduce model inputs information (data type,shape,value range,etc.)",

'test data info":"0 shape:(32,3,224,224)data type:float32"

"optimizations":【

"name":"PyTorchQuantizeInt8_Fp32",

"status":"effective",

"speedup":"11.41",

"pre_run":"88.21=s"

"post_run":"7.73 ms"

】,

"overall":

"baseline":"88.01 ms"

"optimized":"8.08 ms",

"speedup":"10.90"

"model info":{

"input_format":"torch_script"

"compatibility_list":

{

"device_type":"gpu",

"microarchitecture":"T4"

}

"model_sdk":{}

}

包括数据类型优化生效的 pass，我们会发现这里面最终的优化的结果相比 baseline 的加速达到了6.4倍对，第二个优化的手段是量化，为了开启量化功能，我们通常需要一个一线的计算量化参数的校正书数据集在示例当中，构造了一组模拟的参数校正数据集同时也创建了一组辅助优化的测试数据，和上面的 oa 优化类似我们除了给定数据以及之外还需要设置优化的层级为 o2，其他的参数保持不变。通过报告信息我们可以查看到生效的 pass 是的让他量化比于相比于上面的 IP 16的优化，这里量化的加速比更高，同时优化的时长占优化后的模型的延时也仅有 IP 16的一半，这里就是第一个示例模型。

三．RetinaNe t示例1

RetinaNet 是一种 one-stage RCNN 类型的检测网络，基本结构由一个 backbone,多个子网及 NMS 后处理组成，在许多训练框架中均有实现，典型的有 Detectron2,将以 Detectron2的标准 RetinaNet 实现为例，介绍如何使用 Blade优化。

RetinaNet((Detectron2)一类的模型.

1示例环境

Linux with Python >=3.6

·CUDA10.2

PyTorch >=1.8.1 and Detectron2>=0.4.1

·Blade>=3.16.0

2模型导出

Detectron:2是 FAlR 开源的灵活、可扩展、可配置的 object detection和 segmentation 训练框架。由于框架的灵活性常规的方法进行导出可能会失败或者得到错误的导出结果，为了支持 TorchScript 部署、Detectron2提供了TracingAdapter 和 scripting_with_instances 两种导出方式.

关于 Detectron:2的具体导出部署可以参考：官网示例

htps:://detectron2.readthedocs.io/en/nlatest/tutorials/deployment.html#usage

Blade 支持任意形式的 TorchScript 模型输入，本文中以 scripting_with_instances 为例介绍导出优化过程：

(1]:import torch

import numpy as np

from torch import Tensor

from torch.testing import assert_allclose

from detectron2 import model_zoo

from detectron2.export import scripting_with_instances

from detectron2.structures import Boxes

from detectron2.data.detection_utils import read_image

Image

#使用 scripting_with_instances 导出 RetinaNet 模型

def load_retinanet(config_path):

model model_zoo.get(config_path,trained=True).eval()

fields =

"pred_boxes":Boxes,

"scores":Tensor,

"pred_classes":Tensor,

script_model scripting_with_instances(model,fields)

return model,script_model

#下载一张示例图片

!wget http://images.cocodataset.org/val2017/000000439715.jpg -q-0 input.jpg

img read_image('./input.jpg')

img torch.from_nu=py(np.ascontiguousarray(img.transpose(2,0,1)))

#尝试执行和对比导出模型前后的结果

pytorch_model,script_model load_retinanet("COCO-Detection/retinanet_R_50_FPN_3x.yaml")

with torch.no_grad():

batched_inputs [{"image":img.float())]

pred1 pytorch_model(batched_inputs)

pred2 script_model(batched_inputs)

assert_allclose(pred1[0]['instances'].scores,pred2[0].scores)

Loadingconfig/usr/local/lib64/python3.6/site-packages/detectron2/model_zoo/configs/COC0-Detection/../Base-RetinaNet.yanl with yaml.unsafe_load.Your machine may be at risk if the file contains malicious content.Saving completed.

在实例当中，使用 skating with instance 导出了模型，在设计当中比较主要的导出的步骤是这一小段的空格里面，需要配置里面特殊的数据结构的每个域的类型，传进到 script 函数当中，就可以返回一个 model。下载了一张图片在这里我们做了次原始模型和 christmas model 模型的结果的背景。

model model_zoo.get(config_path,trained=True).eval()

fields =

"pred_boxes":Boxes,

"scores":Tensor,

"pred_classes":Tensor,

script_model scripting_with_instances(model,fields)

return model,script_model

3.blade 优化

在导出优化模型，优化的需要优化的模型之后我们开始进行 blade 的优化，和之前的示例一样，我们在下面的例子中，调用了 blade.optimize 对模型进行了优化，输入如下：

·以刚导出的 TorchScript 模型 script_.model 作为模型输入，开启了 Blade o1级别的优化，优化目标设备为 GPU，给定了一组测试数据，用于辅助优化及测试。

（1）调用 Blade 优化模型

:import blade

test_data=【(batched_.inputs,)】#PyTorch的编入List of tuple

optimized_model,opt_spec,report blade.optimize(

script_model='o1',

device_type='gpu',

test_data=test_data,r

[Progress]5%,phase:user_test_data_validation.

[Progress]10%,phase:test data deduction.

[Progress]15%,phase:CombinedSwitch_4.

[Progress]95%,phase:model_collecting.

[Progress]100%,Finished!

（2）打印优化报表和保存模型

Blade 优化后的模型仍然是一个 TorchScript 模型。在完成优化之后，我们可以通过下面代码打印优化报表和保存优化模型：

：#打印优化结果报表

print("Report:{)"format(report))

torch.jit.save(optimized_model,'optimized.pt')

Report:

"software_context":

"software":"pytorch",

"version'“:"1.8.1+cu182"

Python 3 O

"version":"10.2.0"

"hardware context":

"device type":"gpu",

"microarchitecture":"T4

"user_config":"",

"diagnosis":{

"model":"unnamed.pt"

"test_data_source":"user provided",

"shape variation":"undefined",

"message":"Unable to deduce model inputs information (data type,shape,value range,etc.)",

"test_data_info":"0 shape:(3,480,640)data type:float32"

"optimizations":

"name":"PtTrtPassFp16",

"status":"effective",

"speedup":"3.77",

"pre run":"40.64 ms"

"p0st_run":"10.78ms"

"overall":

"baseline":"40.73 ms",

"optimized":"10.76 ms",

"speedup":"3.79"

"model info":(

"input_format":"torch_script"

"compatibility_list":

"device type":"gpu",

"microarchitecture":"T4"

"model sdk":{}

}

（3.）优化前后的性能测试

:import time

etorch.no_grad()

def benchmark(model,inp):

for i in range(100):

eodel(inp)

torch.cuda.synchronize()

start time.time()

for i in range(200):

model(inp)

torch.cuda.synchronize()

elapsed_ms (time.time()-start)*1000

print("Latency:(:.2f)".format(elapsed_ms /200))

#对优化前的极型测速，

benchmark(pytorch_model,batched_inputs)

#对优化后的模型测适，

benchmark(optimized_model,batched_inputs)

Latency:42.38

Latency:10.77

如上结果显示，同样执行200轮优化前后的模型平均延时分别是42.38ms 和10.77ms

（4）部署运行

NOTE:试用阶段设置此环境变量，防止因为鉴权失败而程序退出

export BLADE_AUTH_USE_COUNTING=1

NOTE:在正式部暑之前请联系获取极权

#必填，请联系 PAI 团队获取

export BLADE_REGION=<region>

#必填，请联系 PAI 团队获取

export BLADE_TOKEN=<token>

在优化的过程当中 Play 的会打印优化进度的信息通过打印 blade 的优化报表，会看到 blade 的 IP 16优化对模型进行的加速达到了将近4倍 Balance 从原来的40毫秒降低到了十毫秒左右对模型的重新的 benchmark 有会发现两个延时的时间和前面优化的报表的信息是一致的，最终将可以对模型进行部署，这里直接使用 Python 上去进行，将优化过的模型进行部署即可。

四．RetinaNet 示例2

RetinaNet示例2：使用 Blade 结合 TorchScript Custom C++Operators 优化 RetinaNet 是一种 one-stage RCNN类型的检测网络，基本结构由一个 backbone,多个子网及NMS后处理组成，在许多训练框架中均有实现，典型的有Detectron2.上一篇教程中，我们介绍了通过 scripting_with_instances 方式导出 Detectron2极型并使用 Blade 快速完成枫型的优化.然而检测模型的后处理部分代码通常需要执行计算和筛选 boxes,nms 等逻辑，py1hon 的实现往往不高效，即使导出 TorchScript 之后优化空间也比较有限。这种情况下可以采用 TorchScript Custom+Operators 将 python 代码实现的逐辑替换成高效的 C+实现，然后再导出 TorchScript 并使用 Blade 来进行模型优化.本教程将介绍如何结合 Blade 和 Custom C++Operators 进行联合优化。

1示例环境

Linux with Python >3.6,GCC>=5.4

Nvidia Tesla T4,CUDA 10.2,CuDNN 8.0.5.39

PyTorch >=1.8.1 and Detectron2>=0.4.1

·Blade>=3.16.0

2创建带有 Custom C++Operators 的 PyTorch 模型

Blade 优化能第和 PyTorch TorchScript 扩展机割无缝衔接，这里介绍使用 TorchScript 扩展实现 RetinaNet 的后处理部分.关于 TorchScript Custom Operator 的介绍可以参考

PyTorch宫方文档EXTENDING TORCHSCRIPT WITH CUSTOM C++OPERATORS.

（1）下载示例代码并解压

本教程中我们提供了 RetinaNet 的后处理部分的示例程序，该后处理程序逻辑来自 NVIDIA 开源社区的示例 htps:/∥github.com/NVIDIA/retinanet--examples..这里档取了核心的代码实现来说明开发实现 Custom Operator 的流程。首先下载并解压示例代码：

!wget -nv https://pai-blade.oss-cn-zhangjiakou.aliyuncs.com/tutorials/retinanet_example/retinanet-examples.tar.gz -0 retinanet-examples.tar.gz

Itar xvfz retinanet-examples.tar.gz 1>/dev/null

2021-08-03 01:00:47 URL:https://pai-blade.oss-cn-zhangjiakou.aliyuncs.com/tutorials/retinanet_example/retinanet-examples.tar.gz [71080/710801 ->"re

tinanet-examples.tar.gz"

（2.）编译 Custom C.++Operators

PyTorch 官方文档 EXTENDING TORCHSCRIPT WITH CUSTOM C+OPERATORS 中提供了3种编译 Custom Operator的方式

·Building with CMAKE

Building with JIT compilation

Building with Setuptools

这三种方法适合不同场景下面果用，开发者可以根据自己的需求来选择，本教程中为了简便，采用了Building with JIT compilation 的方式，

[]import torch.utils.cpp_extension

import os

codebase="retinanet-examples"

sources=['csrc/extensions.cpp',

'csrc/cuda/decode.cu',

'csrc/cuda/nms.cu',]

sources [os.path.join(codebase,src)for src in sources]

torch.utils.cpp_extension.load(

name="custom'”,

sources=sources,

build_directoryecodebase,extra_include_pathse['/usr/local/TensorRT/include/','/usr/local/cuda/include/','/usr/local/cuda/include/thrust/systea/cuda/detail'].

extra_cflagse['-std=c++14','-02','-Wall'],

extra_cuda_cflags=[

1-std=c++14”,

'--expt-extended-lambda',

-use_fast_math','-Xcompiler','-Wall,-fno-gnu-unique',

-gencode=arch=compute_75,code=sm_75',],

is python_module=False,

withcuda=True,

verbose=False,

上述步骤执行完成之后，编译生成的 custom.S0会保存在 retinanet-examples 下面，后续会使用到。

此部分的 code 来自于开源的示例教程我们抽取了核心的代码实现来说明开发流程编译加载可以使用torch.utils.cpp_extension.load(

函数来进行，这个函数会给你出来一个名为的文件他是一个二进制的 binary，在后续我们会使用到。

（3）使用 Custom C++Operators 替换 RetinaNet 的后处理部分

这里为了简洁直接用 a dapter_.forward 替换

RetinaNet,forward,adapter_forward 使用 decode_.cuda 和 nms_cuda 两个 Custom C.+Operators 实现了RetinaNet 的后处理部分。

:import os

import torch

from typing import Tuple,Dict,List,Optional

codebase="retinanet-examples"

torch.ops.load_library(os.path.join(codebase,'custom.so'))

decode_cuda torch.ops.retinanet.decode

nms_cuda torch.ops.retinanet.nms

#该函致的主要代码部分和 RetinaNet.forward 一样，但是后处理部分召换为 decode_cuda 和 nms_cuda实现

def adapter_forward(self,batched_inputs:Tuple[Dict[str,torch.Tensor]]):

images self.preprocess_image(batched_inputs)

features self.backbone(images.tensor)

features [features[f]for f in self.head_in_features]

cls_heads,box_heads self.head(features)

cls_heads [cls.sigmoid()for cls in cls_heads]

box_heads [b.contiguous()for b in box_heads]

后处理部分

strides [images.tensor.shape[-1]/cls_head.shape[-1]for cls_head in cls_heads]

decoded={

decode_cuda(

cls_head,

box_head,

anchor.view(-1),

stride,

self.test scorethresh,

self.test_topk_candidates,

for stride,cls_head,box_head,anchor in zip(

strides,cls_heads,box_heads,self.cell_anchors

}

#non-maximum suppress1on 部分

return nms cuda(decoded[e],decoded[1],decoded[2],self.test nms thresh,self.max detections per image)

from detectron2.modeling.meta_arch import retinanet

用 adapter forward 替换 RetinaNet.forward

retinanet.RetinaNet.forward adapter_forward

编译完成的看点 so 并使用我们意义的两个 custom op 来实现，到后处理部分，在这个事例当中，经历了网络的并使用decode 扩大和扩大来处理这些态度输出 backbone，backbone 的部分和原有的代码保持一致。函数的实现和数学上是应该是等价的，这里是用指适配的 forward 来替换原先的 forward。

（4） TorchScript 模型导出

如上一篇教程，为了支持 TorchScript 部署，Detectron2提供了 TracingAdapter和scripting._with_instances两种导出方式，

关于 Detectron2的导出8部署可以参考：

https:://detectron2.readthedocs.lo/en/natest/tutorials/deployment..html#usage

简单的几行代码就可以达到模型的自定义后出一部分的计划，去处理的模型导出和之前介绍的使用的after her的导出方式一样，例子当中我们使用了 with distance 的方式来导出.

Blade 支持任意形式的 TorchScript 模型输入，本文中以 scripting_.with_instances 为例介绍导出优化过程：

import torch

import numpy as np

from torch import Tensor

from torch.testing import assert_allclose

from detectron2 import model zoo

from detectron2.export import scripting_with_instances

from detectron2.structures import Boxes

from detectron2.data.detection_utils import read_image

#使用scripting with instances导出RetinaNet模型

def load_retinanet(config_path):

modelmodel zoo.get(config_path,trained=True).eval()

s Set anew celt_anchors attributes to PyTorch model.

model.cetl_anchors [c.contiguous()for c in model.anchor_generator.cell_anchors]

fields =

"pred_boxes":Boxes,

"scores":Tensor,

"pred_classes":Tensor

script_model scripting_with_instances(model,fields)

return model,script_model

#一张示例图片

!wget http://images.cocodataset.org/val2017/008008439715.jpg-q-0 input.jpg

img read_image('./input.jpg')

img torch.from_numpy(np.ascontiguousarray(img.transpose(2,0,1)))

#尝试执行和时比导出模型后的结果

pytorch_model,script_model load_retinanet("COCO-Detection/retinanet_R_50_FPN_3x.yaml")

with torch.no_grad():

batched_.inputs={("image":ing.float())}

pred1 pytorch_model(batched_inputs)

pred2 script_codel(batched_inputs)

模型主要导出的 code 我们需要关注 fields 设定，filth 类型的设定。通过 screening with 返回一个 touch模型。并尝试执行和对比导出模型的结果的正确性.

fields =

"pred_boxes":Boxes,

"scores":Tensor,

"pred_classes":Tensor

script_model scripting_with_instances(model,fields)

return model,script_model。

3 .Blade 优化

(1)调用 Blade 优化模型

下面的例子中，我们调用了 blade..optim1ze 对极型进行了优化，编入如下

·以刚导出的 TorchScript 模型 scr1 pt_model 作为模型输入

·开启了 Blade o1级别的优化

·优化目标设备为 GPU

·给定了一组测试数据，用于辅助优化及测试

import os

import blade

import torch

#加载custom c++operator动态链接库

codebase="retinanet-examples"

torch.ops.load_library(os.path.join(codebase,'custom.so'))

blade_config blade.Config()

blade_config.gpu_config.disable_fp16_accuracy_check True

test data=【(batched_inputs,)】#PyTorch的输入数据是List of tuple

with blade_config:

optimized_model,opt_spec,report blade.optimize(

script_model,

'01',

device_type='gpu',

test_data=test_data,

[Progress]5%,phase:user_test_data_validation.

[Progress]10%,phase:test_data_deduction.

[Progress]15%,phase:CombinedSwitch_4.

[Progress]95%,phase:model_collecting.

[Progress]100%,Finished!

（2）打印优化报表和保存模型

Blade 优化后的模型仍然是一个 TorchScript 模型。在完成优化之后，可以通过下面代码打印优化报表和保存优化模型

:#打印优化结果报表

print("Report:{)"format(report))

torch.jit.save(script_model,'script_model.pt')

torch.jit.save(optimized_model,'optimized.pt')

Report:{

"software_context":

"software":"pytorch",

"version":"1.8.1+cu102"

"software":"cuda",

"version'":"10.2.8"

"hardware_context":{

"device_type":"gpu",

"microarchitecture":"T4"

"user_config":"",

"diagnosis":{

"model":"unnamed.pt",

"test data source":"user provided",

"shape_variation":"undefined",

"message":"Unable to deduce model inputs information (data type,shape,value range,etc.)",

"test_data_info":"e shape:(3,480,640)data type:float32"

"optimizations":

"name":"PtTrtPassFp16",

"status":"effective",

"speedup":"3.92",

"pre_run":"40.72ms",

"post_run":"10.39 ms"

1,

"overall":{

"baseline":"40.64 ms",

"optimized":"10.41 ms",

"speedup":"3.90"

},

"modelinfo":{

"input_format":"torch_script"

“compatibility_.ist":

"device_type":"gpu",

"microarchitecture":"T4"

}

"model_sdk":{}

使用优化导出的带有是 operator 的模型。在 blade 优化之前，我们一样首先会加载的，然后将 torch script的模型加载并传入到的优化函数当中去，示例 o1的优化和优化目标设备为 GPU。优化完的模型，通过 save 来保存，通过报表信息我们可以看出优化的加速达到了将近四倍，通过"pre_run 可以看到优化后的模型和优化前的模型相比，从40毫秒降低到了十毫秒，这个结果和 blade 优化报告当中打印出来的信息是一致的。

4.部署运行

NOTE:试用阶段设置此环境变量，防止因为鉴权失败而程序退出

export BLADE_AUTH_USE_COUNTING=1

NOTE:在正式部暑之前请联系获取授权

#必项，请联系 PAI 团队联取.

export BLADE_REGION=<region>

#必项，请联系 PAI 团队获取，

export BLADE_TOKEN=<token>

Blade 优化后的模型仍然是 TorchScript,,因此你无需切换你的环境就可以加数运行优化后的结果，

:import blade.runtime.torch

import detectron2

import torch

from torch.testing import issert_allclose

#加位 custom c+operator 动态日报原

codebase="retinanet-examples"

torch.ops.load_library(os.path.join(codebase,'custom.so'))

script model torch.jit.load('script model.pt')

optimized_model torch.jit.load('optimized.pt')

img read_image('./input.jpg')

img torch.from_numpy(np.ascontiguousarray(img.transpose(2,0,1)))

:尝试执行和对比导出模型前后的结果

with torch.no_grad():

batched_inputs [{"image":img.float())]

predl script_model(batched_inputs)

pred2 optimized_model(batched_inputs)

assert_allclose(predl[0],pred2[e],rtol=le-3,atol=le-2)

Patch 部署的优化部的 touch 模型和和原来的几乎没有什么变化，只是将模型替换成了优化过的模型即可。

PAI-Blade 通用推理优化：系统优化实践（下）|学习笔记

PAI-Blade 通用推理优化：系统优化实践（下）

一．AI 推理加速

二．ResNet50示例

四．RetinaNet 示例2

阿里云开发者学堂

热门文章

最新文章

相关课程

相关电子书

相关实验场景

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

PAI-Blade 通用推理优化：系统优化实践（下）|学习笔记

PAI-Blade 通用推理优化：系统优化实践（下）

一．AI 推理加速

二．ResNet50示例

四．RetinaNet 示例2

阿里云开发者学堂

热门文章

最新文章

相关课程

相关电子书

相关实验场景