PyTorch 2.2 中文官方教程（九）（3）-阿里云开发者社区

PyTorch 2.2 中文官方教程（九）（2）https://developer.aliyun.com/article/1482546

（可选）将模型从 PyTorch 导出到 ONNX 并使用 ONNX Runtime 运行

原文：pytorch.org/tutorials/advanced/super_resolution_with_onnxruntime.html

译者：飞龙

协议：CC BY-NC-SA 4.0

注意

点击这里下载完整示例代码。

注意

截至 PyTorch 2.1，ONNX Exporter 有两个版本。

``torch.onnx.dynamo_export`是基于 TorchDynamo 技术发布的最新（仍处于测试阶段）导出器，随 PyTorch 2.0 发布。
torch.onnx.export基于 TorchScript 后端，自 PyTorch 1.2.0 以来一直可用。

在本教程中，我们描述了如何使用 TorchScript ``torch.onnx.export` ONNX 导出器将在 PyTorch 中定义的模型转换为 ONNX 格式。

导出的模型将使用 ONNX Runtime 执行。ONNX Runtime 是一个专注于性能的引擎，用于有效地推断跨多个平台和硬件（Windows、Linux 和 Mac 以及 CPU 和 GPU）的 ONNX 模型。ONNX Runtime 已被证明在多个模型上显著提高性能，如此处所述。

在本教程中，您需要安装ONNX和ONNX Runtime。您可以通过以下方式获取 ONNX 和 ONNX Runtime 的二进制构建。

%%bash
pip  install  onnx  onnxruntime

ONNX Runtime 建议使用最新的稳定运行时环境来运行 PyTorch。

# Some standard imports
import numpy as np
from torch import nn
import torch.utils.model_zoo as model_zoo
import torch.onnx

超分辨率是一种增加图像、视频分辨率的方法，在图像处理或视频编辑中被广泛使用。在本教程中，我们将使用一个小型的超分辨率模型。

首先，在 PyTorch 中创建一个SuperResolution模型。该模型使用了在“使用高效子像素卷积神经网络实现实时单图像和视频超分辨率” - Shi 等人中描述的高效子像素卷积层，通过一个放大因子增加图像的分辨率。该模型期望图像的YCbCr的 Y 分量作为输入，并输出超分辨率中的放大 Y 分量。

该模型直接来自 PyTorch 的示例，没有修改：

# Super Resolution model definition in PyTorch
import torch.nn as nn
import torch.nn.init as init
class SuperResolutionNet(nn.Module):
    def __init__(self, upscale_factor, inplace=False):
        super(SuperResolutionNet, self).__init__()
        self.relu = nn.ReLU(inplace=inplace)
        self.conv1 = nn.Conv2d(1, 64, (5, 5), (1, 1), (2, 2))
        self.conv2 = nn.Conv2d(64, 64, (3, 3), (1, 1), (1, 1))
        self.conv3 = nn.Conv2d(64, 32, (3, 3), (1, 1), (1, 1))
        self.conv4 = nn.Conv2d(32, upscale_factor ** 2, (3, 3), (1, 1), (1, 1))
        self.pixel_shuffle = nn.PixelShuffle(upscale_factor)
        self._initialize_weights()
    def forward(self, x):
        x = self.relu(self.conv1(x))
        x = self.relu(self.conv2(x))
        x = self.relu(self.conv3(x))
        x = self.pixel_shuffle(self.conv4(x))
        return x
    def _initialize_weights(self):
        init.orthogonal_(self.conv1.weight, init.calculate_gain('relu'))
        init.orthogonal_(self.conv2.weight, init.calculate_gain('relu'))
        init.orthogonal_(self.conv3.weight, init.calculate_gain('relu'))
        init.orthogonal_(self.conv4.weight)
# Create the super-resolution model by using the above model definition.
torch_model = SuperResolutionNet(upscale_factor=3)

通常情况下，您现在会训练这个模型；但是，在本教程中，我们将下载一些预训练权重。请注意，这个模型并没有完全训练以获得良好的准确性，仅用于演示目的。

在导出模型之前，重要的是调用torch_model.eval()或torch_model.train(False)，将模型转换为推断模式。这是必需的，因为像 dropout 或 batchnorm 这样的操作符在推断和训练模式下的行为是不同的。

# Load pretrained model weights
model_url = 'https://s3.amazonaws.com/pytorch/test_data/export/superres_epoch100-44c6958e.pth'
batch_size = 1    # just a random number
# Initialize model with the pretrained weights
map_location = lambda storage, loc: storage
if torch.cuda.is_available():
    map_location = None
torch_model.load_state_dict(model_zoo.load_url(model_url, map_location=map_location))
# set the model to inference mode
torch_model.eval()

在 PyTorch 中导出模型可以通过跟踪或脚本化来实现。本教程将使用一个通过跟踪导出的模型作为示例。要导出一个模型，我们调用torch.onnx.export()函数。这将执行模型，记录计算输出所使用的操作符的跟踪。因为export运行模型，我们需要提供一个输入张量x。这个张量中的值可以是随机的，只要它是正确的类型和大小。请注意，在导出的 ONNX 图中，所有输入的维度的大小将被固定，除非指定为动态轴。在这个示例中，我们导出具有批大小 1 的模型，但然后在torch.onnx.export()的dynamic_axes参数中将第一个维度指定为动态。因此，导出的模型将接受大小为[batch_size, 1, 224, 224]的输入，其中 batch_size 可以是可变的。

要了解更多关于 PyTorch 导出接口的细节，请查看torch.onnx 文档。

# Input to the model
x = torch.randn(batch_size, 1, 224, 224, requires_grad=True)
torch_out = torch_model(x)
# Export the model
torch.onnx.export(torch_model,               # model being run
                  x,                         # model input (or a tuple for multiple inputs)
                  "super_resolution.onnx",   # where to save the model (can be a file or file-like object)
                  export_params=True,        # store the trained parameter weights inside the model file
                  opset_version=10,          # the ONNX version to export the model to
                  do_constant_folding=True,  # whether to execute constant folding for optimization
                  input_names = ['input'],   # the model's input names
                  output_names = ['output'], # the model's output names
                  dynamic_axes={'input' : {0 : 'batch_size'},    # variable length axes
                                'output' : {0 : 'batch_size'}})

我们还计算了torch_out，模型输出之后的结果，我们将使用它来验证我们导出的模型在 ONNX Runtime 中运行时是否计算出相同的值。

但在使用 ONNX Runtime 验证模型输出之前，我们将使用 ONNX API 检查 ONNX 模型。首先，onnx.load("super_resolution.onnx")将加载保存的模型，并输出一个onnx.ModelProto结构（用于捆绑 ML 模型的顶层文件/容器格式。更多信息请参阅onnx.proto 文档）。然后，onnx.checker.check_model(onnx_model)将验证模型的结构，并确认模型具有有效的模式。通过检查模型的版本、图的结构以及节点及其输入和输出来验证 ONNX 图的有效性。

import onnx
onnx_model = onnx.load("super_resolution.onnx")
onnx.checker.check_model(onnx_model)

现在让我们使用 ONNX Runtime 的 Python API 计算输出。这部分通常可以在单独的进程或另一台机器上完成，但我们将继续在同一进程中进行，以便验证 ONNX Runtime 和 PyTorch 为网络计算相同的值。

为了使用 ONNX Runtime 运行模型，我们需要为模型创建一个推理会话，并选择配置参数（这里我们使用默认配置）。会话创建后，我们使用 run() API 评估模型。此调用的输出是一个包含 ONNX Runtime 计算的模型输出的列表。

import onnxruntime
ort_session = onnxruntime.InferenceSession("super_resolution.onnx", providers=["CPUExecutionProvider"])
def to_numpy(tensor):
    return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()
# compute ONNX Runtime output prediction
ort_inputs = {ort_session.get_inputs()[0].name: to_numpy(x)}
ort_outs = ort_session.run(None, ort_inputs)
# compare ONNX Runtime and PyTorch results
np.testing.assert_allclose(to_numpy(torch_out), ort_outs[0], rtol=1e-03, atol=1e-05)
print("Exported model has been tested with ONNXRuntime, and the result looks good!")

我们应该看到 PyTorch 和 ONNX Runtime 的输出在给定精度(rtol=1e-03和atol=1e-05)下数值匹配。值得一提的是，如果它们不匹配，则 ONNX 导出器存在问题，请在这种情况下与我们联系。

在 ONNX Runtime 上运行图像模型

到目前为止，我们已经从 PyTorch 导出了一个模型，并展示了如何加载它并在 ONNX Runtime 中使用一个虚拟张量作为输入来运行它。

在本教程中，我们将使用广泛使用的一张著名的猫图像，如下所示

首先，让我们加载图像，使用标准的 PIL Python 库对其进行预处理。请注意，这种预处理是训练/测试神经网络数据的标准做法。

我们首先将图像调整大小以适应模型的输入大小（224x224）。然后我们将图像分割为其 Y、Cb 和 Cr 组件。这些组件代表灰度图像（Y）以及蓝差（Cb）和红差（Cr）色度分量。Y 分量对人眼更敏感，我们对这个分量感兴趣，我们将对其进行转换。提取 Y 分量后，我们将其转换为一个张量，这将是我们模型的输入。

from PIL import Image
import torchvision.transforms as transforms
img = Image.open("./_static/https://gitcode.net/OpenDocCN/pytorch-doc-zh/-/raw/master/docs/2.2/img/cat.jpg")
resize = transforms.Resize([224, 224])
img = resize(img)
img_ycbcr = img.convert('YCbCr')
img_y, img_cb, img_cr = img_ycbcr.split()
to_tensor = transforms.ToTensor()
img_y = to_tensor(img_y)
img_y.unsqueeze_(0)

现在，作为下一步，让我们取代表灰度调整后的猫图像的张量，并像之前解释的那样在 ONNX Runtime 中运行超分辨率模型。

ort_inputs = {ort_session.get_inputs()[0].name: to_numpy(img_y)}
ort_outs = ort_session.run(None, ort_inputs)
img_out_y = ort_outs[0]

此时，模型的输出是一个张量。现在，我们将处理模型的输出，从输出张量中构建最终的输出图像，并保存图像。后处理步骤是从 PyTorch 超分辨率模型的实现中采用的这里。

img_out_y = Image.fromarray(np.uint8((img_out_y[0] * 255.0).clip(0, 255)[0]), mode='L')
# get the output image follow post-processing step from PyTorch implementation
final_img = Image.merge(
    "YCbCr", [
        img_out_y,
        img_cb.resize(img_out_y.size, Image.BICUBIC),
        img_cr.resize(img_out_y.size, Image.BICUBIC),
    ]).convert("RGB")
# Save the image, we will compare this with the output image from mobile device
final_img.save("./_static/https://gitcode.net/OpenDocCN/pytorch-doc-zh/-/raw/master/docs/2.2/img/cat_superres_with_ort.jpg")

ONNX Runtime 是一个跨平台引擎，可以在多个平台上以及 CPU 和 GPU 上运行。

ONNX Runtime 也可以部署到云端，用于使用 Azure 机器学习服务进行模型推断。更多信息在这里。

关于 ONNX Runtime 性能的更多信息在这里。

关于 ONNX Runtime 的更多信息在这里。

脚本的总运行时间: (0 分钟 0.000 秒)

下载 Python 源代码：super_resolution_with_onnxruntime.py

下载 Jupyter 笔记本：super_resolution_with_onnxruntime.ipynb

Sphinx-Gallery 生成的图库

树莓派 4 上的实时推理（30 fps！）

原文：pytorch.org/tutorials/intermediate/realtime_rpi.html

译者：飞龙

协议：CC BY-NC-SA 4.0

作者：Tristan Rice

PyTorch 对树莓派 4 有开箱即用的支持。本教程将指导您如何为运行 PyTorch 的树莓派 4 设置树莓派 4，并在 CPU 上实时运行 MobileNet v2 分类模型（30 fps+）。

这一切都是在树莓派 4 型 B 4GB 上测试的，但也应该适用于 2GB 变体以及性能降低的 3B。

先决条件

要按照本教程进行操作，您需要一个树莓派 4，一个相机以及所有其他标准配件。

树莓派 4 型 B 2GB+
树莓派摄像头模块
散热片和风扇（可选但建议）
5V 3A USB-C 电源适配器
SD 卡（至少 8GB）
SD 卡读/写器

树莓派 4 设置

PyTorch 仅为 Arm 64 位（aarch64）提供 pip 软件包，因此您需要在树莓派上安装 64 位版本的操作系统

您可以从downloads.raspberrypi.org/raspios_arm64/images/下载最新的 arm64 树莓派 OS，并通过 rpi-imager 安装它。

32 位树莓派 OS 将无法工作。

安装将至少需要几分钟，具体取决于您的互联网速度和 sd 卡速度。完成后，应如下所示：

现在是时候将您的 sd 卡放入树莓派中，连接摄像头并启动它。

一旦启动并完成初始设置，您需要编辑/boot/config.txt文件以启用摄像头。

# This enables the extended features such as the camera.
start_x=1
# This needs to be at least 128M for the camera processing, if it's bigger you can just leave it as is.
gpu_mem=128
# You need to commment/remove the existing camera_auto_detect line since this causes issues with OpenCV/V4L2 capture.
#camera_auto_detect=1

然后重新启动。重新启动后，video4linux2 设备/dev/video0应该存在。

安装 PyTorch 和 OpenCV

PyTorch 和我们需要的所有其他库都有 ARM 64 位/aarch64 变体，因此您可以通过 pip 安装它们，并使其像任何其他 Linux 系统一样工作。

$  pip  install  torch  torchvision  torchaudio
$  pip  install  opencv-python
$  pip  install  numpy  --upgrade

我们现在可以检查所有安装是否正确：

$  python  -c  "import torch; print(torch.__version__)"

视频捕获

对于视频捕获，我们将使用 OpenCV 来流式传输视频帧，而不是更常见的picamera。 picamera 在 64 位树莓派 OS 上不可用，而且比 OpenCV 慢得多。 OpenCV 直接访问/dev/video0设备以抓取帧。

我们正在使用的模型（MobileNetV2）接受224x224的图像尺寸，因此我们可以直接从 OpenCV 请求 36fps。我们的目标是模型的 30fps，但我们请求的帧率略高于此，以确保始终有足够的帧。

import cv2
from PIL import Image
cap = cv2.VideoCapture(0)
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 224)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 224)
cap.set(cv2.CAP_PROP_FPS, 36)

OpenCV 以 BGR 返回一个numpy数组，因此我们需要读取并进行一些调整，以使其符合预期的 RGB 格式。

ret, image = cap.read()
# convert opencv output from BGR to RGB
image = image[:, :, [2, 1, 0]]

这个数据读取和处理大约需要3.5 毫秒。

图像预处理

我们需要获取帧并将其转换为模型期望的格式。这与您在任何具有标准 torchvision 转换的机器上执行的处理相同。

from torchvision import transforms
preprocess = transforms.Compose([
    # convert the frame to a CHW torch tensor for training
    transforms.ToTensor(),
    # normalize the colors to the range that mobilenet_v2/3 expect
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
input_tensor = preprocess(image)
# The model can handle multiple images simultaneously so we need to add an
# empty dimension for the batch.
# [3, 224, 224] -> [1, 3, 224, 224]
input_batch = input_tensor.unsqueeze(0)

模型选择

您可以选择多种模型，具有不同的性能特征。并非所有模型都提供qnnpack预训练变体，因此为了测试目的，您应该选择一个提供此功能的模型，但如果您训练和量化自己的模型，可以使用其中任何一个。

我们在本教程中使用mobilenet_v2，因为它具有良好的性能和准确性。

树莓派 4 基准测试结果：

模型	FPS	总时间（毫秒/帧）	模型时间（毫秒/帧）	qnnpack 预训练
mobilenet_v2	33.7	29.7	26.4	True
mobilenet_v3_large	29.3	34.1	30.7	True
resnet18	9.2	109.0	100.3	False
resnet50	4.3	233.9	225.2	False
resnext101_32x8d	1.1	892.5	885.3	False
inception_v3	4.9	204.1	195.5	False
googlenet	7.4	135.3	132.0	False
shufflenet_v2_x0_5	46.7	21.4	18.2	False
shufflenet_v2_x1_0	24.4	41.0	37.7	False
shufflenet_v2_x1_5	16.8	59.6	56.3	False
shufflenet_v2_x2_0	11.6	86.3	82.7	False

MobileNetV2：量化和 JIT

为了获得最佳性能，我们希望使用量化和融合的模型。量化意味着使用 int8 进行计算，这比标准的 float32 数学更高效。融合意味着连续的操作已经被合并成更高效的版本，可能会合并像激活函数（ReLU）这样的操作到推断期间的前一层（Conv2d）中。

pytorch 的 aarch64 版本需要使用qnnpack引擎。

import torch
torch.backends.quantized.engine = 'qnnpack'

在这个示例中，我们将使用 torchvision 提供的预量化和融合版本的 MobileNetV2。

from torchvision import models
net = models.quantization.mobilenet_v2(pretrained=True, quantize=True)

然后，我们希望对模型进行 jit 以减少 Python 开销并融合任何操作。jit 使我们的帧率达到了约 30fps，而没有 jit 时只有约 20fps。

net = torch.jit.script(net)

将其放在一起

现在我们可以将所有部分组合在一起并运行它：

import time
import torch
import numpy as np
from torchvision import models, transforms
import cv2
from PIL import Image
torch.backends.quantized.engine = 'qnnpack'
cap = cv2.VideoCapture(0, cv2.CAP_V4L2)
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 224)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 224)
cap.set(cv2.CAP_PROP_FPS, 36)
preprocess = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
net = models.quantization.mobilenet_v2(pretrained=True, quantize=True)
# jit model to take it from ~20fps to ~30fps
net = torch.jit.script(net)
started = time.time()
last_logged = time.time()
frame_count = 0
with torch.no_grad():
    while True:
        # read frame
        ret, image = cap.read()
        if not ret:
            raise RuntimeError("failed to read frame")
        # convert opencv output from BGR to RGB
        image = image[:, :, [2, 1, 0]]
        permuted = image
        # preprocess
        input_tensor = preprocess(image)
        # create a mini-batch as expected by the model
        input_batch = input_tensor.unsqueeze(0)
        # run model
        output = net(input_batch)
        # do something with output ...
        # log model performance
        frame_count += 1
        now = time.time()
        if now - last_logged > 1:
            print(f"{frame_count  /  (now-last_logged)} fps")
            last_logged = now
            frame_count = 0

运行后，我们发现帧率约为 30fps。

这是在 Raspberry Pi OS 中的所有默认设置下。如果您禁用了默认启用的 UI 和所有其他后台服务，性能和稳定性会更好。

如果我们检查htop，我们会看到几乎 100%的利用率。

为了验证它是否正常工作，我们可以计算类别的概率并使用 ImageNet 类标签来打印检测结果。

top = list(enumerate(output[0].softmax(dim=0)))
top.sort(key=lambda x: x[1], reverse=True)
for idx, val in top[:10]:
    print(f"{val.item()*100:.2f}% {classes[idx]}")

mobilenet_v3_large实时运行：

检测一个橙色物体：

检测一个杯子：

故障排除：性能

PyTorch 默认会使用所有可用的核心。如果您的树莓派上有任何后台运行的东西，可能会导致模型推断时出现延迟峰值。为了缓解这个问题，您可以减少线程数，这将减少峰值延迟，但会有一点性能损失。

torch.set_num_threads(2)

对于shufflenet_v2_x1_5，使用2 个线程而不是4 个线程会将最佳情况下的延迟增加到72 毫秒，而不是60 毫秒，但会消除128 毫秒的延迟峰值。

下一步

您可以创建自己的模型或微调现有模型。如果您在torchvision.models.quantized中的一个模型上进行微调，大部分融合和量化的工作已经为您完成，因此您可以直接在树莓派上部署并获得良好的性能。

查看更多：

量化获取有关如何量化和融合您的模型的更多信息。
迁移学习教程介绍如何使用迁移学习来微调预先存在的模型以适应您的数据集。

PyTorch 性能分析

分析您的 PyTorch 模块

原文：pytorch.org/tutorials/beginner/profiler.html

译者：飞龙

协议：CC BY-NC-SA 4.0

注意

点击这里下载完整示例代码

作者：Suraj Subramanian

PyTorch 包含一个分析器 API，可用于识别代码中各种 PyTorch 操作的时间和内存成本。分析器可以轻松集成到您的代码中，并且结果可以打印为表格或返回为 JSON 跟踪文件。

注意

分析器支持多线程模型。分析器在与操作相同的线程中运行，但也会分析可能在另一个线程中运行的子操作符。同时运行的分析器将被限定在自己的线程中，以防止结果混合。

注意

PyTorch 1.8 引入了新的 API，将在未来版本中取代旧的分析器 API。请查看新 API 页面：此处。

前往此处的教程快速了解分析器 API 的使用。

import torch
import numpy as np
from torch import nn
import torch.autograd.profiler as profiler

PyTorch 2.2 中文官方教程（九）（4）https://developer.aliyun.com/article/1482550

PyTorch 2.2 中文官方教程（九）（3）