PyTorch 2.2 中文官方教程（十一）（2）-阿里云开发者社区

PyTorch 2.2 中文官方教程（十一）（1）https://developer.aliyun.com/article/1482551

基本语法

动态并行性的两个重要 API 是：

torch.jit.fork(fn: Callable[..., T], *args, **kwargs) -> torch.jit.Future[T]
torch.jit.wait(fut: torch.jit.Future[T]) -> T

通过一个例子演示这些工作的好方法是：

import torch
def foo(x):
    return torch.neg(x)
@torch.jit.script
def example(x):
    # Call `foo` using parallelism:
    # First, we "fork" off a task. This task will run `foo` with argument `x`
    future = torch.jit.fork(foo, x)
    # Call `foo` normally
    x_normal = foo(x)
    # Second, we "wait" on the task. Since the task may be running in
    # parallel, we have to "wait" for its result to become available.
    # Notice that by having lines of code between the "fork()" and "wait()"
    # call for a given Future, we can overlap computations so that they
    # run in parallel.
    x_parallel = torch.jit.wait(future)
    return x_normal, x_parallel
print(example(torch.ones(1))) # (-1., -1.)

fork()接受可调用的fn以及该可调用的参数args和kwargs，并为fn的执行创建一个异步任务。fn可以是一个函数、方法或模块实例。fork()返回对此执行结果值的引用，称为Future。由于fork在创建异步任务后立即返回，所以在fork()调用后的代码行执行时，fn可能尚未被执行。因此，使用wait()来等待异步任务完成并返回值。

这些结构可以用来重叠函数内语句的执行（在示例部分中显示），或者与其他语言结构如循环组合：

import torch
from typing import List
def foo(x):
    return torch.neg(x)
@torch.jit.script
def example(x):
    futures : List[torch.jit.Future[torch.Tensor]] = []
    for _ in range(100):
        futures.append(torch.jit.fork(foo, x))
    results = []
    for future in futures:
        results.append(torch.jit.wait(future))
    return torch.sum(torch.stack(results))
print(example(torch.ones([])))

注意

当我们初始化一个空的 Future 列表时，我们需要为futures添加显式类型注释。在 TorchScript 中，空容器默认假定它们包含 Tensor 值，因此我们将列表构造函数的注释标记为List[torch.jit.Future[torch.Tensor]]

这个例子使用fork()启动 100 个foo函数的实例，等待这 100 个任务完成，然后对结果求和，返回-100.0。

应用示例：双向 LSTM 集合

让我们尝试将并行性应用于一个更现实的例子，看看我们能从中获得什么样的性能。首先，让我们定义基线模型：双向 LSTM 层的集合。

import torch, time
# In RNN parlance, the dimensions we care about are:
# # of time-steps (T)
# Batch size (B)
# Hidden size/number of "channels" (C)
T, B, C = 50, 50, 1024
# A module that defines a single "bidirectional LSTM". This is simply two
# LSTMs applied to the same sequence, but one in reverse
class BidirectionalRecurrentLSTM(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.cell_f = torch.nn.LSTM(input_size=C, hidden_size=C)
        self.cell_b = torch.nn.LSTM(input_size=C, hidden_size=C)
    def forward(self, x : torch.Tensor) -> torch.Tensor:
        # Forward layer
        output_f, _ = self.cell_f(x)
        # Backward layer. Flip input in the time dimension (dim 0), apply the
        # layer, then flip the outputs in the time dimension
        x_rev = torch.flip(x, dims=[0])
        output_b, _ = self.cell_b(torch.flip(x, dims=[0]))
        output_b_rev = torch.flip(output_b, dims=[0])
        return torch.cat((output_f, output_b_rev), dim=2)
# An "ensemble" of `BidirectionalRecurrentLSTM` modules. The modules in the
# ensemble are run one-by-one on the same input then their results are
# stacked and summed together, returning the combined result.
class LSTMEnsemble(torch.nn.Module):
    def __init__(self, n_models):
        super().__init__()
        self.n_models = n_models
        self.models = torch.nn.ModuleList([
            BidirectionalRecurrentLSTM() for _ in range(self.n_models)])
    def forward(self, x : torch.Tensor) -> torch.Tensor:
        results = []
        for model in self.models:
            results.append(model(x))
        return torch.stack(results).sum(dim=0)
# For a head-to-head comparison to what we're going to do with fork/wait, let's
# instantiate the model and compile it with TorchScript
ens = torch.jit.script(LSTMEnsemble(n_models=4))
# Normally you would pull this input out of an embedding table, but for the
# purpose of this demo let's just use random data.
x = torch.rand(T, B, C)
# Let's run the model once to warm up things like the memory allocator
ens(x)
x = torch.rand(T, B, C)
# Let's see how fast it runs!
s = time.time()
ens(x)
print('Inference took', time.time() - s, ' seconds')

在我的机器上，这个网络运行需要2.05秒。我们可以做得更好！

并行化前向和后向层

我们可以做的一个非常简单的事情是并行化BidirectionalRecurrentLSTM中的前向和后向层。对于这个结构的计算是静态的，所以我们实际上甚至不需要任何循环。让我们像这样重写BidirectionalRecurrentLSTM的forward方法：

def forward(self, x : torch.Tensor) -> torch.Tensor:
    # Forward layer - fork() so this can run in parallel to the backward
    # layer
    future_f = torch.jit.fork(self.cell_f, x)
    # Backward layer. Flip input in the time dimension (dim 0), apply the
    # layer, then flip the outputs in the time dimension
    x_rev = torch.flip(x, dims=[0])
    output_b, _ = self.cell_b(torch.flip(x, dims=[0]))
    output_b_rev = torch.flip(output_b, dims=[0])
    # Retrieve the output from the forward layer. Note this needs to happen
    # *after* the stuff we want to parallelize with
    output_f, _ = torch.jit.wait(future_f)
    return torch.cat((output_f, output_b_rev), dim=2)

在这个例子中，forward()将cell_f的执行委托给另一个线程，同时继续执行cell_b。这导致两个单元的执行互相重叠。

使用这个简单修改再次运行脚本，运行时间为1.71秒，提高了17%！

附注：可视化并行性

我们还没有优化完我们的模型，但值得介绍一下我们用于可视化性能的工具。一个重要的工具是PyTorch 分析器。

让我们使用分析器以及 Chrome 跟踪导出功能来可视化我们并行化模型的性能：

with torch.autograd.profiler.profile() as prof:
    ens(x)
prof.export_chrome_trace('parallel.json')

这段代码将写出一个名为parallel.json的文件。如果你将 Google Chrome 导航到chrome://tracing，点击Load按钮，然后加载该 JSON 文件，你应该会看到如下时间线：

)

时间线的水平轴表示时间，垂直轴表示执行线程。正如我们所看到的，我们同时运行两个lstm实例。这是我们并行化双向层的努力的结果！

在集成模型中并行化模型

您可能已经注意到我们的代码中还有进一步的并行化机会：我们也可以让包含在LSTMEnsemble中的模型相互并行运行。要做到这一点很简单，我们应该改变LSTMEnsemble的forward方法：

def forward(self, x : torch.Tensor) -> torch.Tensor:
    # Launch tasks for each model
    futures : List[torch.jit.Future[torch.Tensor]] = []
    for model in self.models:
        futures.append(torch.jit.fork(model, x))
    # Collect the results from the launched tasks
    results : List[torch.Tensor] = []
    for future in futures:
        results.append(torch.jit.wait(future))
    return torch.stack(results).sum(dim=0)

或者，如果您更看重简洁性，我们可以使用列表推导：

def forward(self, x : torch.Tensor) -> torch.Tensor:
    futures = [torch.jit.fork(model, x) for model in self.models]
    results = [torch.jit.wait(fut) for fut in futures]
    return torch.stack(results).sum(dim=0)

就像在介绍中描述的那样，我们使用循环为集成模型中的每个模型启动任务。然后我们使用另一个循环等待所有任务完成。这提供了更多的计算重叠。

通过这个小更新，脚本运行时间缩短至1.4秒，总体加速达到32%！两行代码的效果相当不错。

我们还可以再次使用 Chrome 跟踪器来查看发生了什么：

)

现在我们可以看到所有的LSTM实例都在完全并行运行。

结论

在本教程中，我们学习了fork()和wait()，这是 TorchScript 中进行动态、跨操作并行处理的基本 API。我们看到了一些使用这些函数来并行执行函数、方法或Modules的典型用法。最后，我们通过一个优化模型的示例来探讨了这种技术，并探索了 PyTorch 中可用的性能测量和可视化工具。

C++前端的 Autograd

原文：pytorch.org/tutorials/advanced/cpp_autograd.html

译者：飞龙

协议：CC BY-NC-SA 4.0

autograd包对于在 PyTorch 中构建高度灵活和动态的神经网络至关重要。PyTorch Python 前端中的大多数 autograd API 在 C++前端中也是可用的，允许将 autograd 代码从 Python 轻松翻译为 C++。

在本教程中，探索了在 PyTorch C++前端中进行 autograd 的几个示例。请注意，本教程假定您已经对 Python 前端中的 autograd 有基本的了解。如果不是这样，请先阅读Autograd：自动微分。

基本的 autograd 操作

（改编自此教程）

创建一个张量并设置torch::requires_grad()以跟踪计算

auto  x  =  torch::ones({2,  2},  torch::requires_grad());
std::cout  <<  x  <<  std::endl;

输出：

1  1
1  1
[  CPUFloatType{2,2}  ]

进行张量操作：

auto  y  =  x  +  2;
std::cout  <<  y  <<  std::endl;

输出：

3  3
  3  3
[  CPUFloatType{2,2}  ]

y是作为操作的结果创建的，因此它有一个grad_fn。

std::cout  <<  y.grad_fn()->name()  <<  std::endl;

输出：

AddBackward1

在y上执行更多操作

auto  z  =  y  *  y  *  3;
auto  out  =  z.mean();
std::cout  <<  z  <<  std::endl;
std::cout  <<  z.grad_fn()->name()  <<  std::endl;
std::cout  <<  out  <<  std::endl;
std::cout  <<  out.grad_fn()->name()  <<  std::endl;

输出：

27  27
  27  27
[  CPUFloatType{2,2}  ]
MulBackward1
27
[  CPUFloatType{}  ]
MeanBackward0

.requires_grad_( ... )会就地更改现有张量的requires_grad标志。

auto  a  =  torch::randn({2,  2});
a  =  ((a  *  3)  /  (a  -  1));
std::cout  <<  a.requires_grad()  <<  std::endl;
a.requires_grad_(true);
std::cout  <<  a.requires_grad()  <<  std::endl;
auto  b  =  (a  *  a).sum();
std::cout  <<  b.grad_fn()->name()  <<  std::endl;

输出：

false
true
SumBackward0

现在进行反向传播。因为out包含一个标量，out.backward()等同于out.backward(torch::tensor(1.))。

out.backward();

打印梯度 d(out)/dx

std::cout  <<  x.grad()  <<  std::endl;

输出：

4.5000  4.5000
  4.5000  4.5000
[  CPUFloatType{2,2}  ]

您应该得到一个4.5的矩阵。有关我们如何得到这个值的解释，请参阅此教程中的相应部分。

现在让我们看一个矢量-Jacobian 乘积的例子：

x  =  torch::randn(3,  torch::requires_grad());
y  =  x  *  2;
while  (y.norm().item<double>()  <  1000)  {
  y  =  y  *  2;
}
std::cout  <<  y  <<  std::endl;
std::cout  <<  y.grad_fn()->name()  <<  std::endl;

输出：

-1021.4020
  314.6695
  -613.4944
[  CPUFloatType{3}  ]
MulBackward1

如果我们想要矢量-Jacobian 乘积，请将矢量作为参数传递给backward：

auto  v  =  torch::tensor({0.1,  1.0,  0.0001},  torch::kFloat);
y.backward(v);
std::cout  <<  x.grad()  <<  std::endl;

输出：

102.4000
  1024.0000
  0.1024
[  CPUFloatType{3}  ]

您还可以通过在代码块中放置torch::NoGradGuard来停止自动梯度跟踪需要梯度的张量的历史记录

std::cout  <<  x.requires_grad()  <<  std::endl;
std::cout  <<  x.pow(2).requires_grad()  <<  std::endl;
{
  torch::NoGradGuard  no_grad;
  std::cout  <<  x.pow(2).requires_grad()  <<  std::endl;
}

输出：

true
true
false

或者通过使用.detach()来获得一个具有相同内容但不需要梯度的新张量：

std::cout  <<  x.requires_grad()  <<  std::endl;
y  =  x.detach();
std::cout  <<  y.requires_grad()  <<  std::endl;
std::cout  <<  x.eq(y).all().item<bool>()  <<  std::endl;

输出：

true
false
true

有关 C++张量 autograd API 的更多信息，如grad / requires_grad / is_leaf / backward / detach / detach_ / register_hook / retain_grad，请参阅相应的 C++ API 文档。

在 C++中计算高阶梯度

高阶梯度的一个应用是计算梯度惩罚。让我们看一个使用torch::autograd::grad的例子：

#include  <torch/torch.h>
auto  model  =  torch::nn::Linear(4,  3);
auto  input  =  torch::randn({3,  4}).requires_grad_(true);
auto  output  =  model(input);
// Calculate loss
auto  target  =  torch::randn({3,  3});
auto  loss  =  torch::nn::MSELoss()(output,  target);
// Use norm of gradients as penalty
auto  grad_output  =  torch::ones_like(output);
auto  gradient  =  torch::autograd::grad({output},  {input},  /*grad_outputs=*/{grad_output},  /*create_graph=*/true)[0];
auto  gradient_penalty  =  torch::pow((gradient.norm(2,  /*dim=*/1)  -  1),  2).mean();
// Add gradient penalty to loss
auto  combined_loss  =  loss  +  gradient_penalty;
combined_loss.backward();
std::cout  <<  input.grad()  <<  std::endl;

输出：

-0.1042  -0.0638  0.0103  0.0723
-0.2543  -0.1222  0.0071  0.0814
-0.1683  -0.1052  0.0355  0.1024
[  CPUFloatType{3,4}  ]

有关如何使用torch::autograd::backward（链接）和torch::autograd::grad（链接）的更多信息，请参阅文档。

在 C++中使用自定义 autograd 函数

（改编自此教程）

向torch::autograd添加一个新的基本操作需要为每个操作实现一个新的torch::autograd::Function子类。torch::autograd::Function是torch::autograd用于计算结果和梯度以及编码操作历史的内容。每个新函数都需要您实现 2 个方法：forward和backward，请参阅此链接以获取详细要求。

下面是来自torch::nn的Linear函数的代码：

#include  <torch/torch.h>
using  namespace  torch::autograd;
// Inherit from Function
class  LinearFunction  :  public  Function<LinearFunction>  {
  public:
  // Note that both forward and backward are static functions
  // bias is an optional argument
  static  torch::Tensor  forward(
  AutogradContext  *ctx,  torch::Tensor  input,  torch::Tensor  weight,  torch::Tensor  bias  =  torch::Tensor())  {
  ctx->save_for_backward({input,  weight,  bias});
  auto  output  =  input.mm(weight.t());
  if  (bias.defined())  {
  output  +=  bias.unsqueeze(0).expand_as(output);
  }
  return  output;
  }
  static  tensor_list  backward(AutogradContext  *ctx,  tensor_list  grad_outputs)  {
  auto  saved  =  ctx->get_saved_variables();
  auto  input  =  saved[0];
  auto  weight  =  saved[1];
  auto  bias  =  saved[2];
  auto  grad_output  =  grad_outputs[0];
  auto  grad_input  =  grad_output.mm(weight);
  auto  grad_weight  =  grad_output.t().mm(input);
  auto  grad_bias  =  torch::Tensor();
  if  (bias.defined())  {
  grad_bias  =  grad_output.sum(0);
  }
  return  {grad_input,  grad_weight,  grad_bias};
  }
};

然后，我们可以这样使用LinearFunction：

auto  x  =  torch::randn({2,  3}).requires_grad_();
auto  weight  =  torch::randn({4,  3}).requires_grad_();
auto  y  =  LinearFunction::apply(x,  weight);
y.sum().backward();
std::cout  <<  x.grad()  <<  std::endl;
std::cout  <<  weight.grad()  <<  std::endl;

输出：

0.5314  1.2807  1.4864
  0.5314  1.2807  1.4864
[  CPUFloatType{2,3}  ]
  3.7608  0.9101  0.0073
  3.7608  0.9101  0.0073
  3.7608  0.9101  0.0073
  3.7608  0.9101  0.0073
[  CPUFloatType{4,3}  ]

这里，我们给出一个由非张量参数参数化的函数的额外示例：

#include  <torch/torch.h>
using  namespace  torch::autograd;
class  MulConstant  :  public  Function<MulConstant>  {
  public:
  static  torch::Tensor  forward(AutogradContext  *ctx,  torch::Tensor  tensor,  double  constant)  {
  // ctx is a context object that can be used to stash information
  // for backward computation
  ctx->saved_data["constant"]  =  constant;
  return  tensor  *  constant;
  }
  static  tensor_list  backward(AutogradContext  *ctx,  tensor_list  grad_outputs)  {
  // We return as many input gradients as there were arguments.
  // Gradients of non-tensor arguments to forward must be `torch::Tensor()`.
  return  {grad_outputs[0]  *  ctx->saved_data["constant"].toDouble(),  torch::Tensor()};
  }
};

然后，我们可以这样使用MulConstant：

auto  x  =  torch::randn({2}).requires_grad_();
auto  y  =  MulConstant::apply(x,  5.5);
y.sum().backward();
std::cout  <<  x.grad()  <<  std::endl;

输出：

5.5000
  5.5000
[  CPUFloatType{2}  ]

有关torch::autograd::Function的更多信息，请参阅其文档。

从 Python 翻译 autograd 代码到 C++

从高层次来看，在 C++中使用 autograd 的最简单方法是首先在 Python 中编写工作的 autograd 代码，然后使用以下表格将您的 autograd 代码从 Python 翻译成 C++：

Python	C++
`torch.autograd.backward`	`torch::autograd::backward` (link)
`torch.autograd.grad`	`torch::autograd::grad` (link)
`torch.Tensor.detach`	`torch::Tensor::detach` (link)
`torch.Tensor.detach_`	`torch::Tensor::detach_` (link)
`torch.Tensor.backward`	`torch::Tensor::backward` (link)
`torch.Tensor.register_hook`	`torch::Tensor::register_hook` (link)
`torch.Tensor.requires_grad`	`torch::Tensor::requires_grad_` (link)
`torch.Tensor.retain_grad`	`torch::Tensor::retain_grad` (link)
`torch.Tensor.grad`	`torch::Tensor::grad` (link)
`torch.Tensor.grad_fn`	`torch::Tensor::grad_fn` (link)
`torch.Tensor.set_data`	`torch::Tensor::set_data` (link)
`torch.Tensor.data`	`torch::Tensor::data` (link)
`torch.Tensor.output_nr`	`torch::Tensor::output_nr` (link)
`torch.Tensor.is_leaf`	`torch::Tensor::is_leaf` (link)

翻译后，大部分的 Python 自动求导代码应该可以在 C++中正常工作。如果不是这种情况，请在GitHub issues上报告 bug，我们会尽快修复。

结论

现在，您应该对 PyTorch 的 C++自动求导 API 有一个很好的概述。您可以在这个笔记中找到显示的代码示例这里。如常，如果遇到任何问题或有疑问，您可以使用我们的论坛或GitHub issues与我们联系。

PyTorch 2.2 中文官方教程（十一）（3）https://developer.aliyun.com/article/1482554

PyTorch 2.2 中文官方教程（十一）（2）

基本语法

应用示例：双向 LSTM 集合

并行化前向和后向层

附注：可视化并行性

在集成模型中并行化模型

结论

C++前端的 Autograd

基本的 autograd 操作

在 C++中计算高阶梯度

在 C++中使用自定义 autograd 函数

从 Python 翻译 autograd 代码到 C++

结论

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

PyTorch 2.2 中文官方教程（十一）（2）

基本语法

应用示例：双向 LSTM 集合

并行化前向和后向层

附注：可视化并行性

在集成模型中并行化模型

结论

C++前端的 Autograd

基本的 autograd 操作

在 C++中计算高阶梯度

在 C++中使用自定义 autograd 函数

从 Python 翻译 autograd 代码到 C++

结论

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像