PyTorch 2.2 中文官方教程（十六）（1）-阿里云开发者社区

介绍 torch.compile

原文：pytorch.org/tutorials/intermediate/torch_compile_tutorial.html

译者：飞龙

协议：CC BY-NC-SA 4.0

注意

点击这里下载完整的示例代码

作者: William Wen

torch.compile是加速 PyTorch 代码的最新方法！torch.compile通过将 PyTorch 代码 JIT 编译成优化的内核来使 PyTorch 代码运行更快，同时需要最少的代码更改。

在本教程中，我们涵盖了基本的torch.compile用法，并展示了torch.compile相对于之前的 PyTorch 编译器解决方案（如TorchScript和FX Tracing）的优势。

基本用法
演示加速效果
与 TorchScript 和 FX Tracing 的比较
TorchDynamo 和 FX 图
结论

所需的 pip 依赖项

torch >= 2.0
torchvision
numpy
scipy
tabulate

注意：为了重现下面显示的速度提升数字以及其他地方记录的数字，建议使用现代的 NVIDIA GPU（H100、A100 或 V100）进行本教程。

import torch
import warnings
gpu_ok = False
if torch.cuda.is_available():
    device_cap = torch.cuda.get_device_capability()
    if device_cap in ((7, 0), (8, 0), (9, 0)):
        gpu_ok = True
if not gpu_ok:
    warnings.warn(
        "GPU is not NVIDIA V100, A100, or H100\. Speedup numbers may be lower "
        "than expected."
    )

/var/lib/jenkins/workspace/intermediate_source/torch_compile_tutorial.py:48: UserWarning:
GPU is not NVIDIA V100, A100, or H100\. Speedup numbers may be lower than expected.

基本用法

torch.compile已包含在最新的 PyTorch 中。在 GPU 上运行 TorchInductor 需要 Triton，Triton 已包含在 PyTorch 2.0 nightly 二进制文件中。如果 Triton 仍然缺失，请尝试通过 pip 安装torchtriton（pip install torchtriton --extra-index-url "https://download.pytorch.org/whl/nightly/cu117"用于 CUDA 11.7）。

通过将可调用对象传递给torch.compile，可以优化任意的 Python 函数。然后我们可以调用返回的优化函数来替代原始函数。

def foo(x, y):
    a = torch.sin(x)
    b = torch.cos(y)
    return a + b
opt_foo1 = torch.compile(foo)
print(opt_foo1(torch.randn(10, 10), torch.randn(10, 10)))

tensor([[ 1.6850,  1.9924,  1.7090,  0.0034,  1.1414, -0.1822,  0.4861, -0.0536,
         -0.2252,  1.9398],
        [ 0.3693, -0.0695,  0.1748,  0.3436,  0.1939,  1.5721,  1.9882, -0.2235,
          0.3161,  1.2642],
        [ 0.2480,  1.8793,  1.7152,  1.6772,  1.8881,  1.4748,  1.3466,  1.7763,
          0.7469,  1.0407],
        [-0.1121,  1.6015, -0.0188,  0.2128,  0.5218,  1.9838,  0.8185,  0.5093,
         -0.3603,  0.1793],
        [-1.7890,  1.7532, -0.4040,  0.1222, -0.0029,  1.7975, -0.3877,  0.5123,
          0.1673,  0.1330],
        [ 1.0627,  0.9609,  0.1019,  1.8814,  0.1142, -0.2338, -0.9621,  0.7631,
          0.6506,  0.1853],
        [ 0.4584,  1.7648, -0.0444,  1.9610,  1.5884,  0.7353,  1.2190,  1.3662,
          1.0938, -0.1587],
        [-0.7502,  1.6640,  0.3495,  1.3496,  0.8187,  1.1719,  0.5820,  0.1498,
          0.0885,  0.1036],
        [ 0.3961,  0.6043, -0.0861, -0.3371,  0.8622,  1.4341,  1.2988,  0.5023,
          0.3074,  0.1277],
        [ 0.9748,  0.4117,  1.2616,  1.6314,  0.4693,  0.4092,  0.0401,  1.1196,
          1.2458,  1.3280]])

或者，我们可以装饰这个函数。

@torch.compile
def opt_foo2(x, y):
    a = torch.sin(x)
    b = torch.cos(y)
    return a + b
print(opt_foo2(torch.randn(10, 10), torch.randn(10, 10)))

tensor([[ 0.5360,  0.1697, -0.0561,  0.1890, -0.1310,  1.2276,  1.1739,  0.1944,
         -0.1561,  1.6990],
        [ 1.0421,  1.9472,  0.2682,  0.2701,  1.3346,  0.7651,  1.0897,  1.1730,
          0.6161,  0.9223],
        [ 1.5756,  1.5294,  0.0112, -0.1522, -0.7674,  1.8515, -0.2443,  0.3696,
          0.2693,  0.8735],
        [-0.3701,  1.1190,  1.4164,  1.8648,  1.2080,  0.0732,  1.5274,  0.6868,
          1.2440,  1.0715],
        [-1.2454, -0.0159,  0.4315,  0.1317,  1.0530, -1.0603, -0.0532,  0.6661,
          1.7101, -0.2076],
        [-0.7091,  0.7824,  1.7161,  1.2750,  0.6368,  1.2488,  0.4897,  1.2429,
          1.3409,  1.3735],
        [ 0.8345,  0.0653,  0.3462,  1.2383, -0.4092,  1.6438, -0.0962,  0.4011,
          0.2463, -0.5802],
        [ 1.6349,  0.7297,  1.2547, -0.3113,  0.9310,  0.1162,  1.7618,  0.4882,
          0.7640,  0.2930],
        [ 1.1669, -0.7775,  1.2000,  0.6008, -0.2814,  0.5541,  0.5753,  1.4731,
          1.6835,  0.7370],
        [ 1.5087,  0.6195,  0.1153,  1.2966,  1.8815,  1.1678,  1.5686,  1.6018,
          0.2193,  1.3500]])

我们也可以优化torch.nn.Module实例。

class MyModule(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.lin = torch.nn.Linear(100, 10)
    def forward(self, x):
        return torch.nn.functional.relu(self.lin(x))
mod = MyModule()
opt_mod = torch.compile(mod)
print(opt_mod(torch.randn(10, 100)))

tensor([[-0.0000, -0.0000, 0.2419, 0.0446, 0.9011, 0.2674, 0.3633, 0.4984, -0.0000,
         0.0988],
        [0.6906, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, 0.8490, -0.0000, -0.0000,
         0.5475],
        [0.0852, 0.2762, 0.7441, -0.0000, -0.0000, 0.1820, -0.0000, -0.0000, -0.0000,
         0.0334],
        [0.3024, 0.0077, 1.2572, -0.0000, -0.0000, 0.6520, -0.0000, -0.0000, -0.0000,
         0.8976],
        [0.1998, 0.3333, -0.0000, 0.7803, 0.4202, 0.0915, -0.0000, 1.2543, -0.0000,
         0.4615],
        [0.2487, 0.4187, -0.0000, -0.0000, 0.5124, -0.0000, 0.2512, -0.0000, 0.5850,
         -0.0000],
        [-0.0000, 0.0048, -0.0000, -0.0000, -0.0000, 0.2287, -0.0000, 0.4841, 0.3915,
         -0.0000],
        [0.2017, -0.0000, 0.0896, 1.4135, 0.0593, 0.3788, -0.0000, -0.0000, -0.0000,
         0.4972],
        [-0.0000, -0.0000, 1.6580, 0.6414, -0.0000, -0.0000, -0.0000, -0.0000, 0.6491,
         0.7755],
        [-0.0000, -0.0000, 0.6442, 0.0260, 0.7456, 0.1000, -0.0000, -0.0000, 0.5366,
         0.1193]], grad_fn=<CompiledFunctionBackward>)

演示加速

让我们现在演示一下，使用torch.compile可以加速真实模型。我们将通过在随机数据上评估和训练一个torchvision模型来比较标准的急切模式和torch.compile。

在开始之前，我们需要定义一些实用函数。

# Returns the result of running `fn()` and the time it took for `fn()` to run,
# in seconds. We use CUDA events and synchronization for the most accurate
# measurements.
def timed(fn):
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    start.record()
    result = fn()
    end.record()
    torch.cuda.synchronize()
    return result, start.elapsed_time(end) / 1000
# Generates random input and targets data for the model, where `b` is
# batch size.
def generate_data(b):
    return (
        torch.randn(b, 3, 128, 128).to(torch.float32).cuda(),
        torch.randint(1000, (b,)).cuda(),
    )
N_ITERS = 10
from torchvision.models import densenet121
def init_model():
    return densenet121().to(torch.float32).cuda()

首先，让我们比较推理。

请注意，在调用torch.compile时，我们有额外的mode参数，我们将在下面讨论。

model = init_model()
# Reset since we are using a different mode.
import torch._dynamo
torch._dynamo.reset()
model_opt = torch.compile(model, mode="reduce-overhead")
inp = generate_data(16)[0]
with torch.no_grad():
    print("eager:", timed(lambda: model(inp))[1])
    print("compile:", timed(lambda: model_opt(inp))[1])

eager: 0.3166423034667969
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:140: UserWarning:
TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
compile: 76.9008984375

请注意，与急切模式相比，torch.compile需要更长的时间才能完成。这是因为torch.compile在执行时将模型编译为优化的内核。在我们的示例中，模型的结构没有改变，因此不需要重新编译。因此，如果我们运行我们优化过的模型多次，我们应该会看到与急切模式相比的显著改进。

eager_times = []
for i in range(N_ITERS):
    inp = generate_data(16)[0]
    with torch.no_grad():
        _, eager_time = timed(lambda: model(inp))
    eager_times.append(eager_time)
    print(f"eager eval time {i}: {eager_time}")
print("~" * 10)
compile_times = []
for i in range(N_ITERS):
    inp = generate_data(16)[0]
    with torch.no_grad():
        _, compile_time = timed(lambda: model_opt(inp))
    compile_times.append(compile_time)
    print(f"compile eval time {i}: {compile_time}")
print("~" * 10)
import numpy as np
eager_med = np.median(eager_times)
compile_med = np.median(compile_times)
speedup = eager_med / compile_med
assert(speedup > 1)
print(f"(eval) eager median: {eager_med}, compile median: {compile_med}, speedup: {speedup}x")
print("~" * 10)

eager eval time 0: 0.018123775482177733
eager eval time 1: 0.01638707160949707
eager eval time 2: 0.015945728302001954
eager eval time 3: 0.015856639862060547
eager eval time 4: 0.016062463760375977
eager eval time 5: 0.016149408340454103
eager eval time 6: 0.01600307273864746
eager eval time 7: 0.01600614356994629
eager eval time 8: 0.015964159965515135
eager eval time 9: 0.015935487747192383
~~~~~~~~~~
compile eval time 0: 0.708474853515625
compile eval time 1: 0.008540160179138183
compile eval time 2: 0.00828006362915039
compile eval time 3: 0.008294400215148925
compile eval time 4: 0.00828825569152832
compile eval time 5: 0.008264703750610352
compile eval time 6: 0.008274944305419921
compile eval time 7: 0.008263680458068847
compile eval time 8: 0.008263680458068847
compile eval time 9: 0.00827187156677246
~~~~~~~~~~
(eval) eager median: 0.016004608154296874, compile median: 0.008277503967285157, speedup: 1.9335065519208734x
~~~~~~~~~~

事实上，我们可以看到使用 torch.compile 运行我们的模型会显著加速。加速主要来自减少 Python 开销和 GPU 读写，因此观察到的加速可能会受到模型架构和批量大小等因素的影响。例如，如果模型的架构简单且数据量大，则瓶颈将是 GPU 计算，观察到的加速可能不那么显著。

您可能会看到不同的加速结果，这取决于所选择的 mode 参数。"reduce-overhead" 模式使用 CUDA 图来进一步减少 Python 的开销。对于您自己的模型，您可能需要尝试不同的模式以最大化加速。您可以在这里阅读更多关于模式的信息。

您可能还注意到，我们使用torch.compile运行模型的第二次比其他运行要慢得多，尽管它比第一次运行要快得多。这是因为"reduce-overhead"模式会为 CUDA 图运行几次预热迭代。

对于一般的 PyTorch 基准测试，您可以尝试使用torch.utils.benchmark而不是我们上面定义的timed函数。我们在本教程中编写了自己的计时函数，以展示torch.compile的编译延迟。

现在，让我们考虑比较训练。

model = init_model()
opt = torch.optim.Adam(model.parameters())
def train(mod, data):
    opt.zero_grad(True)
    pred = mod(data[0])
    loss = torch.nn.CrossEntropyLoss()(pred, data[1])
    loss.backward()
    opt.step()
eager_times = []
for i in range(N_ITERS):
    inp = generate_data(16)
    _, eager_time = timed(lambda: train(model, inp))
    eager_times.append(eager_time)
    print(f"eager train time {i}: {eager_time}")
print("~" * 10)
model = init_model()
opt = torch.optim.Adam(model.parameters())
train_opt = torch.compile(train, mode="reduce-overhead")
compile_times = []
for i in range(N_ITERS):
    inp = generate_data(16)
    _, compile_time = timed(lambda: train_opt(model, inp))
    compile_times.append(compile_time)
    print(f"compile train time {i}: {compile_time}")
print("~" * 10)
eager_med = np.median(eager_times)
compile_med = np.median(compile_times)
speedup = eager_med / compile_med
assert(speedup > 1)
print(f"(train) eager median: {eager_med}, compile median: {compile_med}, speedup: {speedup}x")
print("~" * 10)

eager train time 0: 0.3557437438964844
eager train time 1: 0.0508171501159668
eager train time 2: 0.04858163070678711
eager train time 3: 0.048674816131591796
eager train time 4: 0.04914883041381836
eager train time 5: 0.04877619171142578
eager train time 6: 0.048503807067871094
eager train time 7: 0.048318462371826174
eager train time 8: 0.04821299362182617
eager train time 9: 0.04865331268310547
~~~~~~~~~~
compile train time 0: 208.459546875
compile train time 1: 5.33654541015625
compile train time 2: 0.0332677116394043
compile train time 3: 0.023565311431884766
compile train time 4: 0.023459840774536132
compile train time 5: 0.02349772834777832
compile train time 6: 0.023554048538208007
compile train time 7: 0.02490163230895996
compile train time 8: 0.023513023376464843
compile train time 9: 0.02345062446594238
~~~~~~~~~~
(train) eager median: 0.048664064407348634, compile median: 0.023559679985046385, speedup: 2.065565595043579x
~~~~~~~~~~