介绍 torch.compile
原文:
pytorch.org/tutorials/intermediate/torch_compile_tutorial.html
译者:飞龙
注意
点击这里下载完整的示例代码
作者: William Wen
torch.compile
是加速 PyTorch 代码的最新方法!torch.compile
通过将 PyTorch 代码 JIT 编译成优化的内核来使 PyTorch 代码运行更快,同时需要最少的代码更改。
在本教程中,我们涵盖了基本的torch.compile
用法,并展示了torch.compile
相对于之前的 PyTorch 编译器解决方案(如TorchScript和FX Tracing)的优势。
目录
- 基本用法
- 演示加速效果
- 与 TorchScript 和 FX Tracing 的比较
- TorchDynamo 和 FX 图
- 结论
所需的 pip 依赖项
torch >= 2.0
torchvision
numpy
scipy
tabulate
注意:为了重现下面显示的速度提升数字以及其他地方记录的数字,建议使用现代的 NVIDIA GPU(H100、A100 或 V100)进行本教程。
import torch import warnings gpu_ok = False if torch.cuda.is_available(): device_cap = torch.cuda.get_device_capability() if device_cap in ((7, 0), (8, 0), (9, 0)): gpu_ok = True if not gpu_ok: warnings.warn( "GPU is not NVIDIA V100, A100, or H100\. Speedup numbers may be lower " "than expected." )
/var/lib/jenkins/workspace/intermediate_source/torch_compile_tutorial.py:48: UserWarning: GPU is not NVIDIA V100, A100, or H100\. Speedup numbers may be lower than expected.
基本用法
torch.compile
已包含在最新的 PyTorch 中。在 GPU 上运行 TorchInductor 需要 Triton,Triton 已包含在 PyTorch 2.0 nightly 二进制文件中。如果 Triton 仍然缺失,请尝试通过 pip 安装torchtriton
(pip install torchtriton --extra-index-url "https://download.pytorch.org/whl/nightly/cu117"
用于 CUDA 11.7)。
通过将可调用对象传递给torch.compile
,可以优化任意的 Python 函数。然后我们可以调用返回的优化函数来替代原始函数。
def foo(x, y): a = torch.sin(x) b = torch.cos(y) return a + b opt_foo1 = torch.compile(foo) print(opt_foo1(torch.randn(10, 10), torch.randn(10, 10)))
tensor([[ 1.6850, 1.9924, 1.7090, 0.0034, 1.1414, -0.1822, 0.4861, -0.0536, -0.2252, 1.9398], [ 0.3693, -0.0695, 0.1748, 0.3436, 0.1939, 1.5721, 1.9882, -0.2235, 0.3161, 1.2642], [ 0.2480, 1.8793, 1.7152, 1.6772, 1.8881, 1.4748, 1.3466, 1.7763, 0.7469, 1.0407], [-0.1121, 1.6015, -0.0188, 0.2128, 0.5218, 1.9838, 0.8185, 0.5093, -0.3603, 0.1793], [-1.7890, 1.7532, -0.4040, 0.1222, -0.0029, 1.7975, -0.3877, 0.5123, 0.1673, 0.1330], [ 1.0627, 0.9609, 0.1019, 1.8814, 0.1142, -0.2338, -0.9621, 0.7631, 0.6506, 0.1853], [ 0.4584, 1.7648, -0.0444, 1.9610, 1.5884, 0.7353, 1.2190, 1.3662, 1.0938, -0.1587], [-0.7502, 1.6640, 0.3495, 1.3496, 0.8187, 1.1719, 0.5820, 0.1498, 0.0885, 0.1036], [ 0.3961, 0.6043, -0.0861, -0.3371, 0.8622, 1.4341, 1.2988, 0.5023, 0.3074, 0.1277], [ 0.9748, 0.4117, 1.2616, 1.6314, 0.4693, 0.4092, 0.0401, 1.1196, 1.2458, 1.3280]])
或者,我们可以装饰这个函数。
@torch.compile def opt_foo2(x, y): a = torch.sin(x) b = torch.cos(y) return a + b print(opt_foo2(torch.randn(10, 10), torch.randn(10, 10)))
tensor([[ 0.5360, 0.1697, -0.0561, 0.1890, -0.1310, 1.2276, 1.1739, 0.1944, -0.1561, 1.6990], [ 1.0421, 1.9472, 0.2682, 0.2701, 1.3346, 0.7651, 1.0897, 1.1730, 0.6161, 0.9223], [ 1.5756, 1.5294, 0.0112, -0.1522, -0.7674, 1.8515, -0.2443, 0.3696, 0.2693, 0.8735], [-0.3701, 1.1190, 1.4164, 1.8648, 1.2080, 0.0732, 1.5274, 0.6868, 1.2440, 1.0715], [-1.2454, -0.0159, 0.4315, 0.1317, 1.0530, -1.0603, -0.0532, 0.6661, 1.7101, -0.2076], [-0.7091, 0.7824, 1.7161, 1.2750, 0.6368, 1.2488, 0.4897, 1.2429, 1.3409, 1.3735], [ 0.8345, 0.0653, 0.3462, 1.2383, -0.4092, 1.6438, -0.0962, 0.4011, 0.2463, -0.5802], [ 1.6349, 0.7297, 1.2547, -0.3113, 0.9310, 0.1162, 1.7618, 0.4882, 0.7640, 0.2930], [ 1.1669, -0.7775, 1.2000, 0.6008, -0.2814, 0.5541, 0.5753, 1.4731, 1.6835, 0.7370], [ 1.5087, 0.6195, 0.1153, 1.2966, 1.8815, 1.1678, 1.5686, 1.6018, 0.2193, 1.3500]])
我们也可以优化torch.nn.Module
实例。
class MyModule(torch.nn.Module): def __init__(self): super().__init__() self.lin = torch.nn.Linear(100, 10) def forward(self, x): return torch.nn.functional.relu(self.lin(x)) mod = MyModule() opt_mod = torch.compile(mod) print(opt_mod(torch.randn(10, 100)))
tensor([[-0.0000, -0.0000, 0.2419, 0.0446, 0.9011, 0.2674, 0.3633, 0.4984, -0.0000, 0.0988], [0.6906, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, 0.8490, -0.0000, -0.0000, 0.5475], [0.0852, 0.2762, 0.7441, -0.0000, -0.0000, 0.1820, -0.0000, -0.0000, -0.0000, 0.0334], [0.3024, 0.0077, 1.2572, -0.0000, -0.0000, 0.6520, -0.0000, -0.0000, -0.0000, 0.8976], [0.1998, 0.3333, -0.0000, 0.7803, 0.4202, 0.0915, -0.0000, 1.2543, -0.0000, 0.4615], [0.2487, 0.4187, -0.0000, -0.0000, 0.5124, -0.0000, 0.2512, -0.0000, 0.5850, -0.0000], [-0.0000, 0.0048, -0.0000, -0.0000, -0.0000, 0.2287, -0.0000, 0.4841, 0.3915, -0.0000], [0.2017, -0.0000, 0.0896, 1.4135, 0.0593, 0.3788, -0.0000, -0.0000, -0.0000, 0.4972], [-0.0000, -0.0000, 1.6580, 0.6414, -0.0000, -0.0000, -0.0000, -0.0000, 0.6491, 0.7755], [-0.0000, -0.0000, 0.6442, 0.0260, 0.7456, 0.1000, -0.0000, -0.0000, 0.5366, 0.1193]], grad_fn=<CompiledFunctionBackward>)
演示加速
让我们现在演示一下,使用torch.compile
可以加速真实模型。我们将通过在随机数据上评估和训练一个torchvision
模型来比较标准的急切模式和torch.compile
。
在开始之前,我们需要定义一些实用函数。
# Returns the result of running `fn()` and the time it took for `fn()` to run, # in seconds. We use CUDA events and synchronization for the most accurate # measurements. def timed(fn): start = torch.cuda.Event(enable_timing=True) end = torch.cuda.Event(enable_timing=True) start.record() result = fn() end.record() torch.cuda.synchronize() return result, start.elapsed_time(end) / 1000 # Generates random input and targets data for the model, where `b` is # batch size. def generate_data(b): return ( torch.randn(b, 3, 128, 128).to(torch.float32).cuda(), torch.randint(1000, (b,)).cuda(), ) N_ITERS = 10 from torchvision.models import densenet121 def init_model(): return densenet121().to(torch.float32).cuda()
首先,让我们比较推理。
请注意,在调用torch.compile
时,我们有额外的mode
参数,我们将在下面讨论。
model = init_model() # Reset since we are using a different mode. import torch._dynamo torch._dynamo.reset() model_opt = torch.compile(model, mode="reduce-overhead") inp = generate_data(16)[0] with torch.no_grad(): print("eager:", timed(lambda: model(inp))[1]) print("compile:", timed(lambda: model_opt(inp))[1])
eager: 0.3166423034667969 /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:140: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. compile: 76.9008984375
请注意,与急切模式相比,torch.compile
需要更长的时间才能完成。这是因为torch.compile
在执行时将模型编译为优化的内核。在我们的示例中,模型的结构没有改变,因此不需要重新编译。因此,如果我们运行我们优化过的模型多次,我们应该会看到与急切模式相比的显著改进。
eager_times = [] for i in range(N_ITERS): inp = generate_data(16)[0] with torch.no_grad(): _, eager_time = timed(lambda: model(inp)) eager_times.append(eager_time) print(f"eager eval time {i}: {eager_time}") print("~" * 10) compile_times = [] for i in range(N_ITERS): inp = generate_data(16)[0] with torch.no_grad(): _, compile_time = timed(lambda: model_opt(inp)) compile_times.append(compile_time) print(f"compile eval time {i}: {compile_time}") print("~" * 10) import numpy as np eager_med = np.median(eager_times) compile_med = np.median(compile_times) speedup = eager_med / compile_med assert(speedup > 1) print(f"(eval) eager median: {eager_med}, compile median: {compile_med}, speedup: {speedup}x") print("~" * 10)
eager eval time 0: 0.018123775482177733 eager eval time 1: 0.01638707160949707 eager eval time 2: 0.015945728302001954 eager eval time 3: 0.015856639862060547 eager eval time 4: 0.016062463760375977 eager eval time 5: 0.016149408340454103 eager eval time 6: 0.01600307273864746 eager eval time 7: 0.01600614356994629 eager eval time 8: 0.015964159965515135 eager eval time 9: 0.015935487747192383 ~~~~~~~~~~ compile eval time 0: 0.708474853515625 compile eval time 1: 0.008540160179138183 compile eval time 2: 0.00828006362915039 compile eval time 3: 0.008294400215148925 compile eval time 4: 0.00828825569152832 compile eval time 5: 0.008264703750610352 compile eval time 6: 0.008274944305419921 compile eval time 7: 0.008263680458068847 compile eval time 8: 0.008263680458068847 compile eval time 9: 0.00827187156677246 ~~~~~~~~~~ (eval) eager median: 0.016004608154296874, compile median: 0.008277503967285157, speedup: 1.9335065519208734x ~~~~~~~~~~
事实上,我们可以看到使用 torch.compile
运行我们的模型会显著加速。加速主要来自减少 Python 开销和 GPU 读写,因此观察到的加速可能会受到模型架构和批量大小等因素的影响。例如,如果模型的架构简单且数据量大,则瓶颈将是 GPU 计算,观察到的加速可能不那么显著。
您可能会看到不同的加速结果,这取决于所选择的 mode
参数。"reduce-overhead"
模式使用 CUDA 图来进一步减少 Python 的开销。对于您自己的模型,您可能需要尝试不同的模式以最大化加速。您可以在这里阅读更多关于模式的信息。
您可能还注意到,我们使用torch.compile
运行模型的第二次比其他运行要慢得多,尽管它比第一次运行要快得多。这是因为"reduce-overhead"
模式会为 CUDA 图运行几次预热迭代。
对于一般的 PyTorch 基准测试,您可以尝试使用torch.utils.benchmark
而不是我们上面定义的timed
函数。我们在本教程中编写了自己的计时函数,以展示torch.compile
的编译延迟。
现在,让我们考虑比较训练。
model = init_model() opt = torch.optim.Adam(model.parameters()) def train(mod, data): opt.zero_grad(True) pred = mod(data[0]) loss = torch.nn.CrossEntropyLoss()(pred, data[1]) loss.backward() opt.step() eager_times = [] for i in range(N_ITERS): inp = generate_data(16) _, eager_time = timed(lambda: train(model, inp)) eager_times.append(eager_time) print(f"eager train time {i}: {eager_time}") print("~" * 10) model = init_model() opt = torch.optim.Adam(model.parameters()) train_opt = torch.compile(train, mode="reduce-overhead") compile_times = [] for i in range(N_ITERS): inp = generate_data(16) _, compile_time = timed(lambda: train_opt(model, inp)) compile_times.append(compile_time) print(f"compile train time {i}: {compile_time}") print("~" * 10) eager_med = np.median(eager_times) compile_med = np.median(compile_times) speedup = eager_med / compile_med assert(speedup > 1) print(f"(train) eager median: {eager_med}, compile median: {compile_med}, speedup: {speedup}x") print("~" * 10)
eager train time 0: 0.3557437438964844 eager train time 1: 0.0508171501159668 eager train time 2: 0.04858163070678711 eager train time 3: 0.048674816131591796 eager train time 4: 0.04914883041381836 eager train time 5: 0.04877619171142578 eager train time 6: 0.048503807067871094 eager train time 7: 0.048318462371826174 eager train time 8: 0.04821299362182617 eager train time 9: 0.04865331268310547 ~~~~~~~~~~ compile train time 0: 208.459546875 compile train time 1: 5.33654541015625 compile train time 2: 0.0332677116394043 compile train time 3: 0.023565311431884766 compile train time 4: 0.023459840774536132 compile train time 5: 0.02349772834777832 compile train time 6: 0.023554048538208007 compile train time 7: 0.02490163230895996 compile train time 8: 0.023513023376464843 compile train time 9: 0.02345062446594238 ~~~~~~~~~~ (train) eager median: 0.048664064407348634, compile median: 0.023559679985046385, speedup: 2.065565595043579x ~~~~~~~~~~
同样,我们可以看到torch.compile
在第一次迭代中需要更长的时间,因为它必须编译模型,但在后续迭代中,与急切执行相比,我们看到了显著的加速。
我们注意到,本教程中呈现的加速比仅用于演示目的。官方加速数值可以在TorchInductor 性能仪表板上查看。
PyTorch 2.2 中文官方教程(十六)(2)https://developer.aliyun.com/article/1482589