手把手教你如何高效地在 MMCV 中贡献算子-阿里云开发者社区

手把手教你如何高效地在 MMCV 中贡献算子

2022-05-09 665

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 不知道大家在使用 MMCV 的过程中有没有遇到这种情况：MMCV 没有提供自己需要的 CPU/CUDA 算子，于是希望提一个 PR（Pull Request），将这个算子加入 MMCV，但是又不知从何处下手。本文以最简单的 TensorAdd 算子为例，向大家展示为 MMCV 贡献算子的全过程，希望能够帮助大家更好地理解 MMCV 算子的

前言

不知道大家在使用 MMCV 的过程中有没有遇到这种情况：MMCV 没有提供自己需要的 CPU/CUDA 算子，于是希望提一个 PR（Pull Request），将这个算子加入 MMCV，但是又不知从何处下手。本文以最简单的 TensorAdd 算子为例，向大家展示为 MMCV 贡献算子的全过程，希望能够帮助大家更好地理解 MMCV 算子的目录结构，以便更高效地贡献算子。

注意：如果您不太了解提 PR 的流程，可以阅读https://mmcv.readthedocs.io/zh_CN/latest/community/pr.html。

1.更新文档

在 docs/en/understand_mmcv/ops.md 以及 docs/zh_cn/understand_mmcv/ops.md 里添加 TensorAdd，该文档用于维护 MMCV 已支持的算子。

2.算子实现

在这篇文章里我们提供了 CPU 算子和 CUDA 算子的实现方法，您可以选择只提供 CPU 算子或者只提供 CUDA 算子或者两种算子都提供。

2.1 提供算子 C++ 接口

在 mmcv/ops/csrc/pytorch/ 目录添加 tensor_add.cpp 文件。

// Copyright(c) OpenMMLab.All rights reserved.
 2#include "pytorch_cpp_helper.hpp"
 3#include "pytorch_device_registry.hpp"
 4
 5void tensor_add_impl(const Tensor input1, const Tensor input2, Tensor output) {
 6  DISPATCH_DEVICE_IMPL(tensor_add_impl, input1, input2, output);
 7}
 8
 9void tensor_add(const Tensor input1, const Tensor input2, Tensor output) {
10  tensor_add_impl(input1, input2, output);
11}

tensor_add 是算子在 C++ 层的接口，而 tensor_add_impl 中的 DISPATCH_DEVICE_IMPL 宏会根据 Tensor 参数的设备类型自动选择 CPU 或 CUDA 的算子实现。DISPATCH_DEVICE_IMPL 宏依赖于REGISTER_DEVICE_IMPL 宏，我们会在下面的 CPU 算子实现和 CUDA 算子实现提到。DISPATCH_DEVICE_IMPL 和 REGISTER_DEVICE_IMPL 都用于算子分发，更多细节见兼容性文档，PR-1463 以及 Dispatcher 机制解析。

2.2 CPU 算子实现

在 mmcv/ops/csrc/pytorch/cpu/ 目录添加 tensor_add_cpu.cpp 文件。

// Copyright(c) OpenMMLab.All rights reserved.
 2#include "pytorch_cpp_helper.hpp"
 3#include "pytorch_device_registry.hpp"
 4
 5template <typename T>
 6void tensor_add_cpu_kernel(int N, const T* input1, const T* input2, T* output) {
 7  for (int i = 0; i < N; i++) {
 8    output[i] = input1[i] + input2[i];
 9  }
10}
11
12void TensorAddCPUKernelLaucher(const Tensor input1, const Tensor input2,
13                         Tensor output) {
14  int N = input1.size(0);
15  AT_DISPATCH_FLOATING_TYPES_AND_HALF(
16      input1.scalar_type(), "tensor_add_cpu_kernel", [&] {
17        tensor_add_cpu_kernel<scalar_t>(N, input1.data_ptr<scalar_t>(),
18                                        input2.data_ptr<scalar_t>(),
19                                        output.data_ptr<scalar_t>());
20      });
21}
22
23void tensor_add_cpu(const Tensor input1, const Tensor input2, Tensor output) {
24  TensorAddCPUKernelLaucher(input1, input2, output);
25}
26void tensor_add_impl(const Tensor input1, const Tensor input2, Tensor output);
27
28REGISTER_DEVICE_IMPL(tensor_add_impl, CPU, tensor_add_cpu);

我们看到最后一行 REGISTER_DEVICE_IMPL 宏将 tensor_add_impl 和 tensor_add_cpu 绑定在一起。tensor_add_cpu 调用 TensorAddCPUKernelLaucher，TensorAddCPUKernelLaucher 启动 tensor_add_cpu_kernel 完成算子的计算。

2.3 CUDA 算子实现

CUDA 算子的调用过程和 CPU 算子类似，但是在代码安排上略有不同，比如很多 CUDA Kernel 被放在 mmcv/ops/csrc/common/cuda/ 目录，因为该目录负责管理后端无关可共享的代码（cuda kernel，mlu kernel等）。

2.3.1 算子绑定

在 mmcv/ops/csrc/pytorch/cuda/cudabind.cpp 里添加 TensorAdd 的 CUDA 算子绑定。

// Copyright (c) OpenMMLab. All rights reserved.
 2...
 3void TensorAddCUDAKernelLauncher(const Tensor input1, const Tensor input2,
 4                                 const Tensor output);
 5
 6void tensor_add_cuda(const Tensor input1, const Tensor input2, Tensor output) {
 7  TensorAddCUDAKernelLauncher(input1, input2, output);
 8}
 9
10void tensor_add_impl(const Tensor input1, const Tensor input2, Tensor output);
11
12REGISTER_DEVICE_IMPL(tensor_add_impl, CUDA, tensor_add_cuda);

我们看到最后一行 REGISTER_DEVICE_IMPL 宏将 tensor_add_impl 和 tensor_add_cuda 绑定在一起。tensor_add_cuda 调用 TensorAddCUDAKernelLaucher。

2.3.2 KernelLaucher

在 mmcv/ops/csrc/pytorch/cuda/ 目录添加 tensor_add_cuda.cu 文件。

// Copyright (c) OpenMMLab. All rights reserved.
 2#include <torch/types.h>
 3
 4#include "pytorch_cuda_helper.hpp"
 5#include "tensor_add_cuda_kernel.cuh"
 6
 7void TensorAddCUDAKernelLauncher(const Tensor input1, const Tensor input2,
 8                                 const Tensor output) {
 9  int N = input1.size(0);
10  at::cuda::CUDAGuard device_guard(input1.device());
11  cudaStream_t stream = at::cuda::getCurrentCUDAStream();
12
13  AT_DISPATCH_FLOATING_TYPES_AND_HALF(
14      input1.scalar_type(), "tensor_add_cuda_kernel", ([&] {
15        dim3 blocks(GET_BLOCKS(N));
16        dim3 threads(THREADS_PER_BLOCK);
17        tensor_add_cuda_kernel<<<blocks, threads, 0, stream>>>(
18            N, input1.data_ptr<scalar_t>(), input2.data_ptr<scalar_t>(),
19            output.data_ptr<scalar_t>());
20      }));
21
22  AT_CUDA_CHECK(cudaGetLastError());
23}

TensorAddCUDAKernelLauncher 会启动 tensor_add_cuda_kernel 完成算子的具体操作。其中使用 AT_DISPATCH_FLOATING_TYPES_AND_HALF 宏启动 CUDA Kernel ，该宏内部包装了一个 switch 语句来完成针对张量类型的分派，更多这类宏可见：https://github.com/pytorch/pytorch/blob/HEAD/aten/src/ATen/Dispatch.h#L262。

2.3.3 Kernel

在 mmcv/ops/csrc/common/cuda/ 下添加 tensor_add_cuda_kernel.cuh。

// Copyright (c) OpenMMLab. All rights reserved.
 2#ifndef TENSOR_ADD_CUDA_KERNEL_CUH
 3#define TENSOR_ADD_CUDA_KERNEL_CUH
 4
 5#ifdef MMCV_USE_PARROTS
 6#include "parrots_cuda_helper.hpp"
 7#else
 8#include "pytorch_cuda_helper.hpp"
 9#endif
10
11template <typename T>
12__global__ void tensor_add_cuda_kernel(const int N, const T* input1,
13                                       const T* input2, T* output) {
14  CUDA_1D_KERNEL_LOOP(i, N) { output[i] = input1[i] + input2[i]; }
15}
16#endif  // TENSOR_ADD_CUDA_KERNEL_CUH

在这里实现了算子的具体操作。其中 CUDA_1D_KERNEL_LOOP 是 MMCV 提供的一个简写 Kernel Loop 的宏

3. 提供 Python 接口

在完成 C++/CUDA 的算子后，我们需要在 mmcv/ops/csrc/pytorch/pybind.cpp 里实现 C++ 接口和 Python 接口的绑定，从而提供一个 Python 可以调用的接口。

// Copyright (c) OpenMMLab. All rights reserved.
2...
3void tensor_add(const Tensor input1, const Tensor input2, Tensor output);
4
5PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {  
6  ...
7  m.def("tensor_add", &tensor_add, "tensor_add", py::arg("input1"),
8        py::arg("input2"), py::arg("output"));
9}

4. Python 调用算子

在 mmcv/ops/ 下添加 tensor_add.py。

# Copyright (c) OpenMMLab. All rights reserved.
 2import torch
 3from torch.autograd import Function
 4from torch.autograd.function import once_differentiable
 5
 6from ..utils import ext_loader
 7
 8ext_module = ext_loader.load_ext('_ext', ['tensor_add'])
 9
10
11class TensorAdd(Function):
12
13    @staticmethod
14    def forward(ctx, input1: torch.Tensor,
15                input2: torch.Tensor) -> torch.Tensor:
16        """Add two tensor.
17
18        Args:
19            input1 (torch.Tensor): shape (N).
20            input2 (torch.Tensor): shape (N).
21
22        Returns:
23            torch.Tensor: shape (N), tensor of input1 + input2.
24        """
25        assert input1.is_contiguous() and input2.is_contiguous()
26        assert input1.dim() == 1 and input2.dim() == 1
27        assert input1.size(0) == input2.size(0)
28        output = torch.zeros(
29            input1.size(0), dtype=input1.dtype, device=input1.device)
30        ext_module.tensor_add(input1, input2, output)
31        return output
32
33    @staticmethod
34    @once_differentiable
35    def backward(ctx, grad_output):
36        return grad_output, grad_output
37
38
39tensor_add = TensorAdd.apply

在 mmcv/ops/__init__.py 文件添加对外的接口。

# Copyright (c) OpenMMLab. All rights reserved.
2...
3from .tensor_add import tensor_add
4
5__all__ = [
6    ..., 
7    'tensor_add'
8]

5. 编译 MMCV

参考从源码编译 MMCV 的步骤重新编译 MMCV

从源码编译 MMCV：https://mmcv.readthedocs.io/zh_CN/latest/get_started/build.html

PyTorch 源码解读之 cpp_extension：

https://zhuanlan.zhihu.com/p/348555597

6. 添加单元测试

import numpy as np
 2import pytest
 3import torch
 4
 5from mmcv.ops import tensor_add
 6
 7
 8@pytest.mark.parametrize('device', [
 9    'cpu',
10    pytest.param(
11        'cuda',
12        marks=pytest.mark.skipif(
13            not torch.cuda.is_available(), reason='requires CUDA support'))
14])
15@pytest.mark.parametrize('dtype', [torch.float, torch.half])
16def test_tensor_add(device, dtype):
17    n = 1024 * 1024
18    input1 = torch.rand(n).type(dtype).to(device).requires_grad_()
19    input2 = torch.rand(n).type(dtype).to(device).requires_grad_()
20    expected_output = (input1 + input2).cpu()
21    output = tensor_add(input1, input2)
22    output.backward(torch.ones_like(output))
23    assert np.allclose(
24        output.detach().cpu(), expected_output.detach(), atol=1e-4)
25    assert np.allclose(input1.grad.cpu().detach(), 1, atol=1e-4)

在终端通过 pytest tests/test_ops/test_tensor_add.py 测试 tensor_add 是否正确。

7. 需要注意的点

1.文件需要添加 copyright。

2.DISPATCH_DEVICE_IMPL 接口参数至少要有一个是 Tensor，否则无法根据设备类型做算子分发。

3.目录安排、代码风格和命名规范等需符合 MMCV 算子的规范，例如 tensor_add_cpu_kernel 不能写成 tensor_add_kernel、tensor_add_cpukernel 等其他形式。

4.如果算子需要反向传播，请编写反向传播算子并根据自定义 C++ 和 CUDA 扩展组织 Python 代码。

8. 函数调用流程图

横实线区别 Python 层和 C++ 层，两者之间通过 pybind11 绑定。竖实线区别 C++ 算子和 CUDA 算子，算子的分发由 DISPATCH_DEVICE_IMPL 决定。

9. 总结

希望本篇文章让您更为深入地了解了如何在 MMCV 中添加自定义算子

文章来源：【OpenMMLab】

2022-02-09 18:07

手把手教你如何高效地在 MMCV 中贡献算子

前言

1.更新文档

2.算子实现

2.1 提供算子 C++ 接口

2.2 CPU 算子实现

2.3 CUDA 算子实现

2.3.1 算子绑定

2.3.2 KernelLaucher

2.3.3 Kernel

3. 提供 Python 接口

4. Python 调用算子

5. 编译 MMCV

6. 添加单元测试

7. 需要注意的点

8. 函数调用流程图

9. 总结

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

手把手教你如何高效地在 MMCV 中贡献算子

前言

1.更新文档

2.算子实现

2.1 提供算子 C++ 接口

2.2 CPU 算子实现

2.3 CUDA 算子实现

2.3.1 算子绑定

2.3.2 KernelLaucher

2.3.3 Kernel

3. 提供 Python 接口

4. Python 调用算子

5. 编译 MMCV

6. 添加单元测试

7. 需要注意的点

8. 函数调用流程图

9. 总结

热门文章

最新文章

相关课程

相关电子书

相关实验场景