从头开始重新创建 PyTorch-阿里云开发者社区

简介

多年来，我们一直在使用 PyTorch 来构建和训练深度学习模型。尽管对其语法和规则已经了如指掌，但内心深处总有一些疑问：这些操作背后究竟发生了什么？它们是如何运作的？如果询问你如何在 PyTorch 中创建和训练一个模型，你可能会想到类似以下的代码：

import torch
import torch.nn as nn
import torch.optim as optim

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.fc1 = nn.Linear(1, 10)
        self.sigmoid = nn.Sigmoid()
        self.fc2 = nn.Linear(10, 1)

    def forward(self, x):
        out = self.fc1(x)
        out = self.sigmoid(out)
        out = self.fc2(out)

        return out

...

model = MyModel().to(device)
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(epochs):
    for x, y in ...

        x = x.to(device)
        y = y.to(device)

        outputs = model(x)
        loss = criterion(outputs, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

如果询问你这个回溯步骤是如何运作的，你会如何回答？再比如，当你改变一个张量的形状时，背后发生了什么？数据是否在内部被重新组织？这个过程是如何进行的？PyTorch为何能够运行得如此迅速？它是如何处理GPU运算的？这些问题一直让我充满好奇，我相信它们同样也引起了你的兴趣。因此，为了更深入地掌握这些概念，有什么方法能比自己动手从零开始构建一个张量库更有效呢？这就是你在本文中将要学习的内容！

Tensor

要创建一个张量库，你首先必须掌握的概念当然是：张量是什么？

你可能已经有一个直观的理解，即张量是一个包含数值的多维数据结构的数学概念。但在这儿，我们得从计算的角度来理解如何构建这种数据结构。我们可以将张量想象成由两部分组成：一部分是数据本身，另一部分是描述张量特征的元数据，比如它的维度形状，或者它存储在哪种设备上（比如CPU内存或GPU内存）。

还有一个你可能不太熟悉的元数据概念，称为步长（stride）。理解这个概念对于深入掌握张量数据重排的内部工作原理至关重要，因此我们有必要对其进行更深入的探讨。

设想一个形状为 [4, 8] 的二维张量，如下图所示。

张量的数据（即浮点数）实际上在内存中存储为一维数组：

所以，为了将这个一维数组表示为多维张量，我们利用了步长（strides）的概念。简单来说，思路是这样的：

我们有一个 4 行 8 列的矩阵。假设所有元素都是按行顺序存储在一维数组中，如果我们想访问位于 [2, 3] 的元素值，我们需要跳过 2 行（每行有 8 个元素）再加上 3 个位置。用数学语言描述，我们需要在一维数组中跳过 3 + 2 * 8 个元素来访问它：

这个“8”代表的是第二维度的步长（stride）。简单来说，它告诉我们在数组中需要跳过多少个元素才能“移动”到第二维度上的其他位置。因此，要访问形状为 [shape_0, shape_1] 的二维张量中的元素 [i, j]，我们实际上需要定位到 j + i * shape_1 的位置。

接下来，让我们设想一个三维张量的情况：

你可以将这个三维张量视作矩阵的序列。比如，你可以把这个形状为 [5, 4, 8] 的张量看作是 5 个形状为 [4, 8] 的矩阵。现在，要获取位于 [1, 2, 7] 的元素，你需要跳过 1 个完整的 [4,8] 形状的矩阵，2 行 [8] 形状的行，以及 7 列 [1] 形状的列。所以，你需要在一维数组中跳过 (1 4 8) + (2 8) + (7 1) 个位置。

因此，要访问 1-D 数据数组上具有 [shape_0, shape_1, shape_2] 的 3-D 张量的元素 [i][j][k]，您可以执行以下操作：

shape_1 * shape_2 是第一维度的步幅，shape_2 是第二维度的步幅，1 是第三维度的步幅。然后，为了概括：

其中每个维度的步幅可以使用下一维度张量形状的乘积来计算：

然后我们设置 stride[n-1] = 1。在形状为 [5, 4, 8] 的张量示例中，步长 = [4*8, 8, 1] = [32, 8, 1] 您可以自行测试：

import torch

torch.rand([5, 4, 8]).stride()
#(32, 8, 1)

好的，但我们为什么要使用形状和步长呢？这个概念不仅可以用于访问存储为一维数组形式的N维张量的元素，还可以非常方便地用来调整张量的布局。

比如，当你想要改变一个张量的形状时，只需指定新的形状，并据此计算出新的步长即可！（这样做是因为新的形状确保了元素的总数保持不变）

import torch

t = torch.rand([5, 4, 8])

print(t.shape)
# [5, 4, 8]

print(t.stride())
# [32, 8, 1]

new_t = t.reshape([4, 5, 2, 2, 2])

print(new_t.shape)
# [4, 5, 2, 2, 2]

print(new_t.stride())
# [40, 8, 4, 2, 1]

在内部，张量仍然以一维数组的形式存储。重塑操作并没有改变数组中元素的排列顺序！这真是令人惊叹，对吧？你可以通过下面的函数来亲自验证这一点，该函数能够访问 PyTorch 内部的一维数组：

import ctypes

def print_internal(t: torch.Tensor):
    print(
        torch.frombuffer(
            ctypes.string_at(t.data_ptr(), t.storage().nbytes()), dtype=t.dtype
        )
    )

print_internal(t)
# [0.0752, 0.5898, 0.3930, 0.9577, 0.2276, 0.9786, 0.1009, 0.138, ...

print_internal(new_t)
# [0.0752, 0.5898, 0.3930, 0.9577, 0.2276, 0.9786, 0.1009, 0.138, ...

或者例如，您想要转置两个轴。在内部，您只需交换各自的步幅即可！

t = torch.arange(0, 24).reshape(2, 3, 4)
print(t)
# [[[ 0,  1,  2,  3],
#   [ 4,  5,  6,  7],
#   [ 8,  9, 10, 11]],

#  [[12, 13, 14, 15],
#   [16, 17, 18, 19],
#   [20, 21, 22, 23]]]

print(t.shape)
# [2, 3, 4]

print(t.stride())
# [12, 4, 1]

new_t = t.transpose(0, 1)
print(new_t)
# [[[ 0,  1,  2,  3],
#   [12, 13, 14, 15]],

#  [[ 4,  5,  6,  7],
#   [16, 17, 18, 19]],

#  [[ 8,  9, 10, 11],
#   [20, 21, 22, 23]]]

print(new_t.shape)
# [3, 2, 4]

print(new_t.stride())
# [4, 12, 1]

如果打印内部数组，两者具有相同的值：

print_internal(t)
# [ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]

print_internal(new_t)
# [ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]

但是，新张量 new_t 的步长现在与我之前展示的公式不一致。这是因为张量现在不再连续。也就是说，尽管内部数组本身没有变化，它在内存中的存储顺序与张量的实际排列顺序并不一致。

t.is_contiguous()
# True

new_t.is_contiguous()
# False

这意味着，如果我们要按顺序访问那些不连续的元素，效率会比较低（因为这些元素在内存中的存储并不是连续的）。为了提高效率，我们可以通过以下方式来解决：

new_t_contiguous = new_t.contiguous()

print(new_t_contiguous.is_contiguous())
# True

如果我们分析内部数组，它的顺序现在与实际张量顺序匹配，这可以提供更好的内存访问效率：

print(new_t)
# [[[ 0,  1,  2,  3],
#   [12, 13, 14, 15]],

#  [[ 4,  5,  6,  7],
#   [16, 17, 18, 19]],

#  [[ 8,  9, 10, 11],
#   [20, 21, 22, 23]]]

print_internal(new_t)
# [ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]

print_internal(new_t_contiguous)
# [ 0,  1,  2,  3, 12, 13, 14, 15,  4,  5,  6,  7, 16, 17, 18, 19,  8,  9, 10, 11, 20, 21, 22, 23]

构建自己的张量库（Norch）

现在我们已经明白了张量是如何构建的，接下来就让我们着手开发我们的库。我打算给它命名为Norch，这个名字既表示“非PyTorch”，也隐含了我自己的姓氏Nogueira 😁首先需要了解的是，尽管PyTorch是通过Python接口使用的，但其核心实际上是用C/C++编写的。因此，我们首先需要开发我们自己的C/C++底层功能。我们可以首先定义一个结构体来存储张量的数据和元数据，并编写一个函数来创建这个结构体的实例。

//norch/csrc/tensor.cpp

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>

typedef struct {
   
   
    float* data;
    int* strides;
    int* shape;
    int ndim;
    int size;
    char* device;
} Tensor;

Tensor* create_tensor(float* data, int* shape, int ndim) {
   
   

    Tensor* tensor = (Tensor*)malloc(sizeof(Tensor));
    if (tensor == NULL) {
   
   
        fprintf(stderr, "Memory allocation failed\n");
        exit(1);
    }
    tensor->data = data;
    tensor->shape = shape;
    tensor->ndim = ndim;

    tensor->size = 1;
    for (int i = 0; i < ndim; i++) {
   
   
        tensor->size *= shape[i];
    }

    tensor->strides = (int*)malloc(ndim * sizeof(int));
    if (tensor->strides == NULL) {
   
   
        fprintf(stderr, "Memory allocation failed\n");
        exit(1);
    }
    int stride = 1;
    for (int i = ndim - 1; i >= 0; i--) {
   
   
        tensor->strides[i] = stride;
        stride *= shape[i];
    }

    return tensor;
}

为了访问某些元素，我们可以利用步幅，正如我们之前学到的：

//norch/csrc/tensor.cpp

float get_item(Tensor* tensor, int* indices) {
   
   
    int index = 0;
    for (int i = 0; i < tensor->ndim; i++) {
   
   
        index += indices[i] * tensor->strides[i];
    }

    float result;
    result = tensor->data[index];

    return result;
}

现在，我们可以创建张量运算。我将展示一些示例，您可以在本文末尾链接的存储库中找到完整版本。

//norch/csrc/cpu.cpp

void add_tensor_cpu(Tensor* tensor1, Tensor* tensor2, float* result_data) {
   
   

    for (int i = 0; i < tensor1->size; i++) {
   
   
        result_data[i] = tensor1->data[i] + tensor2->data[i];
    }
}

void sub_tensor_cpu(Tensor* tensor1, Tensor* tensor2, float* result_data) {
   
   

    for (int i = 0; i < tensor1->size; i++) {
   
   
        result_data[i] = tensor1->data[i] - tensor2->data[i];
    }
}

void elementwise_mul_tensor_cpu(Tensor* tensor1, Tensor* tensor2, float* result_data) {
   
   

    for (int i = 0; i < tensor1->size; i++) {
   
   
        result_data[i] = tensor1->data[i] * tensor2->data[i];
    }
}

void assign_tensor_cpu(Tensor* tensor, float* result_data) {
   
   

    for (int i = 0; i < tensor->size; i++) {
   
   
        result_data[i] = tensor->data[i];
    }
}

...

之后我们就可以创建其他张量函数来调用这些操作：

//norch/csrc/tensor.cpp

Tensor* add_tensor(Tensor* tensor1, Tensor* tensor2) {
   
   
    if (tensor1->ndim != tensor2->ndim) {
   
   
        fprintf(stderr, "Tensors must have the same number of dimensions %d and %d for addition\n", tensor1->ndim, tensor2->ndim);
        exit(1);
    }

    int ndim = tensor1->ndim;
    int* shape = (int*)malloc(ndim * sizeof(int));
    if (shape == NULL) {
   
   
        fprintf(stderr, "Memory allocation failed\n");
        exit(1);
    }

    for (int i = 0; i < ndim; i++) {
   
   
        if (tensor1->shape[i] != tensor2->shape[i]) {
   
   
            fprintf(stderr, "Tensors must have the same shape %d and %d at index %d for addition\n", tensor1->shape[i], tensor2->shape[i], i);
            exit(1);
        }
        shape[i] = tensor1->shape[i];
    }        
    float* result_data = (float*)malloc(tensor1->size * sizeof(float));
    if (result_data == NULL) {
   
   
        fprintf(stderr, "Memory allocation failed\n");
        exit(1);
    }
    add_tensor_cpu(tensor1, tensor2, result_data);

    return create_tensor(result_data, shape, ndim, device);
}

如前所述，张量重塑不会修改内部数据数组：

//norch/csrc/tensor.cpp

Tensor* reshape_tensor(Tensor* tensor, int* new_shape, int new_ndim) {
   
   

    int ndim = new_ndim;
    int* shape = (int*)malloc(ndim * sizeof(int));
    if (shape == NULL) {
   
   
        fprintf(stderr, "Memory allocation failed\n");
        exit(1);
    }

    for (int i = 0; i < ndim; i++) {
   
   
        shape[i] = new_shape[i];
    }

    // Calculate the total number of elements in the new shape
    int size = 1;
    for (int i = 0; i < new_ndim; i++) {
   
   
        size *= shape[i];
    }

    // Check if the total number of elements matches the current tensor's size
    if (size != tensor->size) {
   
   
        fprintf(stderr, "Cannot reshape tensor. Total number of elements in new shape does not match the current size of the tensor.\n");
        exit(1);
    }

    float* result_data = (float*)malloc(tensor->size * sizeof(float));
    if (result_data == NULL) {
   
   
        fprintf(stderr, "Memory allocation failed\n");
        exit(1);
    }
    assign_tensor_cpu(tensor, result_data);
    return create_tensor(result_data, shape, ndim, device);
}

虽然我们已经能够执行一些张量操作，但不应该有人需要直接使用C/C++来运行这些操作，对吗？现在，让我们着手开发Python接口！在Python中执行C/C++代码有很多选择，例如Pybind11和Cython。在我们的示例中，我选择使用ctypes。下面是ctypes的基本结构：

//C code
#include <stdio.h>

float add_floats(float a, float b) {
   
   
    return a + b;
}

# Compile
gcc -shared -o add_floats.so -fPIC add_floats.c

# Python code
import ctypes

# Load the shared library
lib = ctypes.CDLL('./add_floats.so')

# Define the argument and return types for the function
lib.add_floats.argtypes = [ctypes.c_float, ctypes.c_float]
lib.add_floats.restype = ctypes.c_float

# Convert python float to c_float type 
a = ctypes.c_float(3.5)
b = ctypes.c_float(2.2)

# Call the C function
result = lib.add_floats(a, b)
print(result)
# 5.7

正如您所看到的，它非常直观。编译 C/C++ 代码后，您可以非常轻松地通过 ctypes 在 Python 上使用它。您只需要定义函数的参数和返回 c_types，将变量转换为其各自的 c_types 并调用该函数。对于更复杂的类型，例如数组（浮点列表），您可以使用指针。

data = [1.0, 2.0, 3.0]
data_ctype = (ctypes.c_float * len(data))(*data)

lib.some_array_func.argstypes = [ctypes.POINTER(ctypes.c_float)]

...

lib.some_array_func(data)

对于结构类型，我们可以创建自己的 c_type：

class CustomType(ctypes.Structure):
    _fields_ = [
        ('field1', ctypes.POINTER(ctypes.c_float)),
        ('field2', ctypes.POINTER(ctypes.c_int)),
        ('field3', ctypes.c_int),
    ]

# Can be used as ctypes.POINTER(CustomType)

经过简短的解释后，让我们为张量 C/C++ 库构建 Python 包装器！

# norch/tensor.py

import ctypes

class CTensor(ctypes.Structure):
    _fields_ = [
        ('data', ctypes.POINTER(ctypes.c_float)),
        ('strides', ctypes.POINTER(ctypes.c_int)),
        ('shape', ctypes.POINTER(ctypes.c_int)),
        ('ndim', ctypes.c_int),
        ('size', ctypes.c_int),
    ]

class Tensor:
    os.path.abspath(os.curdir)
    _C = ctypes.CDLL("COMPILED_LIB.so"))

    def __init__(self):

        data, shape = self.flatten(data)
        self.data_ctype = (ctypes.c_float * len(data))(*data)
        self.shape_ctype = (ctypes.c_int * len(shape))(*shape)
        self.ndim_ctype = ctypes.c_int(len(shape))

        self.shape = shape
        self.ndim = len(shape)

        Tensor._C.create_tensor.argtypes = [ctypes.POINTER(ctypes.c_float), ctypes.POINTER(ctypes.c_int), ctypes.c_int]
        Tensor._C.create_tensor.restype = ctypes.POINTER(CTensor)

        self.tensor = Tensor._C.create_tensor(
            self.data_ctype,
            self.shape_ctype,
            self.ndim_ctype,
        )

    def flatten(self, nested_list):
        """
        This method simply convert a list type tensor to a flatten tensor with its shape

        Example:

        Arguments:  
            nested_list: [[1, 2, 3], [-5, 2, 0]]
        Return:
            flat_data: [1, 2, 3, -5, 2, 0]
            shape: [2, 3]
        """
        def flatten_recursively(nested_list):
            flat_data = []
            shape = []
            if isinstance(nested_list, list):
                for sublist in nested_list:
                    inner_data, inner_shape = flatten_recursively(sublist)
                    flat_data.extend(inner_data)
                shape.append(len(nested_list))
                shape.extend(inner_shape)
            else:
                flat_data.append(nested_list)
            return flat_data, shape

        flat_data, shape = flatten_recursively(nested_list)
        return flat_data, shape

现在我们包含 Python 张量操作来调用 C/C++ 操作。

# norch/tensor.py

def __getitem__(self, indices):
    """
    Access tensor by index tensor[i, j, k...]
    """

    if len(indices) != self.ndim:
        raise ValueError("Number of indices must match the number of dimensions")

    Tensor._C.get_item.argtypes = [ctypes.POINTER(CTensor), ctypes.POINTER(ctypes.c_int)]
    Tensor._C.get_item.restype = ctypes.c_float

    indices = (ctypes.c_int * len(indices))(*indices)
    value = Tensor._C.get_item(self.tensor, indices)  

    return value

def reshape(self, new_shape):
    """
    Reshape tensor
    result = tensor.reshape([1,2])
    """
    new_shape_ctype = (ctypes.c_int * len(new_shape))(*new_shape)
    new_ndim_ctype = ctypes.c_int(len(new_shape))

    Tensor._C.reshape_tensor.argtypes = [ctypes.POINTER(CTensor), ctypes.POINTER(ctypes.c_int), ctypes.c_int]
    Tensor._C.reshape_tensor.restype = ctypes.POINTER(CTensor)
    result_tensor_ptr = Tensor._C.reshape_tensor(self.tensor, new_shape_ctype, new_ndim_ctype)   

    result_data = Tensor()
    result_data.tensor = result_tensor_ptr
    result_data.shape = new_shape.copy()
    result_data.ndim = len(new_shape)
    result_data.device = self.device

    return result_data

def __add__(self, other):
    """
    Add tensors
    result = tensor1 + tensor2
    """

    if self.shape != other.shape:
        raise ValueError("Tensors must have the same shape for addition")

    Tensor._C.add_tensor.argtypes = [ctypes.POINTER(CTensor), ctypes.POINTER(CTensor)]
    Tensor._C.add_tensor.restype = ctypes.POINTER(CTensor)

    result_tensor_ptr = Tensor._C.add_tensor(self.tensor, other.tensor)

    result_data = Tensor()
    result_data.tensor = result_tensor_ptr
    result_data.shape = self.shape.copy()
    result_data.ndim = self.ndim
    result_data.device = self.device

    return result_data

# Include the other operations:
# __str__
# __sub__ (-)
# __mul__ (*)
# __matmul__ (@)
# __pow__ (**)
# __truediv__ (/)
# log
# ...

您现在就可以运行代码并开始执行一些张量运算！

import norch

tensor1 = norch.Tensor([[1, 2, 3], [3, 2, 1]])
tensor2 = norch.Tensor([[3, 2, 1], [1, 2, 3]])

result = tensor1 + tensor2
print(result[0, 0])
# 4

从头开始重新创建 PyTorch

简介

Tensor

构建自己的张量库（Norch）

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

从头开始重新创建 PyTorch

简介

Tensor

构建自己的张量库（Norch）

热门文章

最新文章

相关课程

相关电子书

相关实验场景