简介
多年来,我们一直在使用 PyTorch 来构建和训练深度学习模型。尽管对其语法和规则已经了如指掌,但内心深处总有一些疑问:这些操作背后究竟发生了什么?它们是如何运作的?如果询问你如何在 PyTorch 中创建和训练一个模型,你可能会想到类似以下的代码:
import torch
import torch.nn as nn
import torch.optim as optim
class MyModel(nn.Module):
def __init__(self):
super(MyModel, self).__init__()
self.fc1 = nn.Linear(1, 10)
self.sigmoid = nn.Sigmoid()
self.fc2 = nn.Linear(10, 1)
def forward(self, x):
out = self.fc1(x)
out = self.sigmoid(out)
out = self.fc2(out)
return out
...
model = MyModel().to(device)
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.001)
for epoch in range(epochs):
for x, y in ...
x = x.to(device)
y = y.to(device)
outputs = model(x)
loss = criterion(outputs, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
如果询问你这个回溯步骤是如何运作的,你会如何回答?再比如,当你改变一个张量的形状时,背后发生了什么?数据是否在内部被重新组织?这个过程是如何进行的?PyTorch为何能够运行得如此迅速?它是如何处理GPU运算的?这些问题一直让我充满好奇,我相信它们同样也引起了你的兴趣。因此,为了更深入地掌握这些概念,有什么方法能比自己动手从零开始构建一个张量库更有效呢?这就是你在本文中将要学习的内容!
Tensor
要创建一个张量库,你首先必须掌握的概念当然是:张量是什么?
你可能已经有一个直观的理解,即张量是一个包含数值的多维数据结构的数学概念。但在这儿,我们得从计算的角度来理解如何构建这种数据结构。我们可以将张量想象成由两部分组成:一部分是数据本身,另一部分是描述张量特征的元数据,比如它的维度形状,或者它存储在哪种设备上(比如CPU内存或GPU内存)。
还有一个你可能不太熟悉的元数据概念,称为步长(stride)。理解这个概念对于深入掌握张量数据重排的内部工作原理至关重要,因此我们有必要对其进行更深入的探讨。
设想一个形状为 [4, 8] 的二维张量,如下图所示。
张量的数据(即浮点数)实际上在内存中存储为一维数组:
所以,为了将这个一维数组表示为多维张量,我们利用了步长(strides)的概念。简单来说,思路是这样的:
我们有一个 4 行 8 列的矩阵。假设所有元素都是按行顺序存储在一维数组中,如果我们想访问位于 [2, 3] 的元素值,我们需要跳过 2 行(每行有 8 个元素)再加上 3 个位置。用数学语言描述,我们需要在一维数组中跳过 3 + 2 * 8 个元素来访问它:
这个“8”代表的是第二维度的步长(stride)。简单来说,它告诉我们在数组中需要跳过多少个元素才能“移动”到第二维度上的其他位置。因此,要访问形状为 [shape_0, shape_1] 的二维张量中的元素 [i, j],我们实际上需要定位到 j + i * shape_1 的位置。
接下来,让我们设想一个三维张量的情况:
你可以将这个三维张量视作矩阵的序列。比如,你可以把这个形状为 [5, 4, 8] 的张量看作是 5 个形状为 [4, 8] 的矩阵。现在,要获取位于 [1, 2, 7] 的元素,你需要跳过 1 个完整的 [4,8] 形状的矩阵,2 行 [8] 形状的行,以及 7 列 [1] 形状的列。所以,你需要在一维数组中跳过 (1 4 8) + (2 8) + (7 1) 个位置。
因此,要访问 1-D 数据数组上具有 [shape_0, shape_1, shape_2] 的 3-D 张量的元素 [i][j][k],您可以执行以下操作:
shape_1 * shape_2 是第一维度的步幅,shape_2 是第二维度的步幅,1 是第三维度的步幅。然后,为了概括:
其中每个维度的步幅可以使用下一维度张量形状的乘积来计算:
然后我们设置 stride[n-1] = 1。在形状为 [5, 4, 8] 的张量示例中,步长 = [4*8, 8, 1] = [32, 8, 1] 您可以自行测试:
import torch
torch.rand([5, 4, 8]).stride()
#(32, 8, 1)
好的,但我们为什么要使用形状和步长呢?这个概念不仅可以用于访问存储为一维数组形式的N维张量的元素,还可以非常方便地用来调整张量的布局。
比如,当你想要改变一个张量的形状时,只需指定新的形状,并据此计算出新的步长即可!(这样做是因为新的形状确保了元素的总数保持不变)
import torch
t = torch.rand([5, 4, 8])
print(t.shape)
# [5, 4, 8]
print(t.stride())
# [32, 8, 1]
new_t = t.reshape([4, 5, 2, 2, 2])
print(new_t.shape)
# [4, 5, 2, 2, 2]
print(new_t.stride())
# [40, 8, 4, 2, 1]
在内部,张量仍然以一维数组的形式存储。重塑操作并没有改变数组中元素的排列顺序!这真是令人惊叹,对吧?你可以通过下面的函数来亲自验证这一点,该函数能够访问 PyTorch 内部的一维数组:
import ctypes
def print_internal(t: torch.Tensor):
print(
torch.frombuffer(
ctypes.string_at(t.data_ptr(), t.storage().nbytes()), dtype=t.dtype
)
)
print_internal(t)
# [0.0752, 0.5898, 0.3930, 0.9577, 0.2276, 0.9786, 0.1009, 0.138, ...
print_internal(new_t)
# [0.0752, 0.5898, 0.3930, 0.9577, 0.2276, 0.9786, 0.1009, 0.138, ...
或者例如,您想要转置两个轴。在内部,您只需交换各自的步幅即可!
t = torch.arange(0, 24).reshape(2, 3, 4)
print(t)
# [[[ 0, 1, 2, 3],
# [ 4, 5, 6, 7],
# [ 8, 9, 10, 11]],
# [[12, 13, 14, 15],
# [16, 17, 18, 19],
# [20, 21, 22, 23]]]
print(t.shape)
# [2, 3, 4]
print(t.stride())
# [12, 4, 1]
new_t = t.transpose(0, 1)
print(new_t)
# [[[ 0, 1, 2, 3],
# [12, 13, 14, 15]],
# [[ 4, 5, 6, 7],
# [16, 17, 18, 19]],
# [[ 8, 9, 10, 11],
# [20, 21, 22, 23]]]
print(new_t.shape)
# [3, 2, 4]
print(new_t.stride())
# [4, 12, 1]
如果打印内部数组,两者具有相同的值:
print_internal(t)
# [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
print_internal(new_t)
# [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
但是,新张量 new_t
的步长现在与我之前展示的公式不一致。这是因为张量现在不再连续。也就是说,尽管内部数组本身没有变化,它在内存中的存储顺序与张量的实际排列顺序并不一致。
t.is_contiguous()
# True
new_t.is_contiguous()
# False
这意味着,如果我们要按顺序访问那些不连续的元素,效率会比较低(因为这些元素在内存中的存储并不是连续的)。为了提高效率,我们可以通过以下方式来解决:
new_t_contiguous = new_t.contiguous()
print(new_t_contiguous.is_contiguous())
# True
如果我们分析内部数组,它的顺序现在与实际张量顺序匹配,这可以提供更好的内存访问效率:
print(new_t)
# [[[ 0, 1, 2, 3],
# [12, 13, 14, 15]],
# [[ 4, 5, 6, 7],
# [16, 17, 18, 19]],
# [[ 8, 9, 10, 11],
# [20, 21, 22, 23]]]
print_internal(new_t)
# [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
print_internal(new_t_contiguous)
# [ 0, 1, 2, 3, 12, 13, 14, 15, 4, 5, 6, 7, 16, 17, 18, 19, 8, 9, 10, 11, 20, 21, 22, 23]
构建自己的张量库(Norch)
现在我们已经明白了张量是如何构建的,接下来就让我们着手开发我们的库。我打算给它命名为Norch,这个名字既表示“非PyTorch”,也隐含了我自己的姓氏Nogueira 😁首先需要了解的是,尽管PyTorch是通过Python接口使用的,但其核心实际上是用C/C++编写的。因此,我们首先需要开发我们自己的C/C++底层功能。我们可以首先定义一个结构体来存储张量的数据和元数据,并编写一个函数来创建这个结构体的实例。
//norch/csrc/tensor.cpp
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
typedef struct {
float* data;
int* strides;
int* shape;
int ndim;
int size;
char* device;
} Tensor;
Tensor* create_tensor(float* data, int* shape, int ndim) {
Tensor* tensor = (Tensor*)malloc(sizeof(Tensor));
if (tensor == NULL) {
fprintf(stderr, "Memory allocation failed\n");
exit(1);
}
tensor->data = data;
tensor->shape = shape;
tensor->ndim = ndim;
tensor->size = 1;
for (int i = 0; i < ndim; i++) {
tensor->size *= shape[i];
}
tensor->strides = (int*)malloc(ndim * sizeof(int));
if (tensor->strides == NULL) {
fprintf(stderr, "Memory allocation failed\n");
exit(1);
}
int stride = 1;
for (int i = ndim - 1; i >= 0; i--) {
tensor->strides[i] = stride;
stride *= shape[i];
}
return tensor;
}
为了访问某些元素,我们可以利用步幅,正如我们之前学到的:
//norch/csrc/tensor.cpp
float get_item(Tensor* tensor, int* indices) {
int index = 0;
for (int i = 0; i < tensor->ndim; i++) {
index += indices[i] * tensor->strides[i];
}
float result;
result = tensor->data[index];
return result;
}
现在,我们可以创建张量运算。我将展示一些示例,您可以在本文末尾链接的存储库中找到完整版本。
//norch/csrc/cpu.cpp
void add_tensor_cpu(Tensor* tensor1, Tensor* tensor2, float* result_data) {
for (int i = 0; i < tensor1->size; i++) {
result_data[i] = tensor1->data[i] + tensor2->data[i];
}
}
void sub_tensor_cpu(Tensor* tensor1, Tensor* tensor2, float* result_data) {
for (int i = 0; i < tensor1->size; i++) {
result_data[i] = tensor1->data[i] - tensor2->data[i];
}
}
void elementwise_mul_tensor_cpu(Tensor* tensor1, Tensor* tensor2, float* result_data) {
for (int i = 0; i < tensor1->size; i++) {
result_data[i] = tensor1->data[i] * tensor2->data[i];
}
}
void assign_tensor_cpu(Tensor* tensor, float* result_data) {
for (int i = 0; i < tensor->size; i++) {
result_data[i] = tensor->data[i];
}
}
...
之后我们就可以创建其他张量函数来调用这些操作:
//norch/csrc/tensor.cpp
Tensor* add_tensor(Tensor* tensor1, Tensor* tensor2) {
if (tensor1->ndim != tensor2->ndim) {
fprintf(stderr, "Tensors must have the same number of dimensions %d and %d for addition\n", tensor1->ndim, tensor2->ndim);
exit(1);
}
int ndim = tensor1->ndim;
int* shape = (int*)malloc(ndim * sizeof(int));
if (shape == NULL) {
fprintf(stderr, "Memory allocation failed\n");
exit(1);
}
for (int i = 0; i < ndim; i++) {
if (tensor1->shape[i] != tensor2->shape[i]) {
fprintf(stderr, "Tensors must have the same shape %d and %d at index %d for addition\n", tensor1->shape[i], tensor2->shape[i], i);
exit(1);
}
shape[i] = tensor1->shape[i];
}
float* result_data = (float*)malloc(tensor1->size * sizeof(float));
if (result_data == NULL) {
fprintf(stderr, "Memory allocation failed\n");
exit(1);
}
add_tensor_cpu(tensor1, tensor2, result_data);
return create_tensor(result_data, shape, ndim, device);
}
如前所述,张量重塑不会修改内部数据数组:
//norch/csrc/tensor.cpp
Tensor* reshape_tensor(Tensor* tensor, int* new_shape, int new_ndim) {
int ndim = new_ndim;
int* shape = (int*)malloc(ndim * sizeof(int));
if (shape == NULL) {
fprintf(stderr, "Memory allocation failed\n");
exit(1);
}
for (int i = 0; i < ndim; i++) {
shape[i] = new_shape[i];
}
// Calculate the total number of elements in the new shape
int size = 1;
for (int i = 0; i < new_ndim; i++) {
size *= shape[i];
}
// Check if the total number of elements matches the current tensor's size
if (size != tensor->size) {
fprintf(stderr, "Cannot reshape tensor. Total number of elements in new shape does not match the current size of the tensor.\n");
exit(1);
}
float* result_data = (float*)malloc(tensor->size * sizeof(float));
if (result_data == NULL) {
fprintf(stderr, "Memory allocation failed\n");
exit(1);
}
assign_tensor_cpu(tensor, result_data);
return create_tensor(result_data, shape, ndim, device);
}
虽然我们已经能够执行一些张量操作,但不应该有人需要直接使用C/C++来运行这些操作,对吗?现在,让我们着手开发Python接口!在Python中执行C/C++代码有很多选择,例如Pybind11和Cython。在我们的示例中,我选择使用ctypes。下面是ctypes的基本结构:
//C code
#include <stdio.h>
float add_floats(float a, float b) {
return a + b;
}
# Compile
gcc -shared -o add_floats.so -fPIC add_floats.c
# Python code
import ctypes
# Load the shared library
lib = ctypes.CDLL('./add_floats.so')
# Define the argument and return types for the function
lib.add_floats.argtypes = [ctypes.c_float, ctypes.c_float]
lib.add_floats.restype = ctypes.c_float
# Convert python float to c_float type
a = ctypes.c_float(3.5)
b = ctypes.c_float(2.2)
# Call the C function
result = lib.add_floats(a, b)
print(result)
# 5.7
正如您所看到的,它非常直观。编译 C/C++ 代码后,您可以非常轻松地通过 ctypes 在 Python 上使用它。您只需要定义函数的参数和返回 c_types,将变量转换为其各自的 c_types 并调用该函数。对于更复杂的类型,例如数组(浮点列表),您可以使用指针。
data = [1.0, 2.0, 3.0]
data_ctype = (ctypes.c_float * len(data))(*data)
lib.some_array_func.argstypes = [ctypes.POINTER(ctypes.c_float)]
...
lib.some_array_func(data)
对于结构类型,我们可以创建自己的 c_type:
class CustomType(ctypes.Structure):
_fields_ = [
('field1', ctypes.POINTER(ctypes.c_float)),
('field2', ctypes.POINTER(ctypes.c_int)),
('field3', ctypes.c_int),
]
# Can be used as ctypes.POINTER(CustomType)
经过简短的解释后,让我们为张量 C/C++ 库构建 Python 包装器!
# norch/tensor.py
import ctypes
class CTensor(ctypes.Structure):
_fields_ = [
('data', ctypes.POINTER(ctypes.c_float)),
('strides', ctypes.POINTER(ctypes.c_int)),
('shape', ctypes.POINTER(ctypes.c_int)),
('ndim', ctypes.c_int),
('size', ctypes.c_int),
]
class Tensor:
os.path.abspath(os.curdir)
_C = ctypes.CDLL("COMPILED_LIB.so"))
def __init__(self):
data, shape = self.flatten(data)
self.data_ctype = (ctypes.c_float * len(data))(*data)
self.shape_ctype = (ctypes.c_int * len(shape))(*shape)
self.ndim_ctype = ctypes.c_int(len(shape))
self.shape = shape
self.ndim = len(shape)
Tensor._C.create_tensor.argtypes = [ctypes.POINTER(ctypes.c_float), ctypes.POINTER(ctypes.c_int), ctypes.c_int]
Tensor._C.create_tensor.restype = ctypes.POINTER(CTensor)
self.tensor = Tensor._C.create_tensor(
self.data_ctype,
self.shape_ctype,
self.ndim_ctype,
)
def flatten(self, nested_list):
"""
This method simply convert a list type tensor to a flatten tensor with its shape
Example:
Arguments:
nested_list: [[1, 2, 3], [-5, 2, 0]]
Return:
flat_data: [1, 2, 3, -5, 2, 0]
shape: [2, 3]
"""
def flatten_recursively(nested_list):
flat_data = []
shape = []
if isinstance(nested_list, list):
for sublist in nested_list:
inner_data, inner_shape = flatten_recursively(sublist)
flat_data.extend(inner_data)
shape.append(len(nested_list))
shape.extend(inner_shape)
else:
flat_data.append(nested_list)
return flat_data, shape
flat_data, shape = flatten_recursively(nested_list)
return flat_data, shape
现在我们包含 Python 张量操作来调用 C/C++ 操作。
# norch/tensor.py
def __getitem__(self, indices):
"""
Access tensor by index tensor[i, j, k...]
"""
if len(indices) != self.ndim:
raise ValueError("Number of indices must match the number of dimensions")
Tensor._C.get_item.argtypes = [ctypes.POINTER(CTensor), ctypes.POINTER(ctypes.c_int)]
Tensor._C.get_item.restype = ctypes.c_float
indices = (ctypes.c_int * len(indices))(*indices)
value = Tensor._C.get_item(self.tensor, indices)
return value
def reshape(self, new_shape):
"""
Reshape tensor
result = tensor.reshape([1,2])
"""
new_shape_ctype = (ctypes.c_int * len(new_shape))(*new_shape)
new_ndim_ctype = ctypes.c_int(len(new_shape))
Tensor._C.reshape_tensor.argtypes = [ctypes.POINTER(CTensor), ctypes.POINTER(ctypes.c_int), ctypes.c_int]
Tensor._C.reshape_tensor.restype = ctypes.POINTER(CTensor)
result_tensor_ptr = Tensor._C.reshape_tensor(self.tensor, new_shape_ctype, new_ndim_ctype)
result_data = Tensor()
result_data.tensor = result_tensor_ptr
result_data.shape = new_shape.copy()
result_data.ndim = len(new_shape)
result_data.device = self.device
return result_data
def __add__(self, other):
"""
Add tensors
result = tensor1 + tensor2
"""
if self.shape != other.shape:
raise ValueError("Tensors must have the same shape for addition")
Tensor._C.add_tensor.argtypes = [ctypes.POINTER(CTensor), ctypes.POINTER(CTensor)]
Tensor._C.add_tensor.restype = ctypes.POINTER(CTensor)
result_tensor_ptr = Tensor._C.add_tensor(self.tensor, other.tensor)
result_data = Tensor()
result_data.tensor = result_tensor_ptr
result_data.shape = self.shape.copy()
result_data.ndim = self.ndim
result_data.device = self.device
return result_data
# Include the other operations:
# __str__
# __sub__ (-)
# __mul__ (*)
# __matmul__ (@)
# __pow__ (**)
# __truediv__ (/)
# log
# ...
您现在就可以运行代码并开始执行一些张量运算!
import norch
tensor1 = norch.Tensor([[1, 2, 3], [3, 2, 1]])
tensor2 = norch.Tensor([[3, 2, 1], [1, 2, 3]])
result = tensor1 + tensor2
print(result[0, 0])
# 4