如何从TensorFlow迁移到PyTorch-阿里云开发者社区

更多深度文章，请关注云计算频道：https://yq.aliyun.com/cloud

当我最初开始学习PyTorch的时候，没学几天就放弃了。与TensorFlow相比，我很难掌握这个框架的核心概念。随后我把它放在我的“知识书架”上，接着就把它淡忘了。但是不久之前，PyTorch发布了一个新版本。所以，我决定再次给它一次机会。又过了一段时间，我终于明白这个框架真的很容易使用，用PyTorch编代码真的很开心。在这篇文章中，我将为你介绍PyTorch的核心概念，这样，你现在就可以去尝试一下，而不用等到几年以后。本文将涵盖一些基本原理和进阶知识，例如，学习率调度程序、自定义层等等。

学习资源

首先，你应该了解一下PyTorch，看看它的文档和教程。有时候由于版本更新过快，文档教程与实际的软件不一定会配套。所以，你可以看看它的源代码，这样更清楚直接一点，免费的哦。另外还有非常棒的PyTorch论坛，你可以在那里提出任何问题，并能很快地得到答案。这个地方似乎比StackOverflow上的PyTorch专栏更受欢迎。

把PyTorch当作NumPy来用

我们先来看看PyTorch吧。 PyTorch的主要构件是张量，这与NumPy的非常类似。这使得它们之间存在大量相同的API，因此有时候你可以使用PyTorch来替代NumPy。你可能会问为什么要这么做。主要原因是PyTorch可以使用GPU，这样，你就可以将数据预处理或任何重计算的工作迁移到负责机器学习的机器上来。将张量从NumPy移植到PyTorch上非常容易，反之亦然。我们来看一下如下代码示例：

import torch
import numpy as np

numpy_tensor = np.random.randn(10, 20)

# convert numpy array to pytorch array
pytorch_tensor = torch.Tensor(numpy_tensor)
# or another way
pytorch_tensor = torch.from_numpy(numpy_tensor)

# convert torch tensor to numpy representation
pytorch_tensor.numpy()

# if we want to use tensor on GPU provide another type
dtype = torch.cuda.FloatTensor
gpu_tensor = torch.randn(10, 20).type(dtype)
# or just call `cuda()` method
gpu_tensor = pytorch_tensor.cuda()
# call back to the CPU
cpu_tensor = gpu_tensor.cpu()

# define pytorch tensors
x = torch.randn(10, 20)
y = torch.ones(20, 5)
# `@` mean matrix multiplication from python3.5, PEP-0465
res = x @ y

# get the shape
res.shape  # torch.Size([10, 5])

从张量到变量

张量是PyTorch中一个很棒的东西。但是我们主要的目的是建立神经网络。反向传播怎么搞？当然，我们可以来手工实现，但为什么要这么做呢？幸运的是还有自动微分法。为了支持这个功能，PyTorch提供了变量。变量是对张量的包装。通过使用变量，我们可以建立计算图，并自动计算梯度。每个变量实例都有两个属性：包含初始张量本身的.data和包含张量梯度的.grad。

import torch
from torch.autograd import Variable

# define an inputs
x_tensor = torch.randn(10, 20)
y_tensor = torch.randn(10, 5)
x = Variable(x_tensor, requires_grad=False)
y = Variable(y_tensor, requires_grad=False)
# define some weights
w = Variable(torch.randn(20, 5), requires_grad=True)

# get variable tensor
print(type(w.data))  # torch.FloatTensor
# get variable gradient
print(w.grad)  # None

loss = torch.mean((y - x @ w) ** 2)

# calculate the gradients
loss.backward()
print(w.grad)  # some gradients
# manually apply gradients
w.data -= 0.01 * w.grad.data
# manually zero gradients after update
w.grad.data.zero_()

你可能已经注意到，我们手工计算并应用了梯度。但这实在太麻烦了，我们可以使用某个优化器吗？当然可以！

import torch
from torch.autograd import Variable
import torch.nn.functional as F


x = Variable(torch.randn(10, 20), requires_grad=False)
y = Variable(torch.randn(10, 3), requires_grad=False)
# define some weights
w1 = Variable(torch.randn(20, 5), requires_grad=True)
w2 = Variable(torch.randn(5, 3), requires_grad=True)

learning_rate = 0.1
loss_fn = torch.nn.MSELoss()
optimizer = torch.optim.SGD([w1, w2], lr=learning_rate)
for step in range(5):
    pred = F.sigmoid(x @ w1)
    pred = F.sigmoid(pred @ w2)
    loss = loss_fn(pred, y)

    # manually zero all previous gradients
    optimizer.zero_grad()
    # calculate new gradients
    loss.backward()
    # apply new gradients
    optimizer.step()

现在，所有的变量都能自动更新了。但是对于最后一段代码，你应该要注意：我们在计算新梯度之前应该手工将梯度置零。这是PyTorch的核心理念之一。虽然我们可能不太明白为什么程序要这么写，但是，这样我们能够完全把握梯度什么时候使用以及如何使用。

静态计算图与动态计算图

PyTorch和TensorFlow之间另一个主要的区别就是图的表示方法的不同。 Tensorflow 使用静态图表，这意味着只要定义一次即可一遍又一遍地重复执行该图。在PyTorch中，每次正向传播都会定义了一个新的计算图。一开始，两者之间的区别可能并不是那么大。但是如果你想要调试代码或定义一些条件语句的话，动态图就会变得非常有用。就像你可以使用你最喜欢的调试器一样！比较一下while循环语句的两种定义吧 - 第一个是在TensorFlow中，第二个是在PyTorch中：

import tensorflow as tf


first_counter = tf.constant(0)
second_counter = tf.constant(10)
some_value = tf.Variable(15)


# condition should handle all args:
def cond(first_counter, second_counter, *args):
    return first_counter < second_counter


def body(first_counter, second_counter, some_value):
    first_counter = tf.add(first_counter, 2)
    second_counter = tf.add(second_counter, 1)
    return first_counter, second_counter, some_value


c1, c2, val = tf.while_loop(
    cond, body, [first_counter, second_counter, some_value])

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    counter_1_res, counter_2_res = sess.run([c1, c2])

import torch

first_counter = torch.Tensor([0])
second_counter = torch.Tensor([10])
some_value = torch.Tensor(15)

while (first_counter < second_counter)[0]:
    first_counter += 2
    second_counter += 1

在我看来第二种解决方案要比第一种简单得多。你觉得呢？

模型定义

在PyTorch中写一些if/else/while复杂语句很容易吧。现在我们回过头来看看常见模型怎么定义吧。该框架提供了与Keras非常相似的开箱即用的层构造函数：

nn包中定义了一系列的模块，大致相当于神经网络层。模块接收输入变量并计算输出变量，但也可以保存内部状态，例如包含可学习参数的变量。 nn包还定义了一组在训练神经网络时经常会用到的损失函数。

from collections import OrderedDict

import torch.nn as nn


# Example of using Sequential
model = nn.Sequential(
    nn.Conv2d(1, 20, 5),
    nn.ReLU(),
    nn.Conv2d(20, 64, 5),
    nn.ReLU()
)

# Example of using Sequential with OrderedDict
model = nn.Sequential(OrderedDict([
    ('conv1', nn.Conv2d(1, 20, 5)),
    ('relu1', nn.ReLU()),
    ('conv2', nn.Conv2d(20, 64, 5)),
    ('relu2', nn.ReLU())
]))

output = model(some_input)

另外，如果要构建更复杂的模型，可以将提供的nn.Module类子类化。当然这两种方式可以互相结合。

from torch import nn


class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.feature_extractor = nn.Sequential(
            nn.Conv2d(3, 12, kernel_size=3, padding=1, stride=1),
            nn.Conv2d(12, 24, kernel_size=3, padding=1, stride=1),
        )
        self.second_extractor = nn.Conv2d(
            24, 36, kernel_size=3, padding=1, stride=1)

    def forward(self, x):
        x = self.feature_extractor(x)
        x = self.second_extractor(x)
        # note that we may call same layer twice or mode
        x = self.second_extractor(x)
        return x

在__init__方法中，我们应该把以后要使用的所有层都定义一下。在forward方法中，我们应指定使用已定义层的步骤。通常来说，反向传播将会自动进行计算。

自定义层

But what if we want to define some custom model with nonstandard backprop? Here is one example — XNOR networks:

但是，如果我们想定义一些非标准的后向传播模型的话要怎么办呢？这里有一个例子 - XNOR网络：

这里我不会深究细节问题。想要了解更多有关这个类型网络的内容，请阅读原始论文。与我们这个问题有关的一切是反向传播的权重必须小于1并且大于-1。在PyTorch中，这实现起来很容易：

import torch


class MyFunction(torch.autograd.Function):

    @staticmethod
    def forward(ctx, input):
        ctx.save_for_backward(input)
        output = torch.sign(input)
        return output

    @staticmethod
    def backward(ctx, grad_output):
        # saved tensors - tuple of tensors, so we need get first
        input, = ctx.saved_variables
        grad_output[input.ge(1)] = 0
        grad_output[input.le(-1)] = 0
        return grad_output


# usage
x = torch.randn(10, 20)
y = MyFunction.apply(x)
# or
my_func = MyFunction.apply
y = my_func(x)


# and if we want to use inside nn.Module
class MyFunctionModule(torch.nn.Module):
    def forward(self, x):
        return MyFunction.apply(x)

真如你所见，我们只能定义两种方法：一种用于正向传播，另一种用于反向传播。如果我们需要从正向通道中访问一些变量，可以将它们存储在变量ctx中。注意：以前的API正向/反向传播方法不是静态的，要存储变量，必须像这样：self.save_for_backward(input)，要访问变量这必须：input, _ = self.saved_tensors。

用CUDA训练模型

我们先前已经讨论过如何将一张张量传给CUDA。但是，如果我们想要传递整个模型，则可以调用模型的.cuda()方法，然后将每个输入变量传给.cuda()。在完成所有的计算之后，我们可以用.cpu()方法获取结果。

import torch

### tensor example
x_cpu = torch.randn(10, 20)
w_cpu = torch.randn(20, 10)
# direct transfer to the GPU
x_gpu = x_cpu.cuda()
w_gpu = w_cpu.cuda()
result_gpu = x_gpu @ w_gpu
# get back from GPU to CPU
result_cpu = result_gpu.cpu()

### model example
model = model.cuda()
# train step
inputs = Variable(inputs.cuda())
outputs = model(inputs)
# get back from GPU to CPU
outputs = outputs.cpu()

此外，PyTorch还支持在源代码中直接分配设备：

import torch

# check is cuda enabled
torch.cuda.is_available()

# set required device
torch.cuda.set_device(0)

# work with some required cuda device
with torch.cuda.device(1):
    # allocates a tensor on GPU 1
    a = torch.cuda.FloatTensor(1)
    assert a.get_device() == 1

    # but you still can manually assign tensor to required device
    d = torch.randn(2).cuda(2)
    assert d.get_device() == 2

由于有时我们想要在无需修改代码的情况下在CPU和GPU上运行相同的模型，因此，我设计了这些封装：

class Trainer:
    def __init__(self, model, use_cuda=False, gpu_idx=0):
        self.use_cuda = use_cuda
        self.gpu_idx = gpu_idx
        self.model = self.to_gpu(model)

    def to_gpu(self, tensor):
        if self.use_cuda:
            return tensor.cuda(self.gpu_idx)
        else:
            return tensor

    def from_gpu(self, tensor):
        if self.use_cuda:
            return tensor.cpu()
        else:
            return tensor

    def train(self, inputs):
        inputs = self.to_gpu(inputs)
        outputs = self.model(inputs)
        outputs = self.from_gpu(outputs)

权重的初始化

在TensorFlow中，权重的初始化主要是在声明张量的时候进行的。而PyTorch则提供了另一种方法：首先应该声明张量，然后修改张量的权重。通过调用torch.nn.init包中的多种方法可以将权重初始化为直接访问张量的属性。做这个决定可能并不简单，但想要初始化具有某些相同初始化类型的层时，这个功能就会变得很有用。

import torch
from torch.autograd import Variable

# new way with `init` module
w = torch.Tensor(3, 5)
torch.nn.init.normal(w)
# work for Variables also
w2 = Variable(w)
torch.nn.init.normal(w2)
# old styled direct access to tensors data attribute
w2.data.normal_()

# example for some module
def weights_init(m):
    classname = m.__class__.__name__
    if classname.find('Conv') != -1:
        m.weight.data.normal_(0.0, 0.02)
    elif classname.find('BatchNorm') != -1:
        m.weight.data.normal_(1.0, 0.02)
        m.bias.data.fill_(0)

# for loop approach with direct access
class MyModel(nn.Module):
    def __init__(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
                m.weight.data.normal_(0, math.sqrt(2. / n))
            elif isinstance(m, nn.BatchNorm2d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()
            elif isinstance(m, nn.Linear):
                m.bias.data.zero_()

反向排除子图

有时，当你想要重新训练模型的某些层，或者为生产环境做准备时，禁用某些层的自动梯度机制将非常有用。为此，PyTorch提供了两个标志：require_grad和volatile。第一个标志将禁用当前层的梯度，但是子节点仍然可以计算。第二个标志将禁用当前层和所有子节点的自动梯度。

import torch
from torch.autograd import Variable

# requires grad
# If there’s a single input to an operation that requires gradient,
# its output will also require gradient.
x = Variable(torch.randn(5, 5))
y = Variable(torch.randn(5, 5))
z = Variable(torch.randn(5, 5), requires_grad=True)
a = x + y
a.requires_grad  # False
b = a + z
b.requires_grad  # True

# Volatile differs from requires_grad in how the flag propagates.
# If there’s even a single volatile input to an operation,
# its output is also going to be volatile.
x = Variable(torch.randn(5, 5), requires_grad=True)
y = Variable(torch.randn(5, 5), volatile=True)
a = x + y
a.requires_grad  # False

训练的过程

PyTorch还有其他一些卖点。例如，你可以使用学习速度调度程序，让学习速率根据一些规则进行调整。或者你可以使用简单的训练标志来启用或禁用批量标准层和dropout。如果你想要为CPU和GPU分别更改随机种子，也很容易。

# scheduler example
from torch.optim import lr_scheduler

optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
scheduler = lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)

for epoch in range(100):
    scheduler.step()
    train()
    validate()

# Train flag can be updated with boolean
# to disable dropout and batch norm learning
model.train(True)
# execute train step
model.train(False)
# run inference step

# CPU seed
torch.manual_seed(42)
# GPU seed
torch.cuda.manual_seed_all(42)

此外，你可以打印出模型的相关信息，或使用几行代码来保存/加载这些信息。如果你的模型是用OrderedDict初始化的或者是基于类的模型，那么模型的表示中将包含层的名称。

from collections import OrderedDict

import torch.nn as nn

model = nn.Sequential(OrderedDict([
    ('conv1', nn.Conv2d(1, 20, 5)),
    ('relu1', nn.ReLU()),
    ('conv2', nn.Conv2d(20, 64, 5)),
    ('relu2', nn.ReLU())
]))

print(model)

# Sequential (
#   (conv1): Conv2d(1, 20, kernel_size=(5, 5), stride=(1, 1))
#   (relu1): ReLU ()
#   (conv2): Conv2d(20, 64, kernel_size=(5, 5), stride=(1, 1))
#   (relu2): ReLU ()
# )

# save/load only the model parameters(prefered solution)
torch.save(model.state_dict(), save_path)
model.load_state_dict(torch.load(save_path))

# save whole model
torch.save(model, save_path)
model = torch.load(save_path)

根据PyTorch文档所述，使用state_dict()方法来保存文档更好。

日志记录

训练过程的日志记录是一个很重要的部分。不幸的是，PyTorch没有像Tensorboard这样的工具。所以你只能使用Python日志记录模块来记录普通文本日志，或者尝试使用一些第三方库：

数据处理

你可能会记得TensorFlow中提供的数据加载程序，甚至想要实现其中的一些功能。对我来说，这花了我大概4个多小时的时间来了解所有管道的工作原理。

最初，我想要在这里添加一些代码，但我认为这样的动画足以解释清楚其中的基本原理了。

PyTorch的开发者并不希望重复发明轮子。他们只是使用了多重处理。要创建自定义的数据加载器，可以从torch.utils.data.Dataset继承，并改变其中的一些方法：

import torch
import torchvision as tv


class ImagesDataset(torch.utils.data.Dataset):
    def __init__(self, df, transform=None,
                 loader=tv.datasets.folder.default_loader):
        self.df = df
        self.transform = transform
        self.loader = loader

    def __getitem__(self, index):
        row = self.df.iloc[index]

        target = row['class_']
        path = row['path']
        img = self.loader(path)
        if self.transform is not None:
            img = self.transform(img)

        return img, target

    def __len__(self):
        n, _ = self.df.shape
        return n

# what transformations should be done with our images
data_transforms = tv.transforms.Compose([
    tv.transforms.RandomCrop((64, 64), padding=4),
    tv.transforms.RandomHorizontalFlip(),
    tv.transforms.ToTensor(),
])

train_df = pd.read_csv('path/to/some.csv')
# initialize our dataset at first
train_dataset = ImagesDataset(
    df=train_df,
    transform=data_transforms
)

# initialize data loader with required number of workers and other params
train_loader = torch.utils.data.DataLoader(train_dataset,
                                           batch_size=10,
                                           shuffle=True,
                                           num_workers=16)

# fetch the batch(call to `__getitem__` method)
for img, target in train_loader:
    pass

有两件事情你应该知道。第一，图的维度是[batch_size x channels x height x width]，和TensorFlow是不同的。但是，必须要先执行预处理步骤torchvision.transforms.ToTensor()，才可以进行转换。在transforms包中有很多有用的工具。

第二，你可以在GPU上使用固定的内存。为此，你只需要在cuda()调用中添加另一个标志async=True，并从标记为pin_memory=True的DataLoader中获取固定批次。更多关于这个功能的详细信息，你可以在这里讨论。

最终的架构

现在，你已经知道了模型、优化器和很多其他的东西了。那么，如何将它们正确的融合在一起呢？我建议将模型和所有封装划分为以下则几个模块：

这里有一些伪代码能帮助你理解：

class ImagesDataset(torch.utils.data.Dataset):
    pass

class Net(nn.Module):
    pass

model = Net()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
scheduler = lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
criterion = torch.nn.MSELoss()

dataset = ImagesDataset(path_to_images)
data_loader = torch.utils.data.DataLoader(dataset, batch_size=10)

train = True
for epoch in range(epochs):
    if train:
        lr_scheduler.step()

    for inputs, labels in data_loader:
        inputs = Variable(to_gpu(inputs))
        labels = Variable(to_gpu(labels))

        outputs = model(inputs)
        loss = criterion(outputs, labels)
        if train:
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

    if not train:
        save_best_model(epoch_validation_accuracy)

结论

我希望通过这篇文章你能了解PyTorch的一些特点：

它可以用来替代Numpy
用于做原型设计真的很快
调试和使用条件流非常容易
有很多开箱即用的工具

PyTorch是一个发展迅速的框架，并拥有一个很棒的社区。择日不如撞日，赶快来试试PyTorch吧！

文章原标题《PyTorch tutorial distilled - Migrating from TensorFlow to PyTorch》，作者：Illarion Khlestov，译者：夏天，审校：主题曲。

文章为简译，更为详细的内容，请查看原文

如何从TensorFlow迁移到PyTorch

学习资源

把PyTorch当作NumPy来用

从张量到变量

静态计算图与动态计算图

模型定义

自定义层

用CUDA训练模型

权重的初始化

反向排除子图

训练的过程

日志记录

数据处理

最终的架构

结论

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像