【pytorch】【202504】关于torch.nn.Linear

简介: 小白从开始这段代码展示了`nn.Linear`的使用及其背后的原理。此外,小白还深入研究了PyTorch的核心类`torch.nn.Module`以及其子类`torch.nn.Linear`的源码。`grad_fn`作为张量的一个属性,用于指导反向传播进一步地,小白探讨了`requires_grad`与叶子节点(leaf tensor)的关系。叶子节点是指在计算图中没有前驱操作的张量,只有设置了`requires_grad=True`的叶子节点才会在反向传播时保存梯度。最后,小白学习了PyTorch中的三种梯度模式通过以上学习小白对PyTorch的自动求导机制有了更深刻的理解。

今天,小白的torch之旅开始,小白看到了这么一段代码:

import torch
import torch.nn as nn
l = nn.Linear(100,2)

d1 = torch.randn(3,100)
d2 = torch.randn(3,100)
l[d1],l[d2]
l[d1]的执行结果

tensor([[ 0.8882, -0.1242],
        [ 0.2551, -0.0560],
        [ 1.3589,  0.2343]], grad_fn=<AddmmBackward0>)

l[d2]的执行结果
tensor([[ 0.1178,  0.7039],
        [-0.1690,  0.5082],
        [-1.0211, -0.1887]], grad_fn=<AddmmBackward0>)

小白为了入门,开始了对上面这段代码的理解:

【1】 nn.Linear表示的是线性变换,线性函数:y=kx+b


l = nn.Linear(100,2),这其中100是输入维度(in_features)2是输出维度(out_features);

那我们复现一下l[d1]这个输出结果的左上角计算过程就是:
sum([a*b for a,b in zip(d1[0],l.weight[0])])+l.bias[0]

上述线性函数计算之后的结果,执行结果:

tensor(0.8882, grad_fn=<AddBackward0>)
可以看到和l[d1]的tensor第一维第一位结果是一样的

【2】 nn.Linear实例化后的对象类型是torch.nn.modules.linear.Linear:

type(l)

执行结果

torch.nn.modules.linear.Linear

借此,我们看下torch.nn.Linear的源码,地址:

https://pytorch.org/docs/stable/_modules/torch/nn/modules/linear.html

可以看到torch.nn.Linear是继承了torch.nn.Module类的:

from .module import Module
class Linear(Module):

    __constants__ = ["in_features", "out_features"]
    in_features: int
    out_features: int
    weight: Tensor

    def __init__(
        self,
        in_features: int,
        out_features: int,
        bias: bool = True,
        device=None,
        dtype=None,
    ) -> None:

        self.reset_parameters()

    def reset_parameters(self) -> None:def forward(self, input: Tensor) -> Tensor:def extra_repr(self) -> str:

事实上,torch.nn.Module的地位是非常核心的,具体我们可以看torch的官网,其中在torch.nn的Containers部分:
https://pytorch.org/docs/stable/nn.html#containers

它原话是这么介绍torch.nn.Module的:Base class for all neural network modules.(也就是说nn.modules下面的类都是继承自它,https://github.com/pytorch/pytorch/tree/v2.6.0/torch/nn/modules)

包括torch.nn.Module在内,官网上torch.nn的Container列出了6种,分别是

Container 功能 描述
Module Base class for all neural network modules. 核心
Sequential A sequential container 继承自Module
ModuleList Holds submodules in a list. 继承自Module
ModuleDict Holds submodules in a dictionary. 继承自Module
ParameterList Holds parameters in a list. 继承自Module
ParameterDict Holds parameters in a dictionary. 继承自Module

查看源代码,https://github.com/pytorch/pytorch/blob/v2.6.0/torch/nn/modules/container.py
可以看到,其它5个都是继承自torch.nn.Module

其实即使是损失函数也是继承自Module的:

https://pytorch.org/docs/main/_modules/torch/nn/modules/loss.html

【3】对"grad_fn = < AddBackward0>"的解释

grad_fn其实是张量的属性,用于指导反向传播的,fn的意思就是function name。

这里add表明是从加法来的,事实上,我们还可以见到有的张量后面带的属性是grad_fn= < PowBackward0>,这个意思就是说是幂运算带来的,你还会见到有数字差异的比如grad_fn =< MeanBackward0>, grad_fn = < MeanBackward1>等等

type(l(d1))  执行结果是 torch.Tensor

type(d))  执行结果是 torch.Tensor

我们可以看下该torch.Tensor对象的grad_fn属性

l(d1).grad_fn  执行结果是 <AddmmBackward0 at 0x7f0870c064d0>

l(d1).grad_fn._saved_self

我们可以继续看看官网,https://pytorch.org/docs/stable/notes/autograd.html
这里暂时不用太深究,你记住它是tensor的属性就好了。

When computing the forward pass, autograd simultaneously performs the requested computations and builds up a graph representing the function that computes the gradient (the .grad_fn attribute of each torch.Tensor is an entry point into this graph). When the forward pass is completed, we evaluate this graph in the backwards pass to compute the gradients.

【4】对grad_fn的理解

接下来我们从pytorch的图计算的过程来形象地理解,来看看这个答主的解释:https://discuss.pytorch.org/t/what-does-grad-fn-powbackward0-mean-exactly/160014

要解释清楚,grad_fn,咱们需要从requires_grad参数开始,讲讲整个大致的流程:

原文 理解
1 during the forward pass PyTorch will track the operations if one of the involved tensors requires gradients (i.e. its .requires_grad attribute it set to True) and will create a computation graph from these operations. 看这段话的解释,requires_grad顾名思义就是需要梯度更新,有要有一个Tensor设置了requires_grad=True,程序就会生成计算图用于梯度更新
2 To be able to backpropagate through this computation graph and to calculate the gradients for all involved parameters, PyTorch will additionally store the corresponding “gradient functions” (or “backward functions”) of the executed operations to the output tensor (stored as the .grad_fn attribute) 为了能沿着这个计算图进行梯度计算,需要保存该操作对应的梯度函数,存在了grad_fn里
3 Once the forward pass is done, you can then call the .backward() operation on the output (or loss) tensor, which will backpropagate through the computation graph using the functions stored in .grad_fn. 前向传播完成后,对输出或损失函数调用backward()开始反向传播,这个损失函数就会沿着这个计算图来反向传播,梯度计算用的就是grad_fn存储的函数

【4.1】所以,reguires_grad设置是个啥

刚才答主提到,只要有个Tensor设置了requires_grad,就会生成计算图。所以这参数是个啥? 官网正好有说,来,一二三上链接:
https://pytorch.org/docs/stable/notes/autograd.html#setting-requires-grad

requires_grad默认是False的,但如果这个张量是包装在torch.nn.Parameter里面的那就默认是True。
requires_grad is a flag, defaulting to false unless wrapped in a nn.Parameter, that allows for fine-grained exclusion of subgraphs from gradient computation. It takes effect in both the forward and backward passes:

这个参数在前向传播和反向传播里都有作用:

【前向传播时记录反向图backward graph】
During the forward pass, an operation is only recorded in the backward graph if at least one of its input tensors require grad.

【反向传播时,叶子节点设置了requires_grad=True的张量保存梯度值】
During the backward pass (.backward()), only leaf tensors with requires_grad=True will have gradients accumulated into their .grad fields.

【4.2】是不是只要设置了requires_grad=True就会保存梯度呢,前提还得是叶子张量!非叶子节点即使设置了requires_grad=True也会释放梯度

only leaf tensors with requires_grad=True will have gradients accumulated into their .grad fields.

官网对这个问题有作特别强调(下面提到的this flag指的是requires_grad):
叶子张量(也即,没有grad_fn的张量)咱们设置requires_grad=True,有意义。就比如说nn.Module里面的parameters就是叶子张量

It is important to note that even though every tensor has this flag, setting it only makes sense for leaf tensors (tensors that do not have a grad_fn, e.g., a nn.Module’s parameters).

非叶子张量就是有grad_fn的张量,非叶子张量是关联有反向传播计算图的,非叶子张量的梯度是作为中间过程,用于后续的叶子节点计算。
Non-leaf tensors (tensors that do have grad_fn) are tensors that have a backward graph associated with them. Thus their gradients will be needed as an intermediary result to compute the gradient for a leaf tensor that requires grad. From this definition, it is clear that all non-leaf tensors will automatically have require_grad=True.
咱们从上面这个非叶子节点的定义也能读出非叶子节点的性质:非叶子节点也有requires_grad=True这个性质,但咱不保存grad

我们来看看下面这位朋友在提问张量的is_leaf属性的时候,说Ta看到官方文档里提到了叶子张量的一些断语:
https://discuss.pytorch.org/t/what-is-the-purpose-of-is-leaf/87000

这位朋友说,Ta看到了官方文档写了这些:

原文 解释说明
1 All Tensors that have requires_grad which is False will be leaf Tensors by convention. 惯常认为,当一个张量设置为requires_grad=False,这个张量肯定是叶子张量
2 For Tensors that have requires_grad which is True , they will be leaf Tensors if they were created by the user. This means that they are not the result of an operation and so grad_fn is None. 但有些requires_grad设置为True了,它也可能是叶子张量呢,怎么说呢,举个例子,用户手动创建的时候设置的requires_grad=True,它是创建的而不是张量运算的结果,故而grad_fn属性是空,这个时候咱们也认为这是叶子张量
3 Only leaf Tensors will have their grad populated during a call to backward() . To get grad populated for non-leaf Tensors, you can use retain_grad() . 只有叶子节点在调用backward方法时保留梯度,对于非叶子节点,想要保存梯度,就得给这个张量调用retain_grad方法

【4.3】is_leaf是表明是否为叶子张量

我们来看看最初代码里面那几个张量是否是叶子张量,可以看张量的is_leaf属性

# 非叶子张量
l(d1).is_leaf  # False
# 非叶子张量
(sum([a*b for a,b in zip(d1[0],l.weight[0])])+l.bias[0]).is_leaf # False
# 是叶子张量
d1.is_leaf # True

具体叶子张量的判断,这里不再继续展开,留待最后面来讲

【5】复现grad_fn的计算

这位博主给了个案例https://www.cnblogs.com/picassooo/p/13757403.html

import torch

w1 = torch.tensor(2.0, requires_grad=True)
a = torch.tensor([[1., 2.], [3., 4.]], requires_grad=True)
tmp = a[0, :]
# tmp是非叶子张量,需用.retain_grad()方法保留导数,否则导数将会在反向传播完成之后被释放掉
tmp.retain_grad()   
b = tmp.repeat([3, 1])
b.retain_grad()
loss = (b * w1).mean()
loss.backward()

查看输出:

print(b.grad_fn) #输出是:
<RepeatBackward0 object at 0x7f0870612e50>
# 前面有提到,反向传播时记录梯度值到grad字段
print(b.grad) #输出是:
tensor([[0.3333, 0.3333],
        [0.3333, 0.3333],
        [0.3333, 0.3333]])
print(tmp.grad_fn) # 输出是:
<SliceBackward0 object at 0x7f087050b750>
print(tmp.grad) # 输出是:
tensor([1., 1.])
print(a.grad) # 输出是:
tensor([[1., 1.],
        [0., 0.]])

下面是作者是手动推导:
image.png

【6】叶子张量概念辨析

熟悉PyTorch的automatic differentiation机制(torch.autograd)的朋友,可能会对叶子张量这个概念有一点迷糊,所以有人特意展开讲解了一下,是从数学的角度而不是工程细节的角度:
https://leimao.github.io/blog/PyTorch-Leaf-Tensor/

可以从Requires Grad、User Created 、Is Leaf 、Grad Populated这四个角度来对张量类型进行考察

用于分类的角度 分类说明
1 Requires Grad torch.Tensor的requires_grad属性,表明其是常量还是变量(constant or variable)
2 User Created 创建者分类, 即,是否由用户显式创建
3 Is Leaf 如果是叶子节点,意味着在torch.autograd directed acyclic graph (DAG)是叶子节点,DAG里面有一个根张量节点,许多中间张量节点,许多叶子张量节点:“Is Leaf” is true means that a torch.Tensor is a leaf node in a torch.autograd directed acyclic graph (DAG) which only consists of a root (tensor) node, many leaf (tensor) nodes, and many intermediate (backward function call) nodes
4 Grad Populated Populate意思是说梯度是否会被保留还是释放,“Grad Populated” is true means that the gradient with respect to a 张量对象(torch.Tensor) will be saved in the tensor object (for optimization) so that the grad attribute of the torch.Tensor will not be None after a backward pass.

那难道说排列组合下来有2 X 2 X 2 X 2=16种张量吗,作者说是4种张量:

Requires Grad User Created Is Leaf Grad Populated
True True True True
True False False False
False True True False
False False True False

咱们自己总结一下作者列的表:
1>非叶子张量,grad一定是不保存的(另外,按照作者这表格,非叶子张量,一定不是用户自己手动创建的)
2> 常量(requires_grad是False)都是叶子张量,不管是不是手动创建,都不保留grad
3> 非叶子节点即使requires_grad = True,最后grad仍然是没有

作者还给了一些案例:

import torch
# 定义一个函数,打印张量的属性
def print_tensor_attributes(tensor: torch.Tensor) -> None:

    print(f"requires_grad: {tensor.requires_grad}, "
          f"grad_fn: {tensor.grad_fn is not None}, "
          f"is_leaf: {tensor.is_leaf}, "
          f"grad: {tensor.grad is not None}")
cuda_device = torch.device("cuda:0")
# 两个变量(requires_grad=True),一个cpu的,一个cuda的
variable_tensor_cpu = torch.tensor([2., 3.], requires_grad=True)
variable_tensor_cuda = variable_tensor_cpu.to(cuda_device)

# 一个常量
constant_tensor_cuda = torch.tensor([6., 4.],
                                        requires_grad=False,
                                        device=cuda_device)

# 损失函数,都是cuda的运算
loss = torch.sum((constant_tensor_cuda - variable_tensor_cuda)**2)

我们先来看看,反向传播之前的张量属性:

# 反向传播前,CPU变量
print_tensor_attributes(tensor=variable_tensor_cpu)
可以看到:
requires_grad: True
grad_fn: False
is_leaf: True
grad: False

cpu变量转CUDA之后,我们可以看到属性就变了,这是个关键点,叶子节点变成了非叶子节点

# 反向传播前,CUDA 变量
print_tensor_attributes(tensor=variable_tensor_cuda)
可以看到:
requires_grad: True
grad_fn: True
is_leaf: False
grad: False
# 反向传播前,CUDA常量
print_tensor_attributes(tensor=constant_tensor_cuda)
requires_grad: False
grad_fn: False
is_leaf: True
grad: False
# 反向传播前,LOSS
print_tensor_attributes(tensor=loss)

requires_grad: True
grad_fn: True
is_leaf: False
grad: False

可以看到,在反向传播前,所有的变量的grad都是False;

现在,咱们开始反向传播:

loss.backward()

会发生啥呢?

# 反向传播之后,CPU变量,作为叶子节点,咱得存储梯度,grad字段就从False变成了True
print_tensor_attributes(tensor=variable_tensor_cpu) 
requires_grad: True
grad_fn: False
is_leaf: True
grad: True
# 反向传播之后,CUDA变量,作为非叶子节点,咱不存储梯度,所以咱们的grad依然是False
print_tensor_attributes(tensor=variable_tensor_cuda)
requires_grad: True
grad_fn: True
is_leaf: False
grad: False
# 反向传播之后,CUDA常量,虽然是叶子节点,但是是常量,不要求反向传播,状态和之前一样
print_tensor_attributes(tensor=constant_tensor_cuda)
requires_grad: False
grad_fn: False
is_leaf: True
grad: False
# 反向传播之后,损失函数,作为非叶子节点,继续不存储梯度
print_tensor_attributes(tensor=loss)
requires_grad: True
grad_fn: True
is_leaf: False
grad: False

单看这四个字段,可以看到,CUDA张量,CUDA常量,损失函数的输出和之前一样,只有最开始的CPU张量输出有变化,grad从False变成了True

很多人可能会以为CUDA变量在反向传播之后,是不是grad也变成了True,因为这样在训练的时候才能得到优化,但是事实是CUDA变量grad仍然是空,CPU变量有grad,这就意味着说variable_tensor_cpu才是实际上那个variable for optimization。

换而言之to(cuda) 也是一次operation!!!

可以看下to(cuda)操作之后的grad_fn是啥:
print(variable_tensor_cuda)
<ToCopyBackward0 object at 0x7ff4d8ea15a0>

After the optimization is performed after the backward pass, the variable_tensor_cuda value will not be the same as the variable_tensor_cpu until the next forward pass is performed.

直到下次正向传播的时候,variable_tensor_cuda才能变成和variable_tensor_cpu一样。

另外,当你试图访问非叶子节点的grad属性的时候,会收到警告,它会告诉你,非叶子节点在autograd.backward()期间是不会保存grad的,你如果非要的话,请您对非叶子节点调用.retain_grad()方法

leaf.py:6: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at aten/src/ATen/core/TensorBody.h:489.)
  f"grad: {tensor.grad is not None}")

咱们可以借助第三方包把这个计算图DAG可视化出来,torchviz.
安装方法:

# 假如不认识sudo命令,所以先执行了下面的
$ apt-get update
$ apt-get install sudo

# 然后再执行
$ sudo apt update
$ sudo apt install -y graphviz
$ pip install torchviz
import torch
from torchviz import make_dot

if __name__ == "__main__":

    cuda_device = torch.device("cuda:0")
    variable_tensor_cpu = torch.tensor([2., 3.], requires_grad=True)
    variable_tensor_cuda = variable_tensor_cpu.to(cuda_device)
    constant_tensor_cuda = torch.tensor([6., 4.],
                                        requires_grad=False,
                                        device=cuda_device)
    loss = torch.sum((constant_tensor_cuda - variable_tensor_cuda)**2)
    make_dot(loss).render("dag", format="svg")

作者画出来的图就是这样的:
需要指出的是,用torchviz画出来的图,不会展示the leaf node that does not require grad.

image.png

The blue box in the DAG diagram, although having no tensor name, is the leaf tensor variable_tensor_cpu in our program.

上面的DAG diagram中的蓝色方框就是我们程序中的variable_tensor_cpu,只是蓝色框里面没有写这个tesor的名字的。

【7】Why grad is not populated for a tensor that requires grad but is not a leaf node?为啥非叶子节点即使requires_grad了,还是不给咱保存grad?因为不值得

上面的作者说:

原文 解释
1 Conventionally, only leaf tensors, usually model parameters to be trained, deserves grad 叶子节点,比如模型的参数,才值得保存grad
2 All the non-leaf tensors, such as the intermediate activation tensors, do not deserve grad 非叶子节点, 比如中间的激活函数,不值得保存grad

作者说,为啥要给激活函数保留grad呢?下一次正向传播的时候值就变了,没有必要保留梯度了

Why would we need to keep a grad for the activation tensors?
1 Even if we keep the grad in the activation tensor and apply the grad to the activation tensor values in the optimization, those values will be overwritten in the next forward pass.
2 So populating grad for non-leaf tensors is usually a waste of memory and computation.

However, in some “rare” use cases, the user would need the grad for non-leaf tensors, and PyTorch has the API torch.Tensor.retain_grad() for that. But usually it’s not making sense and is an indication of problematic implementation.

https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html
https://pytorch.org/docs/1.13/generated/torch.Tensor.is_leaf.html#torch.Tensor.is_leaf

【8】关于CPU叶子节点转cuda后就变成非叶子节点

关于这个问题,上面那个is_leaf的提问页(https://discuss.pytorch.org/t/what-is-the-purpose-of-is-leaf/87000/3 )也有个提问:

#  当转cuda的时候,就变成非叶子节点了
# b was created by the operation that cast a cpu Tensor into a cuda Tensor
>>> b = torch.rand(10, requires_grad=True).cuda()
>>> b.is_leaf
False
# 直接从cuda创建的时候,是叶子节点
# f requires grad, has no operation creating it
>>> f1 = torch.rand(10, requires_grad=True, device="cuda")
>>> f1.is_leaf
True

cuda_device = torch.device("cuda:0")
f2 = torch.rand(10, requires_grad=True, device=cuda_device)
f2.is_leaf
True

针对这种现象,提问者说,这是否意味着,当我们是调用cuda()方法得到张量的时候,我们是不是得use retain_grad=True,这样的话,grad属性才能be populated(存起来)?

有个回答者说的很在理,cuda本身也是operation。

# 其实,如果执行一下,它是有grad_fn的
b.grad_fn 
<ToCopyBackward0 object at 0x7ff3e1d409d0>

【9】案例分析--叶子节点的张量保存

这里把前面的那个提问is_leaf问答提到的一些例子贴出来
https://discuss.pytorch.org/t/what-is-the-purpose-of-is-leaf/87000/4

【9.1】叶子节点张量的grad保存

################################################################
#  SCENARIO 1  叶子节点设置requires_grad会影响反向传播后是否保存grad
################################################################
a = torch.rand(3,2,requires_grad=True)
b = torch.rand(2,3)
loss = 10 - (a @ b).sum()
print(a.is_leaf) # 反向传播前 True
print(a.grad)  # 反向传播前 None

print(b.is_leaf) # 反向传播前 True
print(b.grad)  # 反向传播前 None


# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
# >>>>>>>>>>>>>>>>>> 开始反向传播
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
loss.backward()

# a 是叶子节点,由于requires_grad了,反向传播后,grad得以保存
print(a.is_leaf) # True,a是叶子节点 
print(a.grad) # 有grad,为:tensor([[-1.6699, -1.7115],
        [-1.6699, -1.7115],
        [-1.6699, -1.7115]])


# b 也是叶子节点,由于是用户手动创建的,默认requires_grad = False,反向传播后,grad未能保存
print(b.is_leaf) # True,b是叶子节点
print(b.grad) #None, b没有requires_grad

【9.2】引入CUDA操作后是否叶子节点与grad保存

################################################################
#  SCENARIO 2  节点是通过cuda生成的
################################################################
import torch
a = torch.rand(3,2,requires_grad=True).cuda()
b = torch.rand(2,3).cuda()
loss = 10 - (a @ b).sum()


# 反向传播前
print('>>>> a')
print(a.is_leaf) #False,如果requires_grad=True,然后又进行了一道cuda()操作,就不再是叶子节点了
print(a.grad) #None 初始化,没有grad
print(a.grad_fn) #<ToCopyBackward0 object at 0x7fd1f416ba60>,按照定义,有grad_fn是非叶子节点 

print('>>>> b')
print(b.is_leaf) # True,默认requires_grad=False,进行cuda()操作,仍然是叶子节点
print(b.grad)  # None ,初始化,没有grad
print(b.grad_fn) # None, 按照定义,没有grad_fn的是叶子节点

loss.backward()

# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
# >>>>>>>>>>>>>>>>>> 开始反向传播
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

print('>>>> a')
print(a.is_leaf) #False
print(a.grad) #None,反向传播,非也自己节点不保留grad
print(a.grad_fn) #<ToCopyBackward0 object at 0x7fd1f416ba60>

print('>>>> b')
print(b.is_leaf) # True,是叶子节点
print(b.grad)  # None,虽然是叶子节点,requires_grad=False,所以没有保留
print(b.grad_fn) #None

【9.3】nn.Module的参数的grad保存

nn.Module里的参数是叶子节点,而且默认行为是requires_grad=True,自定义的叶子节点,需要显示指明,requires_grad=True才会保存grad

################################################################
#  SCENARIO 3  Module的参数张量
################################################################
ll = nn.Linear(3,3).cuda()
inp = torch.rand(3).cuda()
loss = (10 - ll(inp)).sum()

# 反向传播前
print(inp.is_leaf) #True
print(inp.grad) #None
print(inp.grad_fn) # None

print(loss.is_leaf) # False
print(loss.grad) # None
print(loss.grad_fn) # <SumBackward0 object at 0x7fd0e5c77ee0>

print(ll.weight.is_leaf) #True
print(ll.weight.grad) #None
print(ll.weight.grad_fn) #None

print(ll.bias.is_leaf) #True
print(ll.bias.grad) #None
print(ll.bias.grad_fn) #None

# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
# >>>>>>>>>>>>>>>>>> 开始反向传播
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
loss.backward()

# 反向传播后
print(inp.is_leaf) #True
print(inp.grad) #None
print(inp.grad_fn) # None

print(loss.is_leaf) # False
print(loss.grad) # None
print(loss.grad_fn) #换了一个地址, <SumBackward0 object at 0x7fd1f41a3760> 

print(ll.weight.is_leaf) #True
print(ll.weight.grad) # tensor([[-0.3796, -0.9358, -0.0274],
        [-0.3796, -0.9358, -0.0274],
        [-0.3796, -0.9358, -0.0274]], device='cuda:0')
print(ll.weight.grad_fn) #None

print(ll.bias.is_leaf) #True
print(ll.bias.grad) # tensor([-1., -1., -1.], device='cuda:0')
print(ll.bias.grad_fn) #None

【9.4】总结

用这个问题里某答主的话来总结,有一些小细节可能不一定赞同
https://discuss.pytorch.org/t/what-is-the-purpose-of-is-leaf/87000/5

一家之言 简述
1 The docs don’t really indicate what is going on with is_leaf. In particular, I think the sentence that says “Only leaf Tensors will have their grad populated…” is misleading. From what I can guess, leaves don’t really have to do with populating grad; requires_grad is what governs that. 不是所有的叶子节点都保存grad的,requires_grad很重要
2 I think the is_leaf property is really about the reverse-graph. When x.backward() is called, all of the action happens on the “reverse-differentation mode” graph at x. x is the root, and the graph runs up along (against arrows in) the forward graph from x. “reverse-differentation mode” graph 叶子节点关乎反向传播图计算
3 When the process hits a non-leaf, it knows it can keep mapping along to more nodes. On the other hand, when the process hits a leaf, it knows to stop; leaves have no graph_fn. 到了叶子节点就得停
4 If this is right, it makes it more clear why weights are “leaves with requires_grad = True”, and inputs are “leaves with requires_grad = False.” You could even take this as a definition of “weights” and “inputs”. 参数就是requires_grad=True的叶子节点,输入张量如果不设置,就是requires_grad=False的叶子节点

【10】官网的一些尊尊教诲

https://pytorch.org/docs/stable/notes/autograd.html#setting-requires-grad

原文 解释
1 Setting requires_grad should be the main way you control which parts of the model are part of the gradient computation, for example, if you need to freeze parts of your pretrained model during model fine-tuning. 想控制模型的哪些部分进入梯度计算,主要依赖设置requires_grad
2 To freeze parts of your model, simply apply .requiresgrad(False) to the parameters that you don’t want updated. 如果想冻结模型里面的一些参数,设置requiresgrad 这个属性
3 And as described above, since computations that use these parameters as inputs would not be recorded in the forward pass, they won’t have their .grad fields updated in the backward pass because they won’t be part of the backward graph in the first place, as desired. 不进入计算图,这样也不会更新grad了
4 Because this is such a common pattern, requiresgrad can also be set at the module level with nn.Module. requires grad (). When applied to a module, .requires grad_() takes effect on all of the module’s parameters (which have requires_grad=True by default) 还可以在Module层面设置requiresgrad(),那么对这个Module的所有参数都会生效

【11】requires_grad设置效果,还要看grad mode

Apart from setting requires_grad there are also three grad modes that can be selected from Python that can affect how computations in PyTorch are processed by autograd internally:

三种 模式
1 default mode (grad mode)
2 no-grad mode
3 inference mode

all of which can be togglable via context managers and decorators. 三种模式都可以通过上下文管理装饰器调节

Mode Excludes operations from being recorded in backward graph Skips additional autograd tracking overhead Tensors created while the mode is enabled can be used in grad-mode later Examples
default Forward pass
no-grad Optimizer updates
inference Data processing, model evaluation

【11.1】 Default Mode (Grad Mode)

三种模式中唯一一个要求requires_grad生效的,其它的两种模式中,经常要求requires_grad重置为False
The “default mode” is the mode we are implicitly in when no other modes like no-grad and inference mode are enabled. To be contrasted with “no-grad mode” the default mode is also sometimes called “grad mode”.

The most important thing to know about the default mode is that it is the only mode in which requires_grad takes effect. requires_grad is always overridden to be False in both the two other modes.

【11.2】No-grad Mode

no-grad模式,就好比是所有的输入都不require_grad,换而言之,no-grad mode里的计算都不会被记录到反向传播图里,即使输入设置了requires_grad=True

Computations in no-grad mode behave as if none of the inputs require grad. In other words, computations in no-grad mode are never recorded in the backward graph even if there are inputs that have require_grad=True.

Enable no-grad mode when you need to perform operations that should not be recorded by autograd, but you’d still like to use the outputs of these computations in grad mode later.

可以批量切换requires_grad的设置
This context manager makes it convenient to disable gradients for a block of code or function without having to temporarily set tensors to have requires_grad=False, and then back to True.

For example, no-grad mode might be useful when writing an optimizer: when performing the training update you’d like to update parameters in-place without the update being recorded by autograd. You also intend to use the updated parameters for computations in grad mode in the next forward pass.

The implementations in torch.nn.init also rely on no-grad mode when initializing the parameters as to avoid autograd tracking when updating the initialized parameters in-place.

【11.3】Inference Mode

Inference mode is the extreme version of no-grad mode.

Just like in no-grad mode, computations in inference mode are not recorded in the backward graph, but enabling inference mode will allow PyTorch to speed up your model even more.

This better runtime comes with a drawback: tensors created in inference mode will not be able to be used in computations to be recorded by autograd after exiting inference mode.

Enable inference mode when you are performing computations that do not have interactions with autograd, AND you don’t plan on using the tensors created in inference mode in any computation that is to be recorded by autograd later.

It is recommended that you try out inference mode in the parts of your code that do not require autograd tracking (e.g., data processing and model evaluation).

If it works out of the box for your use case it’s a free performance win. If you run into errors after enabling inference mode, check that you are not using tensors created in inference mode in computations that are recorded by autograd after exiting inference mode. If you cannot avoid such use in your case, you can always switch back to no-grad mode.

For details on inference mode please see Inference Mode(https://pytorch.org/cppdocs/notes/inference_mode.html).

For implementation details of inference mode see RFC-0011-InferenceMode.(https://github.com/pytorch/rfcs/pull/17)

【11.4】Evaluation Mode (nn.Module.eval())

Evaluation mode is not a mechanism to locally disable gradient computation. It is included here anyway because it is sometimes confused to be such a mechanism.

Functionally, module.eval() (or equivalently module.train(False)) are completely orthogonal(正交的) to no-grad mode and inference mode. How model.eval() affects your model depends entirely on the specific modules used in your model and whether they define any training-mode specific behavior.

You are responsible for calling model.eval() and model.train() if your model relies on modules such as torch.nn.Dropout and torch.nn.BatchNorm2d that may behave differently depending on training mode, for example, to avoid updating your BatchNorm running statistics on validation data.

It is recommended that you always use model.train() when training and model.eval() when evaluating your model (validation/testing) even if you aren’t sure your model has training-mode specific behavior, because a module you are using might be updated to behave differently in training and eval modes.

目录
相关文章
|
10月前
|
人工智能 IDE 开发工具
JetBrains DataSpell 2025.1 发布 - 专业数据科学家的 IDE
JetBrains DataSpell 2025.1 (macOS, Linux, Windows) - 专业数据科学家的 IDE
351 3
|
机器学习/深度学习 PyTorch 算法框架/工具
VQ-VAE:矢量量化变分自编码器,离散化特征学习模型
VQ-VAE 是变分自编码器(VAE)的一种改进。这些模型可以用来学习有效的表示。本文将深入研究 VQ-VAE 之前,不过,在这之前我们先讨论一些概率基础和 VAE 架构。
1367 10
|
机器学习/深度学习 自然语言处理 搜索推荐
基于图神经网络的电商购买预测
基于图神经网络的电商购买预测
|
并行计算 算法 编译器
使用 prange 实现 for 循环的并行
使用 prange 实现 for 循环的并行
480 1
使用 prange 实现 for 循环的并行
|
前端开发 JavaScript 测试技术
|
机器学习/深度学习 前端开发
【机器学习】机器学习30个笔试题
本文提供了一份包含30个问题的机器学习笔试试题集,覆盖了回归模型、极大似然估计、特征选择、模型评估、正则化方法、异常值检测、分类问题等多个机器学习领域的关键知识点。
1332 0
【机器学习】机器学习30个笔试题
|
存储 算法 开发者
【Python 基础补充 range()】一文带你了解range()函数的作用
【Python 基础补充 range()】一文带你了解range()函数的作用
2885 0
|
Python
【Python基础】reduce函数详解
【Python基础】reduce函数详解
1499 1
|
机器学习/深度学习 算法 数据挖掘
【机器学习】Voting集成学习算法:分类任务中的新利器
【机器学习】Voting集成学习算法:分类任务中的新利器
965 0
|
机器学习/深度学习 存储 安全
基于YOLOv8深度学习的安全帽目标检测系统【python源码+Pyqt5界面+数据集+训练代码】目标检测、深度学习实战
基于YOLOv8深度学习的安全帽目标检测系统【python源码+Pyqt5界面+数据集+训练代码】目标检测、深度学习实战