主要讨论实值函数对矩阵或向量的梯度。先给出定义,若函数 f : R m × n → R f:\mathbb{R}^{m\times n}\rightarrow \mathbb{R} f:R
m×n
→R,则 ∂ f ∂ X \frac{\partial f}{\partial X}
∂X
∂f
也是一个 m × n m\times n m×n矩阵,且满足:
( ∂ f ∂ X ) i j = ∂ f ∂ x i j \left( \frac{\partial f}{\partial X} \right) _{ij}=\frac{\partial f}{\partial x_{ij}}
(
∂X
∂f
)
ij
=
∂x
ij
∂f
则表示实值函数对矩阵的梯度,记做 ∇ X f \nabla _{\boldsymbol{X}}f ∇
X
f
同样地,若函数 f : R m → R f:\mathbb{R}^{m}\rightarrow \mathbb{R} f:R
m
→R,则 ∂ f ∂ x \frac{\partial f}{\partial x}
∂x
∂f
也是一个 m m m维列向量,且满足:
( ∂ f ∂ x ) i = ∂ f ∂ x i \left( \frac{\partial f}{\partial \boldsymbol{x}} \right) _i=\frac{\partial f}{\partial x_i}
(
∂x
∂f
)
i
=
∂x
i
∂f
则表示实值函数对向量的梯度,记做 ∇ x f \nabla _{\boldsymbol{x}}f ∇
x
f
总结:实值函数对向量或矩阵的梯度,与该向量或矩阵同型。下面从定义出发,推导机器学习中常用的向量/矩阵梯度公式。
① ∇ ( a T x ) = ∇ ( x T a ) = a \nabla \left( \boldsymbol{a}^T\boldsymbol{x} \right) =\nabla \left( \boldsymbol{x}^T\boldsymbol{a} \right) =\boldsymbol{a} ∇(a
T
x)=∇(x
T
a)=a
证明:
∇ ( a T x ) i = ∂ a T x ∂ x i = ∂ ( ∑ j a j x j ) ∂ x i = a i ⇒ ∇ ( a T x ) = a \nabla \left( \boldsymbol{a}^T\boldsymbol{x} \right) _i=\frac{\partial \boldsymbol{a}^T\boldsymbol{x}}{\partial x_i}=\frac{\partial \left( \sum_j{a_jx_j} \right)}{\partial x_i}=a_i\\\Rightarrow \,\, \nabla \left( \boldsymbol{a}^T\boldsymbol{x} \right) =\boldsymbol{a}
∇(a
T
x)
i
=
∂x
i
∂a
T
x
=
∂x
i
∂(∑
j
a
j
x
j
)
=a
i
⇒∇(a
T
x)=a
② ∇ ∥ x ∥ 2 2 = ∇ ( x T x ) = 2 x \nabla \lVert \boldsymbol{x} \rVert _{2}^{2}=\nabla \left( \boldsymbol{x}^T\boldsymbol{x} \right) =2\boldsymbol{x} ∇∥x∥
2
2
=∇(x
T
x)=2x
证明:
∇ ( x T x ) i = ∇ ( ∑ j x j 2 ) x i = 2 x i ⇒ ∇ ( x T x ) = 2 x \nabla \left( \boldsymbol{x}^T\boldsymbol{x} \right) _i=\frac{\nabla \left( \sum_j{x_{j}^{2}} \right)}{x_i}=2x_i\\\Rightarrow \,\, \nabla \left( \boldsymbol{x}^T\boldsymbol{x} \right) =2\boldsymbol{x}
∇(x
T
x)
i
=
x
i
∇(∑
j
x
j
2
)
=2x
i
⇒∇(x
T
x)=2x
③ ∇ ( x T A x ) = ( A + A T ) x \nabla \left( \boldsymbol{x}^T\boldsymbol{Ax} \right) =\left( \boldsymbol{A}+\boldsymbol{A}^T \right) \boldsymbol{x} ∇(x
T
Ax)=(A+A
T
)x
证明:
L H S = ∇ ( x C T A x ) + ∇ ( x T A x C ) = ( x C T A ) T + A x C = ( A + A T ) x = R H S LHS=\nabla \left( \boldsymbol{x}_{C}^{T}\boldsymbol{Ax} \right) +\nabla \left( \boldsymbol{x}^T\boldsymbol{Ax}_C \right) \\=\left( \boldsymbol{x}_{C}^{T}\boldsymbol{A} \right) ^T+\boldsymbol{Ax}_C\\=\left( \boldsymbol{A}+\boldsymbol{A}^T \right) \boldsymbol{x}\\=RHS
LHS=∇(x
C
T
Ax)+∇(x
T
Ax
C
)
=(x
C
T
A)
T
+Ax
C
=(A+A
T
)x
=RHS