Least Squares
Least squares regression is a traditional ML algorithm that minimizes the total square error between samples and learning output, i.e. minimize
JLS(θ)=12∑i=1n(fθ(xi)−yi)2
We’d like to get the estimated θ where JLS is the smallest.
θ^LS=argminθJLS(θ)
Take the linear model as an illustration.
fθ(x)=∑j=1bθjϕj(x)=θTϕ(x)
Then
JLS(θ)=12∥Φθ−y∥2
where
Φ=⎛⎝⎜⎜ϕ1(x1)⋮ϕ1(xn)⋯⋱⋯ϕb(x1)⋮ϕb(xn)⎞⎠⎟⎟
which is also known as “design matrix”.
It is not necessary to calculate the minimum JLS, but in order to get the corresponding θ, we may take the derivative of JLS(θ), which you have learned in high school.
∇θJLS=(∂JLS∂θ1,⋯,∂JLS∂θb)T=ΦTΦθ−ΦTy=0
Then
θ^LS=(ΦTΦ)−1ΦTy≜Φ†y
where Φ† is the general inverse of Φ.
You can also apply weights to the training set:
minθ12∑i=1nwi(fθ(xi)−yi)2
Then
θ^LS=(ΦTWΦ)†ΦTWy
.
If we take the kernal model as an example, i.e.
fθ(x)=∑j=1nθjK(x,xj)
which can also be known as a kind of linear model. For simplicity, We just show the corresponding design matrix
K=⎛⎝⎜⎜K(x1,x1)⋮K(xn,x1)⋯⋱⋯K(x1,xn)⋮K(xn,xn)⎞⎠⎟⎟
Note that least squares algorithm share the property of asymptotic unbiasedness, which means that the noise in y can be removed especially when the expectation of noise equals to zero.
E[θ^LS]=θ∗.