Huber损失最小化学习法

简介: Huber regression In least square learning methods, we make use of ℓ2\ell_{2} loss to make sure that we get a suitable outcome. However, in the robust point of view, it is always better to

Huber regression
In least square learning methods, we make use of 2 loss to make sure that we get a suitable outcome. However, in the robust point of view, it is always better to make use of the least absolute as the main criterion, i.e.

θ^LA=argminθJLA(θ),JLA(θ)=i=1n|ri|

where ri=fθ(xi)yi is the residual error. By doing so, it is possible to make the learning method more robust at the cost of accuracy.
In order to balance robustness and accuracy, Huber loss may be a good alternative:
ρHuber(r)={ r2/2η|r|η2/2(|r|η)(|r|>η)

Then the optimization goal turns out to be:
minθJ(θ),J(θ)=i=1nρHuber(ri)

As usual, take the linear parameterized model as an example:
fθ(x)=j=1bθjϕj(x)=θTϕ(x)

For simplicity, we omit the details and give the final outcome (more details needed? refer to 1 constrained LS):
θ^=argminθJ~(θ),J~(θ)=12i=1nω~iri+C

where ω~i={1η/|r~i|(|r~i|η)(|r~i|>η) and C=i:|r~i|>η(η|r~i|/2η2/2) are independent of θ .
Therefore, the solution can be formulated as:
θ^=(ΦTW~Φ)ΦTW~y
where W~=diag(ω~1,,ωn) .
By iteration, we can solve θ^ as an estimation of θ . The corresponding MATLAB codes are given below:
n=50; N=1000;
x=linspace(-3,3,n)'; X=linspace(-4,4,N)';
y=x+0.2*randn(n,1); y(n)=-4;

p(:,1)=ones(n,1); p(:,2)=x; t0=p\y; e=1;
for o=1:1000
    r=abs(p*t0-y); w=ones(n,1); w(r>e)=e./r(r>e);
    t=(p'*(repmat(w,1,2).*p))\(p'*(w.*y));
    if norm(t-t0)<0.001, break, end
    t0=t;
end
P(:,1)=ones(N,1); P(:,2)=X; F=P*t;

figure(1); clf; hold on; axis([-4,4,-4.5,3.5]);
plot(X,F,'g-'); plot(x,y,'bo');

这里写图片描述

Tukey regression
The Huber loss combined 1 loss and 2 loss to balance robustness and accuracy. Since 1 loss is concerned, the outliers may have an enormous impact on the final outcome. To tackling that, Tukey may be a considerable alternative:

ρTukey(r)=(1[1r2/η2]3)η2/6η2/6(|r|η)(|r|>η)

Of course, the Tukey loss is not a convex funciton, that is to say, there may be serveral local optimal solution. In actual applications, we apply the following weights:
ω={(1r2/η2)20(|r|η)(|r|>η)

Hence the outliers can no longer put any impact on our estimation.
相关文章
|
机器学习/深度学习 算法
非凸函数上,随机梯度下降能否收敛?网友热议:能,但有条件,且比凸函数收敛更难
非凸函数上,随机梯度下降能否收敛?网友热议:能,但有条件,且比凸函数收敛更难
132 0
|
大数据
谈谈交叉熵损失函数
一.交叉熵损失函数形式 现在给出三种交叉熵损失函数的形式,来思考下分别表示的的什么含义。 --式子1 --式子2 --式子3 解释下符号,m为样本的个数,C为类别个数。上面三个式子都可以作为神经网络的损失函数作为训练,那么区别是什么? ■1》式子1,用于那些类别之间互斥(如:一张图片中只能保护猫或者狗的其中一个)的单任务分类中。
3679 0
避免过度拟合之正则化(转)
避免过度拟合之正则化 “越少的假设,越好的结果” 商业情景: 当我们选择一种模式去拟合数据时,过度拟合是常见问题。一般化的模型往往能够避免过度拟合,但在有些情况下需要手动降低模型的复杂度,缩减模型相关属性。
799 0
|
10月前
|
机器学习/深度学习 算法
什么是偏拟合和什么是过拟合,解决方法是什么
什么是偏拟合和什么是过拟合,解决方法是什么
87 0
|
机器学习/深度学习 自然语言处理 算法
基于Transformer的蛋白质生成,具有正则化潜伏空间优化
基于Transformer的蛋白质生成,具有正则化潜伏空间优化
155 0
梯度下降算法过程以及感知机算法与梯度下降算法区别
梯度下降算法过程以及感知机算法与梯度下降算法区别