Logistic Regression & Least Square Probability Classification
1. Logistic Regression
Likelihood function, as interpreted by wikipedia:
plays one of the key roles in statistic inference, especially methods of estimating a parameter from a set of statistics. In this article, we’ll make full use of it.
Pattern recognition works on the way that learning the posterior probability
The posterior probability can be seen as the credibility of model
In Logistic regression algorithm, we make use of linear logarithmic function to analyze the posterior probability:
Note that the denominator is a kind of regularization term. Then the Logistic regression is defined by the following optimal problem:
We can solve it by gradient descent method:
- Initialize
θ . - Pick up a training sample
(xi,yi) randomly. - Update
θ=(θ(1)T,…,θ(c)T)T along the direction of gradient ascent:θ(y)←θ(y)+ϵ∇yJi(θ),y=1,…,c ∇yJi(θ)=−exp(θ(y)Tϕ(xi))ϕ(xi)∑cy′=1exp(θ(y′)Tϕ(xi))+{ϕ(xi)0(y=yi)(y≠yi) - Go back to step 2,3 until we get a
θ of suitable precision.
Take the Gaussian Kernal Model as an example:
Aren’t you familiar with Gaussian Kernal Model? Refer to this article:
Here are the corresponding MATLAB codes:
n=90; c=3; y=ones(n/c,1)*(1:c); y=y(:);
x=randn(n/c,c)+repmat(linspace(-3,3,c),n/c,1);x=x(:);
hh=2*1^2; t0=randn(n,c);
for o=1:n*1000
i=ceil(rand*n); yi=y(i); ki=exp(-(x-x(i)).^2/hh);
ci=exp(ki'*t0); t=t0-0.1*(ki*ci)/(1+sum(ci));
t(:,yi)=t(:,yi)+0.1*ki;
if norm(t-t0)<0.000001
break;
end
t0=t;
end
N=100; X=linspace(-5,5,N)';
K=exp(-(repmat(X.^2,1,n)+repmat(x.^2',N,1)-2*X*x')/hh);
figure(1); clf; hold on; axis([-5,5,-0.3,1.8]);
C=exp(K*t); C=C./repmat(sum(C,2),1,c);
plot(X,C(:,1),'b-');
plot(X,C(:,2),'r--');
plot(X,C(:,3),'g:');
plot(x(y==1),-0.1*ones(n/c,1),'bo');
plot(x(y==2),-0.2*ones(n/c,1),'rx');
plot(x(y==3),-0.1*ones(n/c,1),'gv');
legend('q(y=1|x)','q(y=2|x)','q(y=3|x)');
2. Least Square Probability Classification
In LS probability classifiers, linear parameterized model is used to express the posterior probability:
These models depends on the parameters
By the Bayesian formula,
Hence
Note that the first term and second term in the equation above stand for the mathematical expectation of
Due to the fact that
Next, we introduce the regularization term to get the following calculation rule:
Let
Therefore, it is evident that the problem above can be formulated as a convex optimization problem, and we can get the analytic solution by setting the twice order derivative to zero:
In order not to get a negative estimation of the posterior probability, we need to add a constrain on the negative outcome:
We also take Gaussian Kernal Models for example:
n=90; c=3; y=ones(n/c,1)*(1:c); y=y(:);
x=randn(n/c,c)+repmat(linspace(-3,3,c),n/c,1);x=x(:);
hh=2*1^2; x2=x.^2; l=0.1; N=100; X=linspace(-5,5,N)';
k=exp(-(repmat(x2,1,n)+repmat(x2',n,1)-2*x*(x'))/hh);
K=exp(-(repmat(X.^2,1,n)+repmat(x2',N,1)-2*X*(x'))/hh);
for yy=1:c
yk=(y==yy); ky=k(:,yk);
ty=(ky'*ky +l*eye(sum(yk)))\(ky'*yk);
Kt(:,yy)=max(0,K(:,yk)*ty);
end
ph=Kt./repmat(sum(Kt,2),1,c);
figure(1); clf; hold on; axis([-5,5,-0.3,1.8]);
C=exp(K*t); C=C./repmat(sum(C,2),1,c);
plot(X,C(:,1),'b-');
plot(X,C(:,2),'r--');
plot(X,C(:,3),'g:');
plot(x(y==1),-0.1*ones(n/c,1),'bo');
plot(x(y==2),-0.2*ones(n/c,1),'rx');
plot(x(y==3),-0.1*ones(n/c,1),'gv');
legend('q(y=1|x)','q(y=2|x)','q(y=3|x)');
3. Summary
Logistic regression is good at dealing with sample set with small size since it works in a simple way. However, when the number of samples is large to some degree, it is better to turn to the least square probability classifiers.