Logistic回归与最小二乘概率分类算法简述与示例-阿里云开发者社区

Logistic回归与最小二乘概率分类算法简述与示例

2017-04-08 1818

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： Logistic Regression & Least Square Probability Classification1. Logistic RegressionLikelihood function, as interpreted by wikipedia: https://en.wikipedia.org/wiki/Likelihood_f

Logistic Regression & Least Square Probability Classification

1. Logistic Regression

Likelihood function, as interpreted by wikipedia:

https://en.wikipedia.org/wiki/Likelihood_function

plays one of the key roles in statistic inference, especially methods of estimating a parameter from a set of statistics. In this article, we’ll make full use of it.
Pattern recognition works on the way that learning the posterior probability $p(y|x)$ of pattern $x$ belonging to class $y$ . In view of a pattern $x$ , when the posterior probability of one of the class $y$ achieves the maximum, we can take $x$ for class $y$ , i.e.

y^= arg max y = 1, \dots, c p (u | x)

$\hat{y}=\arg\max_{y=1,\dots,c}p(u|x)$
The posterior probability can be seen as the credibility of model

x $x$ belonging to class

y $y$ .
In Logistic regression algorithm, we make use of linear logarithmic function to analyze the posterior probability:

q (y | x, θ) = exp ( \sum b j = 1 θ ( y ) j ϕ j ( x ) ) \sum c y ' = 1 exp ( \sum b j = 1 θ ( y ' ) j ϕ j ( x ) )

$q(y|x,\theta)=\frac{\exp\left( \sum_{j=1}^{b}\theta_{j}^{(y)}\phi_{j}(x) \right)}{\sum_{y'=1}^{c}\exp\left( \sum_{j=1}^{b}\theta_{j}^{(y')}\phi_{j}(x) \right)}$
Note that the denominator is a kind of regularization term. Then the Logistic regression is defined by the following optimal problem:

max θ \sum i = 1 m log q (y i | x i, θ)

$\max_{\theta}\sum_{i=1}^{m}\log q(y_{i}|x_{i},\theta)$
We can solve it by gradient descent method:

Initialize $\theta$ .
Pick up a training sample $(x_{i},y_{i})$ randomly.
Update $\theta=({\theta^{(1)}}^{T},\dots, {\theta^{(c)}}^{T})^{T}$ along the direction of gradient ascent: $θ (y) \leftarrow θ (y) + ϵ \nabla y J i (θ), y = 1, \dots, c$ $\theta^{(y)}\leftarrow \theta^{(y)}+\epsilon\nabla_{y}J_{i}(\theta),\quad y=1,\dots,c$ where $\nabla y J i (θ) = - exp ( θ ( y ) T ϕ ( x i ) ) ϕ ( x i ) \sum c y ' = 1 exp ( θ ( y ' ) T ϕ ( x i ) ) + {ϕ (x i) 0 (y = y i) (y \neq y i)$ $\nabla_{y}J_{i}(\theta)=-\frac{\exp\left( {\theta^{(y)}}^{T}\phi(x_{i}) \right)\phi(x_{i})}{\sum_{y'=1}^{c}\exp\left( {\theta^{(y')}}^{T}\phi(x_{i}) \right)}+\left\{\begin{aligned} &\phi(x_{i})\quad &(y=y_{i})\\ &0 &(y\neq y_{i}) \end{aligned}\right.$
Go back to step 2,3 until we get a $\theta$ of suitable precision.

Take the Gaussian Kernal Model as an example:

q (y | x, θ) \propto exp ⎛ ⎝ \sum j = 1 n θ j K (x, x j) ⎞ ⎠

$q(y|x,\theta) \propto \exp\left( \sum_{j=1}^{n}\theta_{j}K(x,x_{j}) \right)$
Aren’t you familiar with Gaussian Kernal Model? Refer to this article:

http://blog.csdn.net/philthinker/article/details/65628280

Here are the corresponding MATLAB codes:

n=90; c=3; y=ones(n/c,1)*(1:c); y=y(:);
x=randn(n/c,c)+repmat(linspace(-3,3,c),n/c,1);x=x(:);

hh=2*1^2; t0=randn(n,c);
for o=1:n*1000
    i=ceil(rand*n); yi=y(i); ki=exp(-(x-x(i)).^2/hh);
    ci=exp(ki'*t0); t=t0-0.1*(ki*ci)/(1+sum(ci));
    t(:,yi)=t(:,yi)+0.1*ki;
    if norm(t-t0)<0.000001
        break;
    end
    t0=t;
end

N=100; X=linspace(-5,5,N)';
K=exp(-(repmat(X.^2,1,n)+repmat(x.^2',N,1)-2*X*x')/hh);

figure(1); clf; hold on; axis([-5,5,-0.3,1.8]);
C=exp(K*t); C=C./repmat(sum(C,2),1,c);
plot(X,C(:,1),'b-');
plot(X,C(:,2),'r--');
plot(X,C(:,3),'g:');
plot(x(y==1),-0.1*ones(n/c,1),'bo');
plot(x(y==2),-0.2*ones(n/c,1),'rx');
plot(x(y==3),-0.1*ones(n/c,1),'gv');
legend('q(y=1|x)','q(y=2|x)','q(y=3|x)');

这里写图片描述

2. Least Square Probability Classification

In LS probability classifiers, linear parameterized model is used to express the posterior probability:

q (y | x, θ (y)) = \sum j = 1 b θ (y) j ϕ j (x) = θ (y) T ϕ (x), y = 1, \dots, c

$q(y|x,\theta^{(y)})=\sum_{j=1}^{b}\theta_{j}^{(y)}\phi_{j}(x)={\theta^{(y)}}^{T}\phi(x),\quad y=1,\dots,c$
These models depends on the parameters

θ(y)=（θ(y)1,…,θ(y)b）T $\theta^{(y)}=（\theta_{1}^{(y)},\dots, \theta_{b}^{(y)}）^{T}$ correlated to each classes

y $y$ that is diverse from the one used by Logistic classifiers. Learning those models means to minimize the following quadratic error:

J y (θ (y)) = = 1 2 \int (q (y | x, θ (y)) - p (y | x)) 2 p (x) d x 1 2 \int q (y | x, θ (y)) 2 p (x) d x - \int q (y | x, θ (y)) p (y | x) p (x) d x + 1 2 \int p (y | x) 2 p (x) d x

$\begin{split}J_{y}(\theta^{(y)})= & \frac{1}{2}\int\left( q(y|x,\theta^{(y)})-p(y|x) \right)^{2}p(x)\mathrm{d}x \\ =& \frac{1}{2}\int q(y|x,\theta^{(y)})^{2}p(x) \mathrm{d}x-\int q(y|x,\theta^{(y)})p(y|x)p(x)\mathrm{d}x \\ &+ \frac{1}{2}\int p(y|x)^{2}p(x) \mathrm{d}x\end{split}$ where

p(x) $p(x)$ represents the probability density of training set

{xi}ni=1 $\{x_{i}\}_{i=1}^{n}$ .
By the Bayesian formula,

p (y | x) p (x) = p (x, y) = p (x | y) p (y)

$p(y|x)p(x)=p(x,y)=p(x|y)p(y)$
Hence

Jy $J_{y}$ can be reformulated as

J y (θ (y)) = 1 2 \int q (y | x, θ (y)) 2 p (x) d x - \int q (y | x, θ (y)) p (x | y) p (y) d x + 1 2 \int p (y | x) 2 p (x) d x

$J_{y}(\theta^{(y)})=\frac{1}{2}\int q(y|x,\theta^{(y)})^{2}p(x) \mathrm{d}x-\int q(y|x,\theta^{(y)})p(x|y)p(y)\mathrm{d}x+ \frac{1}{2}\int p(y|x)^{2}p(x) \mathrm{d}x$
Note that the first term and second term in the equation above stand for the mathematical expectation of

p(x) $p(x)$ and

p(x|y) $p(x|y)$ respectively, which are often impossible to calculate directly. The last term is independent of

θ $\theta$ and thus can be omitted.
Due to the fact that

p(x|y) $p(x|y)$ is the probability density of sample

x $x$ belonging to class

y $y$ , we are able to estimate term 1 and 2 by the following averages:

1 n \sum i = 1 n q (y | x i, θ (y)) 2, 1 n y \sum i : y i = y q (y | x i, θ (y)) p (y)

$\frac{1}{n}\sum_{i=1}^{n}q(y|x_{i},\theta^{(y)})^{2},\quad \frac{1}{n_{y}}\sum_{i:y_{i}=y}^{}q(y|x_{i},\theta^{(y)})p(y)$
Next, we introduce the regularization term to get the following calculation rule:

J^y (θ (y)) = 1 2 n \sum i = 1 n q (y | x i, θ (y)) 2 - 1 n y \sum i : y i = y q (y | x i, θ (y)) + λ 2 n ∥ θ (y) ∥ 2

$\hat{J}_{y}(\theta^{(y)})=\frac{1}{2n}\sum_{i=1}^{n}q(y|x_{i},\theta^{(y)})^{2}-\frac{1}{n_{y}}\sum_{i:y_{i}=y}^{}q(y|x_{i},\theta^{(y)})+\frac{\lambda}{2n}\|\theta^{(y)}\|^{2}$
Let

π(y)=(π(y)1,…,π(y)n)T $\pi^{(y)}=(\pi_{1}^{(y)},\dots,\pi_{n}^{(y)})^{T}$ and

π(y)i={1(yi=y)0(yi≠y) $\pi_{i}^{(y)}=\left\{\begin{aligned}&1\quad (y_{i}=y)\\ &0 \quad (y_{i}\neq y)\end{aligned}\right.$ , then

J^y (θ (y)) = 1 2 n θ (y) T Φ T Φ θ (y) - 1 n θ (y) T Φ T π (y) + λ 2 n ∥ θ (y) ∥ 2

$\hat{J}_{y}(\theta^{(y)})=\frac{1}{2n}{\theta^{(y)}}^{T}\Phi^{T}\Phi\theta^{(y)}-\frac{1}{n}{\theta^{(y)}}^{T}\Phi^{T}\pi^{(y)}+\frac{\lambda}{2n}\|\theta^{(y)}\|^{2}$ .
Therefore, it is evident that the problem above can be formulated as a convex optimization problem, and we can get the analytic solution by setting the twice order derivative to zero:

θ^(y) = (Φ T Φ + λ I) - 1 Φ T π (y)

$\hat{\theta}^{(y)}=\left( \Phi^{T}\Phi+\lambda I \right)^{-1}\Phi^{T}\pi^{(y)}$ .
In order not to get a negative estimation of the posterior probability, we need to add a constrain on the negative outcome:

p^(y | x) = max ( 0 , θ ^ ( y ) T ϕ ( x ) ) \sum c y ' = 1 max ( 0 , θ ^ ( y ' ) T ϕ ( x ) )

$\hat{p}(y|x)=\frac{\max(0,{\hat{\theta}^{(y)}}^{T}\phi(x))}{\sum_{y'=1}^{c}\max(0,{\hat{\theta}^{(y')}}^{T}\phi(x))}$

We also take Gaussian Kernal Models for example:

n=90; c=3; y=ones(n/c,1)*(1:c); y=y(:);
x=randn(n/c,c)+repmat(linspace(-3,3,c),n/c,1);x=x(:);

hh=2*1^2; x2=x.^2; l=0.1; N=100; X=linspace(-5,5,N)';
k=exp(-(repmat(x2,1,n)+repmat(x2',n,1)-2*x*(x'))/hh);
K=exp(-(repmat(X.^2,1,n)+repmat(x2',N,1)-2*X*(x'))/hh);
for yy=1:c
    yk=(y==yy); ky=k(:,yk);
    ty=(ky'*ky +l*eye(sum(yk)))\(ky'*yk);
    Kt(:,yy)=max(0,K(:,yk)*ty);
end
ph=Kt./repmat(sum(Kt,2),1,c);

figure(1); clf; hold on; axis([-5,5,-0.3,1.8]);
C=exp(K*t); C=C./repmat(sum(C,2),1,c);
plot(X,C(:,1),'b-');
plot(X,C(:,2),'r--');
plot(X,C(:,3),'g:');
plot(x(y==1),-0.1*ones(n/c,1),'bo');
plot(x(y==2),-0.2*ones(n/c,1),'rx');
plot(x(y==3),-0.1*ones(n/c,1),'gv');
legend('q(y=1|x)','q(y=2|x)','q(y=3|x)');

这里写图片描述

3. Summary

Logistic regression is good at dealing with sample set with small size since it works in a simple way. However, when the number of samples is large to some degree, it is better to turn to the least square probability classifiers.

Logistic回归与最小二乘概率分类算法简述与示例

Logistic Regression & Least Square Probability Classification

1. Logistic Regression

2. Least Square Probability Classification

3. Summary

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Logistic回归与最小二乘概率分类算法简述与示例

Logistic Regression & Least Square Probability Classification

1. Logistic Regression

2. Least Square Probability Classification

3. Summary

热门文章

最新文章

相关课程

相关电子书

相关实验场景