3.逻辑回归
大家都知道符号函数:
f(x)=0 (x=0)
=-1 (x<0)
=1 (x>1)
下面这个函数是符号函数的一个变种:
g(x) =0.5 (x=0)
=0 (x<0)
=1 (x>1)
函数
正好符合这个函数与这个函数相吻合,且他是一个连续函数.函数曲线如下:
我们把这个函数叫做逻辑函数,由y=wx,我们令z= wx, 即
这样当y=0, g(x)’=0.5; y>0, g(x)’>0.5且趋于1;y<0, g(x)’<0.5且趋于0,从而达到二分类的目的。sklearn.linear_model通过LogisticRegression类实现逻辑回归。
from sklearn.linear_model import LogisticRegression #对sklearn数据进行分析 def useing_sklearn_datasets_for_LogisticRegression(): # 读取和划分数据 cancer = datasets.load_breast_cancer() X = cancer.data y = cancer.target print("X的shape={},正样本数:{},负样本数:{}".format(X.shape, y[y == 1].shape[0], y[y == 0].shape[0])) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # 训练模型 model = LogisticRegression() model.fit(X_train, y_train) # 查看模型得分 train_score = model.score(X_train, y_train) test_score = model.score(X_test, y_test) print("乳腺癌训练集得分:{trs:.2f},乳腺癌测试集得分:{tss:.2f}".format(trs=train_score, tss=test_score))
输出
X的shape=(569, 30),正样本数:357,负样本数:212 乳腺癌训练集得分:0.95,乳腺癌测试集得分:0.95
这个结果还是非常满意的。
4.岭回归
岭回归(英文名:Ridgeregression, Tikhonov regularization)是一种专用于共线性数据分析的有偏估计回归方法,实质上是一种改良的最小二乘估计法,通过放弃最小二乘法的无偏性,以损失部分信息、降低精度为代价获得回归系数更为符合实际、更可靠的回归方法,对病态数据的拟合要强于最小二乘法。岭回归通过牺牲训练集得分,获得测试集得分。采用Ridge函数实现
from sklearn.linear_model import Ridge #岭回归进行分析 def useing_sklearn_datasets_for_Ridge(): X,y = datasets.load_diabetes().data,datasets.load_diabetes().target X_train,X_test,y_train,y_test = train_test_split(X, y, random_state=8,test_size=0.3) lr = LinearRegression().fit(X_train,y_train) ridge = Ridge().fit(X_train,y_train) print('alpha=1,糖尿病训练集得分: {:.2f}'.format(ridge.score(X_train,y_train))) print('alpha=1,糖尿病测试集得分: {:.2f}'.format(ridge.score(X_test,y_test))) ridge10 = Ridge(alpha=10).fit(X_train,y_train) print('alpha=10,糖尿病训练集得分: {:.2f}'.format(ridge10.score(X_train,y_train))) print('alpha=10,糖尿病测试集得分: {:.2f}'.format(ridge10.score(X_test,y_test))) ridge01 = Ridge(alpha=0.1).fit(X_train,y_train) print('alpha=0.1,糖尿病训练集得分: {:.2f}'.format(ridge01.score(X_train,y_train))) print('alpha=0.1,糖尿病测试集得分: {:.2f}'.format(ridge01.score(X_test,y_test)))
输出
alpha=1,糖尿病训练集得分: 0.43 alpha=1,糖尿病测试集得分: 0.43 alpha=10,糖尿病训练集得分: 0.14 alpha=10,糖尿病测试集得分: 0.16 alpha=0.1,糖尿病训练集得分: 0.52 alpha=0.1,糖尿病测试集得分: 0.47
通过下表分析一下各个alpha下训练集和测试集下的得分。
alpha |
训练集得分 |
测试集得分 |
1 |
0.43 |
0.43 |
10 |
0.14 |
0.16 |
0.1 |
0.52 |
0.47 |
线性回归 |
0.54 |
0.45 |
plt.plot(ridge.coef_,'s',label='Ridge alpha=1') plt.plot(ridge10.coef_,'^',label='Ridge alpha=10') plt.plot(ridge01.coef_,'v',label='Ridge alpha=0.1') plt.plot(lr.coef_,'o',label='Linear Regression') plt.xlabel('coefficient index') plt.ylabel('coefficient magnitude') plt.hlines(0,0,len(lr.coef_)) plt.show()
- alpha =10 在0区间变化(^ 橘黄色上箭头)
- alpha =1 变化变大(s 蓝色方块)
- alpha = 0.1变化更大,接近线性(v 绿色下箭头)
- 线性:变化超大,超出图(o 红色圆点)
alpha越大,收敛性越好
from sklearn.model_selection import learning_curve,KFold #定义一个绘制学习曲线的函数 def plot_learning_curve(est, X, y):#learning_curve:学习曲线 tarining_set_size,train_scores,test_scores = learning_curve( est,X,y,train_sizes=np.linspace(.1,1,20),cv=KFold(20,shuffle=True,random_state=1)) estimator_name = est.__class__.__name__ line = plt.plot(tarining_set_size,train_scores.mean(axis=1),'--',label="training"+estimator_name) plt.plot(tarining_set_size,test_scores.mean(axis=1),'-',label="test"+estimator_name,c=line[0].get_color()) plt.xlabel('training set size') plt.ylabel('Score') plt.ylim(0,1.1)
plot_learning_curve(Ridge(alpha=1), X,y) plot_learning_curve(LinearRegression(), X,y) plt.legend(loc=(0,1.05),ncol=2,fontsize=11) plt.show()
以上结果说明:
- 训练集得分比测试集得分要高;
- 岭回归测试集得分比线性回归测试集得分要低;
- 岭回归测试集得分与训练集得分差不多;
- 训练集小的时候,线性模型都学不到什么东西;
- 训练集加大,两个得分相同。
5.套索回归
套索回归(英文名Lasso Regression)略同于岭回归。在实践中,岭回归与套索回归首先岭回归。但是,如果特征特别多,而某些特征更重要,具有选择性,那就选择Lasso可能更好。采用Lasso函数实现。
from sklearn.linear_model import Lasso #对套索回归进行分析 def useing_sklearn_datasets_for_Lasso(): X,y = datasets.load_diabetes().data,datasets.load_diabetes().target X_train,X_test,y_train,y_test = train_test_split(X, y, random_state=8,test_size=0.3) lasso01 = Lasso().fit(X_train,y_train) print('alpha=1,糖尿病训练集得分: {:.2f}'.format(lasso01.score(X_train,y_train))) print('alpha=1,糖尿病测试集得分: {:.2f}'.format(lasso01.score(X_test,y_test))) print('alpha=1,套索回归特征数: {}'.format(np.sum(lasso01.coef_!=0))) lasso01 = Lasso(alpha=0.1,max_iter=100000).fit(X_train,y_train) print('alpha=0.1,max_iter=100000,糖尿病训练集得分: {:.2f}'.format(lasso01.score(X_train,y_train))) print('alpha=0.1,max_iter=100000,糖尿病测试集得分: {:.2f}'.format(lasso01.score(X_test,y_test))) print('alpha=1,套索回归特征数: {}'.format(np.sum(lasso01.coef_!=0))) lasso01 = Lasso(alpha=0.0001,max_iter=100000).fit(X_train,y_train) print('alpha=0.0001,max_iter=100000,糖尿病训练集得分: {:.2f}'.format(lasso01.score(X_train,y_train))) print('alpha=0.0001,max_iter=100000,糖尿病测试集得分: {:.2f}'.format(lasso01.score(X_test,y_test))) print('alpha=1,套索回归特征数: {}'.format(np.sum(lasso01.coef_!=0)))
输出
alpha=1,糖尿病训练集得分: 0.37 alpha=1,糖尿病测试集得分: 0.38 alpha=1,套索回归特征数: 3 alpha=0.1,max_iter=100000,糖尿病训练集得分: 0.52 alpha=0.1,max_iter=100000,糖尿病测试集得分: 0.48 alpha=1,套索回归特征数: 7 alpha=0.0001,max_iter=100000,糖尿病训练集得分: 0.53 alpha=0.0001,max_iter=100000,糖尿病测试集得分: 0.45 alpha=1,套索回归特征数: 10
alpha=1,特征数为3,得分低,出现欠拟合
- alpha=0.1,降低alpha值可以加大得分,特征数提高到7
- alpha=0.01,测试集得分: 0.45的测试集得分: 0.48,说明降低alpha值让模型。更倾向于出现过拟合现象。
比较岭回归与套索回归
def Ridge_VS_Lasso(): X,y = datasets.load_diabetes().data,datasets.load_diabetes().target X_train,X_test,y_train,y_test = train_test_split(X, y, random_state=8,test_size=0.3) lasso = Lasso(alpha=1,max_iter=100000).fit(X_train,y_train) plt.plot(lasso.coef_,'s',label='lasso alpha=1') lasso01 = Lasso(alpha=0. 1,max_iter=100000).fit(X_train,y_train) plt.plot(lasso01.coef_,'^',label='lasso alpha=0. 1') lasso0001 = Lasso(alpha=0.0001,max_iter=100000).fit(X_train,y_train) plt.plot(lasso0001.coef_,'v',label='lasso alpha=0.001') ridge01 = Ridge(alpha=0.1).fit(X_train,y_train) plt.plot(ridge01.coef_,'o',label='ridge01 alpha=0.1') plt.legend(ncol=2,loc=(0,1.05)) plt.ylim(-1000,750) plt.xlabel('Coefficient index') plt.ylabel("Coefficient magnitude") plt.show()
以上结果说明:
- alpha=1,大部分系数都在0附近
- alpha=0.1,大部分系数都在0附近,但是比=1时少很多,有些不等于1。
- alpha=0.001,整个模型被正则化,大部分不等于0。
- alpha=0.1的岭回归与套索回归基本一致。
数据特征比较多,并且有一小部分真正重要,用套索回归,否则用岭回归。数据和方法。
6. 用sklearn数据测试所有线性模型
建立文件machinelearn_data_model.py。
# coding:utf-8 import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression from sklearn.linear_model import LogisticRegression from sklearn.linear_model import Ridge from sklearn.linear_model import Lasso from sklearn.neighbors import KNeighborsClassifier from sklearn.datasets import make_regression from sklearn.model_selection import train_test_split from sklearn import datasets from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.svm import LinearSVR import statsmodels.api as sm class data_for_model: def machine_learn(data,model): if data == "iris": mydata = datasets.load_iris() elif data == "wine": mydata = datasets.load_wine() elif data == "breast_cancer": mydata = datasets.load_breast_cancer() elif data == "diabetes": mydata = datasets.load_diabetes() elif data == "boston": mydata = datasets.load_boston() elif data != "two_moon": return "提供的数据错误,包括:iris,wine,barest_cancer,diabetes,boston,two_moon" if data == "two_moon": X,y = datasets.make_moons(n_samples=200,noise=0.05, random_state=0) elif model == "DecisionTreeClassifie"or "RandomForestClassifie": X,y = mydata.data[:,:2],mydata.target else: X,y = mydata.data,mydata.target X_train,X_test,y_train,y_test = train_test_split(X, y) if model == "LinearRegression": md = LinearRegression().fit(X_train,y_train) elif model == "LogisticRegression": if data == "boston": y_train = y_train.astype('int') y_test = y_test.astype('int') md = LogisticRegression().fit(X_train,y_train) elif model == "Ridge": md = Ridge(alpha=0.1).fit(X_train,y_train) elif model == "Lasso": md = Lasso(alpha=0.0001,max_iter=10000000).fit(X_train,y_train) elif model == "SVM": md = LinearSVR(C=2).fit(X_train,y_train) elif model == "sm": md = sm.OLS(y,X).fit() else: return "提供的模型错误,包括:LinearRegression,LogisticRegression,Ridge,Lasso,SVM,sm " if model == "sm": print("results.params(diabetes):\n",md.params, "\nresults.summary(diabetes):\n",md.summary()) else: print("模型:",model,"数据:",data,"训练集评分:{:.2%}".format(md.score(X_train,y_train))) print("模型:",model,"数据:",data,"测试集评分:{:.2%}".format(md.score(X_test,y_test)))
在这里考虑:
- LogisticRegression算法在波士顿房价下要求目标y必须为int类型,所以做了判断;
- Ridge 算法的alpha参数为0.1;
- Lasso算法的alpha参数为0.0001, 最大迭代数为10,000,000
这样,我们就可以对指定模型指定数据进行定量分析
from machinelearn_data_model import data_for_model def linear_for_all_data_and_model(): datas = ["iris","wine","breast_cancer","diabetes","boston","two_moon"] models = ["LinearRegression","LogisticRegression","Ridge","Lasso","SVM","sm"] for data in datas: for model in models: data_for_model.machine_learn(data,model)
我们对测试结果进行比较:
数据 |
模型 |
训练集 |
测试集 |
鸢尾花 |
LinearRegression |
92.7% |
93.7% |
LogisticRegression |
96.4% |
100% |
|
Ridge |
93.1% |
92.8% |
|
Lasso |
92.8% |
93.1% |
|
StatsModels OLS |
0.972 |
||
鸢尾花数据在所有训练模型下表现均很好 |
|||
红酒 |
LinearRegression |
90.6% |
85.1% |
LogisticRegression |
97.7% |
95.6% |
|
Ridge |
90.2% |
86.8% |
|
Lasso |
91.0% |
85.2% |
|
StatsModels OLS |
0.948 |
||
红酒数据在所有训练模型下表现均很好,但比鸢尾花略差些 |
|||
乳腺癌 |
LinearRegression |
79.1% |
68.9% |
LogisticRegression |
95.3% |
93.0% |
|
Ridge |
75.7% |
74.5% |
|
Lasso |
77.6% |
71.4% |
|
StatsModels OLS |
0.908 |
||
乳腺癌数据仅在逻辑回归和OLS模型上表现很好 |
|||
糖尿病 |
LinearRegression |
52.5% |
47.9% |
LogisticRegression |
02.7% |
00.0% |
|
Ridge |
51.5% |
49.2% |
|
Lasso |
51.5% |
50.2% |
|
StatsModels OLS |
0.106 |
||
糖尿病数据在所有模型下表现均不好 |
|||
波士顿房价 |
LinearRegression |
74.5% |
70.9% |
LogisticRegression |
20.8% |
11.0% |
|
Ridge |
76.0% |
62.7% |
|
Lasso |
73.5% |
74.5% |
|
StatsModels OLS |
0.959 |
||
波士顿房价数据仅在OLS模型上表现很好,在其他模型下表现均不佳。但是处理逻辑回归模型下,表现比糖尿病略好。 |
|||
2个月亮 |
LinearRegression |
66.9% |
63.0% |
LogisticRegression |
89.3% |
86.0% |
|
Ridge |
66.3% |
64.3% |
|
Lasso |
65.3% |
65.2% |
|
StatsModels OLS |
0.501 |
||
2个月亮数据在LogisticRegressio模型下表现最好,其他表现不太好。 |
总结如下表(绿色表现好,红色表现不好,紫色一般):
数据类型 |
LinearRegression |
LogisticRegression |
Ridge |
Lasso |
OLS |
鸢尾花 |
|||||
红酒 |
|||||
乳腺癌 |
|||||
糖尿病 |
|||||
波士顿房价 |
|||||
2个月亮 |
我们最后把KNN算法也捆绑进去,machinelearn_data_model.py经过如下改造
… elif model == "KNeighborsClassifier": if data == "boston": y_train = y_train.astype('int') y_test = y_test.astype('int') md = KNeighborsClassifier().fit(X_train,y_train) else: return "提供的模型错误,包括:LinearRegression,LogisticRegression,Ridge,Lasso,SVM,sm,KNeighborsClassifier"…
调用测试程序
from machinelearn_data_model import data_for_model def KNN_for_all_data_and_model(): datas = ["iris","wine","breast_cancer","diabetes","boston","two_moon"] models = ["KNeighborsClassifier"] for data in datas: for model in models: data_for_model.machine_learn(data,model)
得到测试结果
数据 |
模型 |
训练集 |
测试集 |
鸢尾花 |
KNeighborsClassifier |
95.5% |
97.4% |
红酒 |
KNeighborsClassifier |
76.7% |
68.9% |
乳腺癌 |
KNeighborsClassifier |
95.3% |
90.2% |
糖尿病 |
KNeighborsClassifier |
19.6% |
00.0% |
波士顿房价 |
KNeighborsClassifier |
36.4% |
04.7% |
2个月亮 |
KNeighborsClassifier |
100.00% |
100.00% |
由此可见,KNeighborsClassifier对鸢尾花,乳腺癌和2个月亮数据是比较有效的。
数据类型 |
KNeighborsClassifier |
鸢尾花 |
|
红酒 |
|
乳腺癌 |
|
糖尿病 |
|
波士顿房价 |
|
2个月亮 |
——————————————————————————————————
软件安全测试
https://study.163.com/course/courseMain.htm?courseId=1209779852&share=2&shareId=480000002205486
接口自动化测试
https://study.163.com/course/courseMain.htm?courseId=1209794815&share=2&shareId=480000002205486
DevOps和Jenkins之DevOps
https://study.163.com/course/courseMain.htm?courseId=1209817844&share=2&shareId=480000002205486
DevOps与Jenkins 2.0之詹金斯
https://study.163.com/course/courseMain.htm?courseId=1209819843&share=2&shareId=480000002205486
硒自动化测试
https://study.163.com/course/courseMain.htm?courseId=1209835807&share=2&shareId=480000002205486
性能测试第1季:性能测试基础知识
https://study.163.com/course/courseMain.htm?courseId=1209852815&share=2&shareId=480000002205486
性能测试第2季:LoadRunner12使用
https://study.163.com/course/courseMain.htm?courseId=1209980013&share=2&shareId=480000002205486
性能测试第3季:JMeter工具使用
https://study.163.com/course/courseMain.htm?courseId=1209903814&share=2&shareId=480000002205486
性能测试第4季:监控与调优
https://study.163.com/course/courseMain.htm?courseId=1209959801&share=2&shareId=480000002205486
Django入门
https://study.163.com/course/courseMain.htm?courseId=1210020806&share=2&shareId=480000002205486
啄木鸟顾老师漫谈软件测试
https://study.163.com/course/courseMain.htm?courseId=1209958326&share=2&shareId=480000002205486