1.3 Decision boundary(决策边界)
x1 = np.linspace(30, 100, 100) # 自变量 x2 = -(final_theta[0] + x1*final_theta[1]) / final_theta[2] # 因变量 fig, ax = plt.subplots(figsize=(12,8)) ax.plot(x1, x2, 'y', label='Prediction') ax.scatter(positive['Exam 1'], positive['Exam 2'], s=50, c='b', marker='o', label='Admitted') ax.scatter(negative['Exam 1'], negative['Exam 2'], s=50, c='r', marker='x', label='Not Admitted') ax.legend() ax.set_xlabel('Exam 1 Score') ax.set_ylabel('Exam 2 Score') plt.show()
2 Regularized logistic regression
In this part of the exercise, you will implement regularized logistic regression to predict whether microchips from a fabrication plant passes quality assurance (QA). During QA, each microchip goes through various tests to ensure it is functioning correctly.
Suppose you are the product manager of the factory and you have the test results for some microchips on two dierent tests. From these two tests, you would like to determine whether the microchips should be accepted or rejected. To help you make the decision, you have a dataset of test results on past microchips, from which you can build a logistic regression model.
在这部分练习中,你将实施正则化逻辑回归来预测来自制造工厂的微芯片是否通过了质量保证(QA)。在QA期间,每个芯片都要经过各种测试,以确保其正常工作。
假设您是工厂的产品经理,在两个不同的测试中有一些微芯片的测试结果。您想通过这两项测试来确定该微芯片应该被接受还是被拒绝。为了帮助您做出决定,您有一个关于过去微芯片的测试结果数据集,您可以根据该数据集建立一个逻辑回归模型。
在这一部分,我们发现,出现了正则化这个词。简而言之,正则化是成本函数中的一个术语,它使算法更倾向于“更简单”的模型(在这种情况下,模型将更小的系数)。这个理论助于减少过拟合,提高模型的泛化能力。
2.1 Visualizing the data
一样先读入数据
path = '../data_files/data/ex2data2.txt' data2 = pd.read_csv(path,header=None,names=['Microchip 1','Microchip 2','Accepted']) data2.head()
接着就画出散点图,⚪ 表示的是Accept,+ 表示的是Reject,一个是正一个是负
def plot_data(): positive = data2[data2['Accepted'].isin([1])] negative = data2[data2['Accepted'].isin([0])] fig, ax = plt.subplots(figsize=(12,8)) ax.scatter(x = positive['Microchip 1'],y = positive['Microchip 2'],c = 'black', s = 50,marker = '+',label = 'Accepted') ax.scatter(x = negative['Microchip 1'],y = negative['Microchip 2'],c = 'y', s = 50,marker = 'o',label = 'Reject') ax.legend() ax.set_xlabel('Microchip 1') ax.set_ylabel('Microchip 2') # plt.show() plot_data()
2.2 Feature mapping
One way to the data better is to create more features from each data point.We will map the features into all polynomial terms of x1 and x2 up to the sixth power.
一个拟合数据的更好的方法是从每个数据点创建更多的特征。我们将把这些特征映射到所有的x1和x2的多项式项上,直到第六次幂。
def feature_mapping(x1, x2, power,as_ndarray = False): data = {} # for i in np.arange(power + 1): # for p in np.arange(i + 1): # data["f{}{}".format(i - p, p)] = np.power(x1, i-p)* np.power(x2,p) data = {"f'{}{}".format( i-p , p ):np.power(x1,i-p) * np.power(x2,p) for i in np.arange(power+1) for p in np.arange(i+1) } if as_ndarray: return np.array(pd.DataFrame(data)) else: return pd.DataFrame(data)
x1 = np.array(data2['Microchip 1']) x2 = np.array(data2['Microchip 2']) _data2 = feature_mapping(x1, x2, power = 6) print(_data2.shape) _data2.head()
As a result of this mapping, our vector of two features (the scores on two QA tests) has been transformed into a 28-dimensional vector. A logistic regression classier trained on this higher-dimension feature vector will have a more complex decision boundary and will appear nonlinear when drawn in our 2-dimensional plot.
While the feature mapping allows us to build a more expressive classier, it also more susceptible to overfitting. In the next parts of the exercise, you will implement regularized logistic regression to the data and also see for yourself how regularization can help combat the overfitting problem.
经过映射,我们的两个特征向量(两个QA测试的分数)被转换成一个28维向量
在这个高维特征向量上训练的logistic回归分类器将具有更复杂的决策边界,并且在我们的二维图中绘制时将出现非线性。
虽然特征映射允许我们建立一个更有表现力的分类器,但它也更容易被过度拟合。在本练习的下一部分中,您将实现对数据的正则化逻辑回归,并亲自了解正则化如何帮助解决过拟合问题。
2.3 Cost function
正则化的基本方法:对高次项添加惩罚值,让高次项的系数接近于0。
假如我们有非常多的特征,我们并不知道其中哪些特征我们要惩罚,我们将对所有的特征进行惩罚,并且让代价函数最优化的软件来选择这些惩罚的程度。这样的结果是得到了一个较为简单的能防止过拟合问题的假设:
其中λ 又称为正则化参数(Regularization Parameter)。 注:根据惯例,我们不对θ0进行惩罚。
sigmod函数
def sigmoid(z): return 1 / (1 + np.exp(-z))
先获取特征,再进行操作
theta = np.zeros(_data2.shape[1]) X = feature_mapping(x1, x2, power = 6,as_ndarray = True) print(X.shape) y = np.array(data2.iloc[:,-1]) print(y.shape)
def regularized_cost(theta, X, y, l=1): thetaReg = theta[1:] first = ( -y * np.log(sigmoid(X @ theta) )) - (1-y) * np.log(1-sigmoid( X @ theta )) reg = (thetaReg @ thetaReg) * l / ( 2*len(X) ) return np.mean(first) + reg regularized_cost(theta,X,y,l=1)
0.6931471805599454
2.4 Regularized Gradient
注:看上去同线性回归一样,但是由于假设h θ ( x ) = g ( θ T X ) ,所以与线性回归不同。
注意:
1. 虽然正则化的逻辑回归中的梯度下降和正则化的线性回归中的表达式看起来一样,但 由于两者的hθ0
(x)不同所以还是有很大差别。
2. θ 0
不参与其中的任何一个正则化。
def regularized_gradient(theta, X, y, l=1): thetaReg = theta[1:] first = ( X.T @ (sigmoid(X @ theta) - y)) / len(X) # print(first) # 这里人为插入一维0,使得对theta_0不惩罚,方便计算 reg = np.concatenate([np.array([0]), (l / len(X)) * thetaReg]) # print(reg) # [8.47457627e-03 1.87880932e-02 7.77711864e-05 5.03446395e-02 # 1.15013308e-02 3.76648474e-02 1.83559872e-02 7.32393391e-03 # 8.19244468e-03 2.34764889e-02 3.93486234e-02 2.23923907e-03 # 1.28600503e-02 3.09593720e-03 3.93028171e-02 1.99707467e-02 # 4.32983232e-03 3.38643902e-03 5.83822078e-03 4.47629067e-03 # 3.10079849e-02 3.10312442e-02 1.09740238e-03 6.31570797e-03 # 4.08503006e-04 7.26504316e-03 1.37646175e-03 3.87936363e-02] # [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. # 0. 0. 0. 0.] return first + reg regularized_gradient(theta,X,y)
补充改进
除了上述可以重新定义函数之外,其实我们可以在原有基础上的cost函数和gradent进行改进就可以了
def cost(theta, X, y): first = (-y) * np.log(sigmoid(X @ theta)) second = (1 - y)*np.log(1 - sigmoid(X @ theta)) return np.mean(first - second)
def gradient(theta, X, y): return (X.T @ (sigmoid(X @ theta) - y))/len(X) # the gradient of the cost is a vector of the same length as θ where the jth element (for j = 0, 1, . . . , n)
def costReg(theta, X, y, l=1): # 不惩罚第一项 _theta = theta[1: ] reg = (l / (2 * len(X))) *(_theta @ _theta) # _theta@_theta == inner product return cost(theta, X, y) + reg
def gradientReg(theta, X, y, l=1): reg = (1 / len(X)) * theta reg[0] = 0 return gradient(theta, X, y) + reg
2.5 Learning θ parameters
与上一题一样,利用optimize的函数
import scipy.optimize as opt print('init cost = {}'.format(regularized_cost(theta,X,y))) #init cost = 0.6931471805599454 res = opt.minimize(fun=regularized_cost,x0=theta,args=(X,y),method='CG',jac=regularized_gradient) res
2.7 Evaluating logistic regression
def predict(theta, X): probability = sigmoid( X @ theta) return [1 if x >= 0.5 else 0 for x in probability] # return a list
final_theta = result2[0] predictions = predict(final_theta, X) correct = [1 if a==b else 0 for (a, b) in zip(predictions, y)] accuracy = sum(correct) / len(correct) accuracy
0.8050847457627118
from sklearn.metrics import classification_report final_theta = res.x y_predict = predict(final_theta, X) predict(final_theta, X) print(classification_report(y,y_predict))
可以得出大概是0.83左右
除此之外,还可以调用sklearn里的线性回归包
from sklearn import linear_model#调用sklearn的线性回归包 model = linear_model.LogisticRegression(penalty='l2', C=1.0) model.fit(X, y.ravel())
model.score(X, y) # 0.8305084745762712
2.6 Plotting the decision boundary
X × θ = 0 (this is the line)
在我们可视化的时候,我们发现,这个函数是不太好求的,我们利用等高线画图,最后将高度设为0,这样的话就可以得到我们的图了,真不错
x = np.linspace(-1, 1.5, 50) #从-1到1.5等间距取出50个数 xx, yy = np.meshgrid(x, x) #将x里的数组合成50*50=250个坐标 z = np.array(feature_mapping(xx.ravel(), yy.ravel(), 6)) z = z @ final_theta z = z.reshape(xx.shape) plot_data() plt.contour(xx, yy, z, 0, colors='black') #等高线是三维图像在二维空间的投影,0表示z的高度为0 plt.ylim(-.8, 1.2)
这样就可视化成功了!!!
好了,真好,我们又完成了exp 2,争取过几天把exp 3也弄出来,加油
每日一句
take control of your own desting.(命运掌握在自己手上)
如果需要数据和代码,可以自提
路径1:我的gitee
路径2:百度网盘
链接:https://pan.baidu.com/s/1uA5YU06FEW7pW8g9KaHaaw
提取码:5605