1. sklearn在机器学习中的一些主要应用: 1. 监督学习/非监督血虚 2. 如何使用sklearn进行自动化的超参数调优 3. 如何使用sklearn创建pipeline,让我们的机器学习代码更加简洁高效
1.1 使用sklearn进行监督学习的建模
import numpy as np import pandas as pd import matplotlib.pyplot as plt plt.style.use('ggplot')
df = pd.read_csv('diabetes.csv') df.head()
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
X = df.drop('Outcome',axis=1).values y = df['Outcome'].values
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.4,random_state=42, stratify=y)
from sklearn.neighbors import KNeighborsClassifier neighbors = np.arange(1,9) train_accuracy =np.empty(len(neighbors)) test_accuracy = np.empty(len(neighbors)) for i,k in enumerate(neighbors): knn = KNeighborsClassifier(n_neighbors=k) knn.fit(X_train, y_train) train_accuracy[i] = knn.score(X_train, y_train)
plt.title('k-NN Varying number of neighbors') plt.plot(neighbors, test_accuracy, label='Testing Accuracy') plt.plot(neighbors, train_accuracy, label='Training accuracy') plt.legend() plt.xlabel('Number of neighbors') plt.ylabel('Accuracy') plt.show()
knn = KNeighborsClassifier(n_neighbors=7) knn.fit(X_train,y_train) knn.score(X_test,y_test)
- 使用sklearn提供的方法
from sklearn.metrics import confusion_matrix y_pred = knn.predict(X_test) confusion_matrix(y_test,y_pred)
array([[165, 36], [ 47, 60]])
- 使用pandas的crosstab方法
pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)
Predicted | 0 | 1 | All |
TRUE | |||
0 | 165 | 36 | 201 |
1 | 47 | 60 | 107 |
All | 212 | 96 | 308 |
- sklearn中还提供了一种很好的api classification_report
classification_report可以帮助用户自动计算出分类问题中常用的几个metrics的值,包括precision/recall, f1 score
from sklearn.metrics import classification_report print(classification_report(y_test,y_pred))
precision recall f1-score support 0 0.78 0.82 0.80 201 1 0.62 0.56 0.59 107 accuracy 0.73 308 macro avg 0.70 0.69 0.70 308 weighted avg 0.73 0.73 0.73 308
from sklearn.metrics import roc_curve y_pred_proba = knn.predict_proba(X_test)[:,1] fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
plt.plot([0,1],[0,1],'k--') plt.plot(fpr,tpr, label='Knn') plt.xlabel('fpr') plt.ylabel('tpr') plt.title('Knn(n_neighbors=7) ROC curve') plt.show()
from sklearn.metrics import roc_auc_score roc_auc_score(y_test,y_pred_proba)
1.2 使用sklearn进行超参数调优
- GridSearch
- 本例子中,只有一个参数n_neighbors,也就是我们预定义的要分多少类。
- 实际使用中,如果参数较多,组合较多,这种方式可能会很慢
from sklearn.model_selection import GridSearchCV param_grid = {'n_neighbors':np.arange(1,50)} knn = KNeighborsClassifier() knn_cv= GridSearchCV(knn,param_grid,cv=5) knn_cv.fit(X,y)
{'n_neighbors': 14}
1.3 使用pipeline
- 可以使用sklearn提供的pipeline功能将一些顺序执行的模块串起来,这样代码可以更加简洁
- 本例中,将会使用pipeline实现下面的事情
- 数据标准化(scaler)
- 数据降维(PCA)
- 训练逻辑回归模型
from sklearn.decomposition import PCA from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.model_selection import GridSearchCV from sklearn.preprocessing import StandardScaler from sklearn import datasets pca = PCA() scaler = StandardScaler() logistic = LogisticRegression(max_iter=10000, tol=0.1) pipe = Pipeline(steps=[("scaler", scaler), ("pca", pca), ("logistic", logistic)]) X_digits, y_digits = datasets.load_digits(return_X_y=True) param_grid = { "pca__n_components": [5, 15, 20,25,30,50,55, 60], "logistic__C": np.logspace(-4, 4, 8), } search = GridSearchCV(pipe, param_grid, n_jobs=2) search.fit(X_digits, y_digits) print("Best parameter (CV score=%0.3f):" % search.best_score_) print(search.best_params_) pca.fit(X_digits) fig, (ax0, ax1) = plt.subplots(nrows=2, sharex=True, figsize=(12, 12)) ax0.plot( np.arange(1, pca.n_components_ + 1), pca.explained_variance_ratio_, "+", linewidth=2 ) ax0.set_ylabel("PCA explained variance ratio") ax1.plot( np.arange(1, pca.n_components_ + 1), search.cv_results_["mean_test_score"], "*", linewidth=2 ) ax1.set_ylabel("Classification Accuracy") ax0.axvline( search.best_estimator_.named_steps["pca"].n_components, linestyle=":", label="n_components chosen", ) ax1.axvline( search.best_estimator_.named_steps["pca"].n_components, linestyle=":", label="n_components chosen", )
Best parameter (CV score=0.925): {'logistic__C': 0.2682695795279725, 'pca__n_components': 55}
1. sklearn还提供了一系列用于文本处理的工具包和transformer。基于这些工具可以方便的对文本数据进行encoding,分类
2. sklearn提供了一系列基本的统计学算法模型的API,可以很方便的直接去引用。比如XGBoost,naive_bayes,SVM等等