05 超参数
超参数:在算法运行前需要确定的参数,即kNN中的k
模型参数:算法过程中学习到的参数
通过以上对kNN方法的讨论可知,kNN算法没有模型参数
import numpy as np from sklearn import datasets digits = datasets.load_digits() X = digits.data y = digits.target from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) from sklearn.neighbors import KNeighborsClassifier knn_clf = KNeighborsClassifier(n_neighbors=3) knn_clf.fit(X_train, y_train) knn_clf.score(X_test, y_test) # Out[4]: # 0.98611111111111116
寻找最好的k
best_score = 0.0 best_k = -1 for k in range(1, 11): # 搜索1到10中最好的k,分别创建k等于不同值时的分类器,用score方法评判 knn_clf = KNeighborsClassifier(n_neighbors=k) knn_clf.fit(X_train, y_train) score = knn_clf.score(X_test, y_test) if score > best_score: best_k = k best_score = score print("best_k=",best_k) print("best_score=",best_score) # best_k= 7 # best_score= 0.988888888889
考虑距离?不考虑距离?
kNN算法如果考虑距离,则分类过程中待测数据点与临近点的关系成反比,距离越大,得票的权重越小
best_method = "" best_score = 0.0 best_k = -1 for method in ["uniform", "distance"]: for k in range(1, 11): # 搜索1到10中最好的k knn_clf = KNeighborsClassifier(n_neighbors=k, weights=method) knn_clf.fit(X_train, y_train) score = knn_clf.score(X_test, y_test) if score > best_score: best_k = k best_score = score best_method = method print("best_method=",best_method) print("best_k=",best_k) print("best_score=",best_score) # best_method= uniform # best_k= 1 # best_score= 0.994444444444
搜索明可夫斯基距离相应的p
%%time best_p = -1 best_score = 0.0 best_k = -1 for k in range(1, 11): # 搜索1到10中最好的k for p in range(1, 6): # 距离参数 knn_clf = KNeighborsClassifier(n_neighbors=k, weights="distance", p = p) knn_clf.fit(X_train, y_train) score = knn_clf.score(X_test, y_test) if score > best_score: best_k = k best_score = score best_p = p print("best_p=", best_p) print("best_k=", best_k) print("best_score=", best_score) # best_p= 2 # best_k= 1 # best_score= 0.994444444444 # CPU times: user 15.3 s, sys: 51.5 ms, total: 15.4 s Wall time: 15.5 s
以上多重循环的过程可以抽象成一张网格,将网格上面的所有点都遍历一遍求最好的值
06 网格搜索
补充:kNN中的“距离”
明可夫斯基距离:此时获得了一个超参数p,当p = 1时为曼哈顿距离,当p = 2时为欧拉距离
数据准备:
import numpy as np from sklearn import datasets digits = datasets.load_digits() X = digits.data y = digits.target from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666) from sklearn.neighbors import KNeighborsClassifier knn_clf = KNeighborsClassifier(n_neighbors=3) knn_clf.fit(X_train, y_train) knn_clf.score(X_test, y_test) # Out[4]: # 0.98888888888888893
Grid Search
可以将上一节中的网格搜索思想用以下方法更简便的表达出来
# 定义网格参数,每个字典写上要遍历的参数的取值集合 param_grid = [ { 'weights':['uniform'], 'n_neighbors':[i for i in range(1, 11)] }, { 'weights':['distance'], 'n_neighbors':[i for i in range(1, 11)], 'p':[i for i in range(1, 6)] } ] knn_clf = KNeighborsClassifier() # 导入网格搜索方法(此方法使用交叉验证CV) from sklearn.model_selection import GridSearchCV grid_search = GridSearchCV(knn_clf, param_grid) %%time grid_search.fit(X_train, y_train) # CPU times: user 2min 2s, sys: 320 ms, total: 2min 2s Wall time: 2min 3s """ GridSearchCV(cv=None, error_score='raise', estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=5, p=2, weights='uniform'), fit_params=None, iid=True, n_jobs=1, param_grid=[{'weights': ['uniform'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}, {'weights': ['distance'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'p': [1, 2, 3, 4, 5]}], pre_dispatch='2*n_jobs', refit=True, return_train_score='warn', scoring=None, verbose=0) """ grid_search.best_estimator_ # 返回最好的分类器,(变量名最后带一个下划线是因为这是根据用户输入所计算出来的) """ KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=3, p=3, weights='distance') """ grid_search.best_score_ # 最好方法的准确率 # Out[11]: # 0.98538622129436326 grid_search.best_params_ # 最优方法的对应参数 # Out[12]: # {'n_neighbors': 3, 'p': 3, 'weights': 'distance'} knn_clf = grid_search.best_estimator_ knn_clf.score(X_test, y_test) # Out[14]: # 0.98333333333333328 %%time grid_search = GridSearchCV(knn_clf, param_grid, n_jobs=-1, verbose=2) # n_jobs采用多少核,verbose:执行时输出,整数越大,信息越详细 grid_search.fit(X_train, y_train)
Fitting 3 folds for each of 60 candidates, totalling 180 fits
[CV] n_neighbors=1, weights=uniform ..................................
[CV] n_neighbors=1, weights=uniform ..................................
[CV] n_neighbors=1, weights=uniform ..................................
[CV] n_neighbors=2, weights=uniform ..................................
[CV] ................... n_neighbors=1, weights=uniform, total= 0.7s
[CV] n_neighbors=3, weights=uniform ..................................
[CV] ................... n_neighbors=2, weights=uniform, total= 1.0s ......
CPU times: user 651 ms, sys: 343 ms, total: 994 ms Wall time: 1min 23s
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed: 1.4min finished