Python遇见机器学习 ---- k近邻（kNN）算法(二）-阿里云开发者社区

Python遇见机器学习 ---- k近邻（kNN）算法(二）

2022-04-15 113

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 所谓：“近朱者赤，近墨者黑”

03 测试我们的算法

本例使用datasets数据集中的鸢尾花数据集

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
# 加在鸢尾花数据集
iris = datasets.load_iris()
X = iris.data
y = iris.target
X.shape
# Out[4]:
# (150, 4)
y.shape
# Out[5]:
# (150,)

train_test_split

平时我们在拿到一个数据集时，往往将其一部分用于对机器进行训练，另一部分用于对训练过后的机器进行测试，即train_test_split

y
"""
Out[6]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
"""
shuffle_indexes = np.random.permutation(len(X)) # 打乱数据，形成150个索引的随机排列
shuffle_indexes
"""
Out[8]:
array([ 22, 142, 86, 111, 72, 80, 17, 137, 5, 66, 33, 55, 40, 122, 108, 24, 45, 110, 68, 46, 118, 44, 136, 121, 78, 31, 103, 35, 105, 107, 76, 116, 84, 144, 123, 57, 42, 7, 38, 28, 117, 115, 89, 58, 126, 74, 49, 27, 94, 77, 85, 21, 119, 132, 100, 120, 6, 104, 62, 53, 64, 41, 106, 26, 29, 18, 129, 146, 148, 1, 82, 139, 135, 96, 127, 56, 37, 130, 65, 149, 113, 92, 131, 2, 4, 125, 54, 79, 50, 61, 112, 95, 19, 109, 102, 141, 30, 39, 83, 25, 140, 60, 12, 20, 138, 71, 59, 11, 13, 0, 52, 91, 3, 73, 23, 124, 15, 14, 81, 97, 75, 114, 16, 69, 32, 134, 36, 8, 63, 51, 147, 67, 93, 47, 133, 48, 143, 43, 34, 98, 87, 88, 145, 70, 90, 9, 10, 128, 101, 99])
"""
test_ratio = 0.2 # 设置测试数据集的占比
test_size = int(len(X) * test_ratio)
test_indexes = shuffle_indexes[:test_size] # 测试数据集索引
train_indexes = shuffle_indexes[test_size:] # 训练数据集索引
X_train = X[train_indexes]
y_train = y[train_indexes]
X_test = X[test_indexes]
y_test = y[test_indexes]
print(X_train.shape)
print(y_train.shape)
# (120, 4) (120,)
print(X_test.shape)
print(y_test.shape)
# (30, 4) (30,)

sklearn中的train_test_split

sklearn中同样为我们提供了将数据集分成训练集与测试集的方法

# 首先创建一个kNN分类器my_knn_clf，略
# 导入模块
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)
print(X_train.shape)
print(y_train.shape)
# (120, 4) (120,)
print(X_test.shape)
print(y_test.shape)
# (30, 4) (30,)
my_knn_clf.fit(X_train, y_train)
# Out[32]:
# KNN(k=3)
y_predict = my_knn_clf.predict(X_test)
sum(y_predict == y_test)/len(y_test)
# Out[34]:
# 1.0

04 分类准确度

本例使用datasets中手写识别数据集来演示分类准确度的计算

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn import datasets
digits = datasets.load_digits() # 手写识别数据集
digits.keys()
# Out[3]:
# dict_keys(['data', 'target', 'target_names', 'images', 'DESCR'])
print(digits.DESCR)
X = digits.data # 数据集
y = digits.target # 标记
# 随便取一个数据集
some_digit = X[666]
some_digit_image = some_digit.reshape(8, 8)
plt.imshow(some_digit_image, cmap = matplotlib.cm.binary)
plt.show()

执行结果：

scikit-learn中的accuracy_score

# 导入split方法先将数据集拆分
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# 导入创建kNN分类器的方法
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)
y_predict = knn_clf.predict(X_test)
# 导入sklearn中计算准确度的方法
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_predict)
# Out[27]:
# 0.99444444444444446

Python遇见机器学习 ---- k近邻（kNN）算法(二）

03 测试我们的算法

train_test_split

sklearn中的train_test_split

sklearn中同样为我们提供了将数据集分成训练集与测试集的方法

04 分类准确度

本例使用datasets中手写识别数据集来演示分类准确度的计算

scikit-learn中的accuracy_score

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像