下面是一个例子,根据啤酒中配料含量的不同进行聚类,以划分出不同品牌的啤酒
1、导入数据并处理
import pandas as pd beer = pd.read_csv('data.txt', sep=' ') beer
X = beer[["calories","sodium","alcohol","cost"]] # 当需要用K-means来做聚类时导入KMeans函数 from sklearn.cluster import KMeans km = KMeans(n_clusters=3).fit(X) km2 = KMeans(n_clusters=2).fit(X) # 直接在数据中添加cluster和cluster2两列 beer['cluster'] = km.labels_ beer['cluster2'] = km2.labels_ beer.sort_values('cluster')
2、计算划分后各维均值
# 下面的错误的写法,plotting已经被提了出来 ## from pandas.tools.plotting import scatter_matrix from pandas.plotting import scatter_matrix %matplotlib inline cluster_centers = km.cluster_centers_ cluster_centers_2 = km2.cluster_centers_
beer.groupby("cluster").mean()
beer.groupby("cluster2").mean()
3、绘图分析
下面是calories和Alcohol两个维度上的聚类结果
# 设置中心点 centers = beer.groupby("cluster").mean().reset_index() # 绘制3堆的聚类效果 %matplotlib inline import matplotlib.pyplot as plt plt.rcParams['font.size'] = 14 # 设置四种颜色 import numpy as np colors = np.array(['red', 'green', 'blue', 'yellow']) # 绘图 plt.scatter(beer["calories"], beer["calories"],c=colors[beer["cluster"]]) plt.scatter(centers.calories, centers.alcohol, linewidths=3, marker='+', s=300, c='black') plt.xlabel("Calories") plt.ylabel("Alcohol")
Text(0, 0.5, 'Alcohol')
把聚类后两两特征的散点图分别进行绘制
- K = 3时的结果
scatter_matrix(beer[["calories","sodium","alcohol","cost"]],s=100, alpha=1, c=colors[beer["cluster"]], figsize=(10,10)) plt.suptitle("With 3 centroids initialized")
Text(0.5, 0.98, 'With 3 centroids initialized')
- K = 2时的结果
scatter_matrix(beer[["calories","sodium","alcohol","cost"]],s=100, alpha=1, c=colors[beer["cluster2"]], figsize=(10,10)) plt.suptitle("With 2 centroids initialized")
Text(0.5, 0.98, 'With 2 centroids initialized')
4、轮廓系数分析
由于以上的k = 2或者3时的区别不大,可以引入轮廓系数进行分析,这是评价聚类效果好坏的方式。
- 计算样本i到同簇其他样本的平均距离ai。ai 越小,说明样本i越应该被聚类到该簇。将ai 称为样本i的簇内不相似度。
- 计算样本i到其他某簇Cj 的所有样本的平均距离bij,称为样本i与簇Cj 的不相似度。定义为样本i的簇间不相似度:bi =min{bi1, bi2, …, bik}
- si接近1,则说明样本i聚类合理
- si接近-1,则说明样本i更应该分类到另外的簇
- 若si 近似为0,则说明样本i在两个簇的边界上。
from sklearn import metrics score_scaled = metrics.silhouette_score(X,beer.scaled_cluster) score = metrics.silhouette_score(X,beer.cluster) print(score_scaled, score) scores = [] for k in range(2,20): labels = KMeans(n_clusters=k).fit(X).labels_ score = metrics.silhouette_score(X, labels) scores.append(score) plt.plot(list(range(2,20)), scores) plt.xlabel("Number of Clusters Initialized") plt.ylabel("Sihouette Score")
0.6731775046455796 0.6731775046455796 Text(0, 0.5, 'Sihouette Score')
从图中可以看出,当n_clusters = 2时,轮廓系数更接近与1,更合适。但是在聚类算法中,评估方法只作为参考,真正数据集来时还是具体分析一番。
在使用sklearn工具包进行建模时,换一个算法非常便捷,只需要更改函数即可。。
from pandas.plotting import scatter_matrix # 改为DBSCAN算法 from sklearn.cluster import DBSCAN db = DBSCAN(eps=10, min_samples=2).fit(X) labels = db.labels_ beer['cluster_db'] = labels beer.sort_values('cluster_db') beer.groupby('cluster_db').mean() scatter_matrix(X, c=colors[beer.cluster_db], figsize=(10,10), s=100)
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000199EA98E0C8>, <matplotlib.axes._subplots.AxesSubplot object at 0x00000199EABFFA08>, <matplotlib.axes._subplots.AxesSubplot object at 0x00000199EA9D49C8>, <matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAA0E408>], [<matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAA46DC8>, <matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAA807C8>, <matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAAB9148>, <matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAAEDFC8>], [<matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAAF7BC8>, <matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAB31D88>, <matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAB9B388>, <matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAD933C8>], [<matplotlib.axes._subplots.AxesSubplot object at 0x00000199EADCD4C8>, <matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAE06608>, <matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAE3F708>, <matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAE79808>]], dtype=object)