聚类
在本练习中,我们将实现K-means聚类
K-means 聚类
我们将实施和应用K-means到一个简单的二维数据集,以获得一些直观的工作原理。 K-means是一个迭代的,无监督的聚类算法,将类似的实例组合成簇。 该算法通过猜测每个簇的初始聚类中心开始,然后重复将实例分配给最近的簇,并重新计算该簇的聚类中心。 我们要实现的第一部分是找到数据中每个实例最接近的聚类中心的函数。
1 准备数据
无监督学习中,数据是不带任何标签的
import numpy as np import pandas as pd import matplotlib.pyplot as plt
import pandas as pd import numpy as np data=pd.read_csv("data/ex7data2.csv") data.head()
X1 | X2 | |
0 | 1.842080 | 4.607572 |
1 | 5.658583 | 4.799964 |
2 | 6.352579 | 3.290854 |
3 | 2.904017 | 4.612204 |
4 | 3.231979 | 4.939894 |
import seaborn as sb plt.figure(figsize=(4,6)) sb.lmplot(x="X1",y="X2",data=data,fit_reg=False) plt.show()
<Figure size 288x432 with 0 Axes>
1 初始化聚类中心
2 所有样本点聚类(计算每个样本点与聚类中心的距离,选择最小距离的聚类中心所在的聚类作为该样本所属聚类)
3 重新计算聚类中心(每个聚类中所有样本的均值)
4 迭代执行2-3直到聚类中心不再变化(iter_num)
2 给定聚类中心,计算每个点属于哪个聚类,定义函数实现
#给定聚类中心,如何求每个样本所属的聚类 def find_closest(X,centroids): #样本数量 m=X.shape[0] idx=np.zeros(m) k=centroids.shape[0] #遍历所有样本 for i in range(m): distance=10000 #遍历所有聚类中心 for j in range(k): #计算样本与聚类中心的距离 dist=np.sum(np.power(X[i,:]-centroids[j,:],2)) if dist<distance: distance=dist idx[i]=j return idx
#测试一下 centorids=np.arange(1,5).reshape(2,2) centorids
array([[1, 2], [3, 4]])
idx=find_closest(data.values,centorids)
idx
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
3 根据已有的数据的标记,来重新更新聚类中心,定义相应的函数
index=np.where(idx==0)[0] index
array([ 56, 100, 101, 103, 104, 105, 106, 107, 108, 110, 111, 112, 114, 115, 117, 118, 119, 120, 121, 122, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 140, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 162, 164, 165, 166, 167, 168, 169, 170, 172, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 188, 189, 190, 191, 192, 193, 194, 195, 197, 198, 199], dtype=int64)
def update_centorids(X,idx,k): centorids=np.zeros((k,X.shape[1])) for i in range(k): index=np.where(idx==i)[0] centorids[i]=np.sum(X[index,:],axis=0)/len(index) return centorids
#测试函数 update_centorids(data.values,idx,2)
array([[2.8195125 , 0.99467112], [4.03762952, 3.8009101 ]])
4 初始化聚类中心,定义相应的函数
def initialize_centroid(X,k): np.random.seed(30) index_initial=[np.random.randint(1,X.shape[0]) for i in range(k)] centorids=np.zeros((k,X.shape[1])) print(index_initial) for i,j in enumerate(index_initial): centorids[i]=X[j,:] return centorids
#测试一下该函数 initialize_centroid(data.values,3)
[294, 141, 252] array([[6.48212628, 2.5508514 ], [3.7875723 , 1.45442904], [6.01017978, 2.72401338]])
data.values[294]
array([6.48212628, 2.5508514 ])
5 定义K-means算法
#设计K-means算法 def k_means(X,k,iter_num): centroids=np.zeros((k,X.shape[1])) #初始化聚类中心 centroids=initialize_centroid(X,k) for i in range(iter_num): #每个样本找到所属聚类 idx=find_closest(X,centroids) print(idx) #更新新的聚类中心 centroids=update_centorids(X,idx,k) print(centroids) return centroids,idx
centroids,idx=k_means(data.values,3,10)
[294, 141, 252] [1. 2. 2. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 2. 1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1. 2. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 2. 0. 2. 2. 2. 0. 2. 2. 2. 0. 0. 0. 0. 2. 2. 2. 0. 0. 0. 2. 2. 0. 2. 2. 2. 2. 2. 2. 2. 2. 2. 0. 2. 0. 2. 0. 2. 2. 1. 2. 1. 2. 0. 0. 0. 2. 2. 2. 2. 2. 0. 2. 0. 2. 2. 2. 0. 0. 0. 0. 2. 0. 2. 0. 1. 0. 2. 2. 2. 0. 2. 0. 1. 2. 2. 2. 2. 2. 2. 0. 2. 0. 0. 2. 2. 2. 0. 2. 2. 2. 2. 2. 2. 0. 0. 0. 2. 2. 1.] [[6.89324886 2.94679018] [2.48934355 2.89564245] [5.42986227 3.25759288]] [1. 2. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 2. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 2. 2. 0. 2. 2. 2. 0. 2. 2. 2. 0. 0. 0. 0. 0. 2. 2. 0. 0. 0. 2. 2. 0. 2. 2. 2. 2. 2. 2. 2. 2. 2. 0. 0. 0. 2. 0. 2. 2. 1. 0. 2. 2. 0. 0. 0. 2. 2. 2. 2. 2. 0. 2. 0. 0. 0. 2. 0. 0. 0. 0. 2. 0. 2. 0. 1. 0. 2. 2. 2. 0. 2. 0. 2. 2. 0. 2. 2. 0. 2. 0. 2. 0. 0. 0. 0. 2. 0. 2. 2. 2. 2. 2. 2. 0. 0. 0. 2. 0. 1.] [[6.73758256 2.94610993] [2.41318124 3.02894849] [5.28641575 2.89315506]] [1. 2. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 2. 1. 1. 2. 1. 2. 1. 1. 1. 2. 1. 1. 1. 1. 1. 1. 1. 2. 1. 2. 2. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 2. 2. 1. 1. 1. 2. 1. 2. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 2. 1. 2. 1. 1. 1. 1. 1. 2. 1. 2. 1. 2. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 2. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 2. 0. 0. 2. 2. 0. 0. 2. 2. 2. 0. 0. 0. 0. 0. 2. 2. 0. 0. 0. 2. 2. 0. 2. 2. 2. 2. 2. 2. 0. 2. 2. 0. 0. 0. 2. 0. 2. 2. 2. 0. 2. 2. 0. 0. 0. 2. 2. 2. 0. 2. 0. 2. 0. 0. 0. 2. 0. 0. 0. 0. 2. 0. 2. 0. 1. 0. 2. 2. 2. 0. 2. 0. 2. 2. 0. 2. 2. 0. 2. 0. 2. 0. 0. 0. 0. 2. 0. 2. 2. 2. 2. 2. 2. 0. 0. 0. 2. 0. 1.] [[6.68390018 2.94499954] [2.28558411 3.23123345] [5.0071835 2.44170899]] [1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 2. 1. 2. 2. 2. 2. 1. 2. 1. 1. 1. 2. 1. 1. 1. 1. 2. 1. 2. 2. 2. 2. 2. 1. 1. 2. 1. 1. 2. 2. 1. 2. 2. 1. 1. 1. 2. 2. 2. 2. 2. 1. 1. 2. 2. 2. 1. 1. 1. 2. 1. 1. 1. 1. 1. 1. 2. 2. 2. 2. 1. 2. 1. 1. 1. 1. 1. 2. 1. 2. 1. 2. 2. 1. 2. 2. 1. 1. 2. 1. 1. 2. 2. 2. 2. 2. 1. 1. 1. 1. 1. 2. 1. 1. 2. 1. 1. 1. 0. 0. 0. 2. 2. 0. 0. 2. 2. 2. 0. 0. 0. 0. 0. 2. 2. 0. 0. 0. 0. 0. 0. 0. 2. 2. 2. 2. 2. 0. 2. 2. 0. 0. 0. 0. 0. 2. 0. 2. 0. 2. 2. 0. 0. 0. 2. 2. 0. 0. 0. 0. 0. 0. 0. 0. 2. 0. 0. 0. 0. 2. 0. 2. 0. 1. 0. 2. 2. 2. 0. 2. 0. 2. 0. 0. 2. 2. 0. 2. 0. 0. 0. 0. 0. 0. 0. 0. 2. 0. 0. 2. 2. 2. 0. 0. 0. 2. 0. 1.] [[6.49272845 2.9926145 ] [2.11716681 3.6129498 ] [4.38057974 1.85041121]] [1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 2. 2. 1. 2. 2. 2. 2. 1. 2. 2. 1. 1. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 1. 2. 2. 2. 2. 2. 2. 1. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 1. 2. 2. 2. 2. 1. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 1. 2. 2. 2. 2. 2. 2. 2. 2. 2. 1. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 1. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 2. 0. 0. 0. 0. 0. 2. 0. 0. 0. 0. 0. 0. 0. 0. 2. 0. 0. 2. 0. 0. 2. 0. 0. 0. 0. 0. 0. 0. 0. 2. 0. 2. 0. 0. 0. 0. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 2. 0. 0. 0. 2. 0. 2. 2. 0. 0. 2. 0. 2. 0. 0. 0. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 2. 0. 1.] [[6.23121683 3.03625011] [1.90893972 4.6245583 ] [3.44146283 1.24700833]] [1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 2. 0. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 2. 0. 0. 0. 0. 0. 0. 0. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.] [[6.07115453 3.00350207] [1.95399466 5.02557006] [3.06584667 1.05078048]] [1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.] [[6.03366736 3.00052511] [1.95399466 5.02557006] [3.04367119 1.01541041]] [1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.] [[6.03366736 3.00052511] [1.95399466 5.02557006] [3.04367119 1.01541041]] [1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.] [[6.03366736 3.00052511] [1.95399466 5.02557006] [3.04367119 1.01541041]] [1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.] [[6.03366736 3.00052511] [1.95399466 5.02557006] [3.04367119 1.01541041]]
centroids.shape[0]
3
6 绘制各个聚类的图
#画图 cluster1=data.values[np.where(idx==0)[0],:] cluster1 cluster2=data.values[np.where(idx==1)[0],:] cluster2 cluster3=data.values[np.where(idx==2)[0],:] cluster3
array([[3.20360621, 0.7222149 ], [3.06192918, 1.5719211 ], [4.01714917, 1.16070647], [1.40260822, 1.08726536], [4.08164951, 0.87200343], [3.15273081, 0.98155871], [3.45186351, 0.42784083], [3.85384314, 0.7920479 ], [1.57449255, 1.34811126], [4.72372078, 0.62044136], [2.87961084, 0.75413741], [0.96791348, 1.16166819], [1.53178107, 1.10054852], [4.13835915, 1.24780979], [3.16109021, 1.29422893], [2.95177039, 0.89583143], [3.27844295, 1.75043926], [2.1270185 , 0.95672042], [3.32648885, 1.28019066], [2.54371489, 0.95732716], [3.233947 , 1.08202324], [4.43152976, 0.54041 ], [3.56478625, 1.11764714], [4.25588482, 0.90643957], [4.05386581, 0.53291862], [3.08970176, 1.08814448], [2.84734459, 0.26759253], [3.63586049, 1.12160194], [1.95538864, 1.32156857], [2.88384005, 0.80454506], [3.48444387, 1.13551448], [3.49798412, 1.10046402], [2.45575934, 0.78904654], [3.2038001 , 1.02728075], [3.00677254, 0.62519128], [1.96547974, 1.2173076 ], [2.17989333, 1.30879831], [2.61207029, 0.99076856], [3.95549912, 0.83269299], [3.64846482, 1.62849697], [4.18450011, 0.45356203], [3.7875723 , 1.45442904], [3.30063655, 1.28107588], [3.02836363, 1.35635189], [3.18412176, 1.41410799], [4.16911897, 0.20581038], [3.24024211, 1.14876237], [3.91596068, 1.01225774], [2.96979716, 1.01210306], [1.12993856, 0.77085284], [2.71730799, 0.48697555], [3.1189017 , 0.69438336], [2.4051802 , 1.11778123], [2.95818429, 1.01887096], [1.65456309, 1.18631175], [2.39775807, 1.24721387], [2.28409305, 0.64865469], [2.79588724, 0.99526664], [3.41156277, 1.1596363 ], [3.50663521, 0.73878104], [3.93616029, 1.46202934], [3.90206657, 1.27778751], [2.61036396, 0.88027602], [4.37271861, 1.02914092], [3.08349136, 1.19632644], [2.1159935 , 0.7930365 ], [2.15653404, 0.40358861], [2.14491101, 1.13582399], [1.84935524, 1.02232644], [4.1590816 , 0.61720733], [2.76494499, 1.43148951], [3.90561153, 1.16575315], [2.54071672, 0.98392516], [4.27783068, 1.1801368 ], [3.31058167, 1.03124461], [2.15520661, 0.80696562], [3.71363659, 0.45813208], [3.54010186, 0.86446135], [1.60519991, 1.1098053 ], [1.75164337, 0.68853536], [3.12405123, 0.67821757], [2.37198785, 1.42789607], [2.53446019, 1.21562081], [3.6834465 , 1.22834538], [3.2670134 , 0.32056676], [3.94159139, 0.82577438], [3.2645514 , 1.3836869 ], [4.30471138, 1.10725995], [2.68499376, 0.35344943], [3.12635184, 1.2806893 ], [2.94294356, 1.02825076], [3.11876541, 1.33285459], [2.02358978, 0.44771614], [3.62202931, 1.28643763], [2.42865879, 0.86499285], [2.09517296, 1.14010491], [5.29239452, 0.36873298], [2.07291709, 1.16763851], [0.94623208, 0.24522253], [2.73911908, 1.10072284], [3.96162465, 2.72025046], [3.45928006, 2.68478445]])
centroids
array([[6.03366736, 3.00052511], [1.95399466, 5.02557006], [3.04367119, 1.01541041]])
import matplotlib.pyplot as plt fig,axe=plt.subplots(figsize=(6,9)) axe.scatter(cluster1[:,0],cluster1[:,1],s=10,color="red",label="cluster1") axe.scatter(cluster2[:,0],cluster2[:,1],s=10,color="yellow",label="cluster2") axe.scatter(cluster3[:,0],cluster3[:,1],s=10,color="green",label="cluster2") axe.scatter(centroids[:,0],centroids[:,1],s=30,marker="+",c="k") plt.show()
我们跳过的一个步骤是初始化聚类中心的过程。 这可以影响算法的收敛。 我们的任务是创建一个选择随机样本并将其用作初始聚类中心的函数。
7 定义评价函数–即任意一点所在聚类与聚类中心的距离平方和
#定义一个评价函数 def metric_square(X,idx,centroids,k): lst_dist=[] for i in range(k): cluster=X[np.where(idx==i)[0],:] dist=np.sum(np.power(cluster-centroids[i,:],2)) lst_dist.append(dist) return sum(lst_dist)
cluster=data.values[np.where(idx==0)[0],:]
centroids[0,:]
array([6.03366736, 3.00052511])
np.sum(np.power(cluster-centroids[0,:],2))
82.48594291556887
#测试该函数 metric_square(data.values,idx,centroids,3)
266.65851965491936