1. 需求
对给定的数据集进行聚类
本案例采用二维数据集,共80个样本,有4个类。样例如下(testSet.txt):
1.658985 4.285136 -3.453687 3.424321 4.838138 -1.151539 -5.379713 -3.362104 0.972564 2.924086 -3.567919 1.531611 0.450614 -3.302219 -3.487105 -1.724432 2.668759 1.594842 -3.156485 3.191137 3.165506 -3.999838 -2.786837 -3.099354 4.208187 2.984927 -2.123337 2.943366 0.704199 -0.479481 -0.392370 -3.963704 2.831667 1.574018 -0.790153 3.343144 2.943496 -3.357075
2. python代码实现
2.1 利用numpy手动实现
from numpy import * #加载数据 def loadDataSet(fileName): dataMat = [] fr = open(fileName) for line in fr.readlines(): curLine = line.strip().split('\t') fltLine = map(float, curLine) #变成float类型 dataMat.append(fltLine) return dataMat # 计算欧几里得距离 def distEclud(vecA, vecB): return sqrt(sum(power(vecA - vecB, 2))) #构建聚簇中心,取k个(此例中为4)随机质心 def randCent(dataSet, k): n = shape(dataSet)[1] centroids = mat(zeros((k,n))) #每个质心有n个坐标值,总共要k个质心 for j in range(n): minJ = min(dataSet[:,j]) maxJ = max(dataSet[:,j]) rangeJ = float(maxJ - minJ) centroids[:,j] = minJ + rangeJ * random.rand(k, 1) return centroids #k-means 聚类算法 def kMeans(dataSet, k, distMeans =distEclud, createCent = randCent): m = shape(dataSet)[0] clusterAssment = mat(zeros((m,2))) #用于存放该样本属于哪类及质心距离 centroids = createCent(dataSet, k) clusterChanged = True while clusterChanged: clusterChanged = False; for i in range(m): minDist = inf; minIndex = -1; for j in range(k): distJI = distMeans(centroids[j,:], dataSet[i,:]) if distJI < minDist: minDist = distJI; minIndex = j if clusterAssment[i,0] != minIndex: clusterChanged = True; clusterAssment[i,:] = minIndex,minDist**2 print centroids for cent in range(k): ptsInClust = dataSet[nonzero(clusterAssment[:,0].A == cent)[0]] # 去第一列等于cent的所有列 centroids[cent,:] = mean(ptsInClust, axis = 0) return centroids, clusterAssment
2.2 利用scikili库实现
Scikit-Learn是基于python的机器学习模块,基于BSD开源许可证。
scikit-learn的基本功能主要被分为六个部分,分类,回归,聚类,数据降维,模型选择,数据预处理。包括SVM,决策树,GBDT,KNN,KMEANS等等。
Kmeans在scikit包中即已有实现,只要将数据按照算法要求处理好,传入相应参数,即可直接调用其kmeans函数进行聚类。
################################################# # kmeans: k-means cluster ################################################# from numpy import * import time import matplotlib.pyplot as plt ## step 1:加载数据 print "step 1: load data..." dataSet = [] fileIn = open('E:/Python/ml-data/kmeans/testSet.txt') for line in fileIn.readlines(): lineArr = line.strip().split('\t') dataSet.append([float(lineArr[0]), float(lineArr[1])]) ## step 2: 聚类 print "step 2: clustering..." dataSet = mat(dataSet) k = 4 centroids, clusterAssment = kmeans(dataSet, k) ## step 3:显示结果 print "step 3: show the result..." showCluster(dataSet, k, centroids, clusterAssment)
2.3 运行结果
不同的类用不同的颜色来表示,其中的大菱形是对应类的均值质心点。