优化Dunn指数计算？_问答-阿里云开发者社区

def dunn_index(pf, cf): """ pf -- all data points cf -- cluster centroids """ numerator = inf for c in cf: # for each cluster for t in cf: # for each cluster if t is c: continue # if same cluster, ignore numerator = min(numerator, distance(t, c)) # find distance between centroids denominator = 0 for c in cf: # for each cluster for p in pf: # for each point if p.get_cluster() is not c: continue # if point not in cluster, ignore for t in pf: # for each point if t.get_cluster() is not c: continue # if point not in cluster, ignore if t is p: continue # if same point, ignore denominator = max(denominator, distance(t, p)) return numerator/denominator

TLDR ：重要的是，问题是在二维*中设置的。对于大尺寸，这些技术可能无效。

在2D中，我们可以在'O（n log n）`时间内计算每个簇的直径（簇间距离），其中'n'是使用凸包的簇大小。向量化用于加快剩余操作的速度。文章结尾提到了两种可能的渐近改进，欢迎贡献;）

*设置和伪造数据： import numpy as np from scipy import spatial from matplotlib import pyplot as plt

# set up fake data
np.random.seed(0)
n_centroids = 1000
centroids = np.random.rand(n_centroids, 2)
cluster_sizes = np.random.randint(1, 1000, size=n_centroids)
# labels from 1 to n_centroids inclusive
labels = np.repeat(np.arange(n_centroids), cluster_sizes) + 1
points = np.zeros((cluster_sizes.sum(), 2))
points[:,0] = np.repeat(centroids[:,0], cluster_sizes)
points[:,1] = np.repeat(centroids[:,1], cluster_sizes)
points += 0.05 * np.random.randn(cluster_sizes.sum(), 2)

看起来有点像这样：

接下来，基于使用凸包的方法，我们定义一个“直径”函数，用于计算最大簇内距离。

# compute the diameter based on convex hull 
def diameter(pts):
  # need at least 3 points to construct the convex hull
  if pts.shape[0] <= 1:
    return 0
  if pts.shape[0] == 2:
    return ((pts[0] - pts[1])\*2).sum()
  # two points which are fruthest apart will occur as vertices of the convex hull
  hull = spatial.ConvexHull(pts)
  candidates = pts[spatial.ConvexHull(pts).vertices]
  return spatial.distance_matrix(candidates, candidates).max()

对于Dunn指数计算，我假设我们已经计算了点，聚类标签和聚类质心。

如果群集数量很大，则以下基于Pandas的解决方案可能会表现良好：

import pandas as pd
def dunn_index_pandas(pts, labels, centroids):
  # O(k n log(n)) with k clusters and n points; better performance with more even clusters
  max_intracluster_dist = pd.DataFrame(pts).groupby(labels).agg(diameter_pandas)[0].max()
  # O(k^2) with k clusters; can be reduced to O(k log(k))
  # get pairwise distances between centroids
  cluster_dmat = spatial.distance_matrix(centroids, centroids)
  # fill diagonal with +inf: ignore zero distance to self in "min" computation
  np.fill_diagonal(cluster_dmat, np.inf)
  min_intercluster_dist = cluster_sizes.min()
  return min_intercluster_dist / max_intracluster_dist

否则，我们可以继续使用纯粹的numpy解决方案。

def dunn_index(pts, labels, centroids):
  # O(k n log(n)) with k clusters and n points; better performance with more even clusters
  max_intracluster_dist = max(diameter(pts[labels==i]) for i in np.unique(labels))
  # O(k^2) with k clusters; can be reduced to O(k log(k))
  # get pairwise distances between centroids
  cluster_dmat = spatial.distance_matrix(centroids, centroids)
  # fill diagonal with +inf: ignore zero distance to self in "min" computation
  np.fill_diagonal(cluster_dmat, np.inf)
  min_intercluster_dist = cluster_sizes.min()
  return min_intercluster_dist / max_intracluster_dist

%time dunn_index(points, labels, centroids)
# returned value 2.15
# in 2.2 seconds
%time dunn_index_pandas(points, labels, centroids)
# returned 2.15
# in 885 ms

对于iid〜U [1,1000]集群大小的1000集群，这需要2.2。秒在我的机器上。在本例中，使用Pandas方法时，此数字下降到0.8秒（许多小集群）。

当集群数量很大时，还有两个其他相关的优化机会：

First, I am computing the minimal intercluster distance with a brute force ` O(k^2) ` approach where ` k ` is the number of clusters. This can be reduced to ` O(k log(k)) ` , as discussed here.
Second, ` max(diameter(pts[labels==i]) for i in np.unique(labels)) ` requires ` k ` passes over an array of size ` n ` . With many clusters this can become the bottleneck (as in this example). This is somewhat mitigated with the pandas approach, but I expect that this can be optimized a lot further. For current parameters, roughly one third of compute time is spent outside of computing intercluser of intracluster distances.

回答来源：stackoverflow

优化Dunn指数计算？

相关课程

相关电子书

优化Dunn指数计算？

相关课程

相关文章

相关电子书