开发者社区> 问答> 正文

优化Dunn指数计算?

Dunn索引是一种评估聚类的方法。值越高越好。它被计算为最低的群集间距离(即,任何两个群集质心之间的最小距离)除以最高的群集内距离(即,任何群集中的任何两个点之间的最大距离)。

我有一个用于计算Dunn索引的代码段:

def dunn_index(pf, cf):
    """
    pf -- all data points
    cf -- cluster centroids
    """
    numerator = inf
    for c in cf: # for each cluster
        for t in cf: # for each cluster
            if t is c: continue # if same cluster, ignore
            numerator = min(numerator, distance(t, c)) # find distance between centroids
    denominator = 0
    for c in cf: # for each cluster
        for p in pf: # for each point
            if p.get_cluster() is not c: continue # if point not in cluster, ignore
            for t in pf: # for each point
                if t.get_cluster() is not c: continue # if point not in cluster, ignore
                if t is p: continue # if same point, ignore
                denominator = max(denominator, distance(t, p))
    return numerator/denominator

问题是这异常缓慢:对于由5000个实例和15个群集组成的示例数据集,最坏的情况是上述功能需要执行超过3.75亿的距离计算。实际上,它要低得多,但即使是按簇已经对数据进行排序的最佳情况,也要进行大约2500万次距离计算。我想节省时间,而且我已经尝试过直线距离与欧几里得距离,但效果不好。

如何改善此算法?

问题来源:stackoverflow

展开
收起
is大龙 2020-03-23 23:55:10 1061 0
1 条回答
写回答
取消 提交回答
    • TLDR :重要的是,问题是在二维*中设置的。对于大尺寸,这些技术可能无效。

    在2D中,我们可以在'O(n log n)`时间内计算每个簇的直径(簇间距离),其中'n'是使用凸包的簇大小。向量化用于加快剩余操作的速度。文章结尾提到了两种可能的渐近改进,欢迎贡献;)

    *设置和伪造数据: import numpy as np from scipy import spatial from matplotlib import pyplot as plt

    # set up fake data
    np.random.seed(0)
    n_centroids = 1000
    centroids = np.random.rand(n_centroids, 2)
    cluster_sizes = np.random.randint(1, 1000, size=n_centroids)
    # labels from 1 to n_centroids inclusive
    labels = np.repeat(np.arange(n_centroids), cluster_sizes) + 1
    points = np.zeros((cluster_sizes.sum(), 2))
    points[:,0] = np.repeat(centroids[:,0], cluster_sizes)
    points[:,1] = np.repeat(centroids[:,1], cluster_sizes)
    points += 0.05 * np.random.randn(cluster_sizes.sum(), 2)
    

    看起来有点像这样:

    接下来,基于使用凸包的方法,我们定义一个“直径”函数,用于计算最大簇内距离。

    # compute the diameter based on convex hull 
    def diameter(pts):
      # need at least 3 points to construct the convex hull
      if pts.shape[0] <= 1:
        return 0
      if pts.shape[0] == 2:
        return ((pts[0] - pts[1])\*2).sum()
      # two points which are fruthest apart will occur as vertices of the convex hull
      hull = spatial.ConvexHull(pts)
      candidates = pts[spatial.ConvexHull(pts).vertices]
      return spatial.distance_matrix(candidates, candidates).max()
    

    对于Dunn指数计算,我假设我们已经计算了点,聚类标签和聚类质心。

    如果群集数量很大,则以下基于Pandas的解决方案可能会表现良好:

    import pandas as pd
    def dunn_index_pandas(pts, labels, centroids):
      # O(k n log(n)) with k clusters and n points; better performance with more even clusters
      max_intracluster_dist = pd.DataFrame(pts).groupby(labels).agg(diameter_pandas)[0].max()
      # O(k^2) with k clusters; can be reduced to O(k log(k))
      # get pairwise distances between centroids
      cluster_dmat = spatial.distance_matrix(centroids, centroids)
      # fill diagonal with +inf: ignore zero distance to self in "min" computation
      np.fill_diagonal(cluster_dmat, np.inf)
      min_intercluster_dist = cluster_sizes.min()
      return min_intercluster_dist / max_intracluster_dist
    

    否则,我们可以继续使用纯粹的numpy解决方案。

    def dunn_index(pts, labels, centroids):
      # O(k n log(n)) with k clusters and n points; better performance with more even clusters
      max_intracluster_dist = max(diameter(pts[labels==i]) for i in np.unique(labels))
      # O(k^2) with k clusters; can be reduced to O(k log(k))
      # get pairwise distances between centroids
      cluster_dmat = spatial.distance_matrix(centroids, centroids)
      # fill diagonal with +inf: ignore zero distance to self in "min" computation
      np.fill_diagonal(cluster_dmat, np.inf)
      min_intercluster_dist = cluster_sizes.min()
      return min_intercluster_dist / max_intracluster_dist
    
    %time dunn_index(points, labels, centroids)
    # returned value 2.15
    # in 2.2 seconds
    %time dunn_index_pandas(points, labels, centroids)
    # returned 2.15
    # in 885 ms
    

    对于iid〜U [1,1000]集群大小的1000集群,这需要2.2。秒在我的机器上。在本例中,使用Pandas方法时,此数字下降到0.8秒(许多小集群)。

    当集群数量很大时,还有两个其他相关的优化机会:

    • First, I am computing the minimal intercluster distance with a brute force ` O(k^2) ` approach where ` k ` is the number of clusters. This can be reduced to ` O(k log(k)) ` , as discussed here.

    • Second, ` max(diameter(pts[labels==i]) for i in np.unique(labels)) ` requires ` k ` passes over an array of size ` n ` . With many clusters this can become the bottleneck (as in this example). This is somewhat mitigated with the pandas approach, but I expect that this can be optimized a lot further. For current parameters, roughly one third of compute time is spent outside of computing intercluser of intracluster distances.

    回答来源:stackoverflow

    2020-03-23 23:55:18
    赞同 展开评论 打赏
问答分类:
问答地址:
问答排行榜
最热
最新

相关电子书

更多
新量⼦⾰命与量⼦计算 立即下载
图计算优化技术探索 立即下载
《函数计算冷启动加速》 立即下载