大数据挖掘算法篇之K-Means实例

本文涉及的产品
云原生大数据计算服务 MaxCompute,5000CU*H 100GB 3个月
云原生大数据计算服务MaxCompute,500CU*H 100GB 3个月
简介:

一、引言

  K-Means算法是聚类算法中,应用最为广泛的一种。本文基于欧几里得距离公式:d = sqrt((x1-x2)^+(y1-y2)^)计算二维向量间的距离,作为聚类划分的依据,输入数据为二维数据两列数据,输出结果为聚类中心和元素划分结果。输入数据格式如下:


18
2
2
0.0 0.0 
1.0 0.0 
0.0 1.0 
2.0 1.0 
1.0 2.0 
2.0 2.0 
2.0 0.0 
0.0 2.0 
7.0 6.0 
7.0 7.0 
7.0 8.0 
8.0 6.0 
8.0 7.0 
8.0 8.0 
8.0 9.0 
9.0 7.0 
9.0 8.0 
9.0 9.0 

二、欧几里得距离:

欧几里得距离定义: 欧几里得距离( Euclidean distance)也称欧氏距离,在n维空间内,最短的线的长度即为其欧氏距离。它是一个通常采用的距离定义,它是在m维空间中两个点之间的真实距离。
在二维和三维空间中的欧式距离的就是两点之间的距离,二维的公式是
d = sqrt((x1-x2)^+(y1-y2)^)
三维的公式是
d=sqrt((x1-x2)^+(y1-y2)^+(z1-z2)^)
推广到n维空间,欧式距离的公式是
d=sqrt( ∑(xi1-xi2)^ ) 这里i=1,2..n
xi1表示第一个点的第i维坐标,xi2表示第二个点的第i维坐标
n维欧氏空间是一个点集,它的每个点可以表示为(x(1),x(2),...x(n)),其中x(i)(i=1,2...n)是实数,称为x的第i个坐标,两个点x和y=(y(1),y(2)...y(n))之间的距离d(x,y)定义为上面的公式.
欧氏距离看作信号的相似程度。 距离越近就越相似,就越容易相互干扰,误码率就越高。
三、代码示例


/****************************************************************************
*                                                                           *
*  KMEANS                                                                   *
*                                                                           *
*****************************************************************************/

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <conio.h>
#include <math.h>

// FUNCTION PROTOTYPES


// DEFINES
#define         SUCCESS         1
#define         FAILURE         0
#define         TRUE            1
#define         FALSE           0
#define         MAXVECTDIM      20
#define         MAXPATTERN      20
#define         MAXCLUSTER      10





char *f2a(double x, int width){
   char cbuf[255];
   char *cp;
   int i,k;
   int d,s;
    cp=fcvt(x,width,&d,&s);
    if (s) {
       strcpy(cbuf,"-");
     }
     else {
       strcpy(cbuf," ");
       } /* endif */
    if (d>0) {
       for (i=0; i<d; i++) {
          cbuf[i+1]=cp[i];
          } /* endfor */
       cbuf[d+1]=0;
       cp+=d;
       strcat(cbuf,".");
       strcat(cbuf,cp);
       } else {
          if (d==0) {
             strcat(cbuf,".");
             strcat(cbuf,cp);
             } 
           else {
             k=-d;
             strcat(cbuf,".");
             for (i=0; i<k; i++) {
                strcat(cbuf,"0");
                } /* endfor */
             strcat(cbuf,cp);
             } /* endif */
       } /* endif */
    cp=&cbuf[0];
    return cp;
}




// ***** Defined structures & classes *****
struct aCluster {
   double       Center[MAXVECTDIM];
   int          Member[MAXPATTERN];  //Index of Vectors belonging to this cluster
   int          NumMembers;
};

struct aVector {
   double       Center[MAXVECTDIM];
   int          Size;
};

class System {
private:
   double       Pattern[MAXPATTERN][MAXVECTDIM+1];
   aCluster     Cluster[MAXCLUSTER];
   int          NumPatterns;          // Number of patterns
   int          SizeVector;           // Number of dimensions in vector
   int          NumClusters;          // Number of clusters
   void         DistributeSamples();  // Step 2 of K-means algorithm
   int          CalcNewClustCenters();// Step 3 of K-means algorithm
   double       EucNorm(int, int);   // Calc Euclidean norm vector
   int          FindClosestCluster(int); //ret indx of clust closest to pattern
                                         //whose index is arg
public:
   void system();
   int LoadPatterns(char *fname);      // Get pattern data to be clustered
   void InitClusters();                // Step 1 of K-means algorithm
   void RunKMeans();                   // Overall control K-means process
   void ShowClusters();                // Show results on screen
   void SaveClusters(char *fname);     // Save results to file
   void ShowCenters();
};
//输出聚类中心
void System::ShowCenters(){
    int i,j;
    printf("Cluster centers:\n");
    for (i=0; i<NumClusters; i++) {
       Cluster[i].Member[0]=i;
       printf("ClusterCenter[%d]=(%f,%f)\n",i,Cluster[i].Center[0],Cluster[i].Center[1]);
       } /* endfor */
    printf("\n");
    getchar();
}

//读取文件
int System::LoadPatterns(char *fname)
{
   FILE *InFilePtr;
   int    i,j;
   double x;
    if((InFilePtr = fopen(fname, "r")) == NULL)
        return FAILURE;
    fscanf(InFilePtr, "%d", &NumPatterns);  // Read # of patterns                18数据量
    fscanf(InFilePtr, "%d", &SizeVector);   // Read dimension of vector            2维度
    fscanf(InFilePtr, "%d", &NumClusters);  // Read # of clusters for K-Means    2簇
    for (i=0; i<NumPatterns; i++) {         // For each vector
       for (j=0; j<SizeVector; j++) {       // create a pattern
          fscanf(InFilePtr,"%lg",&x);       // consisting of all elements
          Pattern[i][j]=x;
          } /* endfor */
       } /* endfor */
    //输出所有数据元素
    printf("Input patterns:\n");
    for (i=0; i<NumPatterns; i++) {
       printf("Pattern[%d]=(%2.3f,%2.3f)\n",i,Pattern[i][0],Pattern[i][1]);
       } /* endfor */
    printf("\n--------------------\n");
    getchar();
    return SUCCESS;
}
//***************************************************************************
// InitClusters                                                             *
//   Arbitrarily assign a vector to each of the K clusters                  *
//   We choose the first K vectors to do this                               *
//***************************************************************************
//初始化聚类中心
void System::InitClusters(){
    int i,j;
    printf("Initial cluster centers:\n");
    for (i=0; i<NumClusters; i++) {
       Cluster[i].Member[0]=i;
       for (j=0; j<SizeVector; j++) {
          Cluster[i].Center[j]=Pattern[i][j];
          } /* endfor */
       } /* endfor */
    for (i=0; i<NumClusters; i++) {
        printf("ClusterCenter[%d]=(%f,%f)\n",i,Cluster[i].Center[0],Cluster[i].Center[1]);                //untransplant
       } /* endfor */
    printf("\n");
    getchar();
}
//运行KMeans
void System::RunKMeans(){
      int converged;
      int pass;
    pass=1;
    converged=FALSE;
    //第N次聚类
    while (converged==FALSE) {
       printf("PASS=%d\n",pass++);
       DistributeSamples();
       converged=CalcNewClustCenters();
       ShowCenters();
       getchar();
       } /* endwhile */
}
//在二维和三维空间中的欧式距离的就是两点之间的距离,二维的公式是
//d = sqrt((x1-x2)^+(y1-y2)^)
//通过这种运算,就可以把所有列的属性都纳入进来
double System::EucNorm(int p, int c){        // Calc Euclidean norm of vector difference
    double dist,x;                          // between pattern vector, p, and cluster
    int i;                                  // center, c.
    char zout[128];
    char znum[40];
    char *pnum;
    //
    pnum=&znum[0];
    strcpy(zout,"d=sqrt(");
    printf("The distance from pattern %d to cluster %d is calculated as:\n",p,c);
    dist=0;
    for (i=0; i<SizeVector ;i++){
       //拼写字符串
       x=(Cluster[c].Center[i]-Pattern[p][i])*(Cluster[c].Center[i]-Pattern[p][i]);
       strcat(zout,f2a(x,4));
       if (i==0)
          strcat(zout,"+");
        //计算距离
       dist += (Cluster[c].Center[i]-Pattern[p][i])*(Cluster[c].Center[i]-Pattern[p][i]);
       } /* endfor */
    printf("%s)\n",zout);
    return dist;
}
//查找最近的群集
int System::FindClosestCluster(int pat){
   int i, ClustID;
   double MinDist, d;
    MinDist =9.9e+99;
    ClustID=-1;
    for (i=0; i<NumClusters; i++) {
       d=EucNorm(pat,i);
       printf("Distance from pattern %d to cluster %d is %f\n\n",pat,i,sqrt(d));
       if (d<MinDist) {
          MinDist=d;
          ClustID=i;
          } /* endif */
       } /* endfor */
    if (ClustID<0) {
       printf("Aaargh");
       exit(0);
       } /* endif */
    return ClustID;
}
//
void System::DistributeSamples(){
    int i,pat,Clustid,MemberIndex;
    //Clear membership list for all current clusters
    for (i=0; i<NumClusters;i++){
       Cluster[i].NumMembers=0;
       }
    for (pat=0; pat<NumPatterns; pat++) {
       //Find cluster center to which the pattern is closest
       Clustid= FindClosestCluster(pat);//查找最近的聚类中心
       printf("patern %d assigned to cluster %d\n\n",pat,Clustid);
       //post this pattern to the cluster
       MemberIndex=Cluster[Clustid].NumMembers;
       Cluster[Clustid].Member[MemberIndex]=pat;
       Cluster[Clustid].NumMembers++;
       } /* endfor */
}
//计算新的群集中心
int  System::CalcNewClustCenters(){
   int ConvFlag,VectID,i,j,k;
   double tmp[MAXVECTDIM];
   char xs[255];
   char ys[255];
   char nc1[20];
   char nc2[20];
   char *pnc1;
   char *pnc2;
   char *fpv;

    pnc1=&nc1[0];
    pnc2=&nc2[0];
    ConvFlag=TRUE;
    printf("The new cluster centers are now calculated as:\n");
    for (i=0; i<NumClusters; i++) {              //for each cluster
       pnc1=itoa(Cluster[i].NumMembers,nc1,10);
       pnc2=itoa(i,nc2,10);
       strcpy(xs,"Cluster Center");
       strcat(xs,nc2);
       strcat(xs,"(1/");
       strcpy(ys,"(1/");
       strcat(xs,nc1);
       strcat(ys,nc1);
       strcat(xs,")(");
       strcat(ys,")(");
       for (j=0; j<SizeVector; j++) {            // clear workspace
          tmp[j]=0.0;
          } /* endfor */
       for (j=0; j<Cluster[i].NumMembers; j++) { //traverse member vectors
          VectID=Cluster[i].Member[j];
          for (k=0; k<SizeVector; k++) {         //traverse elements of vector
                 tmp[k] += Pattern[VectID][k];       // add (member) pattern elmnt into temp
                 if (k==0) {
                      strcat(xs,f2a(Pattern[VectID][k],3));
                    } else {
                      strcat(ys,f2a(Pattern[VectID][k],3));
                      } /* endif */
            } /* endfor */
          if(j<Cluster[i].NumMembers-1){
             strcat(xs,"+");
             strcat(ys,"+");
             }
            else {
             strcat(xs,")");
             strcat(ys,")");
             }
          } /* endfor */
       for (k=0; k<SizeVector; k++) {            //traverse elements of vector
          tmp[k]=tmp[k]/Cluster[i].NumMembers;
          if (tmp[k] != Cluster[i].Center[k])
             ConvFlag=FALSE;
          Cluster[i].Center[k]=tmp[k];
          } /* endfor */
       printf("%s,\n",xs);
       printf("%s\n",ys);
       } /* endfor */
    return ConvFlag;
}
//输出聚类
void System::ShowClusters(){
   int cl;
    for (cl=0; cl<NumClusters; cl++) {
       printf("\nCLUSTER %d ==>[%f,%f]\n", cl,Cluster[cl].Center[0],Cluster[cl].Center[1]);
       } /* endfor */
}

void System::SaveClusters(char *fname){
}

四、主调程序


void main(int argc, char *argv[]) 
{

   System kmeans;
   /*
    if (argc<2) {
       printf("USAGE: KMEANS PATTERN_FILE\n");
       exit(0);
       }*/
    if (kmeans.LoadPatterns("KM2.DAT")==FAILURE ){
       printf("UNABLE TO READ PATTERN_FILE:%s\n",argv[1]);
       exit(0);
        }

    kmeans.InitClusters();
    kmeans.RunKMeans();
    kmeans.ShowClusters();
}

五、输出结果


Input patterns:
Pattern[0]=(0.000,0.000)
Pattern[1]=(1.000,0.000)
Pattern[2]=(0.000,1.000)
Pattern[3]=(2.000,1.000)
Pattern[4]=(1.000,2.000)
Pattern[5]=(2.000,2.000)
Pattern[6]=(2.000,0.000)
Pattern[7]=(0.000,2.000)
Pattern[8]=(7.000,6.000)
Pattern[9]=(7.000,7.000)
Pattern[10]=(7.000,8.000)
Pattern[11]=(8.000,6.000)
Pattern[12]=(8.000,7.000)
Pattern[13]=(8.000,8.000)
Pattern[14]=(8.000,9.000)
Pattern[15]=(9.000,7.000)
Pattern[16]=(9.000,8.000)
Pattern[17]=(9.000,9.000)

--------------------

Initial cluster centers:
ClusterCenter[0]=(0.000000,0.000000)
ClusterCenter[1]=(1.000000,0.000000)


PASS=1
The distance from pattern 0 to cluster 0 is calculated as:
d=sqrt( .0000+ .0000)
Distance from pattern 0 to cluster 0 is 0.000000

The distance from pattern 0 to cluster 1 is calculated as:
d=sqrt( 1.0000+ .0000)
Distance from pattern 0 to cluster 1 is 1.000000

patern 0 assigned to cluster 0

The distance from pattern 1 to cluster 0 is calculated as:
d=sqrt( 1.0000+ .0000)
Distance from pattern 1 to cluster 0 is 1.000000

The distance from pattern 1 to cluster 1 is calculated as:
d=sqrt( .0000+ .0000)
Distance from pattern 1 to cluster 1 is 0.000000

patern 1 assigned to cluster 1

The distance from pattern 2 to cluster 0 is calculated as:
d=sqrt( .0000+ 1.0000)
Distance from pattern 2 to cluster 0 is 1.000000

The distance from pattern 2 to cluster 1 is calculated as:
d=sqrt( 1.0000+ 1.0000)
Distance from pattern 2 to cluster 1 is 1.414214

patern 2 assigned to cluster 0

The distance from pattern 3 to cluster 0 is calculated as:
d=sqrt( 4.0000+ 1.0000)
Distance from pattern 3 to cluster 0 is 2.236068

The distance from pattern 3 to cluster 1 is calculated as:
d=sqrt( 1.0000+ 1.0000)
Distance from pattern 3 to cluster 1 is 1.414214

patern 3 assigned to cluster 1

The distance from pattern 4 to cluster 0 is calculated as:
d=sqrt( 1.0000+ 4.0000)
Distance from pattern 4 to cluster 0 is 2.236068

The distance from pattern 4 to cluster 1 is calculated as:
d=sqrt( .0000+ 4.0000)
Distance from pattern 4 to cluster 1 is 2.000000

patern 4 assigned to cluster 1

The distance from pattern 5 to cluster 0 is calculated as:
d=sqrt( 4.0000+ 4.0000)
Distance from pattern 5 to cluster 0 is 2.828427

The distance from pattern 5 to cluster 1 is calculated as:
d=sqrt( 1.0000+ 4.0000)
Distance from pattern 5 to cluster 1 is 2.236068

patern 5 assigned to cluster 1

The distance from pattern 6 to cluster 0 is calculated as:
d=sqrt( 4.0000+ .0000)
Distance from pattern 6 to cluster 0 is 2.000000

The distance from pattern 6 to cluster 1 is calculated as:
d=sqrt( 1.0000+ .0000)
Distance from pattern 6 to cluster 1 is 1.000000

patern 6 assigned to cluster 1

The distance from pattern 7 to cluster 0 is calculated as:
d=sqrt( .0000+ 4.0000)
Distance from pattern 7 to cluster 0 is 2.000000

The distance from pattern 7 to cluster 1 is calculated as:
d=sqrt( 1.0000+ 4.0000)
Distance from pattern 7 to cluster 1 is 2.236068

patern 7 assigned to cluster 0

The distance from pattern 8 to cluster 0 is calculated as:
d=sqrt( 49.0000+ 36.0000)
Distance from pattern 8 to cluster 0 is 9.219544

The distance from pattern 8 to cluster 1 is calculated as:
d=sqrt( 36.0000+ 36.0000)
Distance from pattern 8 to cluster 1 is 8.485281

patern 8 assigned to cluster 1

The distance from pattern 9 to cluster 0 is calculated as:
d=sqrt( 49.0000+ 49.0000)
Distance from pattern 9 to cluster 0 is 9.899495

The distance from pattern 9 to cluster 1 is calculated as:
d=sqrt( 36.0000+ 49.0000)
Distance from pattern 9 to cluster 1 is 9.219544

patern 9 assigned to cluster 1

The distance from pattern 10 to cluster 0 is calculated as:
d=sqrt( 49.0000+ 64.0000)
Distance from pattern 10 to cluster 0 is 10.630146

The distance from pattern 10 to cluster 1 is calculated as:
d=sqrt( 36.0000+ 64.0000)
Distance from pattern 10 to cluster 1 is 10.000000

patern 10 assigned to cluster 1

The distance from pattern 11 to cluster 0 is calculated as:
d=sqrt( 64.0000+ 36.0000)
Distance from pattern 11 to cluster 0 is 10.000000

The distance from pattern 11 to cluster 1 is calculated as:
d=sqrt( 49.0000+ 36.0000)
Distance from pattern 11 to cluster 1 is 9.219544

patern 11 assigned to cluster 1

The distance from pattern 12 to cluster 0 is calculated as:
d=sqrt( 64.0000+ 49.0000)
Distance from pattern 12 to cluster 0 is 10.630146

The distance from pattern 12 to cluster 1 is calculated as:
d=sqrt( 49.0000+ 49.0000)
Distance from pattern 12 to cluster 1 is 9.899495

patern 12 assigned to cluster 1

The distance from pattern 13 to cluster 0 is calculated as:
d=sqrt( 64.0000+ 64.0000)
Distance from pattern 13 to cluster 0 is 11.313708

The distance from pattern 13 to cluster 1 is calculated as:
d=sqrt( 49.0000+ 64.0000)
Distance from pattern 13 to cluster 1 is 10.630146

patern 13 assigned to cluster 1

The distance from pattern 14 to cluster 0 is calculated as:
d=sqrt( 64.0000+ 81.0000)
Distance from pattern 14 to cluster 0 is 12.041595

The distance from pattern 14 to cluster 1 is calculated as:
d=sqrt( 49.0000+ 81.0000)
Distance from pattern 14 to cluster 1 is 11.401754

patern 14 assigned to cluster 1

The distance from pattern 15 to cluster 0 is calculated as:
d=sqrt( 81.0000+ 49.0000)
Distance from pattern 15 to cluster 0 is 11.401754

The distance from pattern 15 to cluster 1 is calculated as:
d=sqrt( 64.0000+ 49.0000)
Distance from pattern 15 to cluster 1 is 10.630146

patern 15 assigned to cluster 1

The distance from pattern 16 to cluster 0 is calculated as:
d=sqrt( 81.0000+ 64.0000)
Distance from pattern 16 to cluster 0 is 12.041595

The distance from pattern 16 to cluster 1 is calculated as:
d=sqrt( 64.0000+ 64.0000)
Distance from pattern 16 to cluster 1 is 11.313708

patern 16 assigned to cluster 1

The distance from pattern 17 to cluster 0 is calculated as:
d=sqrt( 81.0000+ 81.0000)
Distance from pattern 17 to cluster 0 is 12.727922

The distance from pattern 17 to cluster 1 is calculated as:
d=sqrt( 64.0000+ 81.0000)
Distance from pattern 17 to cluster 1 is 12.041595

patern 17 assigned to cluster 1

The new cluster centers are now calculated as:
Cluster Center0(1/3)( .000+ .000+ .000),
(1/3)( .000+ 1.000+ 2.000)
Cluster Center1(1/15)( 1.000+ 2.000+ 1.000+ 2.000+ 2.000+ 7.000+ 7.000+ 7.000+ 8
.000+ 8.000+ 8.000+ 8.000+ 9.000+ 9.000+ 9.000),
(1/15)( .000+ 1.000+ 2.000+ 2.000+ .000+ 6.000+ 7.000+ 8.000+ 6.000+ 7.000+ 8.00
0+ 9.000+ 7.000+ 8.000+ 9.000)
Cluster centers:
ClusterCenter[0]=(0.000000,1.000000)
ClusterCenter[1]=(5.866667,5.333333)

相关实践学习
基于MaxCompute的热门话题分析
本实验围绕社交用户发布的文章做了详尽的分析,通过分析能得到用户群体年龄分布,性别分布,地理位置分布,以及热门话题的热度。
SaaS 模式云数据仓库必修课
本课程由阿里云开发者社区和阿里云大数据团队共同出品,是SaaS模式云原生数据仓库领导者MaxCompute核心课程。本课程由阿里云资深产品和技术专家们从概念到方法,从场景到实践,体系化的将阿里巴巴飞天大数据平台10多年的经过验证的方法与实践深入浅出的讲给开发者们。帮助大数据开发者快速了解并掌握SaaS模式的云原生的数据仓库,助力开发者学习了解先进的技术栈,并能在实际业务中敏捷的进行大数据分析,赋能企业业务。 通过本课程可以了解SaaS模式云原生数据仓库领导者MaxCompute核心功能及典型适用场景,可应用MaxCompute实现数仓搭建,快速进行大数据分析。适合大数据工程师、大数据分析师 大量数据需要处理、存储和管理,需要搭建数据仓库?学它! 没有足够人员和经验来运维大数据平台,不想自建IDC买机器,需要免运维的大数据平台?会SQL就等于会大数据?学它! 想知道大数据用得对不对,想用更少的钱得到持续演进的数仓能力?获得极致弹性的计算资源和更好的性能,以及持续保护数据安全的生产环境?学它! 想要获得灵活的分析能力,快速洞察数据规律特征?想要兼得数据湖的灵活性与数据仓库的成长性?学它! 出品人:阿里云大数据产品及研发团队专家 产品 MaxCompute 官网 https://www.aliyun.com/product/odps&nbsp;
目录
相关文章
|
2月前
|
机器学习/深度学习 算法 搜索推荐
从理论到实践,Python算法复杂度分析一站式教程,助你轻松驾驭大数据挑战!
【10月更文挑战第4天】在大数据时代,算法效率至关重要。本文从理论入手,介绍时间复杂度和空间复杂度两个核心概念,并通过冒泡排序和快速排序的Python实现详细分析其复杂度。冒泡排序的时间复杂度为O(n^2),空间复杂度为O(1);快速排序平均时间复杂度为O(n log n),空间复杂度为O(log n)。文章还介绍了算法选择、分而治之及空间换时间等优化策略,帮助你在大数据挑战中游刃有余。
67 4
|
4月前
|
数据采集 机器学习/深度学习 算法
|
16天前
|
机器学习/深度学习 算法 数据挖掘
K-means聚类算法是机器学习中常用的一种聚类方法,通过将数据集划分为K个簇来简化数据结构
K-means聚类算法是机器学习中常用的一种聚类方法,通过将数据集划分为K个簇来简化数据结构。本文介绍了K-means算法的基本原理,包括初始化、数据点分配与簇中心更新等步骤,以及如何在Python中实现该算法,最后讨论了其优缺点及应用场景。
58 4
|
2月前
|
存储 分布式计算 算法
大数据-106 Spark Graph X 计算学习 案例:1图的基本计算、2连通图算法、3寻找相同的用户
大数据-106 Spark Graph X 计算学习 案例:1图的基本计算、2连通图算法、3寻找相同的用户
68 0
|
4月前
|
数据采集 机器学习/深度学习 算法
【优秀设计案例】基于K-Means聚类算法的球员数据聚类分析设计与实现
本文通过K-Means聚类算法对NBA球员数据进行聚类分析,旨在揭示球员间的相似性和差异性,为球队管理、战术决策和球员评估提供数据支持,并通过特征工程和结果可视化深入理解球员表现和潜力。
150 1
【优秀设计案例】基于K-Means聚类算法的球员数据聚类分析设计与实现
|
4月前
|
机器学习/深度学习 自然语言处理 算法
【数据挖掘】金山办公2020校招大数据和机器学习算法笔试题
金山办公2020校招大数据和机器学习算法笔试题的解析,涵盖了编程、数据结构、正则表达式、机器学习等多个领域的题目和答案。
106 10
|
4月前
|
数据采集 算法 数据可视化
基于Python的k-means聚类分析算法的实现与应用,可以用在电商评论、招聘信息等各个领域的文本聚类及指标聚类,效果很好
本文介绍了基于Python实现的k-means聚类分析算法,并通过微博考研话题的数据清洗、聚类数量评估、聚类分析实现与结果可视化等步骤,展示了该算法在文本聚类领域的应用效果。
118 1
|
1月前
|
缓存 算法 大数据
大数据查询优化算法
【10月更文挑战第26天】
57 1
|
1月前
|
机器学习/深度学习 数据采集 算法
大数据中缺失值处理使用算法处理
【10月更文挑战第21天】
44 3
|
1月前
|
分布式计算 Java 开发工具
阿里云MaxCompute-XGBoost on Spark 极限梯度提升算法的分布式训练与模型持久化oss的实现与代码浅析
本文介绍了XGBoost在MaxCompute+OSS架构下模型持久化遇到的问题及其解决方案。首先简要介绍了XGBoost的特点和应用场景,随后详细描述了客户在将XGBoost on Spark任务从HDFS迁移到OSS时遇到的异常情况。通过分析异常堆栈和源代码,发现使用的`nativeBooster.saveModel`方法不支持OSS路径,而使用`write.overwrite().save`方法则能成功保存模型。最后提供了完整的Scala代码示例、Maven配置和提交命令,帮助用户顺利迁移模型存储路径。