2 聚合操作:groupBy和agg
2.1 排序算子sort(sort等价于orderBy)
DF.sort(DF.col(“id”).desc).show 以DF中字段id降序,指定升降序的方法。另外可指定多个字段排序
=DF.sort($“id”.desc).show
DF.sort 等价于DF.orderBy
scala> peopleDF.sort($"age").show +--------+---+--------+ | name|age| address| +--------+---+--------+ |zhangsan| 22| chengdu| | lisi| 28|shanghai| | wangwu| 33| beijing| +--------+---+--------+ scala> peopleDF.sort($"age".desc).show +--------+---+--------+ | name|age| address| +--------+---+--------+ | wangwu| 33| beijing| | lisi| 28|shanghai| |zhangsan| 22| chengdu| +--------+---+--------+ scala> peopleDF.sort($"age".asc).show +--------+---+--------+ | name|age| address| +--------+---+--------+ |zhangsan| 22| chengdu| | lisi| 28|shanghai| | wangwu| 33| beijing| +--------+---+--------+ scala> peopleDF.orderBy($"age".asc).show +--------+---+--------+ | name|age| address| +--------+---+--------+ |zhangsan| 22| chengdu| | lisi| 28|shanghai| | wangwu| 33| beijing| +--------+---+--------+ scala> peopleDF.orderBy($"age".desc).show +--------+---+--------+ | name|age| address| +--------+---+--------+ | wangwu| 33| beijing| | lisi| 28|shanghai| |zhangsan| 22| chengdu| +--------+---+--------+
2.2 分组函数groupBy
2.2.1 分组计数
select address,count(1) from people group by address; 等价的算子如下
scala> peopleDF.show() +--------+---+--------+ | name|age| address| +--------+---+--------+ |zhangsan| 22| chengdu| | wangwu| 33| beijing| | lisi| 28|shanghai| |xiaoming| 28| beijing| | mm| 21| chengdu| |xiaoming| 18| beijing| | mm| 11| chengdu| +--------+---+--------+ scala> peopleDF.groupBy("address").count().show +--------+-----+ | address|count| +--------+-----+ | beijing| 3| | chengdu| 3| |shanghai| 1| +--------+-----+
2.2.2 分组后求最值、平均值、求和的方法
//等价于select address,max(age) from people group by address; scala> peopleDF.groupBy("address").max("age").show +--------+--------+ | address|max(age)| +--------+--------+ | beijing| 33| | chengdu| 22| |shanghai| 28| +--------+--------+ //等价于select address,avg(age) from people group by address; scala> peopleDF.groupBy("address").avg("age").show +--------+------------------+ | address| avg(age)| +--------+------------------+ | beijing|26.333333333333332| | chengdu| 18.0| |shanghai| 28.0| +--------+------------------+ //等价于select address,min(age) from people group by address; scala> peopleDF.groupBy("address").min("age").show +--------+--------+ | address|min(age)| +--------+--------+ | beijing| 18| | chengdu| 11| |shanghai| 28| +--------+--------+ //等价于select address,sum(age) from people group by address; scala> peopleDF.groupBy("address").sum("age").show +--------+--------+ | address|sum(age)| +--------+--------+ | beijing| 79| | chengdu| 54| |shanghai| 28| +--------+--------+
2.2.3 分组后,求多个聚合值(最值、平均值等)。使用算子groupBy+agg
//等价于select address,count(age),max(age),min(age),avg(age),sum(age) from people group by address; scala> peopleDF.groupBy("address").agg(count("age"),max("age"),min("age"),avg("age"),sum("age")).show +--------+----------+--------+--------+------------------+--------+ | address|count(age)|max(age)|min(age)| avg(age)|sum(age)| +--------+----------+--------+--------+------------------+--------+ | beijing| 3| 33| 18|26.333333333333332| 79| | chengdu| 3| 22| 11| 18.0| 54| |shanghai| 1| 28| 28| 28.0| 28| +--------+----------+--------+--------+------------------+--------+
2.2.4 分组聚合后取别名
scala> peopleDF.groupBy("address").agg(count("age").as("cnt"),avg("age").as("avg")).show +--------+---+------------------+ | address|cnt| avg| +--------+---+------------------+ | beijing| 3|26.333333333333332| | chengdu| 3| 18.0| |shanghai| 1| 28.0| +--------+---+------------------+
2.2.5 分组后行转列,使用pivot
//求同名用户在同一个地址的平均年龄 //把name的不同值作为列名 scala> peopleDF.groupBy("address").pivot("name").avg("age").show +--------+----+----+------+--------+--------+ | address|lisi| mm|wangwu|xiaoming|zhangsan| +--------+----+----+------+--------+--------+ | beijing|null|null| 33.0| 23.0| null| | chengdu|null|16.0| null| null| 22.0| |shanghai|28.0|null| null| null| null| +--------+----+----+------+--------+--------+
2.3 案例