我正在使用数据框看起来像:
df
client | date
C1 |08-NOV-18 11.29.43
C2 |09-NOV-18 13.29.43
C2 |09-NOV-18 18.29.43
C3 |11-NOV-18 19.29.43
C1 |12-NOV-18 10.29.43
C2 |13-NOV-18 09.29.43
C4 |14-NOV-18 20.29.43
C1 |15-NOV-18 11.29.43
C5 |16-NOV-18 15.29.43
C10 |17-NOV-18 19.29.43
C1 |18-NOV-18 12.29.43
C2 |18-NOV-18 10.29.43
C2 |19-NOV-18 09.29.43
C6 |20-NOV-18 13.29.43
C6 |21-NOV-18 14.29.43
C1 |21-NOV-18 18.29.43
C1 |22-NOV-18 11.29.43
我的目标是过滤此数据框并获取包含每个客户端最后一次出现的新数据帧,如果此次出现<24小时,例如,对于此示例,结果必须为:
client |date
C2 |18-NOV-18 10.29.43
C2 |19-NOV-18 09.29.43
C1 |21-NOV-18 18.29.43
C1 |22-NOV-18 11.29.43
使用窗口函数。看一下这个:
val df = Seq(("C1","08-NOV-18 11.29.43"),
("C2","09-NOV-18 13.29.43"),
("C2","09-NOV-18 18.29.43"),
("C3","11-NOV-18 19.29.43"),
("C1","12-NOV-18 10.29.43"),
("C2","13-NOV-18 09.29.43"),
("C4","14-NOV-18 20.29.43"),
("C1","15-NOV-18 11.29.43"),
("C5","16-NOV-18 15.29.43"),
("C10","17-NOV-18 19.29.43"),
("C1","18-NOV-18 12.29.43"),
("C2","18-NOV-18 10.29.43"),
("C2","19-NOV-18 09.29.43"),
("C6","20-NOV-18 13.29.43"),
("C6","21-NOV-18 14.29.43"),
("C1","21-NOV-18 18.29.43"),
("C1","22-NOV-18 11.29.43")).toDF("client","dt").withColumn("dt",from_unixtime(unix_timestamp('dt,"dd-MMM-yy HH.mm.ss"),"yyyy-MM-dd HH:mm:ss"))
df.createOrReplaceTempView("tbl")
val df2 = spark.sql(""" select from ( select client, dt, count() over(partition by client ) cnt, rank() over(partition by client order by dt desc) rk1 from tbl ) t where cnt>1 and rk1 in (1,2) """)
df2.alias("t1").join(df2.alias("t2"), $"t1.client" === $"t2.client" and $"t1.rk1" =!= $"t2.rk1" , "inner" ).withColumn("dt24",(unix_timestamp($"t1.dt") - unix_timestamp($"t2.dt") )/ 3600 ).where("dt24 > -24 and dt24 < 24").select($"t1.client", $"t1.dt").show(false)
结果:
client | dt |
---|---|
C1 | 2018-11-22 11:29:43 |
C1 | 2018-11-21 18:29:43 |
C2 | 2018-11-19 09:29:43 |
C2 | 2018-11-18 10:29:43 |
版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。