如何根据条件为日期列的列中值的出现来过滤spark数据帧?-问答-阿里云开发者社区-阿里云

开发者社区> 问答> 正文

如何根据条件为日期列的列中值的出现来过滤spark数据帧?

社区小助手 2018-12-11 18:16:53 1127

我正在使用数据框看起来像:

df
client   | date   
  C1     |08-NOV-18 11.29.43
  C2     |09-NOV-18 13.29.43
  C2     |09-NOV-18 18.29.43
  C3     |11-NOV-18 19.29.43
  C1     |12-NOV-18 10.29.43
  C2     |13-NOV-18 09.29.43
  C4     |14-NOV-18 20.29.43
  C1     |15-NOV-18 11.29.43
  C5     |16-NOV-18 15.29.43
  C10    |17-NOV-18 19.29.43
  C1     |18-NOV-18 12.29.43
  C2     |18-NOV-18 10.29.43
  C2     |19-NOV-18 09.29.43
  C6     |20-NOV-18 13.29.43
  C6     |21-NOV-18 14.29.43
  C1     |21-NOV-18 18.29.43
  C1     |22-NOV-18 11.29.43

我的目标是过滤此数据框并获取包含每个客户端最后一次出现的新数据帧,如果此次出现<24小时,例如,对于此示例,结果必须为:

 client  |date
  C2     |18-NOV-18 10.29.43
  C2     |19-NOV-18 09.29.43
  C1     |21-NOV-18 18.29.43
  C1     |22-NOV-18 11.29.43
过期数据 del条件 条件数据 日期util util数据
分享到
取消 提交回答
全部回答(1)
  • 社区小助手
    2019-07-17 23:19:58

    使用窗口函数。看一下这个:

    val df = Seq(("C1","08-NOV-18 11.29.43"),
    ("C2","09-NOV-18 13.29.43"),
    ("C2","09-NOV-18 18.29.43"),
    ("C3","11-NOV-18 19.29.43"),
    ("C1","12-NOV-18 10.29.43"),
    ("C2","13-NOV-18 09.29.43"),
    ("C4","14-NOV-18 20.29.43"),
    ("C1","15-NOV-18 11.29.43"),
    ("C5","16-NOV-18 15.29.43"),
    ("C10","17-NOV-18 19.29.43"),
    ("C1","18-NOV-18 12.29.43"),
    ("C2","18-NOV-18 10.29.43"),
    ("C2","19-NOV-18 09.29.43"),
    ("C6","20-NOV-18 13.29.43"),
    ("C6","21-NOV-18 14.29.43"),
    ("C1","21-NOV-18 18.29.43"),
    ("C1","22-NOV-18 11.29.43")).toDF("client","dt").withColumn("dt",from_unixtime(unix_timestamp('dt,"dd-MMM-yy HH.mm.ss"),"yyyy-MM-dd HH:mm:ss"))

    df.createOrReplaceTempView("tbl")

    val df2 = spark.sql(""" select from ( select client, dt, count() over(partition by client ) cnt, rank() over(partition by client order by dt desc) rk1 from tbl ) t where cnt>1 and rk1 in (1,2) """)

    df2.alias("t1").join(df2.alias("t2"), $"t1.client" === $"t2.client" and $"t1.rk1" =!= $"t2.rk1" , "inner" ).withColumn("dt24",(unix_timestamp($"t1.dt") - unix_timestamp($"t2.dt") )/ 3600 ).where("dt24 > -24 and dt24 < 24").select($"t1.client", $"t1.dt").show(false)
    结果:

    client dt
    C1 2018-11-22 11:29:43
    C1 2018-11-21 18:29:43
    C2 2018-11-19 09:29:43
    C2 2018-11-18 10:29:43
    0 0
大数据
使用钉钉扫一扫加入圈子
+ 订阅

大数据计算实践乐园,近距离学习前沿技术

相似问题
最新问题