开发者社区> 问答> 正文

如何根据条件为日期列的列中值的出现来过滤spark数据帧?

我正在使用数据框看起来像:

df
client   | date   
  C1     |08-NOV-18 11.29.43
  C2     |09-NOV-18 13.29.43
  C2     |09-NOV-18 18.29.43
  C3     |11-NOV-18 19.29.43
  C1     |12-NOV-18 10.29.43
  C2     |13-NOV-18 09.29.43
  C4     |14-NOV-18 20.29.43
  C1     |15-NOV-18 11.29.43
  C5     |16-NOV-18 15.29.43
  C10    |17-NOV-18 19.29.43
  C1     |18-NOV-18 12.29.43
  C2     |18-NOV-18 10.29.43
  C2     |19-NOV-18 09.29.43
  C6     |20-NOV-18 13.29.43
  C6     |21-NOV-18 14.29.43
  C1     |21-NOV-18 18.29.43
  C1     |22-NOV-18 11.29.43

我的目标是过滤此数据框并获取包含每个客户端最后一次出现的新数据帧,如果此次出现<24小时,例如,对于此示例,结果必须为:

 client  |date
  C2     |18-NOV-18 10.29.43
  C2     |19-NOV-18 09.29.43
  C1     |21-NOV-18 18.29.43
  C1     |22-NOV-18 11.29.43

展开
收起
社区小助手 2018-12-11 18:16:53 1835 0
1 条回答
写回答
取消 提交回答
  • 社区小助手是spark中国社区的管理员,我会定期更新直播回顾等资料和文章干货,还整合了大家在钉群提出的有关spark的问题及回答。

    使用窗口函数。看一下这个:

    val df = Seq(("C1","08-NOV-18 11.29.43"),
    ("C2","09-NOV-18 13.29.43"),
    ("C2","09-NOV-18 18.29.43"),
    ("C3","11-NOV-18 19.29.43"),
    ("C1","12-NOV-18 10.29.43"),
    ("C2","13-NOV-18 09.29.43"),
    ("C4","14-NOV-18 20.29.43"),
    ("C1","15-NOV-18 11.29.43"),
    ("C5","16-NOV-18 15.29.43"),
    ("C10","17-NOV-18 19.29.43"),
    ("C1","18-NOV-18 12.29.43"),
    ("C2","18-NOV-18 10.29.43"),
    ("C2","19-NOV-18 09.29.43"),
    ("C6","20-NOV-18 13.29.43"),
    ("C6","21-NOV-18 14.29.43"),
    ("C1","21-NOV-18 18.29.43"),
    ("C1","22-NOV-18 11.29.43")).toDF("client","dt").withColumn("dt",from_unixtime(unix_timestamp('dt,"dd-MMM-yy HH.mm.ss"),"yyyy-MM-dd HH:mm:ss"))

    df.createOrReplaceTempView("tbl")

    val df2 = spark.sql(""" select from ( select client, dt, count() over(partition by client ) cnt, rank() over(partition by client order by dt desc) rk1 from tbl ) t where cnt>1 and rk1 in (1,2) """)

    df2.alias("t1").join(df2.alias("t2"), $"t1.client" === $"t2.client" and $"t1.rk1" =!= $"t2.rk1" , "inner" ).withColumn("dt24",(unix_timestamp($"t1.dt") - unix_timestamp($"t2.dt") )/ 3600 ).where("dt24 > -24 and dt24 < 24").select($"t1.client", $"t1.dt").show(false)
    结果:

    client dt
    C1 2018-11-22 11:29:43
    C1 2018-11-21 18:29:43
    C2 2018-11-19 09:29:43
    C2 2018-11-18 10:29:43
    2019-07-17 23:19:58
    赞同 展开评论 打赏
问答分类:
问答地址:
问答排行榜
最热
最新

相关电子书

更多
Hybrid Cloud and Apache Spark 立即下载
Scalable Deep Learning on Spark 立即下载
Comparison of Spark SQL with Hive 立即下载