开发者社区> 问答> 正文

使用Spark SQL中的窗口函数结束记录

我有一个如下的数据框

colAcolBcolCcolD
a22013-12-122999-12-31
b32011-12-142999-12-31
a42013-12-172999-12-31
b82011-12-192999-12-31
a62013-12-232999-12-31

我需要根据ColA对记录进行分组,并根据colC对记录进行排名(最近的日期得到更大的排名),然后通过从相邻排名的colC记录中减去一天来更新colD中的日期。

最终的数据框如下所示

colAcolBcolCcolD
a22013-12-122013-12-16
a42013-12-172013-12-22
a62013-12-232999-12-31
b32011-12-142011-12-18
b82011-12-292999-12-31

展开
收起
社区小助手 2018-12-21 11:49:33 2007 0
1 条回答
写回答
取消 提交回答
  • 社区小助手是spark中国社区的管理员,我会定期更新直播回顾等资料和文章干货,还整合了大家在钉群提出的有关spark的问题及回答。

    你可以使用窗口函数来获取它

    scala> val df = Seq(("a",2,"2013-12-12","2999-12-31"),("b",3,"2011-12-14","2999-12-31"),("a",4,"2013-12-17","2999-12-31"),("b",8,"2011-12-19","2999-12-31"),("a",6,"2013-12-23","2999-12-31")).toDF("colA","colB","colC","colD")
    df: org.apache.spark.sql.DataFrame = [colA: string, colB: int ... 2 more fields]

    scala> val df2 = df.withColumn("colc",'colc.cast("date")).withColumn("cold",'cold.cast("date"))
    df2: org.apache.spark.sql.DataFrame = [colA: string, colB: int ... 2 more fields]

    scala> df2.createOrReplaceTempView("yash")

    scala> spark.sql(""" select cola,colb,colc,cold, rank() over(partition by cola order by colc) c1, coalesce(date_sub(lead(colc) over(partition by cola order by colc),1),cold) as cold2 from yash """).show
    cola colb colc cold c1 cold2
    b 3 2011-12-14 2999-12-31 1 2011-12-18
    b 8 2011-12-19 2999-12-31 2 2999-12-31
    a 2 2013-12-12 2999-12-31 1 2013-12-16
    a 4 2013-12-17 2999-12-31 2 2013-12-22
    a 6 2013-12-23 2999-12-31 3 2999-12-31

    scala>
    删除不必要的列

    scala> spark.sql(""" select cola,colb,colc, coalesce(date_sub(lead(colc) over(partition by cola order by colc),1),cold) as cold from yash """).show
    cola colb colc cold
    b 3 2011-12-14 2011-12-18
    b 8 2011-12-19 2999-12-31
    a 2 2013-12-12 2013-12-16
    a 4 2013-12-17 2013-12-22
    a 6 2013-12-23 2999-12-31

    scala>

    2019-07-17 23:23:20
    赞同 展开评论 打赏
问答排行榜
最热
最新

相关电子书

更多
Hybrid Cloud and Apache Spark 立即下载
Scalable Deep Learning on Spark 立即下载
Comparison of Spark SQL with Hive 立即下载