我有一个如下的数据框
colA | colB | colC | colD |
---|---|---|---|
a | 2 | 2013-12-12 | 2999-12-31 |
b | 3 | 2011-12-14 | 2999-12-31 |
a | 4 | 2013-12-17 | 2999-12-31 |
b | 8 | 2011-12-19 | 2999-12-31 |
a | 6 | 2013-12-23 | 2999-12-31 |
我需要根据ColA对记录进行分组,并根据colC对记录进行排名(最近的日期得到更大的排名),然后通过从相邻排名的colC记录中减去一天来更新colD中的日期。
最终的数据框如下所示
colA | colB | colC | colD |
---|---|---|---|
a | 2 | 2013-12-12 | 2013-12-16 |
a | 4 | 2013-12-17 | 2013-12-22 |
a | 6 | 2013-12-23 | 2999-12-31 |
b | 3 | 2011-12-14 | 2011-12-18 |
b | 8 | 2011-12-29 | 2999-12-31 |
你可以使用窗口函数来获取它
scala> val df = Seq(("a",2,"2013-12-12","2999-12-31"),("b",3,"2011-12-14","2999-12-31"),("a",4,"2013-12-17","2999-12-31"),("b",8,"2011-12-19","2999-12-31"),("a",6,"2013-12-23","2999-12-31")).toDF("colA","colB","colC","colD")
df: org.apache.spark.sql.DataFrame = [colA: string, colB: int ... 2 more fields]
scala> val df2 = df.withColumn("colc",'colc.cast("date")).withColumn("cold",'cold.cast("date"))
df2: org.apache.spark.sql.DataFrame = [colA: string, colB: int ... 2 more fields]
scala> df2.createOrReplaceTempView("yash")
scala> spark.sql(""" select cola,colb,colc,cold, rank() over(partition by cola order by colc) c1, coalesce(date_sub(lead(colc) over(partition by cola order by colc),1),cold) as cold2 from yash """).show | |||||
---|---|---|---|---|---|
cola | colb | colc | cold | c1 | cold2 |
b | 3 | 2011-12-14 | 2999-12-31 | 1 | 2011-12-18 |
b | 8 | 2011-12-19 | 2999-12-31 | 2 | 2999-12-31 |
a | 2 | 2013-12-12 | 2999-12-31 | 1 | 2013-12-16 |
a | 4 | 2013-12-17 | 2999-12-31 | 2 | 2013-12-22 |
a | 6 | 2013-12-23 | 2999-12-31 | 3 | 2999-12-31 |
scala>
删除不必要的列
scala> spark.sql(""" select cola,colb,colc, coalesce(date_sub(lead(colc) over(partition by cola order by colc),1),cold) as cold from yash """).show | |||
---|---|---|---|
cola | colb | colc | cold |
b | 3 | 2011-12-14 | 2011-12-18 |
b | 8 | 2011-12-19 | 2999-12-31 |
a | 2 | 2013-12-12 | 2013-12-16 |
a | 4 | 2013-12-17 | 2013-12-22 |
a | 6 | 2013-12-23 | 2999-12-31 |
scala>
版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。