我试图在两行相同的列之间找到时间,包括日期和时间,如下所示,
column1
1/1/2017 12:01:00 AM
1/1/2017 12:05:00 AM
所以我想得到column1的第1行和第2行的两行之间的时间变化,因为它们都属于同一个日期。请让我知道实现它的最佳方法是什么?
val sparkSession = SparkSession .builder() .appName(this.getClass.getSimpleName) //本地测试 .master("local[4]") .config("spark.driver.host", "127.0.0.1") .config("spark.driver.bindAddress", "127.0.0.1") .getOrCreate() val rows = Seq( Row("31-AUG-21 02.16.33.371 PM") ) val schema = StructType( Seq( StructField("my_time", StringType, true) ) ) val rowsRDD = sparkSession.sparkContext.parallelize(rows, 4) val df = sparkSession.createDataFrame(rowsRDD, schema)
df.withColumn("time1",expr("date_format(from_unixtime(unix_timestamp(my_time,'dd-MMM-yy hh.mm.ss.SSS a'),'yyyy-MM-dd HH:mm:ss.SSS'),'yyyy-MM-dd HH:mm:ss.SSS')"))
.show(10)
您需要将列转换为时间戳,然后执行diff计算。看一下这个:
scala> val df = Seq(("1/01/2017 12:01:00 AM","1/1/2017 12:05:00 AM")).toDF("time1","time2")
df: org.apache.spark.sql.DataFrame = [time1: string, time2: string]
scala> val df2 = df.withColumn("time1",to_timestamp('time1,"d/MM/yyyy hh:mm:ss a")).withColumn("time2",to_timestamp('time2,"d/MM/yyyy hh:mm:ss a"))
df2: org.apache.spark.sql.DataFrame = [time1: timestamp, time2: timestamp]
scala> df2.printSchema
root
|-- time1: timestamp (nullable = true)
|-- time2: timestamp (nullable = true)
scala> df2.withColumn("diff_sec",unix_timestamp('time2)-unix_timestamp('time1)).withColumn("diff_min",'diff_sec/60).show(false) | |||
---|---|---|---|
time1 | time2 | diff_sec | diff_min |
2017-01-01 00:01:00 | 2017-01-01 00:05:00 | 240 | 4.0 |
scala>
UPDATE1:
scala> val df = Seq(("1/01/2017 12:01:00 AM"),("1/1/2017 12:05:00 AM")).toDF("timex")
df: org.apache.spark.sql.DataFrame = [timex: string]
scala> val df2 = df.withColumn("timex",to_timestamp('timex,"d/MM/yyyy hh:mm:ss a"))
df2: org.apache.spark.sql.DataFrame = [timex: timestamp]
scala> df2.show |
---|
timex |
2017-01-01 00:01:00 |
2017-01-01 00:05:00 |
scala> val df3 = df2.alias("t1").join(df2.alias("t2"), $"t1.timex" =!= $"t2.timex", "leftOuter").toDF("time1","time2")
df3: org.apache.spark.sql.DataFrame = [time1: timestamp, time2: timestamp]
scala> df3.withColumn("diff_sec",unix_timestamp('time2)-unix_timestamp('time1)).withColumn("diff_min",'diff_sec/60).show(false) | |||
---|---|---|---|
time1 | time2 | diff_sec | diff_min |
2017-01-01 00:01:00 | 2017-01-01 00:05:00 | 240 | 4.0 |
2017-01-01 00:05:00 | 2017-01-01 00:01:00 | -240 | -4.0 |
scala> df3.withColumn("diff_sec",unix_timestamp('time2)-unix_timestamp('time1)).withColumn("diff_min",'diff_sec/60).show(1,false) | |||
---|---|---|---|
time1 | time2 | diff_sec | diff_min |
2017-01-01 00:01:00 | 2017-01-01 00:05:00 | 240 | 4.0 |
only showing top 1 row
scala>
版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。