开发者社区> 问答> 正文

如何计算spark Scala中2行之间的时间差

我试图在两行相同的列之间找到时间,包括日期和时间,如下所示,

column1
1/1/2017 12:01:00 AM
1/1/2017 12:05:00 AM
所以我想得到column1的第1行和第2行的两行之间的时间变化,因为它们都属于同一个日期。请让我知道实现它的最佳方法是什么?

展开
收起
社区小助手 2018-12-05 14:07:12 8363 0
2 条回答
写回答
取消 提交回答
  • val sparkSession = SparkSession .builder() .appName(this.getClass.getSimpleName) //本地测试 .master("local[4]") .config("spark.driver.host", "127.0.0.1") .config("spark.driver.bindAddress", "127.0.0.1") .getOrCreate() val rows = Seq( Row("31-AUG-21 02.16.33.371 PM") ) val schema = StructType( Seq( StructField("my_time", StringType, true) ) ) val rowsRDD = sparkSession.sparkContext.parallelize(rows, 4) val df = sparkSession.createDataFrame(rowsRDD, schema)

      df.withColumn("time1",expr("date_format(from_unixtime(unix_timestamp(my_time,'dd-MMM-yy hh.mm.ss.SSS a'),'yyyy-MM-dd HH:mm:ss.SSS'),'yyyy-MM-dd HH:mm:ss.SSS')"))
      .show(10)
    
    2021-09-03 17:36:20
    赞同 展开评论 打赏
  • 社区小助手是spark中国社区的管理员,我会定期更新直播回顾等资料和文章干货,还整合了大家在钉群提出的有关spark的问题及回答。

    您需要将列转换为时间戳,然后执行diff计算。看一下这个:

    scala> val df = Seq(("1/01/2017 12:01:00 AM","1/1/2017 12:05:00 AM")).toDF("time1","time2")
    df: org.apache.spark.sql.DataFrame = [time1: string, time2: string]

    scala> val df2 = df.withColumn("time1",to_timestamp('time1,"d/MM/yyyy hh:mm:ss a")).withColumn("time2",to_timestamp('time2,"d/MM/yyyy hh:mm:ss a"))
    df2: org.apache.spark.sql.DataFrame = [time1: timestamp, time2: timestamp]

    scala> df2.printSchema
    root
    |-- time1: timestamp (nullable = true)
    |-- time2: timestamp (nullable = true)

    scala> df2.withColumn("diff_sec",unix_timestamp('time2)-unix_timestamp('time1)).withColumn("diff_min",'diff_sec/60).show(false)
    time1 time2 diff_sec diff_min
    2017-01-01 00:01:00 2017-01-01 00:05:00 240 4.0

    scala>
    UPDATE1:

    scala> val df = Seq(("1/01/2017 12:01:00 AM"),("1/1/2017 12:05:00 AM")).toDF("timex")
    df: org.apache.spark.sql.DataFrame = [timex: string]

    scala> val df2 = df.withColumn("timex",to_timestamp('timex,"d/MM/yyyy hh:mm:ss a"))
    df2: org.apache.spark.sql.DataFrame = [timex: timestamp]

    scala> df2.show
    timex
    2017-01-01 00:01:00
    2017-01-01 00:05:00

    scala> val df3 = df2.alias("t1").join(df2.alias("t2"), $"t1.timex" =!= $"t2.timex", "leftOuter").toDF("time1","time2")
    df3: org.apache.spark.sql.DataFrame = [time1: timestamp, time2: timestamp]

    scala> df3.withColumn("diff_sec",unix_timestamp('time2)-unix_timestamp('time1)).withColumn("diff_min",'diff_sec/60).show(false)
    time1 time2 diff_sec diff_min
    2017-01-01 00:01:00 2017-01-01 00:05:00 240 4.0
    2017-01-01 00:05:00 2017-01-01 00:01:00 -240 -4.0
    scala> df3.withColumn("diff_sec",unix_timestamp('time2)-unix_timestamp('time1)).withColumn("diff_min",'diff_sec/60).show(1,false)
    time1 time2 diff_sec diff_min
    2017-01-01 00:01:00 2017-01-01 00:05:00 240 4.0

    only showing top 1 row

    scala>

    2019-07-17 23:18:21
    赞同 展开评论 打赏
问答排行榜
最热
最新

相关电子书

更多
Hybrid Cloud and Apache Spark 立即下载
Scalable Deep Learning on Spark 立即下载
Comparison of Spark SQL with Hive 立即下载