Spark SQL 日期和时间戳函数

Spark SQL 提供了内置的标准 Date 和 Timestamp函数，定义在 DataFrame API 中，所有函数都接受输入日期类型、时间戳类型或字符串。如果是String，是可以转换成日期格式，比如或，分别返回date和timestamp；如果输入数据是无法转换为日期和时间戳的字符串，也返回 null。尽可能尝试利用标准库，因为与Spark UDF相比，它们在编译时更安全、可以处理

风翔

12186人浏览 · 2022-06-30 17:33:42

风翔 · 2022-06-30 17:33:42 发布

Spark SQL 日期和时间戳函数

Spark SQL 提供了内置的标准 Date 和 Timestamp函数，定义在 DataFrame API 中，所有函数都接受输入日期类型、时间戳类型或字符串。如果是String，是可以转换成日期格式，比如yyyy-MM-dd 或 yyyy-MM-dd HH:mm:ss.SSSS，分别返回date和timestamp；如果输入数据是无法转换为日期和时间戳的字符串，也返回 null。

尽可能尝试利用标准库，因为与Spark UDF相比，它们在编译时更安全、可以处理 null 并且性能更好。

为了便于阅读，将 Date 和 Timestamp 函数分为以下几组。

[Spark SQL 日期函数 ]
[Spark SQL 时间戳函数]
[日期和时间戳窗口函数]

在使用以下任何示例之前，请确保[创建 sparksession]


import org.apache.spark.sql.SparkSession
val spark:SparkSession = SparkSession.builder()
    .master("local[3]")
    .appName("SparkByExample")
    .getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.sqlContext.implicits._
import org.apache.spark.sql.functions._

Spark SQL 日期函数

DATE FUNCTION SIGNATURE	DATE FUNCTION DESCRIPTION
current_date () : Column	Returns the current date as a date column.
date_format(dateExpr: Column, format: String): Column	Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument.
to_date(e: Column): Column	Converts the column into `DateType` by casting rules to `DateType`.
to_date(e: Column, fmt: String): Column	Converts the column into a `DateType` with a specified format
add_months(startDate: Column, numMonths: Int): Column	Returns the date that is `numMonths` after `startDate`.
date_add(start: Column, days: Int): Column
date_sub(start: Column, days: Int): Column	Returns the date that is `days` days after `start`
datediff(end: Column, start: Column): Column	Returns the number of days from `start` to `end`.
months_between(end: Column, start: Column): Column	Returns number of months between dates `start` and `end`. A whole number is returned if both inputs have the same day of month or both are the last day of their respective months. Otherwise, the difference is calculated assuming 31 days per month.
months_between(end: Column, start: Column, roundOff: Boolean): Column	Returns number of months between dates `end` and `start`. If `roundOff` is set to true, the result is rounded off to 8 digits; it is not rounded otherwise.
next_day(date: Column, dayOfWeek: String): Column	Returns the first date which is later than the value of the `date` column that is on the specified day of the week.
For example, `next_day('2015-07-27', "Sunday")` returns 2015-08-02 because that is the first Sunday after 2015-07-27.
trunc(date: Column, format: String): Column	Returns date truncated to the unit specified by the format.
For example, `trunc("2018-11-19 12:01:19", "year")` returns 2018-01-01
format: ‘year’, ‘yyyy’, ‘yy’ to truncate by year,
‘month’, ‘mon’, ‘mm’ to truncate by month
date_trunc(format: String, timestamp: Column): Column	Returns timestamp truncated to the unit specified by the format.
For example, `date_trunc("year", "2018-11-19 12:01:19")` returns 2018-01-01 00:00:00
format: ‘year’, ‘yyyy’, ‘yy’ to truncate by year,
‘month’, ‘mon’, ‘mm’ to truncate by month,
‘day’, ‘dd’ to truncate by day,
Other options are: ‘second’, ‘minute’, ‘hour’, ‘week’, ‘month’, ‘quarter’
year(e: Column): Column	Extracts the year as an integer from a given date/timestamp/string
quarter(e: Column): Column	Extracts the quarter as an integer from a given date/timestamp/string.
month(e: Column): Column	Extracts the month as an integer from a given date/timestamp/string
dayofweek(e: Column): Column	Extracts the day of the week as an integer from a given date/timestamp/string. Ranges from 1 for a Sunday through to 7 for a Saturday
dayofmonth(e: Column): Column	Extracts the day of the month as an integer from a given date/timestamp/string.
dayofyear(e: Column): Column	Extracts the day of the year as an integer from a given date/timestamp/string.
weekofyear(e: Column): Column	Extracts the week number as an integer from a given date/timestamp/string. A week is considered to start on a Monday and week 1 is the first week with more than 3 days, as defined by ISO 8601
last_day(e: Column): Column	Returns the last day of the month which the given date belongs to. For example, input “2015-07-27” returns “2015-07-31” since July 31 is the last day of the month in July 2015.
from_unixtime(ut: Column): Column	Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the yyyy-MM-dd HH:mm:ss format.
from_unixtime(ut: Column, f: String): Column	Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format.
unix_timestamp(): Column	Returns the current Unix timestamp (in seconds) as a long
unix_timestamp(s: Column): Column	Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale.
unix_timestamp(s: Column, p: String): Column	Converts time string with given pattern to Unix timestamp (in seconds).

Spark SQL 时间戳函数

下面是一些 Spark SQL Timestamp 函数，这些函数对日期和时间戳值都进行操作。选择每个链接以获取每个功能的描述和示例。

Spark Timestamp 的默认格式是yyyy-MM-dd HH:mm:ss.SSSS

TIMESTAMP FUNCTION SIGNATURE	TIMESTAMP FUNCTION DESCRIPTION
current_timestamp () : Column	Returns the current timestamp as a timestamp column
hour(e: Column): Column	Extracts the hours as an integer from a given date/timestamp/string.
minute(e: Column): Column	Extracts the minutes as an integer from a given date/timestamp/string.
second(e: Column): Column	Extracts the seconds as an integer from a given date/timestamp/string.
to_timestamp(s: Column): Column	Converts to a timestamp by casting rules to `TimestampType`.
to_timestamp(s: Column, fmt: String): Column	Converts time string with the given pattern to timestamp.

Spark 日期和时间戳窗口函数

下面是数据和时间戳窗口函数。

DATE & TIME WINDOW FUNCTION SYNTAX	DATE & TIME WINDOW FUNCTION DESCRIPTION
window(timeColumn: Column, windowDuration: String,
slideDuration: String, startTime: String): Column	Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported.
window(timeColumn: Column, windowDuration: String, slideDuration: String): Column	Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported. The windows start beginning at 1970-01-01 00:00:00 UTC
window(timeColumn: Column, windowDuration: String): Column	Generates tumbling time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported. The windows start beginning at 1970-01-01 00:00:00 UTC.

Spark 日期函数示例

以下是最常用的日期函数示例。

current_date() 和 date_format()

如何使用 Scala 示例中的 date_format() 获取当前日期并将日期转换为特定的日期格式。下面的示例解析日期并从 ‘yyyy-dd-mm’ 转换为 ‘MM-dd-yyyy’ 格式。


import org.apache.spark.sql.functions._
Seq(("2019-01-23"))
  .toDF("Input")
  .select( 
    current_date()as("current_date"), 
    col("Input"), 
    date_format(col("Input"), "MM-dd-yyyy").as("format") 
  ).show()


+------------+----------+-----------+ 
|current_date| Input    |format      | 
+------------+----------+-----------+ 
| 2019-07-23 |2019-01-23| 01-23-2019 |
 +------------+----------+-----------+

to_date（）

to_date()下面的示例使用Scala 示例将日期格式 ‘MM/dd/yyyy’ 的字符串转换为 DateType ‘yyyy-MM-dd’ 。


import org.apache.spark.sql.functions._
Seq(("04/13/2019"))
   .toDF("Input")
  .select( col("Input"), 
           to_date(col("Input"), "MM/dd/yyyy").as("to_date") 
   ).show()


+----------+----------+ 
|Input     |to_date   | 
+----------+----------+ 
|04/13/2019|2019-04-13| 
+----------+----------+

datediff（）

datediff()下面的示例使用with Scala 示例返回两个日期之间的差异。


import org.apache.spark.sql.functions._
Seq(("2019-01-23"),("2019-06-24"),("2019-09-20"))
   .toDF("input")
   .select( col("input"), current_date(), 
       datediff(current_date(),col("input")).as("diff") 
    ).show()


+----------+--------------+--------+ 
| input    |current_date()| diff   | 
+----------+--------------+--------+ 
|2019-01-23| 2019-07-23   | 181    | 
|2019-06-24| 2019-07-23   | 29     | 
|2019-09-20| 2019-07-23   | -59    | 
+----------+--------------+--------+

months_between（）

months_between()下面的示例使用Scala 语言返回两个日期之间的月份。


import org.apache.spark.sql.functions._
Seq(("2019-01-23"),("2019-06-24"),("2019-09-20"))
   .toDF("date")
  .select( col("date"), current_date(), 
       datediff(current_date(),col("date")).as("datediff"), 
       months_between(current_date(),col("date")).as("months_between")
   ).show()


+----------+--------------+--------+--------------+ 
| date     |current_date()|datediff|months_between| 
+----------+--------------+--------+--------------+ 
|2019-01-23| 2019-07-23   |     181|           6.0| 
|2019-06-24| 2019-07-23   |      29|    0.96774194| 
|2019-09-20| 2019-07-23   |     -59|   -1.90322581| 
+----------+--------------+--------+--------------+

trunc（）

trunc()下面的示例使用Scala 语言在指定单位截断日期。


import org.apache.spark.sql.functions._
Seq(("2019-01-23"),("2019-06-24"),("2019-09-20"))
    .toDF("input")
    .select( col("input"), 
          trunc(col("input"),"Month").as("Month_Trunc"), 
          trunc(col("input"),"Year").as("Month_Year"), 
          trunc(col("input"),"Month").as("Month_Trunc") 
     ).show()


+----------+-----------+----------+-----------+ 
| input    |Month_Trunc|Month_Year|Month_Trunc| 
+----------+-----------+----------+-----------+ 
|2019-01-23| 2019-01-01|2019-01-01| 2019-01-01| 
|2019-06-24| 2019-06-01|2019-01-01| 2019-06-01| 
|2019-09-20| 2019-09-01|2019-01-01| 2019-09-01|
+----------+-----------+----------+-----------+

add_months() , date_add(), date_sub()

在这里，我们从给定的输入中添加和减去日期和月份。


import org.apache.spark.sql.functions._
Seq(("2019-01-23"),("2019-06-24"),("2019-09-20")).toDF("input")
  .select( col("input"), 
      add_months(col("input"),3).as("add_months"), 
      add_months(col("input"),-3).as("sub_months"), 
      date_add(col("input"),4).as("date_add"), 
      date_sub(col("input"),4).as("date_sub") 
   ).show()


+----------+----------+----------+----------+----------+ 
| input    |add_months|sub_months| date_add | date_sub | 
+----------+----------+----------+----------+----------+ 
|2019-01-23|2019-04-23|2018-10-23|2019-01-27|2019-01-19| 
|2019-06-24|2019-09-24|2019-03-24|2019-06-28|2019-06-20| 
|2019-09-20|2019-12-20|2019-06-20|2019-09-24|2019-09-16| 
+----------+----------+----------+----------+----------+

year（），month（），month（）

dayofweek(), dayofmonth(), dayofyear()

next_day(), weekofyear()


import org.apache.spark.sql.functions._
Seq(("2019-01-23"),("2019-06-24"),("2019-09-20"))
  .toDF("input")
  .select( col("input"), year(col("input")).as("year"), 
       month(col("input")).as("month"), 
       dayofweek(col("input")).as("dayofweek"), 
       dayofmonth(col("input")).as("dayofmonth"), 
       dayofyear(col("input")).as("dayofyear"), 
       next_day(col("input"),"Sunday").as("next_day"), 
       weekofyear(col("input")).as("weekofyear") 
   ).show()


+----------+----+-----+---------+----------+---------+----------+----------+
| input|year|month|dayofweek|dayofmonth|dayofyear| next_day|weekofyear| 
+----------+----+-----+---------+----------+---------+----------+----------+ 
|2019-01-23|2019| 1| 4| 23| 23|2019-01-27| 4| 
|2019-06-24|2019| 6| 2| 24| 175|2019-06-30| 26| 
|2019-09-20|2019| 9| 6| 20| 263|2019-09-22| 38| 
+----------+----+-----+---------+----------+---------+----------+----------+

Spark 时间戳函数示例

下面是最常用的时间戳函数示例。

current_timestamp()

以 spark 默认格式yyyy-MM-dd HH:mm:ss返回当前时间戳


import org.apache.spark.sql.functions._
val df = Seq((1)).toDF("seq")
val curDate = df.withColumn("current_date",current_date().as("current_date"))
 .withColumn("current_timestamp",current_timestamp().as("current_timestamp"))
curDate.show(false)


+---+------------+-----------------------+
|seq|current_date|current_timestamp      |
+---+------------+-----------------------+
|1  |2019-11-16  |2019-11-16 21:00:55.349|
+---+------------+-----------------------+

to_timestamp()

将字符串时间戳转换为时间戳类型格式。


import org.apache.spark.sql.functions._
  val dfDate = Seq(("07-01-2019 12 01 19 406"),
    ("06-24-2019 12 01 19 406"),
    ("11-16-2019 16 44 55 406"),
    ("11-16-2019 16 50 59 406")).toDF("input_timestamp")

  dfDate.withColumn("datetype_timestamp",
          to_timestamp(col("input_timestamp"),"MM-dd-yyyy HH mm ss SSS"))
    .show(false)


+-----------------------+-------------------+
|input_timestamp        |datetype_timestamp |
+-----------------------+-------------------+
|07-01-2019 12 01 19 406|2019-07-01 12:01:19|
|06-24-2019 12 01 19 406|2019-06-24 12:01:19|
|11-16-2019 16 44 55 406|2019-11-16 16:44:55|
|11-16-2019 16 50 59 406|2019-11-16 16:50:59|
+-----------------------+-------------------+

hour（），minute（）和second（）


import org.apache.spark.sql.functions._
  val df = Seq(("2019-07-01 12:01:19.000"),
    ("2019-06-24 12:01:19.000"),
    ("2019-11-16 16:44:55.406"),
    ("2019-11-16 16:50:59.406")).toDF("input_timestamp")

  df.withColumn("hour", hour(col("input_timestamp")))
    .withColumn("minute", minute(col("input_timestamp")))
    .withColumn("second", second(col("input_timestamp")))
    .show(false)


+-----------------------+----+------+------+
|input_timestamp        |hour|minute|second|
+-----------------------+----+------+------+
|2019-07-01 12:01:19.000|12  |1     |19    |
|2019-06-24 12:01:19.000|12  |1     |19    |
|2019-11-16 16:44:55.406|16  |44    |55    |
|2019-11-16 16:50:59.406|16  |50    |59    |
+-----------------------+----+------+------+