07: Dates and times

STA35B: Statistical Data Science 2

Akira Horiguchi

Dates and times

Dates and times: complications

Many things complicate working with dates and times

Not all years have 365 days

A year is a leap year if it’s divisible by 4, unless it’s also divisible by 100, except if it’s also divisible by 400. In other words, in every set of 400 years, there’s 97 leap years.

Not every day in every location has 24 hours a day
- Daylight savings time implies one day has 23, another has 24
Time zones are difficult!
We will be using lubridate package (part of latest tidyverse), and nycflights13.

library(tidyverse)
library(nycflights13)

Creating dates and times

Three types of date/time data:

A date. Tibbles print this as <date>
A time within a day. Tibbles print this as <time>
A date-time is a date plus a time. Tibbles print this as <dttm>
- It uniquely identifies an instant in time (typically to the nearest second).

We will focus on dates and date-times as R doesn’t have a native class for storing times. If you need one, you can use the hms package.

Use the simplest possible data type that works for your needs (e.g., use a date instead of a date-time, if you can).

Get current date or date-time:

today()  # get current date

[1] "2025-04-11"

now()  # get current date-time

[1] "2025-04-11 17:43:44 PDT"

tibble(y=today(), x=now())

# A tibble: 1 × 2
  y          x                  
  <date>     <dttm>             
1 2025-04-11 2025-04-11 17:43:44

class(today())

[1] "Date"

class(now())

[1] "POSIXct" "POSIXt"

Extracting dates and times from strings

To create dates, convert a string using functions whose names are three letter combos of “y”, “m”, “d”

ymd("2017-01-31")

[1] "2017-01-31"

mdy("January 31st, 2017")

[1] "2017-01-31"

mdy("January 31, 2017")

[1] "2017-01-31"

dmy("31-Jan-2017")

[1] "2017-01-31"

To create date-times, add an underscore _ and then one or more of “h”, “m”, “s”.

(ymd_hms("2017-01-31 20:11:59"))

[1] "2017-01-31 20:11:59 UTC"

mdy_hm("01/31/2017 08:01")

[1] "2017-01-31 08:01:00 UTC"

Times are assumed to be UTC time zone; can change by using tz=

mdy_hm("01/31/2017 08:01", tz = "EST")

[1] "2017-01-31 08:01:00 EST"

Creating date-times from dplyr parts

Remember how flights stored some of the date information:

flights |> select(year, month, day, hour, minute)

# A tibble: 336,776 × 5
    year month   day  hour minute
   <int> <int> <int> <dbl>  <dbl>
 1  2013     1     1     5     15
 2  2013     1     1     5     29
...

To create date/time from this, can use make_date() or make_datetime():

flights |>
  mutate(departure = make_datetime(year, month, day, hour, minute), .keep="used")

# A tibble: 336,776 × 6
    year month   day  hour minute departure          
   <int> <int> <int> <dbl>  <dbl> <dttm>             
 1  2013     1     1     5     15 2013-01-01 05:15:00
 2  2013     1     1     5     29 2013-01-01 05:29:00
...

Creating date-times from dplyr parts

We’ll now do a similar computation for the four time columns in flights
We’ll do so using a function - we haven’t seen this yet, but we will see it in a couple weeks

make_datetime_100 <- function(year, month, day, time) {
  # recall: time is an integer, use modular arithmetic to convert 
  hr <- time %/% 100  # get first digit of `time`
  mnt <- time %% 100  # get last two digits of `time`
  make_datetime(year, month, day, hr, mnt)
}

Creating date-times from dplyr parts

flights_dt <- flights |> 
  filter(!is.na(dep_time), !is.na(arr_time)) |> 
  mutate(
    dep_time = make_datetime_100(year, month, day, dep_time),
    arr_time = make_datetime_100(year, month, day, arr_time),
    sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
    sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
  ) |> 
  select(origin, dest, ends_with("delay"), ends_with("time"))
flights_dt

# A tibble: 328,063 × 9
   origin dest  dep_delay arr_delay dep_time            sched_dep_time     
   <chr>  <chr>     <dbl>     <dbl> <dttm>              <dttm>             
 1 EWR    IAH           2        11 2013-01-01 05:17:00 2013-01-01 05:15:00
 2 LGA    IAH           4        20 2013-01-01 05:33:00 2013-01-01 05:29:00
 3 JFK    MIA           2        33 2013-01-01 05:42:00 2013-01-01 05:40:00
 4 JFK    BQN          -1       -18 2013-01-01 05:44:00 2013-01-01 05:45:00
 5 LGA    ATL          -6       -25 2013-01-01 05:54:00 2013-01-01 06:00:00
 6 EWR    ORD          -4        12 2013-01-01 05:54:00 2013-01-01 05:58:00
 7 EWR    FLL          -5        19 2013-01-01 05:55:00 2013-01-01 06:00:00
 8 LGA    IAD          -3       -14 2013-01-01 05:57:00 2013-01-01 06:00:00
 9 JFK    MCO          -3        -8 2013-01-01 05:57:00 2013-01-01 06:00:00
10 LGA    ORD          -2         8 2013-01-01 05:58:00 2013-01-01 06:00:00
# ℹ 328,053 more rows
# ℹ 3 more variables: arr_time <dttm>, sched_arr_time <dttm>, air_time <dbl>

Updated flights df with times for arrivals/departures

We’ll now use these date-time values

flights_dt |>
  filter(dep_time < ymd(20130102))

# A tibble: 837 × 9
   origin dest  dep_delay arr_delay dep_time            sched_dep_time     
   <chr>  <chr>     <dbl>     <dbl> <dttm>              <dttm>             
 1 EWR    IAH           2        11 2013-01-01 05:17:00 2013-01-01 05:15:00
 2 LGA    IAH           4        20 2013-01-01 05:33:00 2013-01-01 05:29:00
 3 JFK    MIA           2        33 2013-01-01 05:42:00 2013-01-01 05:40:00
 4 JFK    BQN          -1       -18 2013-01-01 05:44:00 2013-01-01 05:45:00
 5 LGA    ATL          -6       -25 2013-01-01 05:54:00 2013-01-01 06:00:00
 6 EWR    ORD          -4        12 2013-01-01 05:54:00 2013-01-01 05:58:00
 7 EWR    FLL          -5        19 2013-01-01 05:55:00 2013-01-01 06:00:00
 8 LGA    IAD          -3       -14 2013-01-01 05:57:00 2013-01-01 06:00:00
 9 JFK    MCO          -3        -8 2013-01-01 05:57:00 2013-01-01 06:00:00
10 LGA    ORD          -2         8 2013-01-01 05:58:00 2013-01-01 06:00:00
# ℹ 827 more rows
# ℹ 3 more variables: arr_time <dttm>, sched_arr_time <dttm>, air_time <dbl>

Date-time components: accessor functions

x <- ymd_hms("2026-07-08 12:34:56")

year(), month(), day(), hour(), minute(), second()

year(x)

[1] 2026

month(x)

[1] 7

mday(x)

[1] 8

wday() (day of the week), mday() (day of the month), yday() (day of the year),

wday(x) # 2026-07-08 is Weds. (Sun.=1)

[1] 4

yday(x)

[1] 189

month() and wday() can have label=TRUE, returns abbreviated name of month/day

Set abbr=FALSE to get full name

month(x, label=TRUE)

[1] Jul
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec

wday(x, label=TRUE, abbr=FALSE)

[1] Wednesday
7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

Date-time components: example

You can do things like calculate the minute with the highest departure delays:

flights_dt |>
  mutate(minute = minute(dep_time)) |>
  group_by(minute) |>
  summarize(avg_delay = mean(dep_delay, na.rm=TRUE)) |> 
  arrange(by = desc(avg_delay))

# A tibble: 60 × 2
   minute avg_delay
    <int>     <dbl>
 1     17      18.6
 2     32      17.8
 3     34      17.8
 4     33      17.7
 5     37      17.5
 6     15      17.2
 7     13      17.1
 8     36      17.1
 9     16      17.1
10     18      17.0
# ℹ 50 more rows

Rounding

Analogues of standard rounding functions for dates:

floor_date()
ceiling_date()
round_date()

Arguments:

vector of dates to adjust,
name of unit (week, day, etc)

flights_dt |>
  slice_sample(n=9) |>  # selects 9 rows at random
  mutate(year = floor_date(dep_time, "month"), 
         .keep = "used")

# A tibble: 9 × 2
  dep_time            year               
  <dttm>              <dttm>             
1 2013-05-28 14:40:00 2013-05-01 00:00:00
2 2013-11-07 10:37:00 2013-11-01 00:00:00
3 2013-09-02 19:30:00 2013-09-01 00:00:00
4 2013-07-09 23:49:00 2013-07-01 00:00:00
5 2013-03-22 15:27:00 2013-03-01 00:00:00
6 2013-04-17 07:03:00 2013-04-01 00:00:00
7 2013-09-25 11:52:00 2013-09-01 00:00:00
8 2013-11-11 20:25:00 2013-11-01 00:00:00
9 2013-04-18 06:35:00 2013-04-01 00:00:00

Examples

Compute the average delay time of flights that depart at times in two groups:

departure time is (between minutes 20-30 and 50-60) vs. (the other times)

flights_dt |>
  mutate(dep_minute = minute(dep_time),
         mins_2030 = dep_minute >= 20 & dep_minute <= 30,
         mins_5060 = dep_minute >= 50 & dep_minute <= 59,
         mins_2030_or_5060 = mins_2030 | mins_5060) |>
  group_by(mins_2030_or_5060) |>
  summarize(avg_dep_delay = mean(dep_delay, na.rm=TRUE),
            n = n())

# A tibble: 2 × 3
  mins_2030_or_5060 avg_dep_delay      n
  <lgl>                     <dbl>  <int>
1 FALSE                     15.5  181621
2 TRUE                       8.90 146442

Time spans

Date-time arithmetic

Subtracting two dates returns a “difftime” object:

# How old is Hadley?
h_age <- today() - ymd("1979-10-14")
h_age

Time difference of 16616 days

str(h_age)

 'difftime' num 16616
 - attr(*, "units")= chr "days"

Exact unit of difftime can vary from seconds, minutes, hours, days, or weeks.
as.duration() always uses seconds:

as.duration(h_age)

[1] "1435622400s (~45.49 years)"

Can think of difference between two times as a time span

Time spans

Three important classes representing time spans:

Durations: exact time, measured in seconds
Periods: human units, like weeks/months
Intervals: represent starting/end point

ddays(2) |> class()

[1] "Duration"
attr(,"package")
[1] "lubridate"

days(2) |> class()

[1] "Period"
attr(,"package")
[1] "lubridate"

(ymd("2023-01-01") %--% ymd("2024-01-01")) |> class()

[1] "Interval"
attr(,"package")
[1] "lubridate"

Durations: constructing durations

Durations measure exact time span (in seconds)

Useful constructors: d{units}, {units} is seconds, days, etc

dseconds(15)

[1] "15s"

dminutes(10)

[1] "600s (~10 minutes)"

dhours(c(12, 24))

[1] "43200s (~12 hours)" "86400s (~1 days)"

Cannot convert month to Duration: not well-defined

Arithmetic

Can add and multiply Durations:

2 * dyears(1)

[1] "63115200s (~2 years)"

dyears(1) + dweeks(12) + dhours(5)

[1] "38833200s (~1.23 years)"

Can subtract Durations to / from days

today() + ddays(1) # returns Date

[1] "2025-04-12"

today() - dyears(1) # returns date-time

[1] "2024-04-10 18:00:00 UTC"

Durations: computations and weird results

Because durations represent an exact number of seconds, things can be odd when looking at time zones

one_am <- ymd_hms("2026-03-08 01:00:00", tz = "America/New_York")
one_am

[1] "2026-03-08 01:00:00 EST"

one_am + ddays(1)  # the day advances, but the hour also advances

[1] "2026-03-09 02:00:00 EDT"

If we add a full day of seconds, would need to account for hour time change from EST to EDT.
lubridate package provides periods to address this

Periods

Periods are time spans but don’t have fixed length in seconds

Work like “human” times, i.e. days/months

one_am

[1] "2026-03-08 01:00:00 EST"

# days() returns a Period object
one_am + days(1)

[1] "2026-03-09 01:00:00 EDT"

# ddays() returns a Duration object
one_am + ddays(1)

[1] "2026-03-09 02:00:00 EDT"

Useful constructors and behavior under +/*:

hours(c(12, 24))

[1] "12H 0M 0S" "24H 0M 0S"

days(7)

[1] "7d 0H 0M 0S"

months(2:3)

[1] "2m 0d 0H 0M 0S" "3m 0d 0H 0M 0S"

months(6) + days(2)

[1] "6m 2d 0H 0M 0S"

10 * (months(6) + days(2))

[1] "60m 20d 0H 0M 0S"

Periods: vs Durations

Adding periods can be a bit more in line with expectations

Example: a leap year

ymd("2024-01-01") + dyears(1)  # date + Duration

[1] "2024-12-31 06:00:00 UTC"

ymd("2024-01-01") + years(1)  # date + Period

[1] "2025-01-01"

Example: Daylight saving time

one_am + ddays(1)  # date-time + Duration

[1] "2026-03-09 02:00:00 EDT"

one_am + days(1)  # date-time + Period

[1] "2026-03-09 01:00:00 EDT"

Fixing a bug in `flights_dt`

Some flights arrived before they departed?!

flights_dt |> filter(arr_time < dep_time) |> select(origin, dest, arr_time, dep_time)

# A tibble: 10,633 × 4
   origin dest  arr_time            dep_time           
   <chr>  <chr> <dttm>              <dttm>             
 1 EWR    BQN   2013-01-01 00:03:00 2013-01-01 19:29:00
 2 JFK    DFW   2013-01-01 00:29:00 2013-01-01 19:39:00
 3 EWR    TPA   2013-01-01 00:08:00 2013-01-01 20:58:00
...

These are overnight flights: flights arrived on the following day.

flights |> filter(arr_time < dep_time) |> select(origin, dest, arr_time, dep_time)

# A tibble: 10,633 × 4
   origin dest  arr_time dep_time
   <chr>  <chr>    <int>    <int>
 1 EWR    BQN          3     1929
 2 JFK    DFW         29     1939
 3 EWR    TPA          8     2058
...

Fixing a bug in `flights_dt`

We can fix this by adding a day to the arrival time of each overnight flight.

days(TRUE) gets coerced to days(1)
days(FALSE) gets coerced to days(0)

flights_dt_2 <- flights_dt |> 
  mutate(
    overnight = arr_time < dep_time,  # returns T/F
    arr_time = arr_time + days(overnight),  # coercion
    sched_arr_time = sched_arr_time + days(overnight)
  )

Now flights_dt_2 has no flight that appears to arrive before it departed:

flights_dt_2 |> filter(arr_time < dep_time)

# A tibble: 0 × 10
# ℹ 10 variables: origin <chr>, dest <chr>, dep_delay <dbl>, arr_delay <dbl>,
#   dep_time <dttm>, sched_dep_time <dttm>, arr_time <dttm>,
#   sched_arr_time <dttm>, air_time <dbl>, overnight <lgl>

Intervals

Motivating calculations

dyears(1) / ddays(365) does not return 1, since dyears() is defined as the number of seconds per average year: 365.25 days.
years(1) / days(1) does not return 365, since in leap years this isn’t true.

Intervals allow for defining a time interval between two specific date-times.

Format: start %--% end:

(y2023 <- ymd("2023-01-01") %--% ymd("2024-01-01"))

[1] 2023-01-01 UTC--2024-01-01 UTC

(y2024 <- ymd("2024-01-01") %--% ymd("2025-01-01"))

[1] 2024-01-01 UTC--2025-01-01 UTC

Can divide by days() to find out how many days in specific year:

y2023 / days(1)

[1] 365

y2024 / days(1)

[1] 366

Putting it all together

Example: extracting dates and computing durations

df

# A tibble: 3 × 2
  name  entry                                                  
  <chr> <chr>                                                  
1 Ant   First arrival: Jan 2, 2005; Second arrival: Jan 6, 2023
2 Bug   First time: Jan 5, 1997; Second time: Jan 8, 2015      
3 Cat   First visit: Jan 4, 1990; Second visit: Jan 9, 2008

Task: using df, return a tibble that says how many days elapsed between the first and second visit.

Complex task! We need to:

Parse the “entry” column to extract the dates
Turn them into proper date-times
Compute \(\#\) of days between visits (leap years happen at different times!)

Let’s start by parsing “entry” and creating two columns for different dates

Example: extracting dates and computing durations

df

# A tibble: 3 × 2
  name  entry                                                  
  <chr> <chr>                                                  
1 Ant   First arrival: Jan 2, 2005; Second arrival: Jan 6, 2023
2 Bug   First time: Jan 5, 1997; Second time: Jan 8, 2015      
3 Cat   First visit: Jan 4, 1990; Second visit: Jan 9, 2008

df |>  # step 1: separate two strings
  separate_wider_delim(entry, delim="; ", names=c("d1", "d2"))

# A tibble: 3 × 3
  name  d1                         d2                         
  <chr> <chr>                      <chr>                      
1 Ant   First arrival: Jan 2, 2005 Second arrival: Jan 6, 2023
2 Bug   First time: Jan 5, 1997    Second time: Jan 8, 2015   
3 Cat   First visit: Jan 4, 1990   Second visit: Jan 9, 2008

Example: extracting dates and computing durations

df

# A tibble: 3 × 2
  name  entry                                                  
  <chr> <chr>                                                  
1 Ant   First arrival: Jan 2, 2005; Second arrival: Jan 6, 2023
2 Bug   First time: Jan 5, 1997; Second time: Jan 8, 2015      
3 Cat   First visit: Jan 4, 1990; Second visit: Jan 9, 2008

df |>  # step 2: get date from "d1"
  separate_wider_delim(entry, delim="; ", names=c("d1", "d2")) |> 
  separate_wider_delim(d1, delim=": ", names=c(NA, "date1"))

# A tibble: 3 × 3
  name  date1       d2                         
  <chr> <chr>       <chr>                      
1 Ant   Jan 2, 2005 Second arrival: Jan 6, 2023
2 Bug   Jan 5, 1997 Second time: Jan 8, 2015   
3 Cat   Jan 4, 1990 Second visit: Jan 9, 2008

Example: extracting dates and computing durations

df

# A tibble: 3 × 2
  name  entry                                                  
  <chr> <chr>                                                  
1 Ant   First arrival: Jan 2, 2005; Second arrival: Jan 6, 2023
2 Bug   First time: Jan 5, 1997; Second time: Jan 8, 2015      
3 Cat   First visit: Jan 4, 1990; Second visit: Jan 9, 2008

df |>  # step 3: get date from "d2"
  separate_wider_delim(entry, delim="; ", names=c("d1", "d2")) |> 
  separate_wider_delim(d1, delim=": ", names=c(NA, "date1")) |> 
  separate_wider_delim(d2, delim=": ", names=c(NA, "date2"))

# A tibble: 3 × 3
  name  date1       date2      
  <chr> <chr>       <chr>      
1 Ant   Jan 2, 2005 Jan 6, 2023
2 Bug   Jan 5, 1997 Jan 8, 2015
3 Cat   Jan 4, 1990 Jan 9, 2008

Example: extracting dates and computing durations

df

# A tibble: 3 × 2
  name  entry                                                  
  <chr> <chr>                                                  
1 Ant   First arrival: Jan 2, 2005; Second arrival: Jan 6, 2023
2 Bug   First time: Jan 5, 1997; Second time: Jan 8, 2015      
3 Cat   First visit: Jan 4, 1990; Second visit: Jan 9, 2008

df |>  # step 4: convert "date1" and "date2" into Dates
  separate_wider_delim(entry, delim="; ", names=c("d1", "d2")) |> 
  separate_wider_delim(d1, delim=": ", names=c(NA, "date1")) |> 
  separate_wider_delim(d2, delim=": ", names=c(NA, "date2")) |> 
  mutate(date1 = mdy(date1), date2 = mdy(date2))

# A tibble: 3 × 3
  name  date1      date2     
  <chr> <date>     <date>    
1 Ant   2005-01-02 2023-01-06
2 Bug   1997-01-05 2015-01-08
3 Cat   1990-01-04 2008-01-09

Example: extracting dates and computing durations

df

# A tibble: 3 × 2
  name  entry                                                  
  <chr> <chr>                                                  
1 Ant   First arrival: Jan 2, 2005; Second arrival: Jan 6, 2023
2 Bug   First time: Jan 5, 1997; Second time: Jan 8, 2015      
3 Cat   First visit: Jan 4, 1990; Second visit: Jan 9, 2008

df |>  # step 5: compute number of days between visits
  separate_wider_delim(entry, delim="; ", names=c("d1", "d2")) |> 
  separate_wider_delim(d1, delim=": ", names=c(NA, "date1")) |> 
  separate_wider_delim(d2, delim=": ", names=c(NA, "date2")) |> 
  mutate(date1 = mdy(date1), date2 = mdy(date2), 
         days_elapsed = days(date2-date1))

# A tibble: 3 × 4
  name  date1      date2      days_elapsed  
  <chr> <date>     <date>     <Period>      
1 Ant   2005-01-02 2023-01-06 6578d 0H 0M 0S
2 Bug   1997-01-05 2015-01-08 6577d 0H 0M 0S
3 Cat   1990-01-04 2008-01-09 6579d 0H 0M 0S

–> –>

07: Dates and times

Dates and times

Dates and times: complications

Creating dates and times

Extracting dates and times from strings

Creating date-times from dplyr parts

Creating date-times from dplyr parts

Creating date-times from dplyr parts

Updated flights df with times for arrivals/departures

Date-time components: accessor functions

Date-time components: example

Rounding

Examples

Time spans

Date-time arithmetic

Time spans

Durations: constructing durations

Durations: computations and weird results

Periods

Periods: vs Durations

Fixing a bug in flights_dt

Fixing a bug in flights_dt

Intervals

Putting it all together

Example: extracting dates and computing durations

Example: extracting dates and computing durations

Example: extracting dates and computing durations

Example: extracting dates and computing durations

Example: extracting dates and computing durations

Example: extracting dates and computing durations

Fixing a bug in `flights_dt`

Fixing a bug in `flights_dt`