07: Dates and times

STA35B: Statistical Data Science 2

Akira Horiguchi

Material from Chapter 17 of R4DS2

Dates and times

Dates and times: complications

Many things complicate working with dates and times

  • Not all years have 365 days

    A year is a leap year if it’s divisible by 4, unless it’s also divisible by 100, except if it’s also divisible by 400. In other words, in every set of 400 years, there’s 97 leap years.

  • Not every day in every location has 24 hours a day
    • Daylight savings time implies one day has 23, another has 24
  • Time zones are difficult!
  • We will be using lubridate package (part of latest tidyverse), and nycflights13.
library(tidyverse)
library(nycflights13)

Creating dates and times

Three types of date/time data:

  • A date. Tibbles print this as <date>
  • A time within a day. Tibbles print this as <time>
  • A date-time is a date plus a time. Tibbles print this as <dttm>
    • It uniquely identifies an instant in time (typically to the nearest second).

We will focus on dates and date-times as R doesn’t have a native class for storing times. If you need one, you can use the hms package.

  • Use the simplest possible data type that works for your needs (e.g., use a date instead of a date-time, if you can).

Get current date or date-time:

today()  # get current date
[1] "2025-04-11"
now()  # get current date-time
[1] "2025-04-11 17:43:44 PDT"
tibble(y=today(), x=now())
# A tibble: 1 × 2
  y          x                  
  <date>     <dttm>             
1 2025-04-11 2025-04-11 17:43:44
class(today())
[1] "Date"
class(now())
[1] "POSIXct" "POSIXt" 

Extracting dates and times from strings

To create dates, convert a string using functions whose names are three letter combos of “y”, “m”, “d”

ymd("2017-01-31")
[1] "2017-01-31"
mdy("January 31st, 2017")
[1] "2017-01-31"
mdy("January 31, 2017")
[1] "2017-01-31"
dmy("31-Jan-2017")
[1] "2017-01-31"

To create date-times, add an underscore _ and then one or more of “h”, “m”, “s”.

(ymd_hms("2017-01-31 20:11:59"))
[1] "2017-01-31 20:11:59 UTC"
mdy_hm("01/31/2017 08:01")
[1] "2017-01-31 08:01:00 UTC"
  • Times are assumed to be UTC time zone; can change by using tz=
mdy_hm("01/31/2017 08:01", tz = "EST")
[1] "2017-01-31 08:01:00 EST"

Creating date-times from dplyr parts

Remember how flights stored some of the date information:

flights |> select(year, month, day, hour, minute)
# A tibble: 336,776 × 5
    year month   day  hour minute
   <int> <int> <int> <dbl>  <dbl>
 1  2013     1     1     5     15
 2  2013     1     1     5     29
...
  • To create date/time from this, can use make_date() or make_datetime():
flights |>
  mutate(departure = make_datetime(year, month, day, hour, minute), .keep="used")
# A tibble: 336,776 × 6
    year month   day  hour minute departure          
   <int> <int> <int> <dbl>  <dbl> <dttm>             
 1  2013     1     1     5     15 2013-01-01 05:15:00
 2  2013     1     1     5     29 2013-01-01 05:29:00
...

Creating date-times from dplyr parts

  • We’ll now do a similar computation for the four time columns in flights
  • We’ll do so using a function - we haven’t seen this yet, but we will see it in a couple weeks
make_datetime_100 <- function(year, month, day, time) {
  # recall: time is an integer, use modular arithmetic to convert 
  hr <- time %/% 100  # get first digit of `time`
  mnt <- time %% 100  # get last two digits of `time`
  make_datetime(year, month, day, hr, mnt)
} 

Creating date-times from dplyr parts

flights_dt <- flights |> 
  filter(!is.na(dep_time), !is.na(arr_time)) |> 
  mutate(
    dep_time = make_datetime_100(year, month, day, dep_time),
    arr_time = make_datetime_100(year, month, day, arr_time),
    sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
    sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
  ) |> 
  select(origin, dest, ends_with("delay"), ends_with("time"))
flights_dt
# A tibble: 328,063 × 9
   origin dest  dep_delay arr_delay dep_time            sched_dep_time     
   <chr>  <chr>     <dbl>     <dbl> <dttm>              <dttm>             
 1 EWR    IAH           2        11 2013-01-01 05:17:00 2013-01-01 05:15:00
 2 LGA    IAH           4        20 2013-01-01 05:33:00 2013-01-01 05:29:00
 3 JFK    MIA           2        33 2013-01-01 05:42:00 2013-01-01 05:40:00
 4 JFK    BQN          -1       -18 2013-01-01 05:44:00 2013-01-01 05:45:00
 5 LGA    ATL          -6       -25 2013-01-01 05:54:00 2013-01-01 06:00:00
 6 EWR    ORD          -4        12 2013-01-01 05:54:00 2013-01-01 05:58:00
 7 EWR    FLL          -5        19 2013-01-01 05:55:00 2013-01-01 06:00:00
 8 LGA    IAD          -3       -14 2013-01-01 05:57:00 2013-01-01 06:00:00
 9 JFK    MCO          -3        -8 2013-01-01 05:57:00 2013-01-01 06:00:00
10 LGA    ORD          -2         8 2013-01-01 05:58:00 2013-01-01 06:00:00
# ℹ 328,053 more rows
# ℹ 3 more variables: arr_time <dttm>, sched_arr_time <dttm>, air_time <dbl>

Updated flights df with times for arrivals/departures

  • We’ll now use these date-time values
flights_dt |>
  filter(dep_time < ymd(20130102))
# A tibble: 837 × 9
   origin dest  dep_delay arr_delay dep_time            sched_dep_time     
   <chr>  <chr>     <dbl>     <dbl> <dttm>              <dttm>             
 1 EWR    IAH           2        11 2013-01-01 05:17:00 2013-01-01 05:15:00
 2 LGA    IAH           4        20 2013-01-01 05:33:00 2013-01-01 05:29:00
 3 JFK    MIA           2        33 2013-01-01 05:42:00 2013-01-01 05:40:00
 4 JFK    BQN          -1       -18 2013-01-01 05:44:00 2013-01-01 05:45:00
 5 LGA    ATL          -6       -25 2013-01-01 05:54:00 2013-01-01 06:00:00
 6 EWR    ORD          -4        12 2013-01-01 05:54:00 2013-01-01 05:58:00
 7 EWR    FLL          -5        19 2013-01-01 05:55:00 2013-01-01 06:00:00
 8 LGA    IAD          -3       -14 2013-01-01 05:57:00 2013-01-01 06:00:00
 9 JFK    MCO          -3        -8 2013-01-01 05:57:00 2013-01-01 06:00:00
10 LGA    ORD          -2         8 2013-01-01 05:58:00 2013-01-01 06:00:00
# ℹ 827 more rows
# ℹ 3 more variables: arr_time <dttm>, sched_arr_time <dttm>, air_time <dbl>

Date-time components: accessor functions

x <- ymd_hms("2026-07-08 12:34:56")

year(), month(), day(), hour(), minute(), second()

year(x)
[1] 2026
month(x)
[1] 7
mday(x)
[1] 8

wday() (day of the week), mday() (day of the month), yday() (day of the year),

wday(x) # 2026-07-08 is Weds. (Sun.=1)
[1] 4
yday(x)
[1] 189

month() and wday() can have label=TRUE, returns abbreviated name of month/day

  • Set abbr=FALSE to get full name
month(x, label=TRUE) 
[1] Jul
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
wday(x, label=TRUE, abbr=FALSE)
[1] Wednesday
7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

Date-time components: example

You can do things like calculate the minute with the highest departure delays:

flights_dt |>
  mutate(minute = minute(dep_time)) |>
  group_by(minute) |>
  summarize(avg_delay = mean(dep_delay, na.rm=TRUE)) |> 
  arrange(by = desc(avg_delay))
# A tibble: 60 × 2
   minute avg_delay
    <int>     <dbl>
 1     17      18.6
 2     32      17.8
 3     34      17.8
 4     33      17.7
 5     37      17.5
 6     15      17.2
 7     13      17.1
 8     36      17.1
 9     16      17.1
10     18      17.0
# ℹ 50 more rows

Rounding

Analogues of standard rounding functions for dates:

  • floor_date()
  • ceiling_date()
  • round_date()

Arguments:

  • vector of dates to adjust,
  • name of unit (week, day, etc)
flights_dt |>
  slice_sample(n=9) |>  # selects 9 rows at random
  mutate(year = floor_date(dep_time, "month"), 
         .keep = "used")
# A tibble: 9 × 2
  dep_time            year               
  <dttm>              <dttm>             
1 2013-05-28 14:40:00 2013-05-01 00:00:00
2 2013-11-07 10:37:00 2013-11-01 00:00:00
3 2013-09-02 19:30:00 2013-09-01 00:00:00
4 2013-07-09 23:49:00 2013-07-01 00:00:00
5 2013-03-22 15:27:00 2013-03-01 00:00:00
6 2013-04-17 07:03:00 2013-04-01 00:00:00
7 2013-09-25 11:52:00 2013-09-01 00:00:00
8 2013-11-11 20:25:00 2013-11-01 00:00:00
9 2013-04-18 06:35:00 2013-04-01 00:00:00

Examples

Compute the average delay time of flights that depart at times in two groups:

  • departure time is (between minutes 20-30 and 50-60) vs. (the other times)
flights_dt |>
  mutate(dep_minute = minute(dep_time),
         mins_2030 = dep_minute >= 20 & dep_minute <= 30,
         mins_5060 = dep_minute >= 50 & dep_minute <= 59,
         mins_2030_or_5060 = mins_2030 | mins_5060) |>
  group_by(mins_2030_or_5060) |>
  summarize(avg_dep_delay = mean(dep_delay, na.rm=TRUE),
            n = n())
# A tibble: 2 × 3
  mins_2030_or_5060 avg_dep_delay      n
  <lgl>                     <dbl>  <int>
1 FALSE                     15.5  181621
2 TRUE                       8.90 146442

Time spans

Date-time arithmetic

Subtracting two dates returns a “difftime” object:

# How old is Hadley?
h_age <- today() - ymd("1979-10-14")
h_age
Time difference of 16616 days
str(h_age)
 'difftime' num 16616
 - attr(*, "units")= chr "days"
  • Exact unit of difftime can vary from seconds, minutes, hours, days, or weeks.
  • as.duration() always uses seconds:
as.duration(h_age)
[1] "1435622400s (~45.49 years)"

Can think of difference between two times as a time span

Time spans

Three important classes representing time spans:

  1. Durations: exact time, measured in seconds
  2. Periods: human units, like weeks/months
  3. Intervals: represent starting/end point
ddays(2) |> class()
[1] "Duration"
attr(,"package")
[1] "lubridate"
days(2) |> class()
[1] "Period"
attr(,"package")
[1] "lubridate"
(ymd("2023-01-01") %--% ymd("2024-01-01")) |> class()
[1] "Interval"
attr(,"package")
[1] "lubridate"

Durations: constructing durations

Durations measure exact time span (in seconds)

  • Useful constructors: d{units}, {units} is seconds, days, etc
dseconds(15)
[1] "15s"
dminutes(10)
[1] "600s (~10 minutes)"
dhours(c(12, 24))
[1] "43200s (~12 hours)" "86400s (~1 days)"  
  • Cannot convert month to Duration: not well-defined

Arithmetic

  • Can add and multiply Durations:
2 * dyears(1)
[1] "63115200s (~2 years)"
dyears(1) + dweeks(12) + dhours(5)
[1] "38833200s (~1.23 years)"
  • Can subtract Durations to / from days
today() + ddays(1) # returns Date
[1] "2025-04-12"
today() - dyears(1) # returns date-time
[1] "2024-04-10 18:00:00 UTC"

Durations: computations and weird results

  • Because durations represent an exact number of seconds, things can be odd when looking at time zones
one_am <- ymd_hms("2026-03-08 01:00:00", tz = "America/New_York")
one_am
[1] "2026-03-08 01:00:00 EST"
one_am + ddays(1)  # the day advances, but the hour also advances
[1] "2026-03-09 02:00:00 EDT"
  • If we add a full day of seconds, would need to account for hour time change from EST to EDT.
  • lubridate package provides periods to address this

Periods

Periods are time spans but don’t have fixed length in seconds

  • Work like “human” times, i.e. days/months
one_am
[1] "2026-03-08 01:00:00 EST"
# days() returns a Period object
one_am + days(1)  
[1] "2026-03-09 01:00:00 EDT"
# ddays() returns a Duration object
one_am + ddays(1)  
[1] "2026-03-09 02:00:00 EDT"
  • Useful constructors and behavior under +/*:
hours(c(12, 24))
[1] "12H 0M 0S" "24H 0M 0S"
days(7)
[1] "7d 0H 0M 0S"
months(2:3)
[1] "2m 0d 0H 0M 0S" "3m 0d 0H 0M 0S"
months(6) + days(2)
[1] "6m 2d 0H 0M 0S"
10 * (months(6) + days(2))
[1] "60m 20d 0H 0M 0S"

Periods: vs Durations

Adding periods can be a bit more in line with expectations

  • Example: a leap year
ymd("2024-01-01") + dyears(1)  # date + Duration
[1] "2024-12-31 06:00:00 UTC"
ymd("2024-01-01") + years(1)  # date + Period
[1] "2025-01-01"
  • Example: Daylight saving time
one_am + ddays(1)  # date-time + Duration
[1] "2026-03-09 02:00:00 EDT"
one_am + days(1)  # date-time + Period
[1] "2026-03-09 01:00:00 EDT"

Fixing a bug in flights_dt

Some flights arrived before they departed?!

flights_dt |> filter(arr_time < dep_time) |> select(origin, dest, arr_time, dep_time)
# A tibble: 10,633 × 4
   origin dest  arr_time            dep_time           
   <chr>  <chr> <dttm>              <dttm>             
 1 EWR    BQN   2013-01-01 00:03:00 2013-01-01 19:29:00
 2 JFK    DFW   2013-01-01 00:29:00 2013-01-01 19:39:00
 3 EWR    TPA   2013-01-01 00:08:00 2013-01-01 20:58:00
...

These are overnight flights: flights arrived on the following day.

flights |> filter(arr_time < dep_time) |> select(origin, dest, arr_time, dep_time)
# A tibble: 10,633 × 4
   origin dest  arr_time dep_time
   <chr>  <chr>    <int>    <int>
 1 EWR    BQN          3     1929
 2 JFK    DFW         29     1939
 3 EWR    TPA          8     2058
...

Fixing a bug in flights_dt

We can fix this by adding a day to the arrival time of each overnight flight.

  • days(TRUE) gets coerced to days(1)
  • days(FALSE) gets coerced to days(0)
flights_dt_2 <- flights_dt |> 
  mutate(
    overnight = arr_time < dep_time,  # returns T/F
    arr_time = arr_time + days(overnight),  # coercion
    sched_arr_time = sched_arr_time + days(overnight)
  ) 

Now flights_dt_2 has no flight that appears to arrive before it departed:

flights_dt_2 |> filter(arr_time < dep_time)
# A tibble: 0 × 10
# ℹ 10 variables: origin <chr>, dest <chr>, dep_delay <dbl>, arr_delay <dbl>,
#   dep_time <dttm>, sched_dep_time <dttm>, arr_time <dttm>,
#   sched_arr_time <dttm>, air_time <dbl>, overnight <lgl>

Intervals

Motivating calculations

  • dyears(1) / ddays(365) does not return 1, since dyears() is defined as the number of seconds per average year: 365.25 days.
  • years(1) / days(1) does not return 365, since in leap years this isn’t true.

Intervals allow for defining a time interval between two specific date-times.

  • Format: start %--% end:
(y2023 <- ymd("2023-01-01") %--% ymd("2024-01-01"))
[1] 2023-01-01 UTC--2024-01-01 UTC
(y2024 <- ymd("2024-01-01") %--% ymd("2025-01-01"))
[1] 2024-01-01 UTC--2025-01-01 UTC

Can divide by days() to find out how many days in specific year:

y2023 / days(1)
[1] 365
y2024 / days(1)
[1] 366

Putting it all together

Example: extracting dates and computing durations

df
# A tibble: 3 × 2
  name  entry                                                  
  <chr> <chr>                                                  
1 Ant   First arrival: Jan 2, 2005; Second arrival: Jan 6, 2023
2 Bug   First time: Jan 5, 1997; Second time: Jan 8, 2015      
3 Cat   First visit: Jan 4, 1990; Second visit: Jan 9, 2008    

Task: using df, return a tibble that says how many days elapsed between the first and second visit.

Complex task! We need to:

  1. Parse the “entry” column to extract the dates
  2. Turn them into proper date-times
  3. Compute \(\#\) of days between visits (leap years happen at different times!)

Let’s start by parsing “entry” and creating two columns for different dates

Example: extracting dates and computing durations

df
# A tibble: 3 × 2
  name  entry                                                  
  <chr> <chr>                                                  
1 Ant   First arrival: Jan 2, 2005; Second arrival: Jan 6, 2023
2 Bug   First time: Jan 5, 1997; Second time: Jan 8, 2015      
3 Cat   First visit: Jan 4, 1990; Second visit: Jan 9, 2008    
df |>  # step 1: separate two strings
  separate_wider_delim(entry, delim="; ", names=c("d1", "d2"))
# A tibble: 3 × 3
  name  d1                         d2                         
  <chr> <chr>                      <chr>                      
1 Ant   First arrival: Jan 2, 2005 Second arrival: Jan 6, 2023
2 Bug   First time: Jan 5, 1997    Second time: Jan 8, 2015   
3 Cat   First visit: Jan 4, 1990   Second visit: Jan 9, 2008  

Example: extracting dates and computing durations

df
# A tibble: 3 × 2
  name  entry                                                  
  <chr> <chr>                                                  
1 Ant   First arrival: Jan 2, 2005; Second arrival: Jan 6, 2023
2 Bug   First time: Jan 5, 1997; Second time: Jan 8, 2015      
3 Cat   First visit: Jan 4, 1990; Second visit: Jan 9, 2008    
df |>  # step 2: get date from "d1"
  separate_wider_delim(entry, delim="; ", names=c("d1", "d2")) |> 
  separate_wider_delim(d1, delim=": ", names=c(NA, "date1"))  
# A tibble: 3 × 3
  name  date1       d2                         
  <chr> <chr>       <chr>                      
1 Ant   Jan 2, 2005 Second arrival: Jan 6, 2023
2 Bug   Jan 5, 1997 Second time: Jan 8, 2015   
3 Cat   Jan 4, 1990 Second visit: Jan 9, 2008  

Example: extracting dates and computing durations

df
# A tibble: 3 × 2
  name  entry                                                  
  <chr> <chr>                                                  
1 Ant   First arrival: Jan 2, 2005; Second arrival: Jan 6, 2023
2 Bug   First time: Jan 5, 1997; Second time: Jan 8, 2015      
3 Cat   First visit: Jan 4, 1990; Second visit: Jan 9, 2008    
df |>  # step 3: get date from "d2"
  separate_wider_delim(entry, delim="; ", names=c("d1", "d2")) |> 
  separate_wider_delim(d1, delim=": ", names=c(NA, "date1")) |> 
  separate_wider_delim(d2, delim=": ", names=c(NA, "date2"))
# A tibble: 3 × 3
  name  date1       date2      
  <chr> <chr>       <chr>      
1 Ant   Jan 2, 2005 Jan 6, 2023
2 Bug   Jan 5, 1997 Jan 8, 2015
3 Cat   Jan 4, 1990 Jan 9, 2008

Example: extracting dates and computing durations

df
# A tibble: 3 × 2
  name  entry                                                  
  <chr> <chr>                                                  
1 Ant   First arrival: Jan 2, 2005; Second arrival: Jan 6, 2023
2 Bug   First time: Jan 5, 1997; Second time: Jan 8, 2015      
3 Cat   First visit: Jan 4, 1990; Second visit: Jan 9, 2008    
df |>  # step 4: convert "date1" and "date2" into Dates
  separate_wider_delim(entry, delim="; ", names=c("d1", "d2")) |> 
  separate_wider_delim(d1, delim=": ", names=c(NA, "date1")) |> 
  separate_wider_delim(d2, delim=": ", names=c(NA, "date2")) |> 
  mutate(date1 = mdy(date1), date2 = mdy(date2))
# A tibble: 3 × 3
  name  date1      date2     
  <chr> <date>     <date>    
1 Ant   2005-01-02 2023-01-06
2 Bug   1997-01-05 2015-01-08
3 Cat   1990-01-04 2008-01-09

Example: extracting dates and computing durations

df
# A tibble: 3 × 2
  name  entry                                                  
  <chr> <chr>                                                  
1 Ant   First arrival: Jan 2, 2005; Second arrival: Jan 6, 2023
2 Bug   First time: Jan 5, 1997; Second time: Jan 8, 2015      
3 Cat   First visit: Jan 4, 1990; Second visit: Jan 9, 2008    
df |>  # step 5: compute number of days between visits
  separate_wider_delim(entry, delim="; ", names=c("d1", "d2")) |> 
  separate_wider_delim(d1, delim=": ", names=c(NA, "date1")) |> 
  separate_wider_delim(d2, delim=": ", names=c(NA, "date2")) |> 
  mutate(date1 = mdy(date1), date2 = mdy(date2), 
         days_elapsed = days(date2-date1))
# A tibble: 3 × 4
  name  date1      date2      days_elapsed  
  <chr> <date>     <date>     <Period>      
1 Ant   2005-01-02 2023-01-06 6578d 0H 0M 0S
2 Bug   1997-01-05 2015-01-08 6577d 0H 0M 0S
3 Cat   1990-01-04 2008-01-09 6579d 0H 0M 0S

–> –>

–> –>