STA35B: Statistical Data Science 2
Material from Chapter 17 of R4DS2
Many things complicate working with dates and times
Not all years have 365 days
A year is a leap year if it’s divisible by 4, unless it’s also divisible by 100, except if it’s also divisible by 400. In other words, in every set of 400 years, there’s 97 leap years.
Three types of date/time data:
<date>
<time>
<dttm>
We will focus on dates and date-times as R doesn’t have a native class for storing times. If you need one, you can use the hms package.
To create dates, convert a string using functions whose names are three letter combos of “y”, “m”, “d”
To create date-times, add an underscore _
and then one or more of “h”, “m”, “s”.
[1] "2017-01-31 20:11:59 UTC"
[1] "2017-01-31 08:01:00 UTC"
tz=
Remember how flights stored some of the date information:
# A tibble: 336,776 × 5
year month day hour minute
<int> <int> <int> <dbl> <dbl>
1 2013 1 1 5 15
2 2013 1 1 5 29
...
make_date()
or make_datetime()
:flights_dt <- flights |>
filter(!is.na(dep_time), !is.na(arr_time)) |>
mutate(
dep_time = make_datetime_100(year, month, day, dep_time),
arr_time = make_datetime_100(year, month, day, arr_time),
sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
) |>
select(origin, dest, ends_with("delay"), ends_with("time"))
flights_dt
# A tibble: 328,063 × 9
origin dest dep_delay arr_delay dep_time sched_dep_time
<chr> <chr> <dbl> <dbl> <dttm> <dttm>
1 EWR IAH 2 11 2013-01-01 05:17:00 2013-01-01 05:15:00
2 LGA IAH 4 20 2013-01-01 05:33:00 2013-01-01 05:29:00
3 JFK MIA 2 33 2013-01-01 05:42:00 2013-01-01 05:40:00
4 JFK BQN -1 -18 2013-01-01 05:44:00 2013-01-01 05:45:00
5 LGA ATL -6 -25 2013-01-01 05:54:00 2013-01-01 06:00:00
6 EWR ORD -4 12 2013-01-01 05:54:00 2013-01-01 05:58:00
7 EWR FLL -5 19 2013-01-01 05:55:00 2013-01-01 06:00:00
8 LGA IAD -3 -14 2013-01-01 05:57:00 2013-01-01 06:00:00
9 JFK MCO -3 -8 2013-01-01 05:57:00 2013-01-01 06:00:00
10 LGA ORD -2 8 2013-01-01 05:58:00 2013-01-01 06:00:00
# ℹ 328,053 more rows
# ℹ 3 more variables: arr_time <dttm>, sched_arr_time <dttm>, air_time <dbl>
# A tibble: 837 × 9
origin dest dep_delay arr_delay dep_time sched_dep_time
<chr> <chr> <dbl> <dbl> <dttm> <dttm>
1 EWR IAH 2 11 2013-01-01 05:17:00 2013-01-01 05:15:00
2 LGA IAH 4 20 2013-01-01 05:33:00 2013-01-01 05:29:00
3 JFK MIA 2 33 2013-01-01 05:42:00 2013-01-01 05:40:00
4 JFK BQN -1 -18 2013-01-01 05:44:00 2013-01-01 05:45:00
5 LGA ATL -6 -25 2013-01-01 05:54:00 2013-01-01 06:00:00
6 EWR ORD -4 12 2013-01-01 05:54:00 2013-01-01 05:58:00
7 EWR FLL -5 19 2013-01-01 05:55:00 2013-01-01 06:00:00
8 LGA IAD -3 -14 2013-01-01 05:57:00 2013-01-01 06:00:00
9 JFK MCO -3 -8 2013-01-01 05:57:00 2013-01-01 06:00:00
10 LGA ORD -2 8 2013-01-01 05:58:00 2013-01-01 06:00:00
# ℹ 827 more rows
# ℹ 3 more variables: arr_time <dttm>, sched_arr_time <dttm>, air_time <dbl>
wday()
(day of the week), mday()
(day of the month), yday()
(day of the year),
month()
and wday()
can have label=TRUE
, returns abbreviated name of month/day
abbr=FALSE
to get full nameYou can do things like calculate the minute with the highest departure delays:
flights_dt |>
mutate(minute = minute(dep_time)) |>
group_by(minute) |>
summarize(avg_delay = mean(dep_delay, na.rm=TRUE)) |>
arrange(by = desc(avg_delay))
# A tibble: 60 × 2
minute avg_delay
<int> <dbl>
1 17 18.6
2 32 17.8
3 34 17.8
4 33 17.7
5 37 17.5
6 15 17.2
7 13 17.1
8 36 17.1
9 16 17.1
10 18 17.0
# ℹ 50 more rows
Analogues of standard rounding functions for dates:
floor_date()
ceiling_date()
round_date()
Arguments:
flights_dt |>
slice_sample(n=9) |> # selects 9 rows at random
mutate(year = floor_date(dep_time, "month"),
.keep = "used")
# A tibble: 9 × 2
dep_time year
<dttm> <dttm>
1 2013-05-28 14:40:00 2013-05-01 00:00:00
2 2013-11-07 10:37:00 2013-11-01 00:00:00
3 2013-09-02 19:30:00 2013-09-01 00:00:00
4 2013-07-09 23:49:00 2013-07-01 00:00:00
5 2013-03-22 15:27:00 2013-03-01 00:00:00
6 2013-04-17 07:03:00 2013-04-01 00:00:00
7 2013-09-25 11:52:00 2013-09-01 00:00:00
8 2013-11-11 20:25:00 2013-11-01 00:00:00
9 2013-04-18 06:35:00 2013-04-01 00:00:00
Compute the average delay time of flights that depart at times in two groups:
flights_dt |>
mutate(dep_minute = minute(dep_time),
mins_2030 = dep_minute >= 20 & dep_minute <= 30,
mins_5060 = dep_minute >= 50 & dep_minute <= 59,
mins_2030_or_5060 = mins_2030 | mins_5060) |>
group_by(mins_2030_or_5060) |>
summarize(avg_dep_delay = mean(dep_delay, na.rm=TRUE),
n = n())
# A tibble: 2 × 3
mins_2030_or_5060 avg_dep_delay n
<lgl> <dbl> <int>
1 FALSE 15.5 181621
2 TRUE 8.90 146442
Subtracting two dates returns a “difftime” object:
Time difference of 16616 days
'difftime' num 16616
- attr(*, "units")= chr "days"
as.duration()
always uses seconds:Can think of difference between two times as a time span
Three important classes representing time spans:
Durations measure exact time span (in seconds)
d{units}
, {units}
is seconds
, days
, etc[1] "15s"
[1] "600s (~10 minutes)"
[1] "43200s (~12 hours)" "86400s (~1 days)"
Arithmetic
[1] "63115200s (~2 years)"
[1] "38833200s (~1.23 years)"
[1] "2026-03-08 01:00:00 EST"
[1] "2026-03-09 02:00:00 EDT"
Periods are time spans but don’t have fixed length in seconds
Adding periods can be a bit more in line with expectations
[1] "2024-12-31 06:00:00 UTC"
[1] "2025-01-01"
flights_dt
Some flights arrived before they departed?!
# A tibble: 10,633 × 4
origin dest arr_time dep_time
<chr> <chr> <dttm> <dttm>
1 EWR BQN 2013-01-01 00:03:00 2013-01-01 19:29:00
2 JFK DFW 2013-01-01 00:29:00 2013-01-01 19:39:00
3 EWR TPA 2013-01-01 00:08:00 2013-01-01 20:58:00
...
These are overnight flights: flights arrived on the following day.
flights_dt
We can fix this by adding a day to the arrival time of each overnight flight.
days(TRUE)
gets coerced to days(1)
days(FALSE)
gets coerced to days(0)
Now flights_dt_2
has no flight that appears to arrive before it departed:
Motivating calculations
dyears(1) / ddays(365)
does not return 1, since dyears()
is defined as the number of seconds per average year: 365.25 days.years(1) / days(1)
does not return 365, since in leap years this isn’t true.Intervals allow for defining a time interval between two specific date-times.
start %--% end
:# A tibble: 3 × 2
name entry
<chr> <chr>
1 Ant First arrival: Jan 2, 2005; Second arrival: Jan 6, 2023
2 Bug First time: Jan 5, 1997; Second time: Jan 8, 2015
3 Cat First visit: Jan 4, 1990; Second visit: Jan 9, 2008
Task: using df
, return a tibble that says how many days elapsed between the first and second visit.
Complex task! We need to:
Let’s start by parsing “entry” and creating two columns for different dates
# A tibble: 3 × 2
name entry
<chr> <chr>
1 Ant First arrival: Jan 2, 2005; Second arrival: Jan 6, 2023
2 Bug First time: Jan 5, 1997; Second time: Jan 8, 2015
3 Cat First visit: Jan 4, 1990; Second visit: Jan 9, 2008
# A tibble: 3 × 3
name d1 d2
<chr> <chr> <chr>
1 Ant First arrival: Jan 2, 2005 Second arrival: Jan 6, 2023
2 Bug First time: Jan 5, 1997 Second time: Jan 8, 2015
3 Cat First visit: Jan 4, 1990 Second visit: Jan 9, 2008
# A tibble: 3 × 2
name entry
<chr> <chr>
1 Ant First arrival: Jan 2, 2005; Second arrival: Jan 6, 2023
2 Bug First time: Jan 5, 1997; Second time: Jan 8, 2015
3 Cat First visit: Jan 4, 1990; Second visit: Jan 9, 2008
df |> # step 2: get date from "d1"
separate_wider_delim(entry, delim="; ", names=c("d1", "d2")) |>
separate_wider_delim(d1, delim=": ", names=c(NA, "date1"))
# A tibble: 3 × 3
name date1 d2
<chr> <chr> <chr>
1 Ant Jan 2, 2005 Second arrival: Jan 6, 2023
2 Bug Jan 5, 1997 Second time: Jan 8, 2015
3 Cat Jan 4, 1990 Second visit: Jan 9, 2008
# A tibble: 3 × 2
name entry
<chr> <chr>
1 Ant First arrival: Jan 2, 2005; Second arrival: Jan 6, 2023
2 Bug First time: Jan 5, 1997; Second time: Jan 8, 2015
3 Cat First visit: Jan 4, 1990; Second visit: Jan 9, 2008
df |> # step 3: get date from "d2"
separate_wider_delim(entry, delim="; ", names=c("d1", "d2")) |>
separate_wider_delim(d1, delim=": ", names=c(NA, "date1")) |>
separate_wider_delim(d2, delim=": ", names=c(NA, "date2"))
# A tibble: 3 × 3
name date1 date2
<chr> <chr> <chr>
1 Ant Jan 2, 2005 Jan 6, 2023
2 Bug Jan 5, 1997 Jan 8, 2015
3 Cat Jan 4, 1990 Jan 9, 2008
# A tibble: 3 × 2
name entry
<chr> <chr>
1 Ant First arrival: Jan 2, 2005; Second arrival: Jan 6, 2023
2 Bug First time: Jan 5, 1997; Second time: Jan 8, 2015
3 Cat First visit: Jan 4, 1990; Second visit: Jan 9, 2008
df |> # step 4: convert "date1" and "date2" into Dates
separate_wider_delim(entry, delim="; ", names=c("d1", "d2")) |>
separate_wider_delim(d1, delim=": ", names=c(NA, "date1")) |>
separate_wider_delim(d2, delim=": ", names=c(NA, "date2")) |>
mutate(date1 = mdy(date1), date2 = mdy(date2))
# A tibble: 3 × 3
name date1 date2
<chr> <date> <date>
1 Ant 2005-01-02 2023-01-06
2 Bug 1997-01-05 2015-01-08
3 Cat 1990-01-04 2008-01-09
# A tibble: 3 × 2
name entry
<chr> <chr>
1 Ant First arrival: Jan 2, 2005; Second arrival: Jan 6, 2023
2 Bug First time: Jan 5, 1997; Second time: Jan 8, 2015
3 Cat First visit: Jan 4, 1990; Second visit: Jan 9, 2008
df |> # step 5: compute number of days between visits
separate_wider_delim(entry, delim="; ", names=c("d1", "d2")) |>
separate_wider_delim(d1, delim=": ", names=c(NA, "date1")) |>
separate_wider_delim(d2, delim=": ", names=c(NA, "date2")) |>
mutate(date1 = mdy(date1), date2 = mdy(date2),
days_elapsed = days(date2-date1))
# A tibble: 3 × 4
name date1 date2 days_elapsed
<chr> <date> <date> <Period>
1 Ant 2005-01-02 2023-01-06 6578d 0H 0M 0S
2 Bug 1997-01-05 2015-01-08 6577d 0H 0M 0S
3 Cat 1990-01-04 2008-01-09 6579d 0H 0M 0S
–> –>
–> –>