STA141A: Fundamentals of Statistical Data Science
Previous material you will likely use for any data science project you work on
For logical vectors, every element takes one of 3 values: TRUE, FALSE, NA
We’ll investigate how to manipulate and transform data to get logicals, and how to use logicals.
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
...
& in R): operation between two logicals| in R): operation between two logicals! in R): operation on a single logical.Truth table for AND:
| A | B | A & B |
|---|---|---|
TRUE |
TRUE |
TRUE |
TRUE |
FALSE |
FALSE |
FALSE |
TRUE |
FALSE |
FALSE |
FALSE |
FALSE |
Truth table for OR:
| A | B | A | B |
|---|---|---|
TRUE |
TRUE |
TRUE |
TRUE |
FALSE |
TRUE |
FALSE |
TRUE |
TRUE |
FALSE |
FALSE |
FALSE |
Truth table for NOT:
| A | ! A |
|---|---|
TRUE |
FALSE |
FALSE |
TRUE |
Can combine AND/OR with NOT to cover any binary Boolean operation
Common way to create a logical vector: numeric comparison with <, !=, etc.
We have implicitly been using this when doing filtering.
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
...
Using a comparator between two vectors of logicals returns pairwise comparisons.
So when we use multiple comparisons in filter(), we are building a new vector of logicals.
We only keep those rows where the vector is TRUE.
# A tibble: 172,286 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 601 600 1 844 850
2 2013 1 1 602 610 -8 812 820
3 2013 1 1 602 605 -3 821 805
...
filter() and mutate() can be used in conjunction
flights |>
mutate(
daytime = dep_time > 600 & dep_time < 2000,
approx_ontime = abs(arr_delay) < 20,
) |>
filter(daytime & approx_ontime)# A tibble: 172,286 × 21
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 601 600 1 844 850
2 2013 1 1 602 610 -8 812 820
3 2013 1 1 602 605 -3 821 805
...
Testing equality with == for floating points can cause problems. Numbers are represented with finite “precision”, i.e. only up to 2^{-32} or 2^{-64}.
What’s going on? Let’s look at more precise representation in R.
A useful function for dealing with NA: is.na()
is.na(x) works with any type of vector and returns TRUE for missing values and FALSE for everything else:
Since is.na() returns logicals, can be used in filter():
Can use to help identify where NA come from. e.g., why are there air_time NAs?
dep_time, dep_delay, and sched_dep_time are related.flights |>
mutate(missing_dep_time = is.na(dep_time),
missing_dep_delay = is.na(dep_delay),
missing_sched_dep_time = is.na(sched_dep_time)) |>
count(missing_dep_time, missing_dep_delay, missing_sched_dep_time)# A tibble: 2 × 4
missing_dep_time missing_dep_delay missing_sched_dep_time n
<lgl> <lgl> <lgl> <int>
1 FALSE FALSE FALSE 328521
2 TRUE TRUE FALSE 8255
dep_delay is missing have dep_time missing.dep_delay = dep_time - sched_dep_time?flights |>
mutate(dep_delay_manual = dep_time - sched_dep_time,
manual_matches_given = near(dep_delay_manual, dep_delay)) |>
count(manual_matches_given)# A tibble: 3 × 2
manual_matches_given n
<lgl> <int>
1 FALSE 99777
2 TRUE 228744
3 NA 8255
Quite weird, since we are getting a lot right but also getting a lot wrong.
Let’s inspect further. What do the mismatched observations look like?
flights |>
mutate(manual_delay = dep_time - sched_dep_time,
manual_matches_given = near(manual_delay, dep_delay)) |>
filter(!manual_matches_given) |>
select(time_hour, flight, dep_time, sched_dep_time, dep_delay, manual_delay)# A tibble: 99,777 × 6
time_hour flight dep_time sched_dep_time dep_delay manual_delay
<dttm> <int> <int> <int> <dbl> <int>
1 2013-01-01 06:00:00 461 554 600 -6 -46
2 2013-01-01 06:00:00 507 555 600 -5 -45
3 2013-01-01 06:00:00 5708 557 600 -3 -43
...
Problem: R is treating dep_time and sched_dep_time as integers, not time!
Logical and missing values interact in logical ways, but requires some thought.
NA as “unknown” logical value.Does not depend on value of NA:
NA OR TRUE will return TRUE.NA AND FALSE will return FALSE.Depends on value of NA:
NA AND TRUE will return NA.NA OR FALSE will return NA.# A tibble: 55,403 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 11 1 5 2359 6 352 345
2 2013 11 1 35 2250 105 123 2356
...
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
...
%in%Instead of worrying about | and == in order, just use %in%.
So to find all flights from November and December:
%in% obeys different rules for NA vs. ==, since NA %in% NA is TRUE:Two main functions for logical summaries:
any(x) returns TRUE if any value in x is TRUEall(x) returns TRUE only if all values in x are TRUEFor instance, was there a day where every flight was delayed on departure by less than an hour? Or a day where there were any flights delayed on arrival by \(\leq 5\) hours?
flights |>
group_by(year, month, day) |>
summarize(
all_delayed = all(dep_delay <= 60, na.rm=TRUE),
any_long_delay = any(arr_delay >= 300, na.rm=TRUE)
)# A tibble: 365 × 5
# Groups: year, month [12]
year month day all_delayed any_long_delay
<int> <int> <int> <lgl> <lgl>
1 2013 1 1 FALSE TRUE
2 2013 1 2 FALSE TRUE
3 2013 1 3 FALSE FALSE
4 2013 1 4 FALSE FALSE
...
mean(), sum()flights |>
group_by(year, month, day) |>
summarise(
prop_delayed_1hour = mean(dep_delay > 60, na.rm=TRUE),
num_long_delay = sum(arr_delay > 300, na.rm=TRUE)
)# A tibble: 365 × 5
# Groups: year, month [12]
year month day prop_delayed_1hour num_long_delay
<int> <int> <int> <dbl> <int>
1 2013 1 1 0.0609 3
2 2013 1 2 0.0856 3
3 2013 1 3 0.0586 0
...
Note output: # Groups: year, month [12]
flights |>
group_by(year, month, day) |>
summarise(
prop_delayed_1hour = mean(dep_delay > 60, na.rm=TRUE),
num_long_delay = sum(arr_delay > 300, na.rm=TRUE),
.groups = 'drop'
)# A tibble: 365 × 5
year month day prop_delayed_1hour num_long_delay
<int> <int> <int> <dbl> <int>
1 2013 1 1 0.0609 3
2 2013 1 2 0.0856 3
3 2013 1 3 0.0586 0
...
if_else()if_else(CONDITION, TRUE_VAL, FALSE_VAL, MISSING_VAL) is useful when we want to return some value when condition is TRUE and return another value when condition is FALSE.
The fourth argument of if_else() specifies what to fill NA’s with:
[1] "boo" "boo"
[3] "yay" "yay"
[5] "idk how i feel about x yet"
We can also use vectors as an argument.
if_else()We can use general vectors inside of if_else():
case_when()If you have many different conditions for which you want to specify values, e.g.
a and b then do…b and c then do…c and d then do…You can use case_when().
case_when()Inspired by SQL’s CASE statement. Has a very weird syntax:
condition ~ outputcondition is a logical vectorcondition is TRUE, output is used.Weird, but pretty readable:
case_when()If no cases match, then returns NA:
If multiple conditions are satisfied, only the first is used – be careful!
case_when(): more complex exampleProvide human-readable labels to flight delays.
flights |>
mutate(
status = case_when(
is.na(arr_delay) ~ "cancelled",
arr_delay < -30 ~ "very early",
arr_delay < -15 ~ "early",
abs(arr_delay) <= 15 ~ "on time",
arr_delay < 60 ~ "late",
arr_delay < Inf ~ "very late",
),
.keep = "used" # only returns those columns used in calculations
)# A tibble: 336,776 × 2
arr_delay status
<dbl> <chr>
1 11 on time
2 20 late
...
case_when(), just as in most other tidyverse functions .arr_delay < -30, the remaining conditionals do not get checked.Both if_else() and case_when() require the outputs to be of consistent types.
Most types are incompatible in order to catch errors. Compatible types:
NA is compatible with everything%% (read “modulo”) does “modular arithmetic”:a %% b returns the remainder when dividing a by b, e.g.
17 %% 12 = 534 %% 6 = 4n is even if and only if n %% 2 == 0; otherwise, odd.if_else to label numbers between 0 and 20 as even or oddreadr package in the tidyverse has two useful functions:
parse_double() – useful when you have numbers written as strings.parse_number() – ignores all non-numeric text to parse strings.What happens if use parse_double() with non-numeric-identifying strings?
x <- parse_double(c("qwerty", "6.5", "asdf"))
str(x) # shows that x is a vector with informative attributes num [1:3] NA 6.5 NA
- attr(*, "problems")= tibble [2 × 4] (S3: tbl_df/tbl/data.frame)
..$ row : int [1:2] 1 3
..$ col : int [1:2] NA NA
..$ expected: chr [1:2] "a double" "a double"
..$ actual : chr [1:2] "qwerty" "asdf"
Can access this tibble by attributes(x)$problems
flights |> mutate(air_time_hr = air_time / 60).air_time has 336,776 elements while 60 has only one, so we divide every element of air_time by 60.What happens if the number of elements is not 1 or the exact matching number?
R does what is called recycling, or repeating
Rules apply for all logical comparison (==, <, etc) and arithmetic (+, ^, etc)
# A tibble: 25,977 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 542 540 2 923 850
3 2013 1 1 554 600 -6 812 837
...
month == c(1,2) returns a logical vector where:
TRUE if either the row number is odd and the month is 1, OR row number is even and month 2. Otherwise is FALSE.Better to use month %in% c(1,2) here!
Recall from grade school: division by remainder.
%/%: integer division%%: the remainder after integer division.We can then do things like calculate the percent of delayed flights per hour.
flights |>
mutate(hour = sched_dep_time %/% 100) |>
group_by(hour) |>
summarize(percent_cancelled = 100*mean(is.na(dep_time)),
n = n())# A tibble: 20 × 3
hour percent_cancelled n
<dbl> <dbl> <int>
1 1 100 1
2 5 0.461 1953
3 6 1.64 25951
4 7 1.27 22821
5 8 1.62 27242
6 9 1.61 20312
7 10 1.74 16708
8 11 1.85 16033
...
–>
–>
–>
–>
–>
dplyr::min_rank()desc(x):There are many variants of min_rank() in dplyr:
row_number(), cume_dist(), percent_rank()min_rank() is enough in most cases. dplyr::min_rank()Example: which 3 flight routes have the longest average delays?
origin and destarr_delay means the flight left earlyflights |>
filter(arr_delay > 0) |>
group_by(origin, dest) |>
summarize(avg_delay = mean(arr_delay, na.rm=TRUE),
.groups = 'drop') |>
mutate(rank = min_rank(desc(avg_delay))) |>
filter(rank <= 3) |>
arrange(by = rank)# A tibble: 3 × 4
origin dest avg_delay rank
<chr> <chr> <dbl> <int>
1 LGA TVC 72.7 1
2 EWR TYS 72.6 2
3 LGA OMA 65.0 3
min(x) and max(x) return single smallest/largest vals within vector x.quantile(vector, threshold) is a generalization of median:
quantile(x, 0.25) returns value of x that is >= 25% of values within xquantile(x, 0.5) returns medianquantile(x, 0.95) returns value of x that is >= 95% of values within x.c(1, 2, 3, 2, 5, 2, 3, 1, 4, 2, 3, 1, 5, 2, 10000000)Example: calculate the maximum delay and the 95th quantile of delays for flights per day.
glue::glue()Similar to f-strings in Python. Can be less clunky than R’s paste().
tibble(
Student = c("Ant", "Bug", "Cat", "Ant", "Bug", "Cat"),
Subject = c("Math", "Math", "Math", "Chem", "Chem", "Chem"),
Score = c(85, 92, 88, 90, 78, 95)
) |>
mutate(Msg = glue('Superstar {Student} scored {Score} in {Subject}'))# A tibble: 6 × 4
Student Subject Score Msg
<chr> <chr> <dbl> <glue>
1 Ant Math 85 Superstar Ant scored 85 in Math
2 Bug Math 92 Superstar Bug scored 92 in Math
3 Cat Math 88 Superstar Cat scored 88 in Math
4 Ant Chem 90 Superstar Ant scored 90 in Chem
5 Bug Chem 78 Superstar Bug scored 78 in Chem
6 Cat Chem 95 Superstar Cat scored 95 in Chem