flights$dep_time > 600 # produces a logical vector [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
...
STA141A: Fundamentals of Statistical Data Science
We’ll investigate how to manipulate and transform data to get logicals, and how to use logicals.
For logical vectors, every element takes one of 3 values: TRUE, FALSE, NA
Common way to create a logical vector: numeric comparison with <, !=, etc:
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
...
We might also want to combine multiple conditions.
& in R): operation between two logicals| in R): operation between two logicals! in R): operation on a single logical.Truth table for AND:
| A | B | A & B |
|---|---|---|
TRUE |
TRUE |
TRUE |
TRUE |
FALSE |
FALSE |
FALSE |
TRUE |
FALSE |
FALSE |
FALSE |
FALSE |
Truth table for OR:
| A | B | A | B |
|---|---|---|
TRUE |
TRUE |
TRUE |
TRUE |
FALSE |
TRUE |
FALSE |
TRUE |
TRUE |
FALSE |
FALSE |
FALSE |
Truth table for NOT:
| A | ! A |
|---|---|
TRUE |
FALSE |
FALSE |
TRUE |
Can combine AND/OR with NOT to cover any binary Boolean operation
A comparator between two vectors of logicals returns pairwise comparisons.
FYI, we can also use T for TRUE and F for FALSE:
Multiple comparisons in filter() produce a new vector of logicals.
# A tibble: 172,286 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 601 600 1 844 850
2 2013 1 1 602 610 -8 812 820
3 2013 1 1 602 605 -3 821 805
4 2013 1 1 606 610 -4 858 910
5 2013 1 1 606 610 -4 837 845
6 2013 1 1 607 607 0 858 915
...
filter() and mutate() can be used in conjunction
# A tibble: 172,286 × 21
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 601 600 1 844 850
2 2013 1 1 602 610 -8 812 820
3 2013 1 1 602 605 -3 821 805
...
Testing equality with == for floating points can cause problems. Numbers are represented with finite “precision”, i.e. only up to 2^{-32} or 2^{-64}.
What’s going on? Let’s look at more precise representation in R.
is.na() is a useful function for dealing with NA values.
TRUE for missing values and FALSE for everything else:Since is.na() returns logicals, can be used in filter():
# A tibble: 8,255 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 NA 1630 NA NA 1815
2 2013 1 1 NA 1935 NA NA 2240
3 2013 1 1 NA 1500 NA NA 1825
4 2013 1 1 NA 600 NA NA 901
...
Can use to help identify where NA come from. e.g., why are there air_time NAs?
dep_time, dep_delay, and sched_dep_time are related.# A tibble: 2 × 4
missing_dep_time missing_dep_delay missing_sched_dep_time n
<lgl> <lgl> <lgl> <int>
1 FALSE FALSE FALSE 328521
2 TRUE TRUE FALSE 8255
dep_delay is missing have dep_time missing.dep_delay = dep_time - sched_dep_time?# A tibble: 3 × 2
manual_matches_given n
<lgl> <int>
1 FALSE 99777
2 TRUE 228744
3 NA 8255
Quite weird, since we are getting a lot right but also getting a lot wrong.
Let’s inspect further. What do the mismatched observations look like?
# A tibble: 99,777 × 6
time_hour flight dep_time sched_dep_time dep_delay manual_delay
<dttm> <int> <int> <int> <dbl> <int>
1 2013-01-01 06:00:00 461 554 600 -6 -46
2 2013-01-01 06:00:00 507 555 600 -5 -45
3 2013-01-01 06:00:00 5708 557 600 -3 -43
...
Problem: R is treating dep_time and sched_dep_time as integers, not time!
Logical and missing values interact in logical ways, but requires some thought.
Think of NA as an unknown logical value.
Does not depend on value of NA:
NA OR TRUE will return TRUE.NA AND FALSE will return FALSE.Depends on value of NA:
NA AND TRUE will return NA.NA OR FALSE will return NA.# A tibble: 55,403 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 11 1 5 2359 6 352 345
2 2013 11 1 35 2250 105 123 2356
...
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
...
month==11, creating a logical vector vec.vec | 12.TRUE.vec | 12 returns a vector with TRUE for every element.%in%Instead of worrying about | and == in order, just use %in%.
So to find all flights from November and December:
%in% obeys different rules for NA vs. ==, since NA %in% NA is TRUE:Two main functions to summarize a logical vector:
any(x) returns TRUE if any value in x is TRUEall(x) returns TRUE only if all values in x are TRUEE.g., was there a day where:
# A tibble: 365 × 5
# Groups: year, month [12]
year month day all_delayed any_long_delay
<int> <int> <int> <lgl> <lgl>
1 2013 1 1 FALSE TRUE
2 2013 1 2 FALSE TRUE
3 2013 1 3 FALSE FALSE
4 2013 1 4 FALSE FALSE
...
TRUE coerces to 1 and FALSE coerces to 0TRUE/FALSE, e.g. mean(), sum()# A tibble: 365 × 5
# Groups: year, month [12]
year month day prop_delayed_1hour num_long_delay
<int> <int> <int> <dbl> <int>
1 2013 1 1 0.0609 3
2 2013 1 2 0.0856 3
3 2013 1 3 0.0586 0
...
Note output: # Groups: year, month [12]
# A tibble: 365 × 5
year month day prop_delayed_1hour num_long_delay
<int> <int> <int> <dbl> <int>
1 2013 1 1 0.0609 3
2 2013 1 2 0.0856 3
3 2013 1 3 0.0586 0
4 2013 1 4 0.0473 0
5 2013 1 5 0.0363 1
...
if_else()if_else(CONDITION, TRUE_VAL, FALSE_VAL, MISSING_VAL) is useful when we want to return some value when condition is TRUE and return another value when condition is FALSE.
The fourth argument of if_else() specifies what to fill NA’s with:
We can also use vectors as an argument.
if_else()We can use general vectors inside of if_else():
case_when()if_else() is useful if you have two different conditions for which you want to specify values.
case_when() is useful if you have many different conditions for which you want to specify values. For example:
0 and 20, then print F,20 and 40, then print D,40 and 60, then print C,60 and 80, then print B,80 and 100, then print A.case_when()Inspired by SQL’s CASE statement. Has a very weird syntax:
condition ~ outputcondition is a logical vectorcondition is TRUE, output is used.Weird, but pretty readable:
case_when()If no cases match, then returns NA:
If multiple conditions are satisfied, only the first is used – be careful!
case_when(): more complex exampleProvide human-readable labels to flight delays.
# A tibble: 336,776 × 2
arr_delay status
<dbl> <chr>
1 11 on time
2 20 late
...
case_when(), just as in most other tidyverse functions .arr_delay < -30, the remaining conditionals do not get checked.Both if_else() and case_when() require the outputs to be of consistent types.
Most types are incompatible in order to catch errors. Compatible types:
NA is compatible with everything%% (read “modulo”) does “modular arithmetic”:a %% b returns the remainder when dividing a by b, e.g.
17 %% 12 = 534 %% 6 = 4n is even if and only if n %% 2 == 0; otherwise, odd.if_else to label numbers between 0 and 20 as even or oddreadr package in the tidyverse has two useful functions:
parse_double() – useful when you have numbers written as strings.parse_number() – ignores all non-numeric text to parse strings.What happens if use parse_double() with non-numeric-identifying strings?
num [1:3] NA 6.5 NA
- attr(*, "problems")= tibble [2 × 4] (S3: tbl_df/tbl/data.frame)
..$ row : int [1:2] 1 3
..$ col : int [1:2] NA NA
..$ expected: chr [1:2] "a double" "a double"
..$ actual : chr [1:2] "qwerty" "asdf"
Can access this tibble by attributes(x)$problems
flights |> mutate(air_time_hr = air_time / 60).air_time has 336,776 elements while 60 has only one, so we divide every element of air_time by 60.What happens if the number of elements is not 1 or the exact matching number?
R does what is called recycling, or repeating
Rules apply for all logical comparison (==, <, etc) and arithmetic (+, ^, etc)
# A tibble: 25,977 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 542 540 2 923 850
3 2013 1 1 554 600 -6 812 837
...
month == c(1,2) returns a logical vector where:
TRUE if either the row number is odd and the month is 1, OR row number is even and month 2. Otherwise is FALSE.Better to use month %in% c(1,2) here!
Recall from grade school: division by remainder.
%/%: integer division%%: the remainder after integer division.We can then do things like calculate the percent of delayed flights per hour.
# A tibble: 20 × 3
hour percent_cancelled n
<dbl> <dbl> <int>
1 1 100 1
2 5 0.461 1953
3 6 1.64 25951
4 7 1.27 22821
5 8 1.62 27242
6 9 1.61 20312
7 10 1.74 16708
8 11 1.85 16033
...
–>
–>
–>
–>
–>
dplyr::min_rank()desc(x):There are many variants of min_rank() in dplyr:
row_number(), cume_dist(), percent_rank()min_rank() is enough in most cases. dplyr::min_rank()Example: which 3 flight routes have the longest average delays?
origin and destarr_delay means the flight left early# A tibble: 3 × 4
origin dest avg_delay rank
<chr> <chr> <dbl> <int>
1 LGA TVC 72.7 1
2 EWR TYS 72.6 2
3 LGA OMA 65.0 3
min(x) and max(x) return single smallest/largest vals within vector x.quantile(vector, threshold) is a generalization of median:
quantile(x, 0.25) returns value of x that is >= 25% of values within xquantile(x, 0.5) returns medianquantile(x, 0.95) returns value of x that is >= 95% of values within x.c(1, 2, 3, 2, 5, 2, 3, 1, 4, 2, 3, 1, 5, 2, 10000000)Example: calculate the maximum delay and the 95th quantile of delays for flights per day.
glue::glue()Similar to f-strings in Python. Can be less clunky than R’s paste().
# A tibble: 6 × 4
Student Subject Score Msg
<chr> <chr> <dbl> <glue>
1 Ant Math 85 Superstar Ant scored 85 in Math
2 Bug Math 92 Superstar Bug scored 92 in Math
3 Cat Math 88 Superstar Cat scored 88 in Math
4 Ant Chem 90 Superstar Ant scored 90 in Chem
5 Bug Chem 78 Superstar Bug scored 78 in Chem
6 Cat Chem 95 Superstar Cat scored 95 in Chem
Comments
Previous material you will likely use for any data science project you work on