Producing a logical vector
Logical vector
For logical vectors, every element takes one of 3 values: TRUE, FALSE, NA
c (TRUE , FALSE , NA ) |> typeof ()
Common way to create a logical vector: numeric comparison with <, !=, etc:
flights$ dep_time > 600 # produces a logical vector
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
...
We might also want to combine multiple conditions.
For example, flights that depart between 6am and 8pm that was not delayed for more than 20 minutes.
Three basic logical operators that we will use over and over:
AND (denoted & in R): operation between two logicals
OR (denoted | in R): operation between two logicals
NOT (denoted ! in R): operation on a single logical.
Truth table for AND:
TRUE
TRUE
TRUE
TRUE
FALSE
FALSE
FALSE
TRUE
FALSE
FALSE
FALSE
FALSE
Truth table for OR:
TRUE
TRUE
TRUE
TRUE
FALSE
TRUE
FALSE
TRUE
TRUE
FALSE
FALSE
FALSE
Can combine AND/OR with NOT to cover any binary Boolean operation
Comparisons: pairwise
A comparator between two vectors of logicals returns pairwise comparisons.
x <- c (TRUE , TRUE , FALSE , FALSE )
y <- c (TRUE , FALSE , TRUE , FALSE )
[1] TRUE FALSE FALSE FALSE
[1] FALSE FALSE TRUE TRUE
FYI, we can also use T for TRUE and F for FALSE:
a <- c (T, T, F, F)
b <- c (T, F, T, F)
a == x
Comparisons: multiple
Multiple comparisons in filter() produce a new vector of logicals.
flights |>
filter (dep_time > 600 & dep_time < 2000 & abs (arr_delay) < 20 ) # keep those rows where the vector is TRUE
# A tibble: 172,286 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 601 600 1 844 850
2 2013 1 1 602 610 -8 812 820
3 2013 1 1 602 605 -3 821 805
4 2013 1 1 606 610 -4 858 910
5 2013 1 1 606 610 -4 837 845
6 2013 1 1 607 607 0 858 915
...
Comparisons: conjunction
filter() and mutate() can be used in conjunction
flights |>
mutate (
daytime = dep_time > 600 & dep_time < 2000 ,
approx_ontime = abs (arr_delay) < 20 ,
) |>
filter (daytime & approx_ontime)
# A tibble: 172,286 × 21
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 601 600 1 844 850
2 2013 1 1 602 610 -8 812 820
3 2013 1 1 602 605 -3 821 805
...
Floating point comparisons
Testing equality with == for floating points can cause problems. Numbers are represented with finite “precision”, i.e. only up to 2^{-32} or 2^{-64}.
x <- c ( (1 / 49 ) * 49 , sqrt (2 )^ 2 )
x == c (1 ,2 )
What’s going on? Let’s look at more precise representation in R.
[1] 0.99999999999999988898 2.00000000000000044409
dplyr::near() helps with this, ignores small differences
all.equal (x, c (1 ,2 )) # returns single value
Missing values
Almost any operation involving an NA returns NA.
What about NA==NA?
Why? Think of this example
# Suppose we don't know Ant's age
age_ant <- NA
# And we also don't know Bug's age
age_bug <- NA
# Then we shouldn't know whether Ant and
# Bug are the same age
age_ant == age_bug
Missing values
is.na() is a useful function for dealing with NA values.
works with any type of vector;
returns TRUE for missing values and FALSE for everything else:
is.na (c (TRUE , NA , FALSE ))
is.na (list (3 , '5' , NA )) # also works for a list
Missing values
Since is.na() returns logicals, can be used in filter():
flights |>
filter (is.na (dep_time))
# A tibble: 8,255 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 NA 1630 NA NA 1815
2 2013 1 1 NA 1935 NA NA 2240
3 2013 1 1 NA 1500 NA NA 1825
4 2013 1 1 NA 600 NA NA 901
...
Missing values
Can use to help identify where NA come from. e.g., why are there air_time NAs?
Let’s examine how dep_time, dep_delay, and sched_dep_time are related.
flights |>
mutate (missing_dep_time = is.na (dep_time),
missing_dep_delay = is.na (dep_delay),
missing_sched_dep_time = is.na (sched_dep_time)) |>
count (missing_dep_time, missing_dep_delay, missing_sched_dep_time)
# A tibble: 2 × 4
missing_dep_time missing_dep_delay missing_sched_dep_time n
<lgl> <lgl> <lgl> <int>
1 FALSE FALSE FALSE 328521
2 TRUE TRUE FALSE 8255
The only instances where dep_delay is missing have dep_time missing.
Missing values
Is it the case that dep_delay = dep_time - sched_dep_time?
flights |>
mutate (dep_delay_manual = dep_time - sched_dep_time,
manual_matches_given = near (dep_delay_manual, dep_delay)) |>
count (manual_matches_given)
# A tibble: 3 × 2
manual_matches_given n
<lgl> <int>
1 FALSE 99777
2 TRUE 228744
3 NA 8255
Quite weird, since we are getting a lot right but also getting a lot wrong.
Missing values
Let’s inspect further. What do the mismatched observations look like?
flights |>
mutate (manual_delay = dep_time - sched_dep_time,
manual_matches_given = near (manual_delay, dep_delay)) |>
filter (! manual_matches_given) |>
select (time_hour, flight, dep_time, sched_dep_time, dep_delay, manual_delay)
# A tibble: 99,777 × 6
time_hour flight dep_time sched_dep_time dep_delay manual_delay
<dttm> <int> <int> <int> <dbl> <int>
1 2013-01-01 06:00:00 461 554 600 -6 -46
2 2013-01-01 06:00:00 507 555 600 -5 -45
3 2013-01-01 06:00:00 5708 557 600 -3 -43
...
Problem: R is treating dep_time and sched_dep_time as integers, not time!
5:54 is only 6 minutes away from 6:00, rather than 46.
We will later see how to properly treat dates and times.
Missing values in Boolean algebra
Logical and missing values interact in logical ways, but requires some thought.
Think of NA as an unknown logical value.
Consider finding all flights departing in November or December.
flights |> # results in correct calculation
filter (month == 11 | month == 12 )
# A tibble: 55,403 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 11 1 5 2359 6 352 345
2 2013 11 1 35 2250 105 123 2356
...
flights |> # results in incorrect calculation. Why?
filter (month == 11 | 12 )
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
...
R first evaluates month==11, creating a logical vector vec.
R then compares vec | 12.
When comparing a logical to a number, any nonzero number is considered as TRUE.
[1] TRUE TRUE FALSE TRUE TRUE TRUE TRUE
So vec | 12 returns a vector with TRUE for every element. (Why?)
%in%
Instead of worrying about | and == in order, just use %in%.
[1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
So to find all flights from November and December:
flights |>
filter (month %in% c (11 , 12 ))
%in% obeys different rules for NA vs. ==, since NA %in% NA is TRUE:
case_when(): more complex example
Provide human-readable labels to flight delays.
flights |>
mutate (
status = case_when (
is.na (arr_delay) ~ "cancelled" ,
arr_delay < - 30 ~ "very early" ,
arr_delay < - 15 ~ "early" ,
abs (arr_delay) <= 15 ~ "on time" ,
arr_delay < 60 ~ "late" ,
arr_delay < Inf ~ "very late" ,
),
.keep = "used" # only returns those columns used in calculations
)
# A tibble: 336,776 × 2
arr_delay status
<dbl> <chr>
1 11 on time
2 20 late
...
We can refer to variables inside the dataframe inside case_when(), just as in most other tidyverse functions .
The first conditional that is true is what gets assigned.
So when arr_delay < -30, the remaining conditionals do not get checked.
Compatible types
Both if_else() and case_when() require the outputs to be of consistent types.
if_else (TRUE , "a" , 1 )
#> Error in `if_else()`:
#> ! Can't combine `true` <character> and `false` <double>.
case_when (
x < - 1 ~ TRUE ,
x > 0 ~ now ()
)
#> Error in `case_when()`:
#> ! Can't combine `..1 (right)` <logical> and `..2 (right)` <datetime<local>>
Most types are incompatible in order to catch errors. Compatible types:
Numeric and logical (treats TRUE=1, FALSE=0)
Dates and “date-times” - we will discuss these types later
NA is compatible with everything
Strings and factors are compatible - will discuss later
Example: labelling numbers as even or odd
Number is even \(\Leftrightarrow\) number is divisible by two.
In R, operator %% (read “modulo”) does “modular arithmetic”:
a %% b returns the remainder when dividing a by b, e.g.
A number n is even if and only if n %% 2 == 0; otherwise, odd.
We can use if_else to label numbers between 0 and 18 as even or odd
x <- 0 : 18
if_else (x %% 2 == 0 , 'even' , 'odd' )
[1] "even" "odd" "even" "odd" "even" "odd" "even" "odd" "even" "odd"
[11] "even" "odd" "even" "odd" "even" "odd" "even" "odd" "even"