03: Transformations of logical vectors

STA35B: Statistical Data Science 2

Akira Horiguchi

Flights data

For this slide deck, we will repeatedly use the flights data set

library(tidyverse)
library(nycflights13)
flights

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
...

Producing a logical vector

Logical vector

For logical vectors, every element takes one of 3 values: TRUE, FALSE, NA

c(TRUE, FALSE, NA) |> typeof()

[1] "logical"

Common way to create a logical vector: numeric comparison with <, !=, etc:

flights$dep_time > 600  # produces a logical vector

    [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
   [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
...

We might also want to combine multiple conditions.

For example, flights that depart between 6am and 8pm that was not delayed for more than 20 minutes.

Three basic logical operators that we will use over and over:

AND (denoted & in R): operation between two logicals
OR (denoted | in R): operation between two logicals
NOT (denoted ! in R): operation on a single logical.

Truth table for AND:

A	B	A `&` B
`TRUE`	`TRUE`	`TRUE`
`TRUE`	`FALSE`	`FALSE`
`FALSE`	`TRUE`	`FALSE`
`FALSE`	`FALSE`	`FALSE`

Truth table for OR:

A	B	A `\|` B
`TRUE`	`TRUE`	`TRUE`
`TRUE`	`FALSE`	`TRUE`
`FALSE`	`TRUE`	`TRUE`
`FALSE`	`FALSE`	`FALSE`

Truth table for NOT:

A	`!` A
`TRUE`	`FALSE`
`FALSE`	`TRUE`

Can combine AND/OR with NOT to cover any binary Boolean operation

Comparisons: pairwise

A comparator between two vectors of logicals returns pairwise comparisons.

x <- c(TRUE, TRUE, FALSE, FALSE)
y <- c(TRUE, FALSE, TRUE, FALSE)

(x & y)  # x AND y

[1]  TRUE FALSE FALSE FALSE

(x | y)  # x OR y

[1]  TRUE  TRUE  TRUE FALSE

!x  # NOT x

[1] FALSE FALSE  TRUE  TRUE

FYI, we can also use T for TRUE and F for FALSE:

a <- c(T, T, F, F)
b <- c(T, F, T, F)
a == x

[1] TRUE TRUE TRUE TRUE

b == y

[1] TRUE TRUE TRUE TRUE

Comparisons: multiple

Multiple comparisons in filter() produce a new vector of logicals.

flights |>
  filter(dep_time > 600 & dep_time < 2000 & abs(arr_delay) < 20)  # keep those rows where the vector is TRUE

# A tibble: 172,286 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      601            600         1      844            850
 2  2013     1     1      602            610        -8      812            820
 3  2013     1     1      602            605        -3      821            805
 4  2013     1     1      606            610        -4      858            910
 5  2013     1     1      606            610        -4      837            845
 6  2013     1     1      607            607         0      858            915
...

Comparisons: conjunction

filter() and mutate() can be used in conjunction

flights |>
  mutate(
    daytime = dep_time > 600 & dep_time < 2000,
    approx_ontime = abs(arr_delay) < 20,
  ) |>
  filter(daytime & approx_ontime)

# A tibble: 172,286 × 21
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      601            600         1      844            850
 2  2013     1     1      602            610        -8      812            820
 3  2013     1     1      602            605        -3      821            805
...

Floating point comparisons

Testing equality with == for floating points can cause problems. Numbers are represented with finite “precision”, i.e. only up to 2^{-32} or 2^{-64}.

x <- c( (1/49) * 49, sqrt(2)^2)
x == c(1,2)

[1] FALSE FALSE

What’s going on? Let’s look at more precise representation in R.

print(x, digits=10)

[1] 1 2

print(x, digits=20)

[1] 0.99999999999999988898 2.00000000000000044409

dplyr::near() helps with this, ignores small differences

near(x, c(1,2))

[1] TRUE TRUE

all.equal(x, c(1,2))  # returns single value

[1] TRUE

Missing values

Almost any operation involving an NA returns NA.

(NA > 5)

[1] NA

(10 == NA)

[1] NA

What about NA==NA?

NA==NA

[1] NA

Why? Think of this example

# Suppose we don't know Ant's age
age_ant <- NA

# And we also don't know Bug's age
age_bug <- NA

# Then we shouldn't know whether Ant and
# Bug are the same age
age_ant == age_bug

[1] NA

Missing values

is.na() is a useful function for dealing with NA values.

works with any type of vector;
returns TRUE for missing values and FALSE for everything else:

is.na(c(TRUE, NA, FALSE))

[1] FALSE  TRUE FALSE

is.na(c(1, NA, 3))

[1] FALSE  TRUE FALSE

is.na(c("a", NA, "b"))

[1] FALSE  TRUE FALSE

is.na(list(3, '5', NA))  # also works for a list

[1] FALSE FALSE  TRUE

Missing values

Since is.na() returns logicals, can be used in filter():

flights |>
  filter(is.na(dep_time))

# A tibble: 8,255 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1       NA           1630        NA       NA           1815
 2  2013     1     1       NA           1935        NA       NA           2240
 3  2013     1     1       NA           1500        NA       NA           1825
 4  2013     1     1       NA            600        NA       NA            901
...

Missing values

Can use to help identify where NA come from. e.g., why are there air_time NAs?

Let’s examine how dep_time, dep_delay, and sched_dep_time are related.

flights |> 
  mutate(missing_dep_time = is.na(dep_time),
         missing_dep_delay = is.na(dep_delay),
         missing_sched_dep_time = is.na(sched_dep_time)) |> 
  count(missing_dep_time, missing_dep_delay, missing_sched_dep_time)

# A tibble: 2 × 4
  missing_dep_time missing_dep_delay missing_sched_dep_time      n
  <lgl>            <lgl>             <lgl>                   <int>
1 FALSE            FALSE             FALSE                  328521
2 TRUE             TRUE              FALSE                    8255

The only instances where dep_delay is missing have dep_time missing.

Missing values

Is it the case that dep_delay = dep_time - sched_dep_time?

flights |> 
  mutate(dep_delay_manual = dep_time - sched_dep_time,
         manual_matches_given = near(dep_delay_manual, dep_delay)) |>
  count(manual_matches_given)

# A tibble: 3 × 2
  manual_matches_given      n
  <lgl>                 <int>
1 FALSE                 99777
2 TRUE                 228744
3 NA                     8255

Quite weird, since we are getting a lot right but also getting a lot wrong.

Missing values

Let’s inspect further. What do the mismatched observations look like?

flights |> 
  mutate(manual_delay = dep_time - sched_dep_time,
         manual_matches_given = near(manual_delay, dep_delay)) |>
  filter(!manual_matches_given) |>
  select(time_hour, flight, dep_time, sched_dep_time, dep_delay, manual_delay)

# A tibble: 99,777 × 6
   time_hour           flight dep_time sched_dep_time dep_delay manual_delay
   <dttm>               <int>    <int>          <int>     <dbl>        <int>
 1 2013-01-01 06:00:00    461      554            600        -6          -46
 2 2013-01-01 06:00:00    507      555            600        -5          -45
 3 2013-01-01 06:00:00   5708      557            600        -3          -43
...

Problem: R is treating dep_time and sched_dep_time as integers, not time!

5:54 is only 6 minutes away from 6:00, rather than 46.
We will later see how to properly treat dates and times.

Missing values in Boolean algebra

Logical and missing values interact in logical ways, but requires some thought.

c(TRUE, FALSE) | NA

[1] TRUE   NA

c(TRUE, FALSE) & NA

[1]    NA FALSE

Think of NA as an unknown logical value.

Does not depend on value of NA:
- NA OR TRUE will return TRUE.
- NA AND FALSE will return FALSE.
Depends on value of NA:
- NA AND TRUE will return NA.
- NA OR FALSE will return NA.

Consider finding all flights departing in November or December.

flights |>  # results in correct calculation
  filter(month == 11 | month == 12)

# A tibble: 55,403 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013    11     1        5           2359         6      352            345
 2  2013    11     1       35           2250       105      123           2356
...

flights |>  # results in incorrect calculation. Why?
  filter(month == 11 | 12)

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
...

R first evaluates month==11, creating a logical vector vec.
R then compares vec | 12.
When comparing a logical to a number, any nonzero number is considered as TRUE.

(x <- -2:4)

[1] -2 -1  0  1  2  3  4

TRUE & x

[1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE

So vec | 12 returns a vector with TRUE for every element. (Why?)

`%in%`

Instead of worrying about | and == in order, just use %in%.

1:10 %in% c(1, 5, 10)

 [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE

So to find all flights from November and December:

flights |>
  filter(month %in% c(11, 12))

%in% obeys different rules for NA vs. ==, since NA %in% NA is TRUE:

(c(1,2,NA) == NA)

[1] NA NA NA

(c(1,2,NA) %in% NA)

[1] FALSE FALSE  TRUE

Summarizing a logical vector

Logical summaries

Two main functions to summarize a logical vector:

any(x) returns TRUE if any value in x is TRUE
all(x) returns TRUE only if all values in x are TRUE

E.g., was there a day where:

every flight was delayed on departure by \(<\) 1 hour?
any flight was delayed on arrival by \(\geq 5\) hours?

flights |>
  group_by(year, month, day) |>
  summarize(
    all_delayed = all(dep_delay < 60, na.rm=TRUE),
    any_long_delay = any(arr_delay >= 300, na.rm=TRUE)
  )

# A tibble: 365 × 5
# Groups:   year, month [12]
    year month   day all_delayed any_long_delay
   <int> <int> <int> <lgl>       <lgl>         
 1  2013     1     1 FALSE       TRUE          
 2  2013     1     2 FALSE       TRUE          
 3  2013     1     3 FALSE       FALSE         
 4  2013     1     4 FALSE       FALSE         
...

Logical summaries

When coerced into a numeric, TRUE coerces to 1 and FALSE coerces to 0
Useful if you want to find proportions that are TRUE/FALSE, e.g. mean(), sum()

x <- c(TRUE, FALSE, FALSE, FALSE, TRUE)
sum(x)  # number of TRUE elements in x

[1] 2

mean(x)  # proportion of TRUE elements in x

[1] 0.4

length(x) - sum(x)  # number of FALSE elements in x

[1] 3

1 - mean(x)  # proportion of FALSE elements in x

[1] 0.6

Logical summaries

Example: proportion of flights delayed > 1 hour on departure, and number of flights delayed on arrival by > 5 hours:

flights |> 
  group_by(year, month, day) |>
  summarise(
    prop_delayed_1hour = mean(dep_delay > 60, na.rm=TRUE),
    num_long_delay = sum(arr_delay > 300, na.rm=TRUE)
  )

# A tibble: 365 × 5
# Groups:   year, month [12]
    year month   day prop_delayed_1hour num_long_delay
   <int> <int> <int>              <dbl>          <int>
 1  2013     1     1             0.0609              3
 2  2013     1     2             0.0856              3
 3  2013     1     3             0.0586              0
...

Note output: # Groups: year, month [12]

Logical summaries

Example: proportion of flights delayed > 1 hour on departure, and number of flights delayed on arrival by > 5 hours:

flights |> 
  group_by(year, month, day) |>
  summarise(
    prop_delayed_1hour = mean(dep_delay > 60, na.rm=TRUE),
    num_long_delay = sum(arr_delay > 300, na.rm=TRUE),
    .groups = 'drop'  # Output is no longer grouped
  )

# A tibble: 365 × 5
    year month   day prop_delayed_1hour num_long_delay
   <int> <int> <int>              <dbl>          <int>
 1  2013     1     1             0.0609              3
 2  2013     1     2             0.0856              3
 3  2013     1     3             0.0586              0
 4  2013     1     4             0.0473              0
 5  2013     1     5             0.0363              1
...

Conditional transformations

Conditional transformations: `if_else()`

if_else(CONDITION, TRUE_VAL, FALSE_VAL, MISSING_VAL) is useful when we want to return some value when condition is TRUE and return another value when condition is FALSE.

x <- c(-2, -1, 1, 2, NA)
if_else(x > 0, "yay", "boo")

[1] "boo" "boo" "yay" "yay" NA

The fourth argument of if_else() specifies what to fill NA’s with:

if_else(x > 0, "yay", "boo", "huh")

[1] "boo" "boo" "yay" "yay" "huh"

We can also use vectors as an argument.

if_else(x < 0, -x, x)

[1]  2  1  1  2 NA

Conditional transformations: `if_else()`

We can use general vectors inside of if_else():

x1 <- c(NA, 1, 2, NA)
y1 <- c(3, NA, 4, 6)
if_else(is.na(x1), y1, x1)

[1] 3 1 2 6

Conditional transformations: `case_when()`

if_else() is useful if you have two different conditions for which you want to specify values.

case_when() is useful if you have many different conditions for which you want to specify values. For example:

If number is between 0 and 20, then print F,
If number is between 20 and 40, then print D,
If number is between 40 and 60, then print C,
If number is between 60 and 80, then print B,
If number is between 80 and 100, then print A.

Conditional transformations: `case_when()`

Inspired by SQL’s CASE statement. Has a very weird syntax:

condition ~ output
condition is a logical vector
when condition is TRUE, output is used.

Weird, but pretty readable:

(x <- c(-3:3, NA))

[1] -3 -2 -1  0  1  2  3 NA

case_when(
  x == 0   ~ "0",
  x < 0    ~ "-ve", 
  x > 0    ~ "+ve",
  is.na(x) ~ "???"
)

[1] "-ve" "-ve" "-ve" "0"   "+ve" "+ve" "+ve" "???"

Conditional transformations: `case_when()`

(x <- c(-1:2, NA))

[1] -1  0  1  2 NA

If no cases match, then returns NA:

case_when(
  x < 0 ~ "-ve",
  x > 0 ~ "+ve"
)

[1] "-ve" NA    "+ve" "+ve" NA

If multiple conditions are satisfied, only the first is used – be careful!

case_when(
  x > 0 ~ "+ve",
  x > 1 ~ "big"
)

[1] NA    NA    "+ve" "+ve" NA

The argument .default specifies return value if condition is satisfied, or if value is NA.

case_when(
  x < 0 ~ "-ve",
  x > 0 ~ "+ve",
  .default = "???"
)

[1] "-ve" "???" "+ve" "+ve" "???"

`case_when()`: more complex example

Provide human-readable labels to flight delays.

flights |> 
  mutate(
    status = case_when(
      is.na(arr_delay)      ~ "cancelled",
      arr_delay < -30       ~ "very early",
      arr_delay < -15       ~ "early",
      abs(arr_delay) <= 15  ~ "on time",
      arr_delay < 60        ~ "late",
      arr_delay < Inf       ~ "very late",
    ),
    .keep = "used" # only returns those columns used in calculations
  )

# A tibble: 336,776 × 2
   arr_delay status 
       <dbl> <chr>  
 1        11 on time
 2        20 late   
...

We can refer to variables inside the dataframe inside case_when(), just as in most other tidyverse functions .
The first conditional that is true is what gets assigned.
So when arr_delay < -30, the remaining conditionals do not get checked.

Compatible types

Both if_else() and case_when() require the outputs to be of consistent types.

if_else(TRUE, "a", 1)
#> Error in `if_else()`:
#> ! Can't combine `true` <character> and `false` <double>.

case_when(
  x < -1 ~ TRUE,
  x > 0 ~ now()
)
#> Error in `case_when()`:
#> ! Can't combine `..1 (right)` <logical> and `..2 (right)` <datetime<local>>

Most types are incompatible in order to catch errors. Compatible types:

Numeric and logical (treats TRUE=1, FALSE=0)
Dates and “date-times” - we will discuss these types later
NA is compatible with everything
Strings and factors are compatible - will discuss later

Example: labelling numbers as even or odd

Number is even \(\Leftrightarrow\) number is divisible by two.
In R, operator %% (read “modulo”) does “modular arithmetic”:
a %% b returns the remainder when dividing a by b, e.g.
- 17 %% 12 = 5
- 34 %% 6 = 4
A number n is even if and only if n %% 2 == 0; otherwise, odd.
We can use if_else to label numbers between 0 and 18 as even or odd

x <- 0:18
if_else(x %% 2 == 0, 'even', 'odd')

 [1] "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd" 
[11] "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even"

03: Transformations of logical vectors

Flights data

Producing a logical vector

Logical vector

Three basic logical operators that we will use over and over:

Comparisons: pairwise

Comparisons: multiple

Comparisons: conjunction

Floating point comparisons

Missing values

Missing values

Missing values

Missing values

Missing values

Missing values

Missing values in Boolean algebra

Consider finding all flights departing in November or December.

%in%

Summarizing a logical vector

Logical summaries

Logical summaries

Logical summaries

Logical summaries

Conditional transformations

Conditional transformations: if_else()

Conditional transformations: if_else()

Conditional transformations: case_when()

Conditional transformations: case_when()

Conditional transformations: case_when()

case_when(): more complex example

Compatible types

Example: labelling numbers as even or odd

`%in%`

Conditional transformations: `if_else()`

Conditional transformations: `if_else()`

Conditional transformations: `case_when()`

Conditional transformations: `case_when()`

Conditional transformations: `case_when()`

`case_when()`: more complex example