03: Transformations of logical vectors

STA35B: Statistical Data Science 2

Akira Horiguchi

Producing logical vectors

Intro

For logical vectors, every element takes one of 3 values: TRUE, FALSE, NA

We’ll investigate how to manipulate and transform data to get logicals, and how to use logicals.

library(tidyverse)
library(nycflights13)
flights
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
...

Three basic logical operators that we will use over and over:

  • AND (denoted & in R): operation between two logicals
  • OR (denoted | in R): operation between two logicals
  • NOT (denoted ! in R): operation on a single logical.

Truth table for AND:

A B A & B
TRUE TRUE TRUE
TRUE FALSE FALSE
FALSE TRUE FALSE
FALSE FALSE FALSE

Truth table for OR:

A B A | B
TRUE TRUE TRUE
TRUE FALSE TRUE
FALSE TRUE TRUE
FALSE FALSE FALSE

Truth table for NOT:

A ! A
TRUE FALSE
FALSE TRUE

Comparisons

Common way to create a logical vector: numeric comparison with <, !=, etc.

We have implicitly been using this when doing filtering.

flights$dep_time > 600
    [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
   [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
...

Using a comparator between two vectors of logicals returns pairwise comparisons.

x <- c(TRUE, FALSE, TRUE)
y <- c(FALSE, FALSE, TRUE)
(x & y) # x AND y
[1] FALSE FALSE  TRUE
(x | y) # x OR y
[1]  TRUE FALSE  TRUE

Comparisons

So when we use multiple comparisons in filter(), we are building a new vector of logicals.

We only keep those rows where the vector is TRUE.

flights %>%
  filter(dep_time > 600 & dep_time < 2000 & abs(arr_delay) < 20)
# A tibble: 172,286 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      601            600         1      844            850
 2  2013     1     1      602            610        -8      812            820
 3  2013     1     1      602            605        -3      821            805
...

Comparisons

Filter and mutate can be used in conjunction

flights %>%
  mutate(
    daytime = dep_time > 600 & dep_time < 2000,
    approx_ontime = abs(arr_delay) < 20,
  ) %>%
  filter(daytime & approx_ontime)
# A tibble: 172,286 × 21
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      601            600         1      844            850
 2  2013     1     1      602            610        -8      812            820
 3  2013     1     1      602            605        -3      821            805
...

Floating point comparisons

Testing equality with == for floating points can cause problems. Numbers are represented with finite “precision”, i.e. only up to 2^{-32} or 2^{-64}.

x <- c( (1/49) * 49, sqrt(2)^2)
x == c(1,2)
[1] FALSE FALSE

What’s going on? Let’s look at more precise representation in R.

print(x, digits=10)
[1] 1 2
print(x, digits=20)
[1] 0.99999999999999988898 2.00000000000000044409

dplyr::near() helps with this, ignores small differences

near(x, c(1,2))
[1] TRUE TRUE
all.equal(x, c(1,2))  # returns single value
[1] TRUE

Missing values

Almost any operation involving an NA returns NA.

(NA > 5)
[1] NA
(10 == NA)
[1] NA

What about NA==NA?

NA==NA
[1] NA

Why? Think of this example

# Suppose we don't know Ant's age
age_ant <- NA

# And we also don't know Bug's age
age_bug <- NA

# Then we shouldn't know whether Ant and
# Bug are the same age
age_ant == age_bug
[1] NA

Missing values

A useful function for dealing with NA: is.na()

is.na(x) works with any type of vector and returns TRUE for missing values and FALSE for everything else:

is.na(c(TRUE, NA, FALSE))
[1] FALSE  TRUE FALSE
is.na(c(1, NA, 3))
[1] FALSE  TRUE FALSE
is.na(c("a", NA, "b")) 
[1] FALSE  TRUE FALSE

Missing values

Since is.na() returns logicals, can be used in filter():

flights %>%
  filter(is.na(dep_time))
# A tibble: 8,255 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1       NA           1630        NA       NA           1815
 2  2013     1     1       NA           1935        NA       NA           2240
...

Missing values

Can use to help identify where NA come from. e.g., why are there air_time NAs?

  • Let’s examine how dep_time, dep_delay, and sched_dep_time are related.
flights %>% 
  mutate(missing_dep_time = is.na(dep_time),
         missing_dep_delay = is.na(dep_delay),
         missing_sched_dep_time = is.na(sched_dep_time)) %>% 
  count(missing_dep_time, missing_dep_delay, missing_sched_dep_time)
# A tibble: 2 × 4
  missing_dep_time missing_dep_delay missing_sched_dep_time      n
  <lgl>            <lgl>             <lgl>                   <int>
1 FALSE            FALSE             FALSE                  328521
2 TRUE             TRUE              FALSE                    8255
  • The only instances where dep_delay is missing have dep_time missing.

Missing values

  • Is it the case that dep_delay = dep_time - sched_dep_time?
flights %>% 
  mutate(dep_delay_manual = dep_time - sched_dep_time,
         manual_matches_given = near(dep_delay_manual, dep_delay)) %>%
  count(manual_matches_given)
# A tibble: 3 × 2
  manual_matches_given      n
  <lgl>                 <int>
1 FALSE                 99777
2 TRUE                 228744
3 NA                     8255

Quite weird, since we are getting a lot right but also getting a lot wrong.

Missing values

Let’s inspect further. What do the mismatched observations look like?

flights %>% 
  mutate(manual_delay = dep_time - sched_dep_time,
         manual_matches_given = near(manual_delay, dep_delay)) %>%
  filter(!manual_matches_given) %>%
  select(time_hour, flight, dep_time, sched_dep_time, dep_delay, manual_delay)
# A tibble: 99,777 × 6
   time_hour           flight dep_time sched_dep_time dep_delay manual_delay
   <dttm>               <int>    <int>          <int>     <dbl>        <int>
 1 2013-01-01 06:00:00    461      554            600        -6          -46
 2 2013-01-01 06:00:00    507      555            600        -5          -45
 3 2013-01-01 06:00:00   5708      557            600        -3          -43
...

Problem: R is treating dep_time and sched_dep_time as integers, not time!

  • 5:54 is only 6 minutes away from 6:00, rather than 46.
  • We will later see how to properly treat dates and times.

Boolean algebra

  • Recall the basic Boolean algebra comparators, AND and OR
  • There is a third one, XOR, which we won’t use that often
  • Can combine AND/OR with NOT and cover any combination of a pair of Booleans

Visualization of Boolean algebra

Boolean algebra and missing values

Booleans and missing values interact in logical, but maybe counterintuitive ways.

df <- tibble(x = c(TRUE, FALSE, NA))
df %>% 
  mutate(
    and_NA = x & NA,
    or_NA = x | NA
  )
# A tibble: 3 × 3
  x     and_NA or_NA
  <lgl> <lgl>  <lgl>
1 TRUE  NA     TRUE 
2 FALSE FALSE  NA   
3 NA    NA     NA   
  • NA OR TRUE returns true, since it is TRUE regardless of NA being FALSE or TRUE.
  • NA AND TRUE returns NA since it depends on value of NA.
  • NA OR FALSE returns NA since it depends on value of NA.
  • NA AND FALSE returns FALSE since NA value doesn’t affect result, always false.

Consider finding all flights departing in November or December.

flights %>%  # results in correct calculation
  filter(month == 11 | month == 12)
flights %>%  # results in incorrect calculation. Why?
  filter(month == 11 | 12)
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
...
  • R first evaluates month==11, creating a logical vector vec.
  • R then compares vec | 12.
  • When comparing a number to a logical, any nonzero number is considered as TRUE.
  • So vec | 12 returns a vector with TRUE for every element.

%in%

Instead of worrying about | and == in order, just use %in%.

1:10 %in% c(1, 5, 10)
 [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE

So to find all flights from November and December:

flights %>%
  filter(month %in% c(11, 12))
  • %in% obeys different rules for NA vs. ==, since NA %in% NA is TRUE:
(c(1,2,NA) == NA)
[1] NA NA NA
(c(1,2,NA) %in% NA)
[1] FALSE FALSE  TRUE

Summarizing logical vectors

Logical summaries

Two main functions to summarize a logical vector:

  • any(x) returns TRUE if any value in x is TRUE
  • all(x) returns TRUE only if all values in x are TRUE

E.g., was there a day where:

  • every flight was delayed on departure by less than an hour?
  • any flight was delayed on arrival by \(\geq 5\) hours?
flights %>%
  group_by(year, month, day) %>%
  summarize(
    all_delayed = all(dep_delay < 60, na.rm=TRUE),
    any_long_delay = any(arr_delay >= 300, na.rm=TRUE)
  )
# A tibble: 365 × 5
# Groups:   year, month [12]
    year month   day all_delayed any_long_delay
   <int> <int> <int> <lgl>       <lgl>         
 1  2013     1     1 FALSE       TRUE          
 2  2013     1     2 FALSE       TRUE          
 3  2013     1     3 FALSE       FALSE         
 4  2013     1     4 FALSE       FALSE         
...

Logical summaries

  • When coerced into a numeric, TRUE coerces to 1 and FALSE coerces to 0
  • Useful if you want to find proportions that are TRUE/FALSE, e.g. mean(), sum()
x <- c(TRUE, FALSE, FALSE, FALSE, TRUE)
sum(x)  # number of TRUE elements in x
[1] 2
mean(x)  # proportion of TRUE elements in x
[1] 0.4
length(x) - sum(x)  # number of FALSE elements in x
[1] 3
1 - mean(x)  # proportion of FALSE elements in x
[1] 0.6

Logical summaries

  • Example: proportion of flights delayed > 1 hour on departure, and number of flights delayed on arrival by > 5 hours:
flights %>% 
  group_by(year, month, day) %>%
  summarise(
    prop_delayed_1hour = mean(dep_delay > 60, na.rm=TRUE),
    num_long_delay = sum(arr_delay > 300, na.rm=TRUE)
  )
# A tibble: 365 × 5
# Groups:   year, month [12]
    year month   day prop_delayed_1hour num_long_delay
   <int> <int> <int>              <dbl>          <int>
 1  2013     1     1             0.0609              3
 2  2013     1     2             0.0856              3
 3  2013     1     3             0.0586              0
...

Note output: # Groups: year, month [12]

Logical summaries

  • Example: proportion of flights delayed > 1 hour on departure, and number of flights delayed on arrival by > 5 hours:
flights %>% 
  group_by(year, month, day) %>%
  summarise(
    prop_delayed_1hour = mean(dep_delay > 60, na.rm=TRUE),
    num_long_delay = sum(arr_delay > 300, na.rm=TRUE),
    .groups = 'drop'
  )
# A tibble: 365 × 5
    year month   day prop_delayed_1hour num_long_delay
   <int> <int> <int>              <dbl>          <int>
 1  2013     1     1             0.0609              3
 2  2013     1     2             0.0856              3
 3  2013     1     3             0.0586              0
 4  2013     1     4             0.0473              0
 5  2013     1     5             0.0363              1
...

Logical subsetting

Average delay for delayed flights:

flights |> 
  filter(arr_delay > 0) |> 
  group_by(year, month, day) |> 
  summarize(
    behind = mean(arr_delay),
    n_flight = n(),
    .groups = 'drop'
  )
# A tibble: 365 × 5
    year month   day behind n_flight
   <int> <int> <int>  <dbl>    <int>
 1  2013     1     1   32.5      461
 2  2013     1     2   32.0      535
...

n() gives number of delayed flights per group.

Another way is to subset using logical vectors and the subset operator []:

flights |> 
  group_by(year, month, day) |> 
  summarize(
    behind = mean(arr_delay[arr_delay > 0], na.rm=TRUE),
    n_flight = n(),
    .groups = 'drop'
  )
# A tibble: 365 × 5
    year month   day behind n_flight
   <int> <int> <int>  <dbl>    <int>
 1  2013     1     1   32.5      842
 2  2013     1     2   32.0      943
...

n() gives total \(\#\) of flights per group, not ideal.

Conditional transformations

Conditional transformations: if_else()

if_else(CONDITION, TRUE_VAL, FALSE_VAL, MISSING_VAL) is useful when we want to return some value when condition is TRUE and return another value when condition is FALSE.

x <- c(-2, -1, 1, 2, NA)
if_else(x > 0, "yay", "boo")
[1] "boo" "boo" "yay" "yay" NA   

The fourth argument of if_else() specifies what to fill NA’s with:

if_else(x > 0, "yay", "boo", "idk how i feel yet")
[1] "boo"                "boo"                "yay"               
[4] "yay"                "idk how i feel yet"

We can also use vectors as an argument.

if_else(x < 0, -x, x)
[1]  2  1  1  2 NA

Conditional transformations: if_else()

We can use general vectors inside of if_else():

x1 <- c(NA, 1, 2, NA)
y1 <- c(3, NA, 4, 6)
if_else(is.na(x1), y1, x1)
[1] 3 1 2 6

If you have many different conditions for which you want to specify values, e.g.

  • If number is between a and b then do…
  • If number is between b and c then do…
  • If number is between c and d then do…

You can use case_when().

Conditional transformations: case_when()

Inspired by SQL’s CASE statement. Has a very weird syntax:

  • condition ~ output
  • condition is a logical vector
  • when condition is TRUE, output is used.

Weird, but pretty readable:

x <- c(-3:3, NA)
case_when(
  x == 0   ~ "0",
  x < 0    ~ "-ve", 
  x > 0    ~ "+ve",
  is.na(x) ~ "???"
)
[1] "-ve" "-ve" "-ve" "0"   "+ve" "+ve" "+ve" "???"

Conditional transformations: case_when()

If no cases match, then returns NA:

x <- c(-2:2, NA)
case_when(
  x < 0 ~ "-ve",
  x > 0 ~ "+ve"
)
[1] "-ve" "-ve" NA    "+ve" "+ve" NA   

If multiple conditions are satisfied, only the first is used – be careful!

case_when(
  x > 0 ~ "+ve",
  x > 2 ~ "big"
)
[1] NA    NA    NA    "+ve" "+ve" NA   

The argument .default specifies return value if condition is satisfied, or if value is NA.

case_when(
  x < 0 ~ "-ve",
  x > 0 ~ "+ve",
  .default = "???"
)
[1] "-ve" "-ve" "???" "+ve" "+ve" "???"

case_when(): more complex example

Provide human-readable labels to flight delays.

flights |> 
  mutate(
    status = case_when(
      is.na(arr_delay)      ~ "cancelled",
      arr_delay < -30       ~ "very early",
      arr_delay < -15       ~ "early",
      abs(arr_delay) <= 15  ~ "on time",
      arr_delay < 60        ~ "late",
      arr_delay < Inf       ~ "very late",
    ),
    .keep = "used" # only returns those columns used in calculations
  )
# A tibble: 336,776 × 2
   arr_delay status 
       <dbl> <chr>  
 1        11 on time
 2        20 late   
...
  • We can refer to variables inside the dataframe inside case_when(), just as in most other tidyverse functions .
  • The first conditional that is true is what gets assigned.
  • So when arr_delay < -30, the remaining conditionals do not get checked.

Compatible types

Both if_else() and case_when() require the outputs to be of consistent types.

if_else(TRUE, "a", 1)
#> Error in `if_else()`:
#> ! Can't combine `true` <character> and `false` <double>.

case_when(
  x < -1 ~ TRUE,
  x > 0 ~ now()
)
#> Error in `case_when()`:
#> ! Can't combine `..1 (right)` <logical> and `..2 (right)` <datetime<local>>

Most types are incompatible in order to catch errors. Compatible types:

  • Numeric and logical (treats TRUE=1, FALSE=0)
  • Dates and “date-times” - we will discuss these types later
  • NA is compatible with everything
  • Strings and factors are compatible - will discuss later

Example: labelling numbers as even or odd

  • Number is even \(\Leftrightarrow\) number is divisible by two.
  • In R, operator %% (read “modulo”) does “modular arithmetic”:
  • a %% b returns the remainder when dividing a by b, e.g.
    • 17 %% 12 = 5
    • 34 %% 6 = 4
  • A number n is even if and only if n %% 2 == 0; otherwise, odd.
  • We can use if_else to label numbers between 0 and 20 as even or odd
x <- 0:20
if_else(x %% 2 == 0, 'even', 'odd')
 [1] "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd" 
[11] "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd" 
[21] "even"