03: Transformations of logical vectors

STA35B: Statistical Data Science 2

Akira Horiguchi

Producing logical vectors

Intro

For logical vectors, every element takes one of 3 values: TRUE, FALSE, NA

We’ll investigate how to manipulate and transform data to get logicals, and how to use logicals.

library(tidyverse)
library(nycflights13)
flights

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
...

Three basic logical operators that we will use over and over:

AND (denoted & in R): operation between two logicals
OR (denoted | in R): operation between two logicals
NOT (denoted ! in R): operation on a single logical.

Truth table for AND:

A	B	A `&` B
`TRUE`	`TRUE`	`TRUE`
`TRUE`	`FALSE`	`FALSE`
`FALSE`	`TRUE`	`FALSE`
`FALSE`	`FALSE`	`FALSE`

Truth table for OR:

A	B	A `\|` B
`TRUE`	`TRUE`	`TRUE`
`TRUE`	`FALSE`	`TRUE`
`FALSE`	`TRUE`	`TRUE`
`FALSE`	`FALSE`	`FALSE`

Truth table for NOT:

A	`!` A
`TRUE`	`FALSE`
`FALSE`	`TRUE`

Comparisons

Common way to create a logical vector: numeric comparison with <, !=, etc.

We have implicitly been using this when doing filtering.

flights$dep_time > 600

    [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
   [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
...

Using a comparator between two vectors of logicals returns pairwise comparisons.

x <- c(TRUE, FALSE, TRUE)
y <- c(FALSE, FALSE, TRUE)
(x & y) # x AND y

[1] FALSE FALSE  TRUE

(x | y) # x OR y

[1]  TRUE FALSE  TRUE

Comparisons

So when we use multiple comparisons in filter(), we are building a new vector of logicals.

We only keep those rows where the vector is TRUE.

flights %>%
  filter(dep_time > 600 & dep_time < 2000 & abs(arr_delay) < 20)

# A tibble: 172,286 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      601            600         1      844            850
 2  2013     1     1      602            610        -8      812            820
 3  2013     1     1      602            605        -3      821            805
...

Comparisons

Filter and mutate can be used in conjunction

flights %>%
  mutate(
    daytime = dep_time > 600 & dep_time < 2000,
    approx_ontime = abs(arr_delay) < 20,
  ) %>%
  filter(daytime & approx_ontime)

# A tibble: 172,286 × 21
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      601            600         1      844            850
 2  2013     1     1      602            610        -8      812            820
 3  2013     1     1      602            605        -3      821            805
...

Floating point comparisons

Testing equality with == for floating points can cause problems. Numbers are represented with finite “precision”, i.e. only up to 2^{-32} or 2^{-64}.

x <- c( (1/49) * 49, sqrt(2)^2)
x == c(1,2)

[1] FALSE FALSE

What’s going on? Let’s look at more precise representation in R.

print(x, digits=10)

[1] 1 2

print(x, digits=20)

[1] 0.99999999999999988898 2.00000000000000044409

dplyr::near() helps with this, ignores small differences

near(x, c(1,2))

[1] TRUE TRUE

all.equal(x, c(1,2))  # returns single value

[1] TRUE

Missing values

Almost any operation involving an NA returns NA.

(NA > 5)

[1] NA

(10 == NA)

[1] NA

What about NA==NA?

NA==NA

[1] NA

Why? Think of this example

# Suppose we don't know Ant's age
age_ant <- NA

# And we also don't know Bug's age
age_bug <- NA

# Then we shouldn't know whether Ant and
# Bug are the same age
age_ant == age_bug

[1] NA

Missing values

A useful function for dealing with NA: is.na()

is.na(x) works with any type of vector and returns TRUE for missing values and FALSE for everything else:

is.na(c(TRUE, NA, FALSE))

[1] FALSE  TRUE FALSE

is.na(c(1, NA, 3))

[1] FALSE  TRUE FALSE

is.na(c("a", NA, "b"))

[1] FALSE  TRUE FALSE

Missing values

Since is.na() returns logicals, can be used in filter():

flights %>%
  filter(is.na(dep_time))

# A tibble: 8,255 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1       NA           1630        NA       NA           1815
 2  2013     1     1       NA           1935        NA       NA           2240
...

Missing values

Can use to help identify where NA come from. e.g., why are there air_time NAs?

Let’s examine how dep_time, dep_delay, and sched_dep_time are related.

flights %>% 
  mutate(missing_dep_time = is.na(dep_time),
         missing_dep_delay = is.na(dep_delay),
         missing_sched_dep_time = is.na(sched_dep_time)) %>% 
  count(missing_dep_time, missing_dep_delay, missing_sched_dep_time)

# A tibble: 2 × 4
  missing_dep_time missing_dep_delay missing_sched_dep_time      n
  <lgl>            <lgl>             <lgl>                   <int>
1 FALSE            FALSE             FALSE                  328521
2 TRUE             TRUE              FALSE                    8255

The only instances where dep_delay is missing have dep_time missing.

Missing values

Is it the case that dep_delay = dep_time - sched_dep_time?

flights %>% 
  mutate(dep_delay_manual = dep_time - sched_dep_time,
         manual_matches_given = near(dep_delay_manual, dep_delay)) %>%
  count(manual_matches_given)

# A tibble: 3 × 2
  manual_matches_given      n
  <lgl>                 <int>
1 FALSE                 99777
2 TRUE                 228744
3 NA                     8255

Quite weird, since we are getting a lot right but also getting a lot wrong.

Missing values

Let’s inspect further. What do the mismatched observations look like?

flights %>% 
  mutate(manual_delay = dep_time - sched_dep_time,
         manual_matches_given = near(manual_delay, dep_delay)) %>%
  filter(!manual_matches_given) %>%
  select(time_hour, flight, dep_time, sched_dep_time, dep_delay, manual_delay)

# A tibble: 99,777 × 6
   time_hour           flight dep_time sched_dep_time dep_delay manual_delay
   <dttm>               <int>    <int>          <int>     <dbl>        <int>
 1 2013-01-01 06:00:00    461      554            600        -6          -46
 2 2013-01-01 06:00:00    507      555            600        -5          -45
 3 2013-01-01 06:00:00   5708      557            600        -3          -43
...

Problem: R is treating dep_time and sched_dep_time as integers, not time!

5:54 is only 6 minutes away from 6:00, rather than 46.
We will later see how to properly treat dates and times.

Boolean algebra

Recall the basic Boolean algebra comparators, AND and OR
There is a third one, XOR, which we won’t use that often
Can combine AND/OR with NOT and cover any combination of a pair of Booleans

Boolean algebra and missing values

Booleans and missing values interact in logical, but maybe counterintuitive ways.

df <- tibble(x = c(TRUE, FALSE, NA))
df %>% 
  mutate(
    and_NA = x & NA,
    or_NA = x | NA
  )

# A tibble: 3 × 3
  x     and_NA or_NA
  <lgl> <lgl>  <lgl>
1 TRUE  NA     TRUE 
2 FALSE FALSE  NA   
3 NA    NA     NA

NA OR TRUE returns true, since it is TRUE regardless of NA being FALSE or TRUE.
NA AND TRUE returns NA since it depends on value of NA.
NA OR FALSE returns NA since it depends on value of NA.
NA AND FALSE returns FALSE since NA value doesn’t affect result, always false.

Consider finding all flights departing in November or December.

flights %>%  # results in correct calculation
  filter(month == 11 | month == 12)

flights %>%  # results in incorrect calculation. Why?
  filter(month == 11 | 12)

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
...

R first evaluates month==11, creating a logical vector vec.
R then compares vec | 12.
When comparing a number to a logical, any nonzero number is considered as TRUE.
So vec | 12 returns a vector with TRUE for every element.

`%in%`

Instead of worrying about | and == in order, just use %in%.

1:10 %in% c(1, 5, 10)

 [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE

So to find all flights from November and December:

flights %>%
  filter(month %in% c(11, 12))

%in% obeys different rules for NA vs. ==, since NA %in% NA is TRUE:

(c(1,2,NA) == NA)

[1] NA NA NA

(c(1,2,NA) %in% NA)

[1] FALSE FALSE  TRUE

Summarizing logical vectors

Logical summaries

Two main functions to summarize a logical vector:

any(x) returns TRUE if any value in x is TRUE
all(x) returns TRUE only if all values in x are TRUE

E.g., was there a day where:

every flight was delayed on departure by less than an hour?
any flight was delayed on arrival by \(\geq 5\) hours?

flights %>%
  group_by(year, month, day) %>%
  summarize(
    all_delayed = all(dep_delay < 60, na.rm=TRUE),
    any_long_delay = any(arr_delay >= 300, na.rm=TRUE)
  )

# A tibble: 365 × 5
# Groups:   year, month [12]
    year month   day all_delayed any_long_delay
   <int> <int> <int> <lgl>       <lgl>         
 1  2013     1     1 FALSE       TRUE          
 2  2013     1     2 FALSE       TRUE          
 3  2013     1     3 FALSE       FALSE         
 4  2013     1     4 FALSE       FALSE         
...

Logical summaries

When coerced into a numeric, TRUE coerces to 1 and FALSE coerces to 0
Useful if you want to find proportions that are TRUE/FALSE, e.g. mean(), sum()

x <- c(TRUE, FALSE, FALSE, FALSE, TRUE)
sum(x)  # number of TRUE elements in x

[1] 2

mean(x)  # proportion of TRUE elements in x

[1] 0.4

length(x) - sum(x)  # number of FALSE elements in x

[1] 3

1 - mean(x)  # proportion of FALSE elements in x

[1] 0.6

Logical summaries

Example: proportion of flights delayed > 1 hour on departure, and number of flights delayed on arrival by > 5 hours:

flights %>% 
  group_by(year, month, day) %>%
  summarise(
    prop_delayed_1hour = mean(dep_delay > 60, na.rm=TRUE),
    num_long_delay = sum(arr_delay > 300, na.rm=TRUE)
  )

# A tibble: 365 × 5
# Groups:   year, month [12]
    year month   day prop_delayed_1hour num_long_delay
   <int> <int> <int>              <dbl>          <int>
 1  2013     1     1             0.0609              3
 2  2013     1     2             0.0856              3
 3  2013     1     3             0.0586              0
...

Note output: # Groups: year, month [12]

Logical summaries

Example: proportion of flights delayed > 1 hour on departure, and number of flights delayed on arrival by > 5 hours:

flights %>% 
  group_by(year, month, day) %>%
  summarise(
    prop_delayed_1hour = mean(dep_delay > 60, na.rm=TRUE),
    num_long_delay = sum(arr_delay > 300, na.rm=TRUE),
    .groups = 'drop'
  )

# A tibble: 365 × 5
    year month   day prop_delayed_1hour num_long_delay
   <int> <int> <int>              <dbl>          <int>
 1  2013     1     1             0.0609              3
 2  2013     1     2             0.0856              3
 3  2013     1     3             0.0586              0
 4  2013     1     4             0.0473              0
 5  2013     1     5             0.0363              1
...

Logical subsetting

Average delay for delayed flights:

flights |> 
  filter(arr_delay > 0) |> 
  group_by(year, month, day) |> 
  summarize(
    behind = mean(arr_delay),
    n_flight = n(),
    .groups = 'drop'
  )

# A tibble: 365 × 5
    year month   day behind n_flight
   <int> <int> <int>  <dbl>    <int>
 1  2013     1     1   32.5      461
 2  2013     1     2   32.0      535
...

n() gives number of delayed flights per group.

Another way is to subset using logical vectors and the subset operator []:

flights |> 
  group_by(year, month, day) |> 
  summarize(
    behind = mean(arr_delay[arr_delay > 0], na.rm=TRUE),
    n_flight = n(),
    .groups = 'drop'
  )

# A tibble: 365 × 5
    year month   day behind n_flight
   <int> <int> <int>  <dbl>    <int>
 1  2013     1     1   32.5      842
 2  2013     1     2   32.0      943
...

n() gives total \(\#\) of flights per group, not ideal.

Conditional transformations

Conditional transformations: `if_else()`

if_else(CONDITION, TRUE_VAL, FALSE_VAL, MISSING_VAL) is useful when we want to return some value when condition is TRUE and return another value when condition is FALSE.

x <- c(-2, -1, 1, 2, NA)
if_else(x > 0, "yay", "boo")

[1] "boo" "boo" "yay" "yay" NA

The fourth argument of if_else() specifies what to fill NA’s with:

if_else(x > 0, "yay", "boo", "idk how i feel yet")

[1] "boo"                "boo"                "yay"               
[4] "yay"                "idk how i feel yet"

We can also use vectors as an argument.

if_else(x < 0, -x, x)

[1]  2  1  1  2 NA

Conditional transformations: `if_else()`

We can use general vectors inside of if_else():

x1 <- c(NA, 1, 2, NA)
y1 <- c(3, NA, 4, 6)
if_else(is.na(x1), y1, x1)

[1] 3 1 2 6

If you have many different conditions for which you want to specify values, e.g.

If number is between a and b then do…
If number is between b and c then do…
If number is between c and d then do…

You can use case_when().

Conditional transformations: `case_when()`

Inspired by SQL’s CASE statement. Has a very weird syntax:

condition ~ output
condition is a logical vector
when condition is TRUE, output is used.

Weird, but pretty readable:

x <- c(-3:3, NA)
case_when(
  x == 0   ~ "0",
  x < 0    ~ "-ve", 
  x > 0    ~ "+ve",
  is.na(x) ~ "???"
)

[1] "-ve" "-ve" "-ve" "0"   "+ve" "+ve" "+ve" "???"

Conditional transformations: `case_when()`

If no cases match, then returns NA:

x <- c(-2:2, NA)
case_when(
  x < 0 ~ "-ve",
  x > 0 ~ "+ve"
)

[1] "-ve" "-ve" NA    "+ve" "+ve" NA

If multiple conditions are satisfied, only the first is used – be careful!

case_when(
  x > 0 ~ "+ve",
  x > 2 ~ "big"
)

[1] NA    NA    NA    "+ve" "+ve" NA

The argument .default specifies return value if condition is satisfied, or if value is NA.

case_when(
  x < 0 ~ "-ve",
  x > 0 ~ "+ve",
  .default = "???"
)

[1] "-ve" "-ve" "???" "+ve" "+ve" "???"

`case_when()`: more complex example

Provide human-readable labels to flight delays.

flights |> 
  mutate(
    status = case_when(
      is.na(arr_delay)      ~ "cancelled",
      arr_delay < -30       ~ "very early",
      arr_delay < -15       ~ "early",
      abs(arr_delay) <= 15  ~ "on time",
      arr_delay < 60        ~ "late",
      arr_delay < Inf       ~ "very late",
    ),
    .keep = "used" # only returns those columns used in calculations
  )

# A tibble: 336,776 × 2
   arr_delay status 
       <dbl> <chr>  
 1        11 on time
 2        20 late   
...

We can refer to variables inside the dataframe inside case_when(), just as in most other tidyverse functions .
The first conditional that is true is what gets assigned.
So when arr_delay < -30, the remaining conditionals do not get checked.

Compatible types

Both if_else() and case_when() require the outputs to be of consistent types.

if_else(TRUE, "a", 1)
#> Error in `if_else()`:
#> ! Can't combine `true` <character> and `false` <double>.

case_when(
  x < -1 ~ TRUE,
  x > 0 ~ now()
)
#> Error in `case_when()`:
#> ! Can't combine `..1 (right)` <logical> and `..2 (right)` <datetime<local>>

Most types are incompatible in order to catch errors. Compatible types:

Numeric and logical (treats TRUE=1, FALSE=0)
Dates and “date-times” - we will discuss these types later
NA is compatible with everything
Strings and factors are compatible - will discuss later

Example: labelling numbers as even or odd

Number is even \(\Leftrightarrow\) number is divisible by two.
In R, operator %% (read “modulo”) does “modular arithmetic”:
a %% b returns the remainder when dividing a by b, e.g.
- 17 %% 12 = 5
- 34 %% 6 = 4
A number n is even if and only if n %% 2 == 0; otherwise, odd.
We can use if_else to label numbers between 0 and 20 as even or odd

x <- 0:20
if_else(x %% 2 == 0, 'even', 'odd')

 [1] "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd" 
[11] "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd" 
[21] "even"

03: Transformations of logical vectors

Producing logical vectors

Intro

Three basic logical operators that we will use over and over:

Comparisons

Comparisons

Comparisons

Floating point comparisons

Missing values

Missing values

Missing values

Missing values

Missing values

Missing values

Boolean algebra

Boolean algebra and missing values

Consider finding all flights departing in November or December.

%in%

Summarizing logical vectors

Logical summaries

Logical summaries

Logical summaries

Logical summaries

Logical subsetting

Conditional transformations

Conditional transformations: if_else()

Conditional transformations: if_else()

Conditional transformations: case_when()

Conditional transformations: case_when()

case_when(): more complex example

Compatible types

Example: labelling numbers as even or odd

`%in%`

Conditional transformations: `if_else()`

Conditional transformations: `if_else()`

Conditional transformations: `case_when()`

Conditional transformations: `case_when()`

`case_when()`: more complex example