3.2 Explore data: transforming values

STA141A: Fundamentals of Statistical Data Science

Akira Horiguchi
library(tidyverse)  # Do at the start of every session
library(nycflights13) # loads the `flights` tibble into environment

Previous material you will likely use for any data science project you work on

  • This slide deck will be a bit more specialized
  • Some material will be a lifesaver; some you will never use after this class
    • But we don’t know which is which!

Operating on logical vectors

Intro

For logical vectors, every element takes one of 3 values: TRUE, FALSE, NA

We’ll investigate how to manipulate and transform data to get logicals, and how to use logicals.

library(tidyverse)
library(nycflights13)
flights
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
...

Three basic logical operators that we will use over and over:

  • AND (denoted & in R): operation between two logicals
  • OR (denoted | in R): operation between two logicals
  • NOT (denoted ! in R): operation on a single logical.

Truth table for AND:

A B A & B
TRUE TRUE TRUE
TRUE FALSE FALSE
FALSE TRUE FALSE
FALSE FALSE FALSE

Truth table for OR:

A B A | B
TRUE TRUE TRUE
TRUE FALSE TRUE
FALSE TRUE TRUE
FALSE FALSE FALSE

Truth table for NOT:

A ! A
TRUE FALSE
FALSE TRUE

Can combine AND/OR with NOT to cover any binary Boolean operation

Comparisons

Common way to create a logical vector: numeric comparison with <, !=, etc.

We have implicitly been using this when doing filtering.

flights$dep_time > 600
    [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
   [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
...

Using a comparator between two vectors of logicals returns pairwise comparisons.

x <- c(TRUE, FALSE, TRUE)
y <- c(FALSE, FALSE, TRUE)
(x & y) # x AND y
[1] FALSE FALSE  TRUE
(x | y) # x OR y
[1]  TRUE FALSE  TRUE

Comparisons

So when we use multiple comparisons in filter(), we are building a new vector of logicals.

We only keep those rows where the vector is TRUE.

flights |>
  filter(dep_time > 600 & dep_time < 2000 & abs(arr_delay) < 20)
# A tibble: 172,286 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      601            600         1      844            850
 2  2013     1     1      602            610        -8      812            820
 3  2013     1     1      602            605        -3      821            805
...

Comparisons

filter() and mutate() can be used in conjunction

flights |>
  mutate(
    daytime = dep_time > 600 & dep_time < 2000,
    approx_ontime = abs(arr_delay) < 20,
  ) |>
  filter(daytime & approx_ontime)
# A tibble: 172,286 × 21
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      601            600         1      844            850
 2  2013     1     1      602            610        -8      812            820
 3  2013     1     1      602            605        -3      821            805
...

Floating point comparisons

Testing equality with == for floating points can cause problems. Numbers are represented with finite “precision”, i.e. only up to 2^{-32} or 2^{-64}.

x <- c( (1/49) * 49, sqrt(2)^2)
x == c(1,2)
[1] FALSE FALSE

What’s going on? Let’s look at more precise representation in R.

print(x, digits=10)
[1] 1 2
print(x, digits=20)
[1] 0.99999999999999988898 2.00000000000000044409

dplyr::near() helps with this, ignores small differences

near(x, c(1,2))
[1] TRUE TRUE
all.equal(x, c(1,2))  # returns single value
[1] TRUE

Missing values

Almost any operation involving an NA returns NA.

(NA > 5)
[1] NA
(10 == NA)
[1] NA

What about NA==NA?

NA==NA
[1] NA

Why? Think of this example

# Suppose we don't know Ant's age
age_ant <- NA

# And we also don't know Bug's age
age_bug <- NA

# Then we shouldn't know whether Ant and
# Bug are the same age
age_ant == age_bug
[1] NA

Missing values

A useful function for dealing with NA: is.na()

is.na(x) works with any type of vector and returns TRUE for missing values and FALSE for everything else:

is.na(c(TRUE, NA, FALSE))
[1] FALSE  TRUE FALSE
is.na(c(1, NA, 3))
[1] FALSE  TRUE FALSE
is.na(c("a", NA, "b")) 
[1] FALSE  TRUE FALSE

Missing values

Since is.na() returns logicals, can be used in filter():

flights |>
  filter(is.na(dep_time))
# A tibble: 8,255 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1       NA           1630        NA       NA           1815
 2  2013     1     1       NA           1935        NA       NA           2240
...

Missing values

Can use to help identify where NA come from. e.g., why are there air_time NAs?

  • Let’s examine how dep_time, dep_delay, and sched_dep_time are related.
flights |> 
  mutate(missing_dep_time = is.na(dep_time),
         missing_dep_delay = is.na(dep_delay),
         missing_sched_dep_time = is.na(sched_dep_time)) |> 
  count(missing_dep_time, missing_dep_delay, missing_sched_dep_time)
# A tibble: 2 × 4
  missing_dep_time missing_dep_delay missing_sched_dep_time      n
  <lgl>            <lgl>             <lgl>                   <int>
1 FALSE            FALSE             FALSE                  328521
2 TRUE             TRUE              FALSE                    8255
  • The only instances where dep_delay is missing have dep_time missing.

Missing values

  • Is it the case that dep_delay = dep_time - sched_dep_time?
flights |> 
  mutate(dep_delay_manual = dep_time - sched_dep_time,
         manual_matches_given = near(dep_delay_manual, dep_delay)) |>
  count(manual_matches_given)
# A tibble: 3 × 2
  manual_matches_given      n
  <lgl>                 <int>
1 FALSE                 99777
2 TRUE                 228744
3 NA                     8255

Quite weird, since we are getting a lot right but also getting a lot wrong.

Missing values

Let’s inspect further. What do the mismatched observations look like?

flights |> 
  mutate(manual_delay = dep_time - sched_dep_time,
         manual_matches_given = near(manual_delay, dep_delay)) |>
  filter(!manual_matches_given) |>
  select(time_hour, flight, dep_time, sched_dep_time, dep_delay, manual_delay)
# A tibble: 99,777 × 6
   time_hour           flight dep_time sched_dep_time dep_delay manual_delay
   <dttm>               <int>    <int>          <int>     <dbl>        <int>
 1 2013-01-01 06:00:00    461      554            600        -6          -46
 2 2013-01-01 06:00:00    507      555            600        -5          -45
 3 2013-01-01 06:00:00   5708      557            600        -3          -43
...

Problem: R is treating dep_time and sched_dep_time as integers, not time!

  • 5:54 is only 6 minutes away from 6:00, rather than 46.
  • See Ch 17 (Dates and times) of R4DS2 on how to properly treat dates and times.

Missing values in Boolean algebra

Logical and missing values interact in logical ways, but requires some thought.

  • Think of NA as “unknown” logical value.
tibble(x = c(TRUE, FALSE, NA)) |>  
  mutate(
    and_NA = x & NA,
    or_NA = x | NA
  )
# A tibble: 3 × 3
  x     and_NA or_NA
  <lgl> <lgl>  <lgl>
1 TRUE  NA     TRUE 
2 FALSE FALSE  NA   
3 NA    NA     NA   

Does not depend on value of NA:

  • NA OR TRUE will return TRUE.
  • NA AND FALSE will return FALSE.

Depends on value of NA:

  • NA AND TRUE will return NA.
  • NA OR FALSE will return NA.

Consider finding all flights departing in November or December.

flights |>  # results in correct calculation
  filter(month == 11 | month == 12)
# A tibble: 55,403 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013    11     1        5           2359         6      352            345
 2  2013    11     1       35           2250       105      123           2356
...
flights |>  # results in incorrect calculation. (we get all rows back)
  filter(month == 11 | 12)
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
...

%in%

Instead of worrying about | and == in order, just use %in%.

1:10 %in% c(1, 5, 10)
 [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE

So to find all flights from November and December:

flights |>
  filter(month %in% c(11, 12))
  • %in% obeys different rules for NA vs. ==, since NA %in% NA is TRUE:
(c(1,2,NA) == NA)
[1] NA NA NA
(c(1,2,NA) %in% NA)
[1] FALSE FALSE  TRUE

Logical summaries

Two main functions for logical summaries:

  • any(x) returns TRUE if any value in x is TRUE
  • all(x) returns TRUE only if all values in x are TRUE

For instance, was there a day where every flight was delayed on departure by less than an hour? Or a day where there were any flights delayed on arrival by \(\leq 5\) hours?

flights |>
  group_by(year, month, day) |>
  summarize(
    all_delayed = all(dep_delay <= 60, na.rm=TRUE),
    any_long_delay = any(arr_delay >= 300, na.rm=TRUE)
  )
# A tibble: 365 × 5
# Groups:   year, month [12]
    year month   day all_delayed any_long_delay
   <int> <int> <int> <lgl>       <lgl>         
 1  2013     1     1 FALSE       TRUE          
 2  2013     1     2 FALSE       TRUE          
 3  2013     1     3 FALSE       FALSE         
 4  2013     1     4 FALSE       FALSE         
...

Logical summaries

  • When coerced into a numeric, TRUE coerces to 1 and FALSE coerces to 0
  • Useful if you want to find proportions that are TRUE/FALSE, e.g. mean(), sum()

Logical summaries

  • Example: proportion of flights delayed > 1 hour on departure, and number of flights delayed on arrival by > 5 hours:
flights |> 
  group_by(year, month, day) |>
  summarise(
    prop_delayed_1hour = mean(dep_delay > 60, na.rm=TRUE),
    num_long_delay = sum(arr_delay > 300, na.rm=TRUE)
  )
# A tibble: 365 × 5
# Groups:   year, month [12]
    year month   day prop_delayed_1hour num_long_delay
   <int> <int> <int>              <dbl>          <int>
 1  2013     1     1             0.0609              3
 2  2013     1     2             0.0856              3
 3  2013     1     3             0.0586              0
...

Note output: # Groups: year, month [12]

Logical summaries

  • Example: proportion of flights delayed > 1 hour on departure, and number of flights delayed on arrival by > 5 hours:
flights |> 
  group_by(year, month, day) |>
  summarise(
    prop_delayed_1hour = mean(dep_delay > 60, na.rm=TRUE),
    num_long_delay = sum(arr_delay > 300, na.rm=TRUE),
    .groups = 'drop'
  )
# A tibble: 365 × 5
    year month   day prop_delayed_1hour num_long_delay
   <int> <int> <int>              <dbl>          <int>
 1  2013     1     1             0.0609              3
 2  2013     1     2             0.0856              3
 3  2013     1     3             0.0586              0
...

Conditional transformations: if_else()

if_else(CONDITION, TRUE_VAL, FALSE_VAL, MISSING_VAL) is useful when we want to return some value when condition is TRUE and return another value when condition is FALSE.

x <- c(-2, -1, 1, 2, NA)
if_else(x > 0, "yay", "boo")
[1] "boo" "boo" "yay" "yay" NA   

The fourth argument of if_else() specifies what to fill NA’s with:

if_else(x > 0, "yay", "boo", "idk how i feel about x yet")
[1] "boo"                        "boo"                       
[3] "yay"                        "yay"                       
[5] "idk how i feel about x yet"

We can also use vectors as an argument.

if_else(x < 0, -x, x)
[1]  2  1  1  2 NA

Conditional transformations: if_else()

We can use general vectors inside of if_else():

x1 <- c(NA, 1, 2, NA)
y1 <- c(3, NA, 4, 6)
if_else(is.na(x1), y1, x1)
[1] 3 1 2 6

Conditional transformations: case_when()

If you have many different conditions for which you want to specify values, e.g.

  • If number is between a and b then do…
  • If number is between b and c then do…
  • If number is between c and d then do…

You can use case_when().

Conditional transformations: case_when()

Inspired by SQL’s CASE statement. Has a very weird syntax:

  • condition ~ output
  • condition is a logical vector
  • when condition is TRUE, output is used.

Weird, but pretty readable:

x <- c(-3:3, NA)
case_when(
  x == 0   ~ "0",
  x < 0    ~ "-ve", 
  x > 0    ~ "+ve",
  is.na(x) ~ "???"
)
[1] "-ve" "-ve" "-ve" "0"   "+ve" "+ve" "+ve" "???"

Conditional transformations: case_when()

If no cases match, then returns NA:

x <- c(-2:2, NA)
case_when(
  x < 0 ~ "-ve",
  x > 0 ~ "+ve"
)
[1] "-ve" "-ve" NA    "+ve" "+ve" NA   

If multiple conditions are satisfied, only the first is used – be careful!

case_when(
  x > 0 ~ "+ve",
  x > 2 ~ "big"
)
[1] NA    NA    NA    "+ve" "+ve" NA   

The argument .default specifies return value if condition is satisfied, or if value is NA.

case_when(
  x < 0 ~ "-ve",
  x > 0 ~ "+ve",
  .default = "???"
)
[1] "-ve" "-ve" "???" "+ve" "+ve" "???"

case_when(): more complex example

Provide human-readable labels to flight delays.

flights |> 
  mutate(
    status = case_when(
      is.na(arr_delay)      ~ "cancelled",
      arr_delay < -30       ~ "very early",
      arr_delay < -15       ~ "early",
      abs(arr_delay) <= 15  ~ "on time",
      arr_delay < 60        ~ "late",
      arr_delay < Inf       ~ "very late",
    ),
    .keep = "used" # only returns those columns used in calculations
  )
# A tibble: 336,776 × 2
   arr_delay status 
       <dbl> <chr>  
 1        11 on time
 2        20 late   
...
  • We can refer to variables inside the dataframe inside case_when(), just as in most other tidyverse functions .
  • The first conditional that is true is what gets assigned.
  • So when arr_delay < -30, the remaining conditionals do not get checked.

Compatible types

Both if_else() and case_when() require the outputs to be of consistent types.

if_else(TRUE, "a", 1)
#> Error in `if_else()`:
#> ! Can't combine `true` <character> and `false` <double>.

case_when(
  x < -1 ~ TRUE,
  x > 0 ~ now()
)
#> Error in `case_when()`:
#> ! Can't combine `..1 (right)` <logical> and `..2 (right)` <datetime<local>>

Most types are incompatible in order to catch errors. Compatible types:

  • Numeric and logical (treats TRUE=1, FALSE=0)
  • Dates and “date-times” – we might discuss these types after midterm 2
  • NA is compatible with everything
  • Strings and factors are compatible

Example: labelling numbers as even or odd

  • Number is even \(\Leftrightarrow\) number is divisible by two.
  • In R, operator %% (read “modulo”) does “modular arithmetic”:
  • a %% b returns the remainder when dividing a by b, e.g.
    • 17 %% 12 = 5
    • 34 %% 6 = 4
  • A number n is even if and only if n %% 2 == 0; otherwise, odd.
  • We can use if_else to label numbers between 0 and 20 as even or odd
x <- 0:20
if_else(x %% 2 == 0, 'even', 'odd')
 [1] "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd" 
[11] "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd" 
[21] "even"

Operating on numeric vectors

Parsing strings to get numbers

readr package in the tidyverse has two useful functions:

  • parse_double() – useful when you have numbers written as strings.
  • parse_number() – ignores all non-numeric text to parse strings.
parse_double(c("1.2", "5.6", "1e3"))
[1]    1.2    5.6 1000.0
parse_number(c("$1,234", "USD 53,256", "59%"))
[1]  1234 53256    59

Parsing strings to get numbers

What happens if use parse_double() with non-numeric-identifying strings?

x <- parse_double(c("qwerty", "6.5", "asdf"))
str(x)  # shows that x is a vector with informative attributes
 num [1:3] NA 6.5 NA
 - attr(*, "problems")= tibble [2 × 4] (S3: tbl_df/tbl/data.frame)
  ..$ row     : int [1:2] 1 3
  ..$ col     : int [1:2] NA NA
  ..$ expected: chr [1:2] "a double" "a double"
  ..$ actual  : chr [1:2] "qwerty" "asdf"

Can access this tibble by attributes(x)$problems

Arithmetic and “recycling rules”

  • We’ve created new rows before
    • e.g. flights |> mutate(air_time_hr = air_time / 60).
    • air_time has 336,776 elements while 60 has only one, so we divide every element of air_time by 60.
  • If you have two vectors of same length, operations are done element-wise:
x <- c(1, 2, 3, 4)
y <- c(2, 3, 4, 5)
x / y
[1] 0.5000000 0.6666667 0.7500000 0.8000000

Arithmetic and “recycling rules”

What happens if the number of elements is not 1 or the exact matching number?

  • R does what is called recycling, or repeating

    • It will create a new vector which repeats until reaches vector length.
    • Will throw warning if not an even multiple.
    x <- c(-1, -2, -3, -4)
    x * c(1,2)
    [1] -1 -4 -3 -8
    x * c(5, 6, 7)
    Warning in x * c(5, 6, 7): longer object length is not a multiple of shorter
    object length
    [1]  -5 -12 -21 -20

Recycling rules

Rules apply for all logical comparison (==, <, etc) and arithmetic (+, ^, etc)

  • Be careful when doing logical comparisons / arithmetic using two vectors!
flights |> filter(month == c(1,2))
# A tibble: 25,977 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      542            540         2      923            850
 3  2013     1     1      554            600        -6      812            837
...
  • month == c(1,2) returns a logical vector where:

    • TRUE if either the row number is odd and the month is 1, OR row number is even and month 2. Otherwise is FALSE.
  • Better to use month %in% c(1,2) here!

Modular arithmetic

Recall from grade school: division by remainder.

  • %/%: integer division
1:10 %/% 3
 [1] 0 0 1 1 1 2 2 2 3 3
  • %%: the remainder after integer division.
1:10 %% 3
 [1] 1 2 0 1 2 0 1 2 0 1

Useful for time-related calculations, e.g. get hour and minute from sched_dep_time:

flights |> 
  mutate(hour = sched_dep_time %/% 100,
         minute = sched_dep_time %% 100,
         .keep = 'used') 
# A tibble: 336,776 × 3
   sched_dep_time  hour minute
            <int> <dbl>  <dbl>
 1            515     5     15
 2            529     5     29
 3            540     5     40
 4            545     5     45
...

Modular arithmetic

We can then do things like calculate the percent of delayed flights per hour.

flights |> 
  mutate(hour = sched_dep_time %/% 100) |>
  group_by(hour) |>
  summarize(percent_cancelled = 100*mean(is.na(dep_time)),
            n = n())
# A tibble: 20 × 3
    hour percent_cancelled     n
   <dbl>             <dbl> <int>
 1     1           100         1
 2     5             0.461  1953
 3     6             1.64  25951
 4     7             1.27  22821
 5     8             1.62  27242
 6     9             1.61  20312
 7    10             1.74  16708
 8    11             1.85  16033
...

–>

–>

–>

–>

–>

Ranks: dplyr::min_rank()

  • Takes a vector of numbers and returns the rank of each element, with lowest = 1st.
min_rank(c(7,4,8,9))
[1] 2 1 3 4
  • Ties broken in obvious way: 1st, 2nd, 2nd, 4th if second and third element equal.
x <- c(62, 62, 64, NA, 20)
min_rank(x)
[1]  2  2  4 NA  1
  • To rank large values first, use desc(x):
min_rank(desc(x))
[1]  2  2  1 NA  4

There are many variants of min_rank() in dplyr:

  • row_number(), cume_dist(), percent_rank()
  • You can explore these on your own; min_rank() is enough in most cases.

Ranks: dplyr::min_rank()

Example: which 3 flight routes have the longest average delays?

  • flight route determined by origin and dest
  • negative arr_delay means the flight left early
flights |>
  filter(arr_delay > 0) |>
  group_by(origin, dest) |>
  summarize(avg_delay = mean(arr_delay, na.rm=TRUE),
            .groups = 'drop') |>
  mutate(rank = min_rank(desc(avg_delay))) |>
  filter(rank <= 3) |>
  arrange(by = rank)
# A tibble: 3 × 4
  origin dest  avg_delay  rank
  <chr>  <chr>     <dbl> <int>
1 LGA    TVC        72.7     1
2 EWR    TYS        72.6     2
3 LGA    OMA        65.0     3

Minimum, maximum, quantiles

  • Again, min(x) and max(x) return single smallest/largest vals within vector x.
  • quantile(vector, threshold) is a generalization of median:
    • quantile(x, 0.25) returns value of x that is >= 25% of values within x
    • quantile(x, 0.5) returns median
    • quantile(x, 0.95) returns value of x that is >= 95% of values within x.
  • Compare to the mean, quantiles are less susceptible to extreme values
    • Consider c(1, 2, 3, 2, 5, 2, 3, 1, 4, 2, 3, 1, 5, 2, 10000000)

Example: calculate the maximum delay and the 95th quantile of delays for flights per day.

flights |> 
  group_by(year, month, day) |>
  summarise(
    maxim = max(dep_delay, na.rm=TRUE),
    q95 = quantile(dep_delay, 0.95, na.rm=TRUE),
    .groups = 'drop'
  )
# A tibble: 365 × 5
    year month   day maxim   q95
   <int> <int> <int> <dbl> <dbl>
 1  2013     1     1   853  70.1
 2  2013     1     2   379  85  
 3  2013     1     3   291  68  
...

Operating on character vectors

glue::glue()

Similar to f-strings in Python. Can be less clunky than R’s paste().

library(glue)
tibble(
  Student = c("Ant", "Bug", "Cat", "Ant", "Bug", "Cat"),
  Subject = c("Math", "Math", "Math", "Chem", "Chem", "Chem"),
  Score = c(85, 92, 88, 90, 78, 95)
) |> 
  mutate(Msg = glue('Superstar {Student} scored {Score} in {Subject}'))
# A tibble: 6 × 4
  Student Subject Score Msg                            
  <chr>   <chr>   <dbl> <glue>                         
1 Ant     Math       85 Superstar Ant scored 85 in Math
2 Bug     Math       92 Superstar Bug scored 92 in Math
3 Cat     Math       88 Superstar Cat scored 88 in Math
4 Ant     Chem       90 Superstar Ant scored 90 in Chem
5 Bug     Chem       78 Superstar Bug scored 78 in Chem
6 Cat     Chem       95 Superstar Cat scored 95 in Chem