3.3 Explore data: transforming values

STA141A: Fundamentals of Statistical Data Science

Akira Horiguchi

Comments

Previous material you will likely use for any data science project you work on

This slide deck will be a bit more specialized
Some material will be a lifesaver; some you will never use after this class
- But we don’t know which is which!

library(tidyverse)  # Do at the start of every session
library(nycflights13) # loads the `flights` tibble into environment
flights

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
...

Operating on logical vectors

Intro

We’ll investigate how to manipulate and transform data to get logicals, and how to use logicals.

For logical vectors, every element takes one of 3 values: TRUE, FALSE, NA

Common way to create a logical vector: numeric comparison with <, !=, etc:

flights$dep_time > 600  # produces a logical vector

    [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
   [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
...

We might also want to combine multiple conditions.

For example, flights that depart between 6am and 8pm that was not delayed for more than 20 minutes.

Three basic logical operators that we will use over and over:

AND (denoted & in R): operation between two logicals
OR (denoted | in R): operation between two logicals
NOT (denoted ! in R): operation on a single logical.

Truth table for AND:

A	B	A `&` B
`TRUE`	`TRUE`	`TRUE`
`TRUE`	`FALSE`	`FALSE`
`FALSE`	`TRUE`	`FALSE`
`FALSE`	`FALSE`	`FALSE`

Truth table for OR:

A	B	A `\|` B
`TRUE`	`TRUE`	`TRUE`
`TRUE`	`FALSE`	`TRUE`
`FALSE`	`TRUE`	`TRUE`
`FALSE`	`FALSE`	`FALSE`

Truth table for NOT:

A	`!` A
`TRUE`	`FALSE`
`FALSE`	`TRUE`

Can combine AND/OR with NOT to cover any binary Boolean operation

Comparisons: pairwise

A comparator between two vectors of logicals returns pairwise comparisons.

x <- c(TRUE, TRUE, FALSE, FALSE)
y <- c(TRUE, FALSE, TRUE, FALSE)

(x & y)  # x AND y

[1]  TRUE FALSE FALSE FALSE

(x | y)  # x OR y

[1]  TRUE  TRUE  TRUE FALSE

!x  # NOT x

[1] FALSE FALSE  TRUE  TRUE

FYI, we can also use T for TRUE and F for FALSE:

a <- c(T, T, F, F)
b <- c(T, F, T, F)
a == x

[1] TRUE TRUE TRUE TRUE

b == y

[1] TRUE TRUE TRUE TRUE

Comparisons: multiple

Multiple comparisons in filter() produce a new vector of logicals.

flights |>
  filter(dep_time > 600 & dep_time < 2000 & abs(arr_delay) < 20)  # keep those rows where the vector is TRUE

# A tibble: 172,286 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      601            600         1      844            850
 2  2013     1     1      602            610        -8      812            820
 3  2013     1     1      602            605        -3      821            805
 4  2013     1     1      606            610        -4      858            910
 5  2013     1     1      606            610        -4      837            845
 6  2013     1     1      607            607         0      858            915
...

Comparisons

filter() and mutate() can be used in conjunction

flights |>
  mutate(
    daytime = dep_time > 600 & dep_time < 2000,
    approx_ontime = abs(arr_delay) < 20,
  ) |>
  filter(daytime & approx_ontime)

# A tibble: 172,286 × 21
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      601            600         1      844            850
 2  2013     1     1      602            610        -8      812            820
 3  2013     1     1      602            605        -3      821            805
...

Floating point comparisons

Testing equality with == for floating points can cause problems. Numbers are represented with finite “precision”, i.e. only up to 2^{-32} or 2^{-64}.

x <- c( (1/49) * 49, sqrt(2)^2)
x == c(1,2)

[1] FALSE FALSE

What’s going on? Let’s look at more precise representation in R.

print(x, digits=10)

[1] 1 2

print(x, digits=20)

[1] 0.99999999999999988898 2.00000000000000044409

dplyr::near() helps with this, ignores small differences

near(x, c(1,2))

[1] TRUE TRUE

all.equal(x, c(1,2))  # returns single value

[1] TRUE

Missing values

Almost any operation involving an NA returns NA.

(NA > 5)

[1] NA

(10 == NA)

[1] NA

What about NA==NA?

NA==NA

[1] NA

Why? Think of this example

# Suppose we don't know Ant's age
age_ant <- NA

# And we also don't know Bug's age
age_bug <- NA

# Then we shouldn't know whether Ant and
# Bug are the same age
age_ant == age_bug

[1] NA

Missing values

is.na() is a useful function for dealing with NA values.

works with any type of vector;
returns TRUE for missing values and FALSE for everything else:

is.na(c(TRUE, NA, FALSE))

[1] FALSE  TRUE FALSE

is.na(c(1, NA, 3))

[1] FALSE  TRUE FALSE

is.na(c("a", NA, "b"))

[1] FALSE  TRUE FALSE

is.na(list(3, '5', NA))  # also works for a list

[1] FALSE FALSE  TRUE

Missing values

Since is.na() returns logicals, can be used in filter():

flights |>
  filter(is.na(dep_time))

# A tibble: 8,255 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1       NA           1630        NA       NA           1815
 2  2013     1     1       NA           1935        NA       NA           2240
 3  2013     1     1       NA           1500        NA       NA           1825
 4  2013     1     1       NA            600        NA       NA            901
...

Missing values

Can use to help identify where NA come from. e.g., why are there air_time NAs?

Let’s examine how dep_time, dep_delay, and sched_dep_time are related.

flights |> 
  mutate(missing_dep_time = is.na(dep_time),
         missing_dep_delay = is.na(dep_delay),
         missing_sched_dep_time = is.na(sched_dep_time)) |> 
  count(missing_dep_time, missing_dep_delay, missing_sched_dep_time)

# A tibble: 2 × 4
  missing_dep_time missing_dep_delay missing_sched_dep_time      n
  <lgl>            <lgl>             <lgl>                   <int>
1 FALSE            FALSE             FALSE                  328521
2 TRUE             TRUE              FALSE                    8255

The only instances where dep_delay is missing have dep_time missing.

Missing values

Is it the case that dep_delay = dep_time - sched_dep_time?

flights |> 
  mutate(dep_delay_manual = dep_time - sched_dep_time,
         manual_matches_given = near(dep_delay_manual, dep_delay)) |>
  count(manual_matches_given)

# A tibble: 3 × 2
  manual_matches_given      n
  <lgl>                 <int>
1 FALSE                 99777
2 TRUE                 228744
3 NA                     8255

Quite weird, since we are getting a lot right but also getting a lot wrong.

Missing values

Let’s inspect further. What do the mismatched observations look like?

flights |> 
  mutate(manual_delay = dep_time - sched_dep_time,
         manual_matches_given = near(manual_delay, dep_delay)) |>
  filter(!manual_matches_given) |>
  select(time_hour, flight, dep_time, sched_dep_time, dep_delay, manual_delay)

# A tibble: 99,777 × 6
   time_hour           flight dep_time sched_dep_time dep_delay manual_delay
   <dttm>               <int>    <int>          <int>     <dbl>        <int>
 1 2013-01-01 06:00:00    461      554            600        -6          -46
 2 2013-01-01 06:00:00    507      555            600        -5          -45
 3 2013-01-01 06:00:00   5708      557            600        -3          -43
...

Problem: R is treating dep_time and sched_dep_time as integers, not time!

5:54 is only 6 minutes away from 6:00, rather than 46.
See Ch 17 (Dates and times) of R4DS2 on how to properly treat dates and times.

Missing values in Boolean algebra

Logical and missing values interact in logical ways, but requires some thought.

c(TRUE, FALSE) | NA

[1] TRUE   NA

c(TRUE, FALSE) & NA

[1]    NA FALSE

Think of NA as an unknown logical value.

Does not depend on value of NA:
- NA OR TRUE will return TRUE.
- NA AND FALSE will return FALSE.
Depends on value of NA:
- NA AND TRUE will return NA.
- NA OR FALSE will return NA.

Consider finding all flights departing in November or December.

flights |>  # results in correct calculation
  filter(month == 11 | month == 12)

# A tibble: 55,403 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013    11     1        5           2359         6      352            345
 2  2013    11     1       35           2250       105      123           2356
...

flights |>  # results in incorrect calculation. (we get all rows back)
  filter(month == 11 | 12)

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
...

R first evaluates month==11, creating a logical vector vec.
R then compares vec | 12.
When comparing a number to a logical, any nonzero number is considered as TRUE.
So vec | 12 returns a vector with TRUE for every element.

`%in%`

Instead of worrying about | and == in order, just use %in%.

1:10 %in% c(1, 5, 10)

 [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE

So to find all flights from November and December:

flights |>
  filter(month %in% c(11, 12))

%in% obeys different rules for NA vs. ==, since NA %in% NA is TRUE:

(c(1,2,NA) == NA)

[1] NA NA NA

(c(1,2,NA) %in% NA)

[1] FALSE FALSE  TRUE

Logical summaries

Two main functions to summarize a logical vector:

any(x) returns TRUE if any value in x is TRUE
all(x) returns TRUE only if all values in x are TRUE

E.g., was there a day where:

every flight was delayed on departure by $<$ 1 hour?
any flight was delayed on arrival by $\geq 5$ hours?

flights |>
  group_by(year, month, day) |>
  summarize(
    all_delayed = all(dep_delay < 60, na.rm=TRUE),
    any_long_delay = any(arr_delay >= 300, na.rm=TRUE)
  )

# A tibble: 365 × 5
# Groups:   year, month [12]
    year month   day all_delayed any_long_delay
   <int> <int> <int> <lgl>       <lgl>         
 1  2013     1     1 FALSE       TRUE          
 2  2013     1     2 FALSE       TRUE          
 3  2013     1     3 FALSE       FALSE         
 4  2013     1     4 FALSE       FALSE         
...

Logical summaries

When coerced into a numeric, TRUE coerces to 1 and FALSE coerces to 0
Useful if you want to find proportions that are TRUE/FALSE, e.g. mean(), sum()

x <- c(TRUE, FALSE, FALSE, FALSE, TRUE)
sum(x)  # number of TRUE elements in x

[1] 2

mean(x)  # proportion of TRUE elements in x

[1] 0.4

length(x) - sum(x)  # number of FALSE elements in x

[1] 3

1 - mean(x)  # proportion of FALSE elements in x

[1] 0.6

Logical summaries

Example: proportion of flights delayed > 1 hour on departure, and number of flights delayed on arrival by > 5 hours:

flights |> 
  group_by(year, month, day) |>
  summarise(
    prop_delayed_1hour = mean(dep_delay > 60, na.rm=TRUE),
    num_long_delay = sum(arr_delay > 300, na.rm=TRUE)
  )

# A tibble: 365 × 5
# Groups:   year, month [12]
    year month   day prop_delayed_1hour num_long_delay
   <int> <int> <int>              <dbl>          <int>
 1  2013     1     1             0.0609              3
 2  2013     1     2             0.0856              3
 3  2013     1     3             0.0586              0
...

Note output: # Groups: year, month [12]

Logical summaries

Example: proportion of flights delayed > 1 hour on departure, and number of flights delayed on arrival by > 5 hours:

flights |> 
  group_by(year, month, day) |>
  summarise(
    prop_delayed_1hour = mean(dep_delay > 60, na.rm=TRUE),
    num_long_delay = sum(arr_delay > 300, na.rm=TRUE),
    .groups = 'drop'  # Output is no longer grouped
  )

# A tibble: 365 × 5
    year month   day prop_delayed_1hour num_long_delay
   <int> <int> <int>              <dbl>          <int>
 1  2013     1     1             0.0609              3
 2  2013     1     2             0.0856              3
 3  2013     1     3             0.0586              0
 4  2013     1     4             0.0473              0
 5  2013     1     5             0.0363              1
...

Conditional transformations: `if_else()`

if_else(CONDITION, TRUE_VAL, FALSE_VAL, MISSING_VAL) is useful when we want to return some value when condition is TRUE and return another value when condition is FALSE.

x <- c(-2, -1, 1, 2, NA)
if_else(x > 0, "yay", "boo")

[1] "boo" "boo" "yay" "yay" NA

The fourth argument of if_else() specifies what to fill NA’s with:

if_else(x > 0, "yay", "boo", "huh")

[1] "boo" "boo" "yay" "yay" "huh"

We can also use vectors as an argument.

if_else(x < 0, -x, x)

[1]  2  1  1  2 NA

Conditional transformations: `if_else()`

We can use general vectors inside of if_else():

x1 <- c(NA, 1, 2, NA)
y1 <- c(3, NA, 4, 6)
if_else(is.na(x1), y1, x1)

[1] 3 1 2 6

Conditional transformations: `case_when()`

if_else() is useful if you have two different conditions for which you want to specify values.

case_when() is useful if you have many different conditions for which you want to specify values. For example:

If number is between 0 and 20, then print F,
If number is between 20 and 40, then print D,
If number is between 40 and 60, then print C,
If number is between 60 and 80, then print B,
If number is between 80 and 100, then print A.

Conditional transformations: `case_when()`

Inspired by SQL’s CASE statement. Has a very weird syntax:

condition ~ output
condition is a logical vector
when condition is TRUE, output is used.

Weird, but pretty readable:

(x <- c(-3:3, NA))

[1] -3 -2 -1  0  1  2  3 NA

case_when(
  x == 0   ~ "0",
  x < 0    ~ "-ve", 
  x > 0    ~ "+ve",
  is.na(x) ~ "???"
)

[1] "-ve" "-ve" "-ve" "0"   "+ve" "+ve" "+ve" "???"

Conditional transformations: `case_when()`

(x <- c(-1:2, NA))

[1] -1  0  1  2 NA

If no cases match, then returns NA:

case_when(
  x < 0 ~ "-ve",
  x > 0 ~ "+ve"
)

[1] "-ve" NA    "+ve" "+ve" NA

If multiple conditions are satisfied, only the first is used – be careful!

case_when(
  x > 0 ~ "+ve",
  x > 1 ~ "big"
)

[1] NA    NA    "+ve" "+ve" NA

The argument .default specifies return value if condition is satisfied, or if value is NA.

case_when(
  x < 0 ~ "-ve",
  x > 0 ~ "+ve",
  .default = "???"
)

[1] "-ve" "???" "+ve" "+ve" "???"

`case_when()`: more complex example

Provide human-readable labels to flight delays.

flights |> 
  mutate(
    status = case_when(
      is.na(arr_delay)      ~ "cancelled",
      arr_delay < -30       ~ "very early",
      arr_delay < -15       ~ "early",
      abs(arr_delay) <= 15  ~ "on time",
      arr_delay < 60        ~ "late",
      arr_delay < Inf       ~ "very late",
    ),
    .keep = "used" # only returns those columns used in calculations
  )

# A tibble: 336,776 × 2
   arr_delay status 
       <dbl> <chr>  
 1        11 on time
 2        20 late   
...

We can refer to variables inside the dataframe inside case_when(), just as in most other tidyverse functions .
The first conditional that is true is what gets assigned.
So when arr_delay < -30, the remaining conditionals do not get checked.

Compatible types

Both if_else() and case_when() require the outputs to be of consistent types.

if_else(TRUE, "a", 1)
#> Error in `if_else()`:
#> ! Can't combine `true` <character> and `false` <double>.

case_when(
  x < -1 ~ TRUE,
  x > 0 ~ now()
)
#> Error in `case_when()`:
#> ! Can't combine `..1 (right)` <logical> and `..2 (right)` <datetime<local>>

Most types are incompatible in order to catch errors. Compatible types:

Numeric and logical (treats TRUE=1, FALSE=0)
Dates and “date-times” – in R4DS2 book.
NA is compatible with everything
Strings and factors are compatible

Example: labelling numbers as even or odd

Number is even $\Leftrightarrow$ number is divisible by two.
In R, operator %% (read “modulo”) does “modular arithmetic”:
a %% b returns the remainder when dividing a by b, e.g.
- 17 %% 12 = 5
- 34 %% 6 = 4
A number n is even if and only if n %% 2 == 0; otherwise, odd.
We can use if_else to label numbers between 0 and 20 as even or odd

x <- 0:20
if_else(x %% 2 == 0, 'even', 'odd')

 [1] "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd" 
[11] "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd" 
[21] "even"

Operating on numeric vectors

Parsing strings to get numbers

readr package in the tidyverse has two useful functions:

parse_double() – useful when you have numbers written as strings.
parse_number() – ignores all non-numeric text to parse strings.

parse_double(c("1.2", "5.6", "1e3"))

[1]    1.2    5.6 1000.0

parse_number(c("$1,234", "USD 53,256", "59%"))

[1]  1234 53256    59

Parsing strings to get numbers

What happens if use parse_double() with non-numeric-identifying strings?

x <- parse_double(c("qwerty", "6.5", "asdf"))
str(x)  # shows that x is a vector with informative attributes

 num [1:3] NA 6.5 NA
 - attr(*, "problems")= tibble [2 × 4] (S3: tbl_df/tbl/data.frame)
  ..$ row     : int [1:2] 1 3
  ..$ col     : int [1:2] NA NA
  ..$ expected: chr [1:2] "a double" "a double"
  ..$ actual  : chr [1:2] "qwerty" "asdf"

Can access this tibble by attributes(x)$problems

Arithmetic and “recycling rules”

We’ve created new rows before
- e.g. flights |> mutate(air_time_hr = air_time / 60).
- air_time has 336,776 elements while 60 has only one, so we divide every element of air_time by 60.
If you have two vectors of same length, operations are done element-wise:

x <- c(1, 2, 3, 4)
y <- c(2, 3, 4, 5)
x / y

[1] 0.5000000 0.6666667 0.7500000 0.8000000

Arithmetic and “recycling rules”

What happens if the number of elements is not 1 or the exact matching number?

R does what is called recycling, or repeating

It will create a new vector which repeats until reaches vector length.
Will throw warning if not an even multiple.

x <- c(-1, -2, -3, -4)
x * c(1,2)

[1] -1 -4 -3 -8

x * c(5, 6, 7)

Warning in x * c(5, 6, 7): longer object length is not a multiple of shorter
object length

[1]  -5 -12 -21 -20

Recycling rules

Rules apply for all logical comparison (==, <, etc) and arithmetic (+, ^, etc)

Be careful when doing logical comparisons / arithmetic using two vectors!

flights |> filter(month == c(1,2))

# A tibble: 25,977 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      542            540         2      923            850
 3  2013     1     1      554            600        -6      812            837
...

month == c(1,2) returns a logical vector where:
- TRUE if either the row number is odd and the month is 1, OR row number is even and month 2. Otherwise is FALSE.
Better to use month %in% c(1,2) here!

Modular arithmetic

Recall from grade school: division by remainder.

%/%: integer division

1:10 %/% 3

 [1] 0 0 1 1 1 2 2 2 3 3

%%: the remainder after integer division.

1:10 %% 3

 [1] 1 2 0 1 2 0 1 2 0 1

Useful for time-related calculations, e.g. get hour and minute from sched_dep_time:

flights |> 
  mutate(hour = sched_dep_time %/% 100,
         minute = sched_dep_time %% 100,
         .keep = 'used')

# A tibble: 336,776 × 3
   sched_dep_time  hour minute
            <int> <dbl>  <dbl>
 1            515     5     15
 2            529     5     29
 3            540     5     40
 4            545     5     45
...

Modular arithmetic

We can then do things like calculate the percent of delayed flights per hour.

flights |> 
  mutate(hour = sched_dep_time %/% 100) |>
  group_by(hour) |>
  summarize(percent_cancelled = 100*mean(is.na(dep_time)),
            n = n())

# A tibble: 20 × 3
    hour percent_cancelled     n
   <dbl>             <dbl> <int>
 1     1           100         1
 2     5             0.461  1953
 3     6             1.64  25951
 4     7             1.27  22821
 5     8             1.62  27242
 6     9             1.61  20312
 7    10             1.74  16708
 8    11             1.85  16033
...

–>

Ranks: `dplyr::min_rank()`

Takes a vector of numbers and returns the rank of each element, with lowest = 1st.

min_rank(c(7,4,8,9))

[1] 2 1 3 4

Ties broken in obvious way: 1st, 2nd, 2nd, 4th if second and third element equal.

x <- c(62, 62, 64, NA, 20)
min_rank(x)

[1]  2  2  4 NA  1

To rank large values first, use desc(x):

min_rank(desc(x))

[1]  2  2  1 NA  4

There are many variants of min_rank() in dplyr:

row_number(), cume_dist(), percent_rank()
You can explore these on your own; min_rank() is enough in most cases.

Ranks: `dplyr::min_rank()`

Example: which 3 flight routes have the longest average delays?

flight route determined by origin and dest
negative arr_delay means the flight left early

flights |>
  filter(arr_delay > 0) |>
  group_by(origin, dest) |>
  summarize(avg_delay = mean(arr_delay, na.rm=TRUE),
            .groups = 'drop') |>
  mutate(rank = min_rank(desc(avg_delay))) |>
  filter(rank <= 3) |>
  arrange(by = rank)

# A tibble: 3 × 4
  origin dest  avg_delay  rank
  <chr>  <chr>     <dbl> <int>
1 LGA    TVC        72.7     1
2 EWR    TYS        72.6     2
3 LGA    OMA        65.0     3

Minimum, maximum, quantiles

Again, min(x) and max(x) return single smallest/largest vals within vector x.
quantile(vector, threshold) is a generalization of median:
- quantile(x, 0.25) returns value of x that is >= 25% of values within x
- quantile(x, 0.5) returns median
- quantile(x, 0.95) returns value of x that is >= 95% of values within x.
Compare to the mean, quantiles are less susceptible to extreme values
- Consider c(1, 2, 3, 2, 5, 2, 3, 1, 4, 2, 3, 1, 5, 2, 10000000)

Example: calculate the maximum delay and the 95th quantile of delays for flights per day.

flights |> 
  group_by(year, month, day) |>
  summarise(
    maxim = max(dep_delay, na.rm=TRUE),
    q95 = quantile(dep_delay, 0.95, na.rm=TRUE),
    .groups = 'drop'
  )

# A tibble: 365 × 5
    year month   day maxim   q95
   <int> <int> <int> <dbl> <dbl>
 1  2013     1     1   853  70.1
 2  2013     1     2   379  85  
 3  2013     1     3   291  68  
...

Operating on character vectors

`glue::glue()`

Similar to f-strings in Python. Can be less clunky than R’s paste().

library(glue)

tibble(
  Student = c("Ant", "Bug", "Cat", "Ant", "Bug", "Cat"),
  Subject = c("Math", "Math", "Math", "Chem", "Chem", "Chem"),
  Score = c(85, 92, 88, 90, 78, 95)
) |> 
  mutate(Msg = glue('Superstar {Student} scored {Score} in {Subject}'))

# A tibble: 6 × 4
  Student Subject Score Msg                            
  <chr>   <chr>   <dbl> <glue>                         
1 Ant     Math       85 Superstar Ant scored 85 in Math
2 Bug     Math       92 Superstar Bug scored 92 in Math
3 Cat     Math       88 Superstar Cat scored 88 in Math
4 Ant     Chem       90 Superstar Ant scored 90 in Chem
5 Bug     Chem       78 Superstar Bug scored 78 in Chem
6 Cat     Chem       95 Superstar Cat scored 95 in Chem

3.3 Explore data: transforming values

Comments

Operating on logical vectors

Intro

Three basic logical operators that we will use over and over:

Comparisons: pairwise

Comparisons: multiple

Comparisons

Floating point comparisons

Missing values

Missing values

Missing values

Missing values

Missing values

Missing values

Missing values in Boolean algebra

Consider finding all flights departing in November or December.

%in%

Logical summaries

Logical summaries

Logical summaries

Logical summaries

Conditional transformations: if_else()

Conditional transformations: if_else()

Conditional transformations: case_when()

Conditional transformations: case_when()

Conditional transformations: case_when()

case_when(): more complex example

Compatible types

Example: labelling numbers as even or odd

Operating on numeric vectors

Parsing strings to get numbers

Parsing strings to get numbers

Arithmetic and “recycling rules”

Arithmetic and “recycling rules”

Recycling rules

Modular arithmetic

Modular arithmetic

Ranks: dplyr::min_rank()

Ranks: dplyr::min_rank()

Minimum, maximum, quantiles

Operating on character vectors

glue::glue()

`%in%`

Conditional transformations: `if_else()`

Conditional transformations: `if_else()`

Conditional transformations: `case_when()`

Conditional transformations: `case_when()`

Conditional transformations: `case_when()`

`case_when()`: more complex example

Ranks: `dplyr::min_rank()`

Ranks: `dplyr::min_rank()`

`glue::glue()`