04: Transformations of numeric vectors

STA35B: Statistical Data Science 2

Akira Horiguchi

Intro

This slide deck will focus on transforming numbers.

library(tidyverse)
library(nycflights13)
flights
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
...

Parsing strings to get numbers

readr package in the tidyverse has two useful functions:

  • parse_double() – useful when you have numbers written as strings.
  • parse_number() – ignores all non-numeric text to parse strings.
parse_double(c("1.2", "5.6", "1e3"))
[1]    1.2    5.6 1000.0
parse_number(c("$1,234", "USD 53,256", "59%"))
[1]  1234 53256    59

Parsing strings to get numbers

What happens if use parse_double() with non-numeric-identifying strings?

x <- parse_double(c("qwerty", "6.5", "asdf"))
Warning: 2 parsing failures.
row col expected actual
  1  -- a double qwerty
  3  -- a double asdf  
str(x)  # shows that x is a vector with informative attributes
 num [1:3] NA 6.5 NA
 - attr(*, "problems")= tibble [2 × 4] (S3: tbl_df/tbl/data.frame)
  ..$ row     : int [1:2] 1 3
  ..$ col     : int [1:2] NA NA
  ..$ expected: chr [1:2] "a double" "a double"
  ..$ actual  : chr [1:2] "qwerty" "asdf"

Can access this tibble by attributes(x)$problems

Arithmetic and “recycling rules”

  • We’ve created new rows before
    • e.g. flights %>% mutate(air_time_hr = air_time / 60).
    • air_time has 336,776 elements while 60 has only one, so we divide every element of air_time by 60.
  • If you have two vectors of same length, operations are done element-wise:
x <- c(1, 2, 3, 4)
y <- c(2, 3, 4, 5)
x / y
[1] 0.5000000 0.6666667 0.7500000 0.8000000

Arithmetic and “recycling rules”

What happens if the number of elements is not 1 or the exact matching number?

  • R does what is called recycling, or repeating
    • It will create a new vector which repeats until reaches vector length.
    • Will throw warning if not an even multiple.
x <- c(-1, -2, -3, -4)
x * c(1,2)
[1] -1 -4 -3 -8
x * c(5, 6, 7)
Warning in x * c(5, 6, 7): longer object length is not a multiple of shorter
object length
[1]  -5 -12 -21 -20

Recycling rules

Rules apply for all logical comparison (==, <, etc) and arithmetic (+, ^, etc)

  • Be careful when doing logical comparisons / arithmetic using two vectors!
flights %>% filter(month == c(1,2))
# A tibble: 25,977 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      542            540         2      923            850
 3  2013     1     1      554            600        -6      812            837
...
  • month == c(1,2) returns a logical vector where:

    • TRUE if either the row number is odd and the month is 1, OR row number is even and month 2. Otherwise is FALSE.
  • Better to use month %in% c(1,2) here!

Parallel minimums and maximums

pmin() and pmax() return parallel min / max of 2 or more variables

df
# A tibble: 3 × 3
      x     y     z
  <dbl> <dbl> <dbl>
1     1     3     5
2     5     2     7
3     7    NA     1
df %>% 
  mutate(min = pmin(x, y, z, na.rm=TRUE),
         max = pmax(x, y, z, na.rm=TRUE))
# A tibble: 3 × 5
      x     y     z   min   max
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     3     5     1     5
2     5     2     7     2     7
3     7    NA     1     1     7

Parallel minimums and maximums

Different behavior than using min(), max(), which returns a single value:

df %>%
  mutate(bad_min = min(x, y, z, na.rm=TRUE),
         bad_max = max(x, y, z, na.rm=TRUE))
# A tibble: 3 × 5
      x     y     z bad_min bad_max
  <dbl> <dbl> <dbl>   <dbl>   <dbl>
1     1     3     5       1       7
2     5     2     7       1       7
3     7    NA     1       1       7

Modular arithmetic

Recall from grade school: division by remainder.

  • %/%: integer division
1:10 %/% 3
 [1] 0 0 1 1 1 2 2 2 3 3
  • %%: the remainder after integer division.
1:10 %% 3
 [1] 1 2 0 1 2 0 1 2 0 1

Useful for time-related calculations, e.g. get hour and minute from sched_dep_time:

flights %>% 
  mutate(hour = sched_dep_time %/% 100,
         minute = sched_dep_time %% 100,
         .keep = 'used') 
# A tibble: 336,776 × 3
   sched_dep_time  hour minute
            <int> <dbl>  <dbl>
 1            515     5     15
 2            529     5     29
 3            540     5     40
 4            545     5     45
...

Modular arithmetic

We can then do things like calculate the percent of delayed flights per hour.

flights %>% 
  mutate(hour = sched_dep_time %/% 100) %>%
  group_by(hour) %>%
  summarize(percent_cancelled = 100*mean(is.na(dep_time)),
            n = n())
# A tibble: 20 × 3
    hour percent_cancelled     n
   <dbl>             <dbl> <int>
 1     1           100         1
 2     5             0.461  1953
 3     6             1.64  25951
 4     7             1.27  22821
 5     8             1.62  27242
 6     9             1.61  20312
 7    10             1.74  16708
 8    11             1.85  16033
...

Rounding: round()

  • …to either nearest integer
round(pi)
[1] 3
  • …rounds (an integer + 0.5) to the nearest even integer
round(c(1.5, 2.5, 4.5, 5.5))
[1] 2 2 4 6
  • round(x, digits=n) rounds to n digits past the decimal place (if n>0)
round(123.456, digits=2)
[1] 123.46
round(123.456, digits=1)
[1] 123.5
  • What happens if n<0?
round(123.456, -1)
[1] 120
round(123.456, -2)
[1] 100

Rounding

  • Two similar arguments: floor() and ceiling()
    • floor(x) rounds to greatest integer <= x
    • ceiling(x) rounds to least integer >= x.
floor(123.456)
[1] 123
ceiling(123.456)
[1] 124

Cumulative and rolling aggregates

  • R provides many functions for computing rolling (i.e., cumulative) aggregates
    • sums, products, minimums, etc.
    • cumsum(), cumprod(), cummin(), cummax(), dplyr::cummean()
x <- 9:1
cumsum(x)
[1]  9 17 24 30 35 39 42 44 45
cummean(x)
[1] 9.0 8.5 8.0 7.5 7.0 6.5 6.0 5.5 5.0
cumprod(x)
[1]      9     72    504   3024  15120  60480 181440 362880 362880
cummin(x)
[1] 9 8 7 6 5 4 3 2 1

Ranks: dplyr::min_rank()

  • Takes a vector of numbers and returns the rank of each element, with lowest = 1st.
min_rank(c(7,4,8,9))
[1] 2 1 3 4
  • Ties broken in obvious way: 1st, 2nd, 2nd, 4th if second and third element equal.
x <- c(62, 62, 64, NA, 20)
min_rank(x)
[1]  2  2  4 NA  1
  • To rank large values first, use desc(x):
min_rank(desc(x))
[1]  2  2  1 NA  4

There are many variants of min_rank() in dplyr:

  • row_number(), cume_dist(), percent_rank()
  • You can explore these on your own; min_rank() is enough in most cases.

Ranks: dplyr::min_rank()

Example: which 3 flight routes have the longest average delays?

  • flight route determined by origin and dest
  • negative arr_delay means the flight left early
flights %>%
  filter(arr_delay > 0) %>%
  group_by(origin, dest) %>%
  summarize(avg_delay = mean(arr_delay, na.rm=TRUE),
            .groups = 'drop') %>%
  mutate(rank = min_rank(desc(avg_delay))) %>%
  filter(rank <= 3) %>%
  arrange(by = rank)
# A tibble: 3 × 4
  origin dest  avg_delay  rank
  <chr>  <chr>     <dbl> <int>
1 LGA    TVC        72.7     1
2 EWR    TYS        72.6     2
3 LGA    OMA        65.0     3

Offsets

  • dplyr::lag() returns the “previous” values in a vector.
x <- c(2, 5, 11, 11, 19, 35)
lag(x)  # Returns vector of same length, padded with `NA` if cannot compute.
[1] NA  2  5 11 11 19
  • dplyr::lead() returns the “next” values in a vector.
lead(x)
[1]  5 11 11 19 35 NA

Use cases (e.g., time series)

  • difference between current and previous value.
x - lag(x)
[1] NA  3  6  0  8 16
  • tells whether current value changes.
x == lag(x)
[1]    NA FALSE FALSE  TRUE FALSE FALSE
  • to replace the NA:
lag(x, default = -1)
[1] -1  2  5 11 11 19

Minimum, maximum, quantiles

  • Again, min(x) and max(x) return single smallest/largest vals within vector x.
  • quantile(vector, threshold) is a generalization of median:
    • quantile(x, 0.25) returns value of x that is >= 25% of values within x
    • quantile(x, 0.5) returns median
    • quantile(x, 0.95) returns value of x that is >= 95% of values within x.
  • Compare to the mean, quantiles are less susceptible to extreme values
    • Consider c(1, 2, 3, 2, 5, 2, 3, 1, 4, 2, 3, 1, 5, 2, 10000000)

Example: calculate the maximum delay and the 95th quantile of delays for flights per day.

flights %>% 
  group_by(year, month, day) %>%
  summarise(
    maxim = max(dep_delay, na.rm=TRUE),
    q95 = quantile(dep_delay, 0.95, na.rm=TRUE),
    .groups = 'drop'
  )
# A tibble: 365 × 5
    year month   day maxim   q95
   <int> <int> <int> <dbl> <dbl>
 1  2013     1     1   853  70.1
 2  2013     1     2   379  85  
 3  2013     1     3   291  68  
...

Spread

  • Standard measures of spread (you should already be familiar with these):
    • Variance var(): \(var(x) = \frac{1}{\mathsf{length}(x)-1} \sum_{i=1}^{\mathsf{length}(x)} (x[i] - \mathsf{mean}(x))^2\)
    • Standard deviation sd(): \(sd(x) = \sqrt{var(x)}\)
    • Interquartile range contains middle 50% of data:
      • IQR(x) = quantile(x, 0.75) - quantile(x, 0.25)
  • IQR is less sensitive to big outliers compared to standard deviation.
    • Similar to median vs. mean.

Spread: example

50% of Standard Normal data lies within 0.6745 stddev’s of the mean. True here?

flights %>% 
  group_by(year, month, day) %>%
  summarise(
    stddev_50p_range = 0.6745*sd(dep_delay, na.rm=TRUE),
    iqr = IQR(dep_delay, na.rm=TRUE),
    .groups = 'drop'
  )
# A tibble: 365 × 5
    year month   day stddev_50p_range   iqr
   <int> <int> <int>            <dbl> <dbl>
 1  2013     1     1             30.5  13  
 2  2013     1     2             25.1  17  
 3  2013     1     3             21.2  15  
 4  2013     1     4             18.7  14  
 5  2013     1     5             17.4   9  
 6  2013     1     6             15.6  11.5
...

Spread: example

Which destinations show the greatest variation in air speed? Possible metrics:

  • Middle 90% by quantiles; Within 2 standard deviations; Maximum minus minimum.

flights %>%
  mutate(air_speed_mph = distance / (air_time / 60)) %>%
  group_by(dest) %>%
  filter(n() > 1) %>%  # Q: why do we need this?
  summarize(
    speed_middle90 = quantile(air_speed_mph, 0.95, na.rm=TRUE) - quantile(air_speed_mph, 0.05, na.rm=TRUE),
    speed_2stddev = 2*sd(air_speed_mph, na.rm=TRUE),
    speed_max_diff = max(air_speed_mph, na.rm=TRUE) - min(air_speed_mph, na.rm=TRUE) 
  ) %>% arrange(by = desc(speed_middle90))
# A tibble: 103 × 4
   dest  speed_middle90 speed_2stddev speed_max_diff
   <chr>          <dbl>         <dbl>          <dbl>
 1 ILM             127.          73.8           154.
 2 OKC             125.          76.7           213.
...