04: Transformations of numeric vectors

STA35B: Statistical Data Science 2

Akira Horiguchi

Intro

This slide deck will focus on transforming numbers.

library(tidyverse)
library(nycflights13)
flights

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
...

Parsing strings to get numbers

readr package in the tidyverse has two useful functions:

parse_double() – useful when you have numbers written as strings.
parse_number() – ignores all non-numeric text to parse strings.

parse_double(c("1.2", "5.6", "1e3"))

[1]    1.2    5.6 1000.0

parse_number(c("$1,234", "USD 53,256", "59%"))

[1]  1234 53256    59

Parsing strings to get numbers

What happens if use parse_double() with non-numeric-identifying strings?

x <- parse_double(c("qwerty", "6.5", "asdf"))

Warning: 2 parsing failures.
row col expected actual
  1  -- a double qwerty
  3  -- a double asdf

str(x)  # shows that x is a vector with informative attributes

 num [1:3] NA 6.5 NA
 - attr(*, "problems")= tibble [2 × 4] (S3: tbl_df/tbl/data.frame)
  ..$ row     : int [1:2] 1 3
  ..$ col     : int [1:2] NA NA
  ..$ expected: chr [1:2] "a double" "a double"
  ..$ actual  : chr [1:2] "qwerty" "asdf"

Can access this tibble by attributes(x)$problems

Arithmetic and “recycling rules”

We’ve created new rows before
- e.g. flights %>% mutate(air_time_hr = air_time / 60).
- air_time has 336,776 elements while 60 has only one, so we divide every element of air_time by 60.
If you have two vectors of same length, operations are done element-wise:

x <- c(1, 2, 3, 4)
y <- c(2, 3, 4, 5)
x / y

[1] 0.5000000 0.6666667 0.7500000 0.8000000

Arithmetic and “recycling rules”

What happens if the number of elements is not 1 or the exact matching number?

R does what is called recycling, or repeating
- It will create a new vector which repeats until reaches vector length.
- Will throw warning if not an even multiple.

x <- c(-1, -2, -3, -4)
x * c(1,2)

[1] -1 -4 -3 -8

x * c(5, 6, 7)

Warning in x * c(5, 6, 7): longer object length is not a multiple of shorter
object length

[1]  -5 -12 -21 -20

Recycling rules

Rules apply for all logical comparison (==, <, etc) and arithmetic (+, ^, etc)

Be careful when doing logical comparisons / arithmetic using two vectors!

flights %>% filter(month == c(1,2))

# A tibble: 25,977 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      542            540         2      923            850
 3  2013     1     1      554            600        -6      812            837
...

month == c(1,2) returns a logical vector where:
- TRUE if either the row number is odd and the month is 1, OR row number is even and month 2. Otherwise is FALSE.
Better to use month %in% c(1,2) here!

Parallel minimums and maximums

pmin() and pmax() return parallel min / max of 2 or more variables

df

# A tibble: 3 × 3
      x     y     z
  <dbl> <dbl> <dbl>
1     1     3     5
2     5     2     7
3     7    NA     1

df %>% 
  mutate(min = pmin(x, y, z, na.rm=TRUE),
         max = pmax(x, y, z, na.rm=TRUE))

# A tibble: 3 × 5
      x     y     z   min   max
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     3     5     1     5
2     5     2     7     2     7
3     7    NA     1     1     7

Parallel minimums and maximums

Different behavior than using min(), max(), which returns a single value:

df %>%
  mutate(bad_min = min(x, y, z, na.rm=TRUE),
         bad_max = max(x, y, z, na.rm=TRUE))

# A tibble: 3 × 5
      x     y     z bad_min bad_max
  <dbl> <dbl> <dbl>   <dbl>   <dbl>
1     1     3     5       1       7
2     5     2     7       1       7
3     7    NA     1       1       7

Modular arithmetic

Recall from grade school: division by remainder.

%/%: integer division

1:10 %/% 3

 [1] 0 0 1 1 1 2 2 2 3 3

%%: the remainder after integer division.

1:10 %% 3

 [1] 1 2 0 1 2 0 1 2 0 1

Useful for time-related calculations, e.g. get hour and minute from sched_dep_time:

flights %>% 
  mutate(hour = sched_dep_time %/% 100,
         minute = sched_dep_time %% 100,
         .keep = 'used')

# A tibble: 336,776 × 3
   sched_dep_time  hour minute
            <int> <dbl>  <dbl>
 1            515     5     15
 2            529     5     29
 3            540     5     40
 4            545     5     45
...

Modular arithmetic

We can then do things like calculate the percent of delayed flights per hour.

flights %>% 
  mutate(hour = sched_dep_time %/% 100) %>%
  group_by(hour) %>%
  summarize(percent_cancelled = 100*mean(is.na(dep_time)),
            n = n())

# A tibble: 20 × 3
    hour percent_cancelled     n
   <dbl>             <dbl> <int>
 1     1           100         1
 2     5             0.461  1953
 3     6             1.64  25951
 4     7             1.27  22821
 5     8             1.62  27242
 6     9             1.61  20312
 7    10             1.74  16708
 8    11             1.85  16033
...

Rounding: `round()`

…to either nearest integer

round(pi)

[1] 3

…rounds (an integer + 0.5) to the nearest even integer

round(c(1.5, 2.5, 4.5, 5.5))

[1] 2 2 4 6

round(x, digits=n) rounds to n digits past the decimal place (if n>0)

round(123.456, digits=2)

[1] 123.46

round(123.456, digits=1)

[1] 123.5

What happens if n<0?

round(123.456, -1)

[1] 120

round(123.456, -2)

[1] 100

Rounding

Two similar arguments: floor() and ceiling()
- floor(x) rounds to greatest integer <= x
- ceiling(x) rounds to least integer >= x.

floor(123.456)

[1] 123

ceiling(123.456)

[1] 124

Cumulative and rolling aggregates

R provides many functions for computing rolling (i.e., cumulative) aggregates
- sums, products, minimums, etc.
- cumsum(), cumprod(), cummin(), cummax(), dplyr::cummean()

x <- 9:1
cumsum(x)

[1]  9 17 24 30 35 39 42 44 45

cummean(x)

[1] 9.0 8.5 8.0 7.5 7.0 6.5 6.0 5.5 5.0

cumprod(x)

[1]      9     72    504   3024  15120  60480 181440 362880 362880

cummin(x)

[1] 9 8 7 6 5 4 3 2 1

Ranks: `dplyr::min_rank()`

Takes a vector of numbers and returns the rank of each element, with lowest = 1st.

min_rank(c(7,4,8,9))

[1] 2 1 3 4

Ties broken in obvious way: 1st, 2nd, 2nd, 4th if second and third element equal.

x <- c(62, 62, 64, NA, 20)
min_rank(x)

[1]  2  2  4 NA  1

To rank large values first, use desc(x):

min_rank(desc(x))

[1]  2  2  1 NA  4

There are many variants of min_rank() in dplyr:

row_number(), cume_dist(), percent_rank()
You can explore these on your own; min_rank() is enough in most cases.

Ranks: `dplyr::min_rank()`

Example: which 3 flight routes have the longest average delays?

flight route determined by origin and dest
negative arr_delay means the flight left early

flights %>%
  filter(arr_delay > 0) %>%
  group_by(origin, dest) %>%
  summarize(avg_delay = mean(arr_delay, na.rm=TRUE),
            .groups = 'drop') %>%
  mutate(rank = min_rank(desc(avg_delay))) %>%
  filter(rank <= 3) %>%
  arrange(by = rank)

# A tibble: 3 × 4
  origin dest  avg_delay  rank
  <chr>  <chr>     <dbl> <int>
1 LGA    TVC        72.7     1
2 EWR    TYS        72.6     2
3 LGA    OMA        65.0     3

Offsets

dplyr::lag() returns the “previous” values in a vector.

x <- c(2, 5, 11, 11, 19, 35)
lag(x)  # Returns vector of same length, padded with `NA` if cannot compute.

[1] NA  2  5 11 11 19

dplyr::lead() returns the “next” values in a vector.

lead(x)

[1]  5 11 11 19 35 NA

Use cases (e.g., time series)

difference between current and previous value.

x - lag(x)

[1] NA  3  6  0  8 16

tells whether current value changes.

x == lag(x)

[1]    NA FALSE FALSE  TRUE FALSE FALSE

to replace the NA:

lag(x, default = -1)

[1] -1  2  5 11 11 19

Minimum, maximum, quantiles

Again, min(x) and max(x) return single smallest/largest vals within vector x.
quantile(vector, threshold) is a generalization of median:
- quantile(x, 0.25) returns value of x that is >= 25% of values within x
- quantile(x, 0.5) returns median
- quantile(x, 0.95) returns value of x that is >= 95% of values within x.
Compare to the mean, quantiles are less susceptible to extreme values
- Consider c(1, 2, 3, 2, 5, 2, 3, 1, 4, 2, 3, 1, 5, 2, 10000000)

Example: calculate the maximum delay and the 95th quantile of delays for flights per day.

flights %>% 
  group_by(year, month, day) %>%
  summarise(
    maxim = max(dep_delay, na.rm=TRUE),
    q95 = quantile(dep_delay, 0.95, na.rm=TRUE),
    .groups = 'drop'
  )

# A tibble: 365 × 5
    year month   day maxim   q95
   <int> <int> <int> <dbl> <dbl>
 1  2013     1     1   853  70.1
 2  2013     1     2   379  85  
 3  2013     1     3   291  68  
...

Spread

Standard measures of spread (you should already be familiar with these):
- Variance var(): $var(x) = \frac{1}{\mathsf{length}(x)-1} \sum_{i=1}^{\mathsf{length}(x)} (x[i] - \mathsf{mean}(x))^2$
- Standard deviation sd(): $sd(x) = \sqrt{var(x)}$
- Interquartile range contains middle 50% of data:
  - IQR(x) = quantile(x, 0.75) - quantile(x, 0.25)
IQR is less sensitive to big outliers compared to standard deviation.
- Similar to median vs. mean.

Spread: example

50% of Standard Normal data lies within 0.6745 stddev’s of the mean. True here?

flights %>% 
  group_by(year, month, day) %>%
  summarise(
    stddev_50p_range = 0.6745*sd(dep_delay, na.rm=TRUE),
    iqr = IQR(dep_delay, na.rm=TRUE),
    .groups = 'drop'
  )

# A tibble: 365 × 5
    year month   day stddev_50p_range   iqr
   <int> <int> <int>            <dbl> <dbl>
 1  2013     1     1             30.5  13  
 2  2013     1     2             25.1  17  
 3  2013     1     3             21.2  15  
 4  2013     1     4             18.7  14  
 5  2013     1     5             17.4   9  
 6  2013     1     6             15.6  11.5
...

Spread: example

Which destinations show the greatest variation in air speed? Possible metrics:

Middle 90% by quantiles; Within 2 standard deviations; Maximum minus minimum.

flights %>%
  mutate(air_speed_mph = distance / (air_time / 60)) %>%
  group_by(dest) %>%
  filter(n() > 1) %>%  # Q: why do we need this?
  summarize(
    speed_middle90 = quantile(air_speed_mph, 0.95, na.rm=TRUE) - quantile(air_speed_mph, 0.05, na.rm=TRUE),
    speed_2stddev = 2*sd(air_speed_mph, na.rm=TRUE),
    speed_max_diff = max(air_speed_mph, na.rm=TRUE) - min(air_speed_mph, na.rm=TRUE) 
  ) %>% arrange(by = desc(speed_middle90))

# A tibble: 103 × 4
   dest  speed_middle90 speed_2stddev speed_max_diff
   <chr>          <dbl>         <dbl>          <dbl>
 1 ILM             127.          73.8           154.
 2 OKC             125.          76.7           213.
...

04: Transformations of numeric vectors

Intro

Parsing strings to get numbers

Parsing strings to get numbers

Arithmetic and “recycling rules”

Arithmetic and “recycling rules”

Recycling rules

Parallel minimums and maximums

Parallel minimums and maximums

Modular arithmetic

Modular arithmetic

Rounding: round()

Rounding

Cumulative and rolling aggregates

Ranks: dplyr::min_rank()

Ranks: dplyr::min_rank()

Offsets

Minimum, maximum, quantiles

Spread

Spread: example

Spread: example

Rounding: `round()`

Ranks: `dplyr::min_rank()`

Ranks: `dplyr::min_rank()`