04: Transformations of numeric vectors
STA35B: Statistical Data Science 2
Intro
This slide deck will focus on transforming numbers.
library (tidyverse)
library (nycflights13)
flights
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
...
Parsing strings to get numbers
readr package in the tidyverse has two useful functions:
parse_double() – useful when you have numbers written as strings.
parse_number() – ignores all non-numeric text to parse strings.
parse_double (c ("1.2" , "5.6" , "1e3" ))
parse_number (c ("$1,234" , "USD 53,256" , "59%" ))
Parsing strings to get numbers
What happens if use parse_double() with non-numeric-identifying strings?
x <- parse_double (c ("qwerty" , "6.5" , "asdf" ))
Warning: 2 parsing failures.
row col expected actual
1 -- a double qwerty
3 -- a double asdf
str (x) # shows that x is a vector with informative attributes
num [1:3] NA 6.5 NA
- attr(*, "problems")= tibble [2 × 4] (S3: tbl_df/tbl/data.frame)
..$ row : int [1:2] 1 3
..$ col : int [1:2] NA NA
..$ expected: chr [1:2] "a double" "a double"
..$ actual : chr [1:2] "qwerty" "asdf"
Can access this tibble by attributes(x)$problems
Arithmetic and “recycling rules”
We’ve created new rows before
e.g. flights %>% mutate(air_time_hr = air_time / 60).
air_time has 336,776 elements while 60 has only one, so we divide every element of air_time by 60.
If you have two vectors of same length, operations are done element-wise:
x <- c (1 , 2 , 3 , 4 )
y <- c (2 , 3 , 4 , 5 )
x / y
[1] 0.5000000 0.6666667 0.7500000 0.8000000
Arithmetic and “recycling rules”
What happens if the number of elements is not 1 or the exact matching number?
R does what is called recycling , or repeating
It will create a new vector which repeats until reaches vector length.
Will throw warning if not an even multiple.
x <- c (- 1 , - 2 , - 3 , - 4 )
x * c (1 ,2 )
Warning in x * c(5, 6, 7): longer object length is not a multiple of shorter
object length
Recycling rules
Rules apply for all logical comparison (==, <, etc) and arithmetic (+, ^, etc)
Be careful when doing logical comparisons / arithmetic using two vectors!
flights %>% filter (month == c (1 ,2 ))
# A tibble: 25,977 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 542 540 2 923 850
3 2013 1 1 554 600 -6 812 837
...
month == c(1,2) returns a logical vector where:
TRUE if either the row number is odd and the month is 1, OR row number is even and month 2. Otherwise is FALSE.
Better to use month %in% c(1,2) here!
Parallel minimums and maximums
pmin() and pmax() return parallel min / max of 2 or more variables
# A tibble: 3 × 3
x y z
<dbl> <dbl> <dbl>
1 1 3 5
2 5 2 7
3 7 NA 1
df %>%
mutate (min = pmin (x, y, z, na.rm= TRUE ),
max = pmax (x, y, z, na.rm= TRUE ))
# A tibble: 3 × 5
x y z min max
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 3 5 1 5
2 5 2 7 2 7
3 7 NA 1 1 7
Parallel minimums and maximums
Different behavior than using min(), max(), which returns a single value:
df %>%
mutate (bad_min = min (x, y, z, na.rm= TRUE ),
bad_max = max (x, y, z, na.rm= TRUE ))
# A tibble: 3 × 5
x y z bad_min bad_max
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 3 5 1 7
2 5 2 7 1 7
3 7 NA 1 1 7
Modular arithmetic
Recall from grade school: division by remainder .
%%: the remainder after integer division.
Useful for time-related calculations, e.g. get hour and minute from sched_dep_time:
flights %>%
mutate (hour = sched_dep_time %/% 100 ,
minute = sched_dep_time %% 100 ,
.keep = 'used' )
# A tibble: 336,776 × 3
sched_dep_time hour minute
<int> <dbl> <dbl>
1 515 5 15
2 529 5 29
3 540 5 40
4 545 5 45
...
Modular arithmetic
We can then do things like calculate the percent of delayed flights per hour.
flights %>%
mutate (hour = sched_dep_time %/% 100 ) %>%
group_by (hour) %>%
summarize (percent_cancelled = 100 * mean (is.na (dep_time)),
n = n ())
# A tibble: 20 × 3
hour percent_cancelled n
<dbl> <dbl> <int>
1 1 100 1
2 5 0.461 1953
3 6 1.64 25951
4 7 1.27 22821
5 8 1.62 27242
6 9 1.61 20312
7 10 1.74 16708
8 11 1.85 16033
...
Rounding: round()
…to either nearest integer
…rounds (an integer + 0.5) to the nearest even integer
round (c (1.5 , 2.5 , 4.5 , 5.5 ))
round(x, digits=n) rounds to n digits past the decimal place (if n>0)
Rounding
Two similar arguments: floor() and ceiling()
floor(x) rounds to greatest integer <= x
ceiling(x) rounds to least integer >= x.
Cumulative and rolling aggregates
R provides many functions for computing rolling (i.e., cumulative) aggregates
sums, products, minimums, etc.
cumsum(), cumprod(), cummin(), cummax(), dplyr::cummean()
[1] 9 17 24 30 35 39 42 44 45
[1] 9.0 8.5 8.0 7.5 7.0 6.5 6.0 5.5 5.0
[1] 9 72 504 3024 15120 60480 181440 362880 362880
Ranks: dplyr::min_rank()
Takes a vector of numbers and returns the rank of each element, with lowest = 1st.
Ties broken in obvious way: 1st, 2nd, 2nd, 4th if second and third element equal.
x <- c (62 , 62 , 64 , NA , 20 )
min_rank (x)
To rank large values first, use desc(x):
There are many variants of min_rank() in dplyr:
row_number(), cume_dist(), percent_rank()
You can explore these on your own; min_rank() is enough in most cases.
Ranks: dplyr::min_rank()
Example: which 3 flight routes have the longest average delays?
flight route determined by origin and dest
negative arr_delay means the flight left early
flights %>%
filter (arr_delay > 0 ) %>%
group_by (origin, dest) %>%
summarize (avg_delay = mean (arr_delay, na.rm= TRUE ),
.groups = 'drop' ) %>%
mutate (rank = min_rank (desc (avg_delay))) %>%
filter (rank <= 3 ) %>%
arrange (by = rank)
# A tibble: 3 × 4
origin dest avg_delay rank
<chr> <chr> <dbl> <int>
1 LGA TVC 72.7 1
2 EWR TYS 72.6 2
3 LGA OMA 65.0 3
Offsets
dplyr::lag() returns the “previous” values in a vector.
x <- c (2 , 5 , 11 , 11 , 19 , 35 )
lag (x) # Returns vector of same length, padded with `NA` if cannot compute.
dplyr::lead() returns the “next” values in a vector.
Use cases (e.g., time series)
difference between current and previous value.
tells whether current value changes.
[1] NA FALSE FALSE TRUE FALSE FALSE
Minimum, maximum, quantiles
Again, min(x) and max(x) return single smallest/largest vals within vector x.
quantile(vector, threshold) is a generalization of median:
quantile(x, 0.25) returns value of x that is >= 25% of values within x
quantile(x, 0.5) returns median
quantile(x, 0.95) returns value of x that is >= 95% of values within x.
Compare to the mean, quantiles are less susceptible to extreme values
Consider c(1, 2, 3, 2, 5, 2, 3, 1, 4, 2, 3, 1, 5, 2, 10000000)
Example: calculate the maximum delay and the 95th quantile of delays for flights per day.
flights %>%
group_by (year, month, day) %>%
summarise (
maxim = max (dep_delay, na.rm= TRUE ),
q95 = quantile (dep_delay, 0.95 , na.rm= TRUE ),
.groups = 'drop'
)
# A tibble: 365 × 5
year month day maxim q95
<int> <int> <int> <dbl> <dbl>
1 2013 1 1 853 70.1
2 2013 1 2 379 85
3 2013 1 3 291 68
...
Spread
Standard measures of spread (you should already be familiar with these):
Variance var(): \(var(x) = \frac{1}{\mathsf{length}(x)-1} \sum_{i=1}^{\mathsf{length}(x)} (x[i] - \mathsf{mean}(x))^2\)
Standard deviation sd(): \(sd(x) = \sqrt{var(x)}\)
Interquartile range contains middle 50% of data:
IQR(x) = quantile(x, 0.75) - quantile(x, 0.25)
IQR is less sensitive to big outliers compared to standard deviation.
Similar to median vs. mean.
Spread: example
50% of Standard Normal data lies within 0.6745 stddev’s of the mean. True here?
flights %>%
group_by (year, month, day) %>%
summarise (
stddev_50p_range = 0.6745 * sd (dep_delay, na.rm= TRUE ),
iqr = IQR (dep_delay, na.rm= TRUE ),
.groups = 'drop'
)
# A tibble: 365 × 5
year month day stddev_50p_range iqr
<int> <int> <int> <dbl> <dbl>
1 2013 1 1 30.5 13
2 2013 1 2 25.1 17
3 2013 1 3 21.2 15
4 2013 1 4 18.7 14
5 2013 1 5 17.4 9
6 2013 1 6 15.6 11.5
...
Spread: example
Which destinations show the greatest variation in air speed? Possible metrics:
flights %>%
mutate (air_speed_mph = distance / (air_time / 60 )) %>%
group_by (dest) %>%
filter (n () > 1 ) %>% # Q: why do we need this?
summarize (
speed_middle90 = quantile (air_speed_mph, 0.95 , na.rm= TRUE ) - quantile (air_speed_mph, 0.05 , na.rm= TRUE ),
speed_2stddev = 2 * sd (air_speed_mph, na.rm= TRUE ),
speed_max_diff = max (air_speed_mph, na.rm= TRUE ) - min (air_speed_mph, na.rm= TRUE )
) %>% arrange (by = desc (speed_middle90))
# A tibble: 103 × 4
dest speed_middle90 speed_2stddev speed_max_diff
<chr> <dbl> <dbl> <dbl>
1 ILM 127. 73.8 154.
2 OKC 125. 76.7 213.
...