04: Transformations of numeric vectors
STA35B: Statistical Data Science 2
Intro
This slide deck will focus on transforming numbers.
library (tidyverse)
library (nycflights13)
flights
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
...
Parsing strings to get numbers
readr
package in the tidyverse has two useful functions:
parse_double()
– useful when you have numbers written as strings.
parse_number()
– ignores all non-numeric text to parse strings.
parse_double (c ("1.2" , "5.6" , "1e3" ))
parse_number (c ("$1,234" , "USD 53,256" , "59%" ))
Parsing strings to get numbers
What happens if use parse_double()
with non-numeric-identifying strings?
x <- parse_double (c ("qwerty" , "6.5" , "asdf" ))
Warning: 2 parsing failures.
row col expected actual
1 -- a double qwerty
3 -- a double asdf
str (x) # shows that x is a vector with informative attributes
num [1:3] NA 6.5 NA
- attr(*, "problems")= tibble [2 × 4] (S3: tbl_df/tbl/data.frame)
..$ row : int [1:2] 1 3
..$ col : int [1:2] NA NA
..$ expected: chr [1:2] "a double" "a double"
..$ actual : chr [1:2] "qwerty" "asdf"
Can access this tibble by attributes(x)$problems
Arithmetic and “recycling rules”
We’ve created new rows before
e.g. flights %>% mutate(air_time_hr = air_time / 60)
.
air_time
has 336,776 elements while 60 has only one, so we divide every element of air_time
by 60.
If you have two vectors of same length, operations are done element-wise:
x <- c (1 , 2 , 3 , 4 )
y <- c (2 , 3 , 4 , 5 )
x / y
[1] 0.5000000 0.6666667 0.7500000 0.8000000
Arithmetic and “recycling rules”
What happens if the number of elements is not 1 or the exact matching number?
R does what is called recycling , or repeating
It will create a new vector which repeats until reaches vector length.
Will throw warning if not an even multiple.
x <- c (- 1 , - 2 , - 3 , - 4 )
x * c (1 ,2 )
Warning in x * c(5, 6, 7): longer object length is not a multiple of shorter
object length
Recycling rules
Rules apply for all logical comparison (==
, <
, etc) and arithmetic (+
, ^
, etc)
Be careful when doing logical comparisons / arithmetic using two vectors!
flights %>% filter (month == c (1 ,2 ))
# A tibble: 25,977 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 542 540 2 923 850
3 2013 1 1 554 600 -6 812 837
...
month == c(1,2)
returns a logical vector where:
TRUE
if either the row number is odd and the month is 1, OR row number is even and month 2. Otherwise is FALSE
.
Better to use month %in% c(1,2)
here!
Parallel minimums and maximums
pmin()
and pmax()
return parallel min / max of 2 or more variables
# A tibble: 3 × 3
x y z
<dbl> <dbl> <dbl>
1 1 3 5
2 5 2 7
3 7 NA 1
df %>%
mutate (min = pmin (x, y, z, na.rm= TRUE ),
max = pmax (x, y, z, na.rm= TRUE ))
# A tibble: 3 × 5
x y z min max
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 3 5 1 5
2 5 2 7 2 7
3 7 NA 1 1 7
Parallel minimums and maximums
Different behavior than using min()
, max()
, which returns a single value:
df %>%
mutate (bad_min = min (x, y, z, na.rm= TRUE ),
bad_max = max (x, y, z, na.rm= TRUE ))
# A tibble: 3 × 5
x y z bad_min bad_max
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 3 5 1 7
2 5 2 7 1 7
3 7 NA 1 1 7
Modular arithmetic
Recall from grade school: division by remainder .
%%
: the remainder after integer division.
Useful for time-related calculations, e.g. get hour and minute from sched_dep_time
:
flights %>%
mutate (hour = sched_dep_time %/% 100 ,
minute = sched_dep_time %% 100 ,
.keep = 'used' )
# A tibble: 336,776 × 3
sched_dep_time hour minute
<int> <dbl> <dbl>
1 515 5 15
2 529 5 29
3 540 5 40
4 545 5 45
...
Modular arithmetic
We can then do things like calculate the percent of delayed flights per hour.
flights %>%
mutate (hour = sched_dep_time %/% 100 ) %>%
group_by (hour) %>%
summarize (percent_cancelled = 100 * mean (is.na (dep_time)),
n = n ())
# A tibble: 20 × 3
hour percent_cancelled n
<dbl> <dbl> <int>
1 1 100 1
2 5 0.461 1953
3 6 1.64 25951
4 7 1.27 22821
5 8 1.62 27242
6 9 1.61 20312
7 10 1.74 16708
8 11 1.85 16033
...
Rounding: round()
…to either nearest integer
…rounds (an integer + 0.5) to the nearest even integer
round (c (1.5 , 2.5 , 4.5 , 5.5 ))
round(x, digits=n)
rounds to n
digits past the decimal place (if n
>0)
Rounding
Two similar arguments: floor()
and ceiling()
floor(x)
rounds to greatest integer <= x
ceiling(x)
rounds to least integer >= x.
Cumulative and rolling aggregates
R provides many functions for computing rolling (i.e., cumulative) aggregates
sums, products, minimums, etc.
cumsum()
, cumprod()
, cummin()
, cummax()
, dplyr::cummean()
[1] 9 17 24 30 35 39 42 44 45
[1] 9.0 8.5 8.0 7.5 7.0 6.5 6.0 5.5 5.0
[1] 9 72 504 3024 15120 60480 181440 362880 362880
Ranks: dplyr::min_rank()
Takes a vector of numbers and returns the rank of each element, with lowest = 1st.
Ties broken in obvious way: 1st, 2nd, 2nd, 4th if second and third element equal.
x <- c (62 , 62 , 64 , NA , 20 )
min_rank (x)
To rank large values first, use desc(x)
:
There are many variants of min_rank()
in dplyr
:
row_number()
, cume_dist()
, percent_rank()
You can explore these on your own; min_rank()
is enough in most cases.
Ranks: dplyr::min_rank()
Example: which 3 flight routes have the longest average delays?
flight route determined by origin
and dest
negative arr_delay
means the flight left early
flights %>%
filter (arr_delay > 0 ) %>%
group_by (origin, dest) %>%
summarize (avg_delay = mean (arr_delay, na.rm= TRUE ),
.groups = 'drop' ) %>%
mutate (rank = min_rank (desc (avg_delay))) %>%
filter (rank <= 3 ) %>%
arrange (by = rank)
# A tibble: 3 × 4
origin dest avg_delay rank
<chr> <chr> <dbl> <int>
1 LGA TVC 72.7 1
2 EWR TYS 72.6 2
3 LGA OMA 65.0 3
Offsets
dplyr::lag()
returns the “previous” values in a vector.
x <- c (2 , 5 , 11 , 11 , 19 , 35 )
lag (x) # Returns vector of same length, padded with `NA` if cannot compute.
dplyr::lead()
returns the “next” values in a vector.
Use cases (e.g., time series)
difference between current and previous value.
tells whether current value changes.
[1] NA FALSE FALSE TRUE FALSE FALSE
Minimum, maximum, quantiles
Again, min(x)
and max(x)
return single smallest/largest vals within vector x
.
quantile(vector, threshold)
is a generalization of median:
quantile(x, 0.25)
returns value of x
that is >= 25% of values within x
quantile(x, 0.5)
returns median
quantile(x, 0.95)
returns value of x
that is >= 95% of values within x
.
Compare to the mean, quantiles are less susceptible to extreme values
Consider c(1, 2, 3, 2, 5, 2, 3, 1, 4, 2, 3, 1, 5, 2, 10000000)
Example: calculate the maximum delay and the 95th quantile of delays for flights per day.
flights %>%
group_by (year, month, day) %>%
summarise (
maxim = max (dep_delay, na.rm= TRUE ),
q95 = quantile (dep_delay, 0.95 , na.rm= TRUE ),
.groups = 'drop'
)
# A tibble: 365 × 5
year month day maxim q95
<int> <int> <int> <dbl> <dbl>
1 2013 1 1 853 70.1
2 2013 1 2 379 85
3 2013 1 3 291 68
...
Spread
Standard measures of spread (you should already be familiar with these):
Variance var()
: \(var(x) = \frac{1}{\mathsf{length}(x)-1} \sum_{i=1}^{\mathsf{length}(x)} (x[i] - \mathsf{mean}(x))^2\)
Standard deviation sd()
: \(sd(x) = \sqrt{var(x)}\)
Interquartile range contains middle 50% of data:
IQR(x) = quantile(x, 0.75) - quantile(x, 0.25)
IQR is less sensitive to big outliers compared to standard deviation.
Similar to median vs. mean.
Spread: example
50% of Standard Normal data lies within 0.6745 stddev’s of the mean. True here?
flights %>%
group_by (year, month, day) %>%
summarise (
stddev_50p_range = 0.6745 * sd (dep_delay, na.rm= TRUE ),
iqr = IQR (dep_delay, na.rm= TRUE ),
.groups = 'drop'
)
# A tibble: 365 × 5
year month day stddev_50p_range iqr
<int> <int> <int> <dbl> <dbl>
1 2013 1 1 30.5 13
2 2013 1 2 25.1 17
3 2013 1 3 21.2 15
4 2013 1 4 18.7 14
5 2013 1 5 17.4 9
6 2013 1 6 15.6 11.5
...
Spread: example
Which destinations show the greatest variation in air speed? Possible metrics:
flights %>%
mutate (air_speed_mph = distance / (air_time / 60 )) %>%
group_by (dest) %>%
filter (n () > 1 ) %>% # Q: why do we need this?
summarize (
speed_middle90 = quantile (air_speed_mph, 0.95 , na.rm= TRUE ) - quantile (air_speed_mph, 0.05 , na.rm= TRUE ),
speed_2stddev = 2 * sd (air_speed_mph, na.rm= TRUE ),
speed_max_diff = max (air_speed_mph, na.rm= TRUE ) - min (air_speed_mph, na.rm= TRUE )
) %>% arrange (by = desc (speed_middle90))
# A tibble: 103 × 4
dest speed_middle90 speed_2stddev speed_max_diff
<chr> <dbl> <dbl> <dbl>
1 ILM 127. 73.8 154.
2 OKC 125. 76.7 213.
...