library(tidyverse)
library(nycflights13)
library(palmerpenguins)
set.seed(88)STA35B: Statistical Data Science 2
df to have range between 0 and 1.# Try to understand this code. (Do you spot the error?)
df <- tibble(a = rnorm(4), b = rnorm(4), c = rnorm(4))
df |> mutate(
a = (a - min(a, na.rm = TRUE)) /
(max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),
b = (b - min(b, na.rm = TRUE)) /
(max(b, na.rm = TRUE) - min(a, na.rm = TRUE)),
c = (c - min(c, na.rm = TRUE)) /
(max(c, na.rm = TRUE) - min(c, na.rm = TRUE))
)# A tibble: 4 × 3
a b c
<dbl> <dbl> <dbl>
1 0.386 5.44 0
2 0.595 4.71 0.855
3 1 4.73 0.206
4 0 0 1
R environment as an object with this name.When calling a function, you can specify the arguments by:
get_age_in_years() which takes in a vector of strings of birthdates of the form “YYYY-MM-DD” and computes the age in years.get_num_both_na() which takes in two vectors of same length and returns the number of positions which have an NA in both vectorsFunction names should be…
is_xxCtrl + I will “correctly” auto-indent the given line.
Ctrl + I to “correctly” auto-indent the selected text.Ctrl + A then Ctrl + I to “correctly” auto-indent the entire file.Recall: here’s what we were trying to do:
Compare code with functions…
… to code without functions:
# A tibble: 4 × 3
a b c
<dbl> <dbl> <dbl>
1 0.386 5.44 0
2 0.595 4.71 0.855
3 1 4.73 0.206
4 0 0 1
Note: rescale01 computes min(x, na.rm = TRUE) twice.
min(x, na.rm = TRUE) takes a long time to compute?min(x, na.rm = TRUE) only once rather than twice.range() returns min and max of the vector, and has arguments…
na.rm, which removes NAsfinite, which removes Infs (max() and min() do not have this argument)mutate() or filter() or summarize()We want a function inside of…
mutate() to return a vector of the same length as the function’s argument.filter() to return a logical vector of the same length as the function’s argument.summarize() to return a single value.# A tibble: 344 × 2
body_mass_g body_mass_z
<int> <dbl>
1 3750 -0.563
2 3800 -0.501
3 3250 -1.19
4 NA NA
5 3450 -0.937
6 3650 -0.688
7 3625 -0.719
8 4675 0.590
9 3475 -0.906
10 4250 0.0602
# ℹ 334 more rows
min and value max [1] 3 3 3 4 5 6 7 7 7 7
# A tibble: 344 × 2
bill_length_mm bill_len_middle
<dbl> <dbl>
1 39.1 39.1
2 39.5 39.5
3 40.3 40
4 NA NA
5 36.7 36.7
6 39.3 39.3
7 38.9 38.9
8 39.2 39.2
9 34.1 35
10 42 40
# ℹ 334 more rows
min and value max? [1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
[1] 344 8
# A tibble: 251 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 40.3 18 195 3250
2 Adelie Torgersen 34.1 18.1 193 3475
3 Adelie Torgersen 42 20.2 190 4250
4 Adelie Torgersen 41.1 17.6 182 3200
5 Adelie Torgersen 34.6 21.1 198 4400
6 Adelie Torgersen 42.5 20.7 197 4500
7 Adelie Torgersen 34.4 18.4 184 3325
8 Adelie Torgersen 46 21.5 194 4200
9 Adelie Biscoe 40.6 18.6 183 3550
10 Adelie Biscoe 40.5 17.9 187 3200
# ℹ 241 more rows
# ℹ 2 more variables: sex <fct>, year <int>
NA:[1] 3 NA -54
[1] "cat, dog, and pigeon"
[1] "cat, dog, and pigeon"
[1] "cat, dog and pigeon"
[1] "cat and dog"
[1] "cat, and dog"
[1] "cat and dog"
[1] "cat and dog"
We will often manipulate tibbles in a repetitive way.
dplyr function:
g_var is not found.” What’s happening here?dplyr uses tidy evaluation to allow you to refer to data-frame columns.
flights |> group_by(month) rather than flights |> group_by(flights$month){{}}# A tibble: 12 × 2
month `mean(dep_time)`
<int> <dbl>
1 1 NA
2 2 NA
3 3 NA
4 4 NA
5 5 NA
6 6 NA
7 7 NA
8 8 NA
9 9 NA
10 10 NA
11 11 NA
12 12 NA
{{}}# A tibble: 12 × 2
month `mean(dep_time, na.rm = TRUE)`
<int> <dbl>
1 1 1347.
2 2 1348.
3 3 1359.
4 4 1353.
5 5 1351.
6 6 1351.
7 7 1353.
8 8 1350.
9 9 1334.
10 10 1340.
11 11 1344.
12 12 1357.
arrange(), filter(), summarize(), select(), rename(), etc.# A tibble: 365 × 9
year month day min mean median max n n_miss
<int> <int> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 2013 1 1 94 1077. 946 4983 842 0
2 2013 1 2 94 1053. 944 4983 943 0
3 2013 1 3 80 1037. 937 4983 914 0
4 2013 1 4 80 1032. 937 4983 915 0
5 2013 1 5 80 1068. 950 4983 720 0
6 2013 1 6 80 1052. 944 4983 832 0
7 2013 1 7 80 998. 820 4983 933 0
8 2013 1 8 80 986. 764 4983 899 0
9 2013 1 9 80 981. 764 4983 902 0
10 2013 1 10 80 993. 765 4983 932 0
# ℹ 355 more rows
# A tibble: 365 × 9
year month day min mean median max n n_miss
<int> <int> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 2013 1 1 1.97 2.92 2.98 3.70 842 0
2 2013 1 2 1.97 2.91 2.97 3.70 943 0
3 2013 1 3 1.90 2.91 2.97 3.70 914 0
4 2013 1 4 1.90 2.90 2.97 3.70 915 0
5 2013 1 5 1.90 2.91 2.98 3.70 720 0
6 2013 1 6 1.90 2.91 2.97 3.70 832 0
7 2013 1 7 1.90 2.88 2.91 3.70 933 0
8 2013 1 8 1.90 2.87 2.88 3.70 899 0
9 2013 1 9 1.90 2.87 2.88 3.70 902 0
10 2013 1 10 1.90 2.88 2.88 3.70 932 0
# ℹ 355 more rows
See the difference between functions using embracing vs not using embracing.
flights tibble, create a function called filter_severe() to find all flights that were cancelled (is.na(arr_time)) or delayed by more than an hour.Consider the following tibble (rnorm(n): n independent standard normals)
# A tibble: 1 × 4
n x z y
<int> <dbl> <dbl> <dbl>
1 10 0.212 -0.798 -0.0483
across()across() works and how to modify this behavior.across():
.cols: which columns to iterate over.fns: what to do (function) for each column.names: name output of each columnacross(): selecting columns with .cols.cols, we can use same things we used for select():across(): selecting columns with .colsTwo additional arguments which are helpful: everything() and where().
everything() computes summaries for every non-grouping variablewhere() allows for selecting columns based on type, e.g. where(is.numeric) for numbers, where(is.character) for strings, where(is.logical) for logicals x z y grp
1 0.211684 -0.7975214 -0.04833544 NA
across(): calling a single function with .fns.fns says how we want data to be transformedacross(), we are not calling the function itself.
() after the function when you pass to across(), otherwise…median() in console will produce an error (no input!)across(): calling multiple functions with .fnsacross(): calling multiple functions with .fnsna.rm = TRUE we can create a new function in-line which calls median:# A tibble: 1 × 4
x y z n
<dbl> <dbl> <dbl> <int>
1 0.174 -1.06 -0.288 5
\:across(): calling multiple functions with .fnsacross(): calling multiple functions with .fnsacross() by using a named list to .fns argument:# A tibble: 1 × 7
x_med x_n_miss y_med y_n_miss z_med z_n_miss n
<dbl> <int> <dbl> <int> <dbl> <int> <int>
1 0.174 1 -1.06 2 -0.288 0 5
{.col}_{.fn}
.col is name of original column and .fn is name of function. across(): calling multiple functions with .fnstibble [10 × 3] (S3: tbl_df/tbl/data.frame)
$ x: num [1:10] 1.393 -0.229 0.171 -0.852 0.349 ...
$ z: num [1:10] -0.746 -1.156 0.899 -0.851 -0.635 ...
$ y: num [1:10] -0.559 0.509 0.462 -0.714 -1.482 ...
# A tibble: 1 × 6
x_med x_mn z_med z_mn y_med y_mn
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.212 0.186 -0.798 -0.554 -0.0483 0.0952
# A tibble: 1 × 6
x_1 x_2 z_1 z_2 y_1 y_2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.212 0.186 -0.798 -0.554 -0.0483 0.0952
across(): output column names with .names.names column allows for custom output names:# A tibble: 1 × 7
med_for_x n_miss_for_x med_for_y n_miss_for_y med_for_z n_miss_for_z n
<dbl> <int> <dbl> <int> <dbl> <int> <int>
1 0.174 1 -1.06 2 -0.288 0 5
across(): output column names with .names.names is especially important if using only one function; by default, across() returns same names as input and thus will replace the input columns.
summarize()coalesce(x, y) replaces all appearances of NA in x with the value yacross(): output column names with .names.names is especially important if using only one function; by default, across() returns same names as input and thus will replace the input columns.
summarize().names to give output new names:# A tibble: 5 × 6
x y z x_na_zero y_na_zero z_na_zero
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA NA 0.495 0 0 0.495
2 -0.829 NA -0.531 -0.829 0 -0.531
3 -0.221 -0.430 0.0803 -0.221 -0.430 0.0803
4 0.915 -1.40 -0.288 0.915 -1.40 -0.288
5 0.570 -1.06 -0.381 0.570 -1.06 -0.381
penguinsmtcars:cut, clarity, and color, then count the number of observations and compute the mean of each numeric column.# A tibble: 276 × 11
# Groups: cut, clarity [40]
cut clarity color num carat depth table price x y z
<ord> <ord> <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Fair I1 D 4 1.88 65.6 56.8 7383 7.52 7.42 4.90
2 Fair I1 E 9 0.969 65.6 58.1 2095. 6.17 6.06 4.01
3 Fair I1 F 35 1.02 65.7 58.4 2544. 6.14 6.04 4.00
4 Fair I1 G 53 1.23 65.3 57.7 3187. 6.52 6.43 4.23
5 Fair I1 H 52 1.50 65.8 58.4 4213. 6.96 6.86 4.55
6 Fair I1 I 34 1.32 65.7 58.4 3501 6.76 6.65 4.41
7 Fair I1 J 23 1.99 66.5 57.9 5795. 7.55 7.46 4.99
8 Fair SI2 D 56 1.02 64.7 58.6 4355. 6.24 6.17 4.01
9 Fair SI2 E 78 1.02 63.4 59.5 4172. 6.28 6.22 3.96
10 Fair SI2 F 89 1.08 63.8 59.5 4520. 6.36 6.30 4.04
# ℹ 266 more rows
# A tibble: 2 × 5
name date date_year date_month date_day
<chr> <date> <dbl> <dbl> <int>
1 Ant 2009-08-03 2009 8 3
2 Bug 2010-01-16 2010 1 16
across() is great with summarize() and mutate(), but not so much with filter() because there we usually combine conditions with & / |.dplyr variants if_any() and if_all() help to combine logicals across columns.# A tibble: 2 × 3
x y z
<dbl> <dbl> <dbl>
1 NA NA 0.495
2 -0.829 NA -0.531
# A tibble: 0 × 3
# ℹ 3 variables: x <dbl>, y <dbl>, z <dbl>
across() vs pivot_longer()Suppose we want to compute a weighted mean.
# A tibble: 3 × 6
Ant_score Ant_wts Bug_score Bug_wts Cat_score Cat_wts
<int> <dbl> <int> <dbl> <int> <dbl>
1 74 0.25 61 0.25 84 0.25
2 66 0.25 78 0.25 91 0.25
3 84 0.5 70 0.5 77 0.5
across(), but easy with pivot_longer()across() vs pivot_longer()