09: Functions 1 and 2

STA35B: Statistical Data Science 2

Akira Horiguchi

library(tidyverse)
library(nycflights13)
library(palmerpenguins)
set.seed(88)

General

Functions

We’ll talk in the next couple of lectures about writing our own function.
Why write our own?
- Readability: have well-defined tasks that are encapsulated within individual functions, easier to debug.
- Portability: makes it easy to re-use code.
Example: let’s rescale each column in df to have range between 0 and 1.

# Try to understand this code. (Do you spot the error?)
df <- tibble(a = rnorm(4), b = rnorm(4), c = rnorm(4))
df |> mutate(
  a = (a - min(a, na.rm = TRUE)) / 
    (max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),
  b = (b - min(b, na.rm = TRUE)) / 
    (max(b, na.rm = TRUE) - min(a, na.rm = TRUE)),
  c = (c - min(c, na.rm = TRUE)) / 
    (max(c, na.rm = TRUE) - min(c, na.rm = TRUE))
)

# A tibble: 4 × 3
      a     b     c
  <dbl> <dbl> <dbl>
1 0.386  5.44 0    
2 0.595  4.71 0.855
3 1      4.73 0.206
4 0      0    1

Components of a function

Function name: It is stored in the R environment as an object with this name.
Argument(s): When calling a function, you pass a value or values to the argument(s).
- …can be required or optional.
- …can have default values.
Function Body: The sequence of commands that are executed when the function is called.
Return Value: The output of the function.

square <- function(x) {
  y <- x^2
  return(y)
}

The function name is square.
The function has only one argument; here it is called x.
The function body are the lines of code between the curly braces { and }.
The return value is y.

square(3)

[1] 9

Passing arguments

When calling a function, you can specify the arguments by:

position

mean(1:10, 0.2, TRUE)

[1] 5.5

complete name

mean(x = 1:10, trim = 0.2, na.rm = TRUE)

[1] 5.5

partial name (does not work when the abbreviation is ambiguous)

mean(x = 1:10, n = TRUE, t = 0.2)

[1] 5.5

Example

Write a function get_age_in_years() which takes in a vector of strings of birthdates of the form “YYYY-MM-DD” and computes the age in years.

get_age_in_years <- function(dates) {
  time_intervals <- ymd(dates) %--% today()
  floor(time_intervals / years(1))
}
get_age_in_years(c("2011-11-11", "1997-02-08", "2025-02-08"))

[1] 14 29  1

Example

A function get_num_both_na() which takes in two vectors of same length and returns the number of positions which have an NA in both vectors

get_num_both_na <- function(vec1, vec2) {
  sum(is.na(vec1) & is.na(vec2))
}
get_num_both_na(c(1, 2, NA, 3), c(4, 5, NA, NA))

[1] 1

get_num_both_na(c(1, 2, NA, NA), c(4, 5, NA, NA))

[1] 2

Style: function names

Function names should be…

…verbs; arguments should be nouns (typically)
- if returning a logical value/vector, can start name as e.g., is_xx
…descriptive; tell the reader something about the function

# Too short
f()

# Not a verb, or descriptive
my_awesome_function()

# Long, but clear
impute_missing()
collapse_years()

Style: spacing

Whitespace does not affect code execution in R (unlike in Python).
Still, it helps readers (e.g., future you!) to use consistent spacing conventions.

unique_where <- function(df, condition, var) {
df |> 
filter({{ condition }}) |> 
distinct({{ var }}) |>  # Keep only unique/distinct rows from a data frame. 
arrange({{ var }})
}

For specifics, see Tidyverse style guide.
The keyboard shortcut Ctrl + I will “correctly” auto-indent the given line.
- Select text then Ctrl + I to “correctly” auto-indent the selected text.
- Ctrl + A then Ctrl + I to “correctly” auto-indent the entire file.

Vector argument

Writing a function

Recall: here’s what we were trying to do:

(a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE))
(b - min(b, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(b, na.rm = TRUE))
(c - min(c, na.rm = TRUE)) / (max(c, na.rm = TRUE) - min(c, na.rm = TRUE))

More generally:

(█ - min(█, na.rm = TRUE)) / (max(█, na.rm = TRUE) - min(█, na.rm = TRUE))

Let’s create a function that has a vector argument.

rescale01 <- function(x) {
  (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}
c(-10, 0, 10) |> rescale01()

[1] 0.0 0.5 1.0

c(1, 2, 3, NA, 5) |> rescale01()

[1] 0.00 0.25 0.50   NA 1.00

Writing a function: comparison

Compare code with functions…

df |> mutate(
  a = rescale01(a),
  b = rescale01(b),
  c = rescale01(c),
)

# A tibble: 4 × 3
      a     b     c
  <dbl> <dbl> <dbl>
1 0.386 1     0    
2 0.595 0.866 0.855
3 1     0.869 0.206
4 0     0     1

… to code without functions:

df |> mutate(
  a = (a - min(a, na.rm = TRUE)) / 
    (max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),
  b = (b - min(b, na.rm = TRUE)) / 
    (max(b, na.rm = TRUE) - min(a, na.rm = TRUE)),
  c = (c - min(c, na.rm = TRUE)) / 
    (max(c, na.rm = TRUE) - min(c, na.rm = TRUE))
)

# A tibble: 4 × 3
      a     b     c
  <dbl> <dbl> <dbl>
1 0.386  5.44 0    
2 0.595  4.71 0.855
3 1      4.73 0.206
4 0      0    1

Improving the function: remove redundant calculations

rescale01 <- function(x) {
  (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}

Note: rescale01 computes min(x, na.rm = TRUE) twice.

Redundant: what if min(x, na.rm = TRUE) takes a long time to compute?
Let’s compute min(x, na.rm = TRUE) only once rather than twice.

rescale01_v2 <- function(x) {
  minval <- min(x, na.rm=TRUE)
  maxval <- max(x, na.rm=TRUE)
  (x - minval) / (maxval - minval)
}

Much easier than having to go through copy/pasted code!

Improving the function: deal with infinity

c(1:5, Inf) |> rescale01_v2()

[1]   0   0   0   0   0 NaN

What if we want to modify it to deal with infinite values differently?
range() returns min and max of the vector, and has arguments…
- na.rm, which removes NAs
- finite, which removes Infs (max() and min() do not have this argument)

rescale01_v3 <- function(x) {
  rng <- range(x, na.rm = TRUE, finite = TRUE)
  minval <- rng[1]
  maxval <- rng[2]
  (x - minval) / (maxval - minval)
}
c(1:5, Inf) |> rescale01_v3()

[1] 0.00 0.25 0.50 0.75 1.00  Inf

Inside of `mutate()` or `filter()` or `summarize()`

We want a function inside of…

…mutate() to return a vector of the same length as the function’s argument.
…filter() to return a logical vector of the same length as the function’s argument.
…summarize() to return a single value.

Mutate function: example 1

Compute the z-score of a vector: rescale a vector to have mean zero and standard deviation of one.

z_score <- function(x) {
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}

penguins |>
  mutate(body_mass_z = z_score(body_mass_g), .keep='used')

# A tibble: 344 × 2
   body_mass_g body_mass_z
         <int>       <dbl>
 1        3750     -0.563 
 2        3800     -0.501 
 3        3250     -1.19  
 4          NA     NA     
 5        3450     -0.937 
 6        3650     -0.688 
 7        3625     -0.719 
 8        4675      0.590 
 9        3475     -0.906 
10        4250      0.0602
# ℹ 334 more rows

Mutate function: example 2

Make sure sure all values in a vector lie between a value min and value max

clamp <- function(x, min_arg, max_arg) {
  case_when(
    x < min_arg ~ min_arg,
    x > max_arg ~ max_arg,
    .default = x
  )
}

clamp(1:10, min_arg = 3, max_arg = 7)

 [1] 3 3 3 4 5 6 7 7 7 7

penguins |> 
  mutate(bill_len_middle = clamp(bill_length_mm, 35, 40), .keep='used')

# A tibble: 344 × 2
   bill_length_mm bill_len_middle
            <dbl>           <dbl>
 1           39.1            39.1
 2           39.5            39.5
 3           40.3            40  
 4           NA              NA  
 5           36.7            36.7
 6           39.3            39.3
 7           38.9            38.9
 8           39.2            39.2
 9           34.1            35  
10           42              40  
# ℹ 334 more rows

Filter function: example 1

Which values in a vector lie outside a value min and value max?

is_outside_rng <- function(x, min_arg, max_arg) {
  (x < min_arg) | (x > max_arg)
}

is_outside_rng(1:10, min_arg = 3, max_arg = 7)

 [1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE

dim(penguins)

[1] 344   8

penguins |> 
  filter(is_outside_rng(bill_length_mm, 35, 40))

# A tibble: 251 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           40.3          18                 195        3250
 2 Adelie  Torgersen           34.1          18.1               193        3475
 3 Adelie  Torgersen           42            20.2               190        4250
 4 Adelie  Torgersen           41.1          17.6               182        3200
 5 Adelie  Torgersen           34.6          21.1               198        4400
 6 Adelie  Torgersen           42.5          20.7               197        4500
 7 Adelie  Torgersen           34.4          18.4               184        3325
 8 Adelie  Torgersen           46            21.5               194        4200
 9 Adelie  Biscoe              40.6          18.6               183        3550
10 Adelie  Biscoe              40.5          17.9               187        3200
# ℹ 241 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Mutate function: example 3

Remove all percent signs, commas, and dollar signs from a string and then convert it to a number.

clean_number <- function(x) {
  is_pct <- str_detect(x, "%")
  num <- x |> 
    str_remove_all("%") |> 
    str_remove_all(",") |> 
    str_remove_all("\\$") |> 
    as.numeric()
  if_else(is_pct, num / 100, num)
}
clean_number("$12,300")

[1] 12300

clean_number("45%")

[1] 0.45

Mutate function: example 4

Another example: taking a vector, and if someone previously used numbers 997 / 998 / 999 to record missing values, replace them with NA:

fix_na <- function(x) {
  if_else(x %in% c(997, 998, 999), NA, x)
}
fix_na(c(3, 998, -54))

[1]   3  NA -54

You can use multiple arguments

make_a_pair <- function(x, y) {
  str_c("The pair will be ", x, " and ", y)
}
make_a_pair("Bug", "Cat")

[1] "The pair will be Bug and Cat"

Summary function: returns a single value

Task: collapse argument into a single string

make_and_string <- function(x) {
  str_flatten(x, collapse = ', ', last = ', and ')
}
c('cat', 'dog', 'pigeon') |> make_and_string()

[1] "cat, dog, and pigeon"

Allow no Oxford comma

make_and_string_v2 <- function(x, no_oxford_commas=FALSE) {
  last_str <- if_else(no_oxford_commas, ' and ', ', and ')
  str_flatten(x, collapse = ', ', last = last_str)
}
c('cat', 'dog', 'pigeon') |> make_and_string_v2()

[1] "cat, dog, and pigeon"

c('cat', 'dog', 'pigeon') |> make_and_string_v2(no_oxford_commas=TRUE)

[1] "cat, dog and pigeon"

c('cat', 'dog') |> make_and_string_v2(no_oxford_commas=TRUE)

[1] "cat and dog"

c('cat', 'dog') |> make_and_string_v2(no_oxford_commas=FALSE)

[1] "cat, and dog"

Automatically remove comma if only two nouns

make_and_string_v3 <- function(x, no_oxford_commas=FALSE) {
  if (length(x) == 2) {
    no_oxford_commas <- TRUE
  }
  last_str <- if_else(no_oxford_commas, ' and ', ', and ')
  str_flatten(x, collapse = ', ', last = last_str)
}
c('cat', 'dog') |> make_and_string_v3(no_oxford_commas=TRUE)

[1] "cat and dog"

c('cat', 'dog') |> make_and_string_v3(no_oxford_commas=FALSE)

[1] "cat and dog"

Data frame argument

Motivation

We will often manipulate tibbles in a repetitive way.

Typical format of a dplyr function:
- take a data frame as the first argument,
- take some extra arguments that say what to do with it,
- return a data frame or a vector.
E.g., compute the mean of a tibble by groups.

compute_group_mean_v1 <- function(df, g_var, m_var) {
  df |> 
    group_by(g_var) |> 
    summarize(mean(m_var))
}

flights |> 
  compute_group_mean_v1(month, dep_time)
#> Error in `group_by()`:
#> ! Must group by variables found in `.data`.
#> ✖ Column `g_var` is not found.

Error message: “Column g_var is not found.” What’s happening here?
dplyr uses tidy evaluation to allow you to refer to data-frame columns.
- e.g. flights |> group_by(month) rather than flights |> group_by(flights$month)
How do we enable such functionality in our own functions?

Embracing

Solution is to use embracing: wrap a variable inside of {{}}

compute_group_mean_v2 <- function(df, g_var, m_var) {
  df |> 
    group_by({{ g_var }}) |> 
    summarize(mean({{ m_var }}))
}
flights |> 
  compute_group_mean_v2(month, dep_time)

# A tibble: 12 × 2
   month `mean(dep_time)`
   <int>            <dbl>
 1     1               NA
 2     2               NA
 3     3               NA
 4     4               NA
 5     5               NA
 6     6               NA
 7     7               NA
 8     8               NA
 9     9               NA
10    10               NA
11    11               NA
12    12               NA

Embracing

Solution is to use embracing: wrap a variable inside of {{}}

compute_group_mean_v3 <- function(df, g_var, m_var) {
  df |> 
    group_by({{ g_var }}) |> 
    summarize(mean({{ m_var }}, na.rm=TRUE))
}
flights |> 
  compute_group_mean_v3(month, dep_time)

# A tibble: 12 × 2
   month `mean(dep_time, na.rm = TRUE)`
   <int>                          <dbl>
 1     1                          1347.
 2     2                          1348.
 3     3                          1359.
 4     4                          1353.
 5     5                          1351.
 6     6                          1351.
 7     7                          1353.
 8     8                          1350.
 9     9                          1334.
10    10                          1340.
11    11                          1344.
12    12                          1357.

Embracing

Typically use embracing whenever using functions like arrange(), filter(), summarize(), select(), rename(), etc.
We’ll see examples of how to apply it.

Embracing: example

Let’s create a function which does initial summary statistics of a variable

summary6 <- function(data, var) {
  data |> summarize(
    min = min({{ var }}, na.rm = TRUE),
    mean = mean({{ var }}, na.rm = TRUE),
    median = median({{ var }}, na.rm = TRUE),
    max = max({{ var }}, na.rm = TRUE),
    n = n(),
    n_miss = sum(is.na({{ var }})),
    .groups = "drop"
  )
}

flights |> 
  summary6(distance)

# A tibble: 1 × 6
    min  mean median   max      n n_miss
  <dbl> <dbl>  <dbl> <dbl>  <int>  <int>
1    17 1040.    872  4983 336776      0

Embracing: example

We can supply grouped data:

flights |> 
  group_by(year, month, day) |> 
  summary6(distance)

# A tibble: 365 × 9
    year month   day   min  mean median   max     n n_miss
   <int> <int> <int> <dbl> <dbl>  <dbl> <dbl> <int>  <int>
 1  2013     1     1    94 1077.    946  4983   842      0
 2  2013     1     2    94 1053.    944  4983   943      0
 3  2013     1     3    80 1037.    937  4983   914      0
 4  2013     1     4    80 1032.    937  4983   915      0
 5  2013     1     5    80 1068.    950  4983   720      0
 6  2013     1     6    80 1052.    944  4983   832      0
 7  2013     1     7    80  998.    820  4983   933      0
 8  2013     1     8    80  986.    764  4983   899      0
 9  2013     1     9    80  981.    764  4983   902      0
10  2013     1    10    80  993.    765  4983   932      0
# ℹ 355 more rows

Embracing: example

We can even do computations on top of variables:

flights |> 
  group_by(year, month, day) |> 
  summary6(log10(distance))

# A tibble: 365 × 9
    year month   day   min  mean median   max     n n_miss
   <int> <int> <int> <dbl> <dbl>  <dbl> <dbl> <int>  <int>
 1  2013     1     1  1.97  2.92   2.98  3.70   842      0
 2  2013     1     2  1.97  2.91   2.97  3.70   943      0
 3  2013     1     3  1.90  2.91   2.97  3.70   914      0
 4  2013     1     4  1.90  2.90   2.97  3.70   915      0
 5  2013     1     5  1.90  2.91   2.98  3.70   720      0
 6  2013     1     6  1.90  2.91   2.97  3.70   832      0
 7  2013     1     7  1.90  2.88   2.91  3.70   933      0
 8  2013     1     8  1.90  2.87   2.88  3.70   899      0
 9  2013     1     9  1.90  2.87   2.88  3.70   902      0
10  2013     1    10  1.90  2.88   2.88  3.70   932      0
# ℹ 355 more rows

Embracing: example

We can supply conditions as well:

unique_where <- function(df, condition, var) {
  df |> 
    filter({{ condition }}) |> 
    distinct({{ var }}) |>  # Keep only unique/distinct rows from a data frame. 
    arrange({{ var }})
}

flights |> 
  unique_where(month == 12, dest)

# A tibble: 96 × 1
   dest 
   <chr>
 1 ABQ  
 2 ALB  
 3 ATL  
 4 AUS  
 5 AVL  
 6 BDL  
 7 BGR  
 8 BHM  
 9 BNA  
10 BOS  
# ℹ 86 more rows

Non-embrace examples

See the difference between functions using embracing vs not using embracing.

For flights tibble, create a function called filter_severe() to find all flights that were cancelled (is.na(arr_time)) or delayed by more than an hour.

filter_severe <- function(df) {
  df |> filter(is.na(arr_time) | dep_delay > 60)
}
flights |> filter_severe()

Find all flights that were cancelled or delayed by more than a user supplied number of hours:

filter_severe <- function(df, hours = 1) {
  df |> filter(is.na(arr_time) | dep_delay > 60*hours)
}
flights |> filter_severe(hours = 3)
flights |> filter_severe(3)  # also fine

Modifying multiple columns

Consider the following tibble (rnorm(n): n independent standard normals)

Suppose we want to compute the median of every column:

df <- tibble(x = rnorm(10), z = rnorm(10), y = rnorm(10))
df |> summarize(n = n(), x = median(x), z = median(z), y = median(y))

# A tibble: 1 × 4
      n     x      z       y
  <int> <dbl>  <dbl>   <dbl>
1    10 0.212 -0.798 -0.0483

We don’t want to copy+paste more than twice; what if we had 500 columns?

Helpful function: across():

df |> summarize(n = n(), across(x:y, median))

# A tibble: 1 × 4
      n     x      z       y
  <int> <dbl>  <dbl>   <dbl>
1    10 0.212 -0.798 -0.0483

`across()`

In coming slides, we’ll see across() works and how to modify this behavior.
Three especially important arguments to across():
- .cols: which columns to iterate over
- .fns: what to do (function) for each column
- .names: name output of each column

`across()`: selecting columns with `.cols`

For .cols, we can use same things we used for select():

df |> summarize(across(-x, median))

# A tibble: 1 × 2
       z       y
   <dbl>   <dbl>
1 -0.798 -0.0483

df |> summarize(across(c(x,z), median))

# A tibble: 1 × 2
      x      z
  <dbl>  <dbl>
1 0.212 -0.798

`across()`: selecting columns with `.cols`

Two additional arguments which are helpful: everything() and where().

everything() computes summaries for every non-grouping variable
where() allows for selecting columns based on type, e.g. where(is.numeric) for numbers, where(is.character) for strings, where(is.logical) for logicals

df2 <- cbind(df, grp = sample(c("A", "B"), 10, replace = TRUE))  # grp is either 1 or 2
df2 |> summarize(across(everything(), median))

         x          z           y grp
1 0.211684 -0.7975214 -0.04833544  NA

df2 |> summarize(across(where(is.numeric), median))

         x          z           y
1 0.211684 -0.7975214 -0.04833544

df2 |> summarize(across(where(is.character), str_flatten))

         grp
1 ABBBBBBAAA

`across()`: calling a single function with `.fns`

.fns says how we want data to be transformed
We are passing the function to across(), we are not calling the function itself.
- Never add the () after the function when you pass to across(), otherwise…

df |> summarize(across(everything(), median()))
#> Error in `summarize()`:
#> ℹ In argument: `across(everything(), median())`.
#> Caused by error in `median.default()`:
#> ! argument "x" is missing, with no default

Same reason why calling median() in console will produce an error (no input!)

df |> summarize(across(everything(), median))

# A tibble: 1 × 3
      x      z       y
  <dbl>  <dbl>   <dbl>
1 0.212 -0.798 -0.0483

`across()`: calling multiple functions with `.fns`

We may want to apply multiple transformations or have multiple arguments
Motivating example: tibble with missing data

df_miss

# A tibble: 5 × 3
       x      y       z
   <dbl>  <dbl>   <dbl>
1 NA     NA      0.495 
2 -0.829 NA     -0.531 
3 -0.221 -0.430  0.0803
4  0.915 -1.40  -0.288 
5  0.570 -1.06  -0.381

df_miss |> 
  summarize(
    across(x:z, median),
    n = n()
    )

# A tibble: 1 × 4
      x     y      z     n
  <dbl> <dbl>  <dbl> <int>
1    NA    NA -0.288     5

`across()`: calling multiple functions with `.fns`

If we want to pass along argument na.rm = TRUE we can create a new function in-line which calls median:

df_miss |> 
  summarize(
    across(x:z, function(x) median(x, na.rm = TRUE)),
    n = n())

# A tibble: 1 × 4
      x     y      z     n
  <dbl> <dbl>  <dbl> <int>
1 0.174 -1.06 -0.288     5

A function without a name is called anonymous.
R also allows for a shortcut for in-line function creations: \:

df_miss |> 
  summarize(
    across(x:z, \(x) median(x, na.rm = TRUE)),
    n = n())

`across()`: calling multiple functions with `.fns`

So we can simplify code like …

df_miss |> 
  summarize(
    x = median(x, na.rm = T),
    y = median(y, na.rm = T),
    z = median(z, na.rm = T),
    n = n())

# A tibble: 1 × 4
      x     y      z     n
  <dbl> <dbl>  <dbl> <int>
1 0.174 -1.06 -0.288     5

… to …

df_miss |> 
  summarize(
    across(x:z, \(a) median(a, na.rm = T)),
    n = n())

# A tibble: 1 × 4
      x     y      z     n
  <dbl> <dbl>  <dbl> <int>
1 0.174 -1.06 -0.288     5

`across()`: calling multiple functions with `.fns`

We might also be interested in how many missing values were removed. We can do that again using across() by using a named list to .fns argument:

df_miss |> 
  summarize(
    across(x:z, list(
      med = \(a) median(a, na.rm = TRUE),
      n_miss = \(a) sum(is.na(a))
    )),
    n = n()
  )

# A tibble: 1 × 7
  x_med x_n_miss y_med y_n_miss  z_med z_n_miss     n
  <dbl>    <int> <dbl>    <int>  <dbl>    <int> <int>
1 0.174        1 -1.06        2 -0.288        0     5

Columns are named using “glue”: {.col}_{.fn}
- .col is name of original column and .fn is name of function.

`across()`: calling multiple functions with `.fns`

Column name examples: consider calculating medians/means for all columns

str(df)

tibble [10 × 3] (S3: tbl_df/tbl/data.frame)
 $ x: num [1:10] 1.393 -0.229 0.171 -0.852 0.349 ...
 $ z: num [1:10] -0.746 -1.156 0.899 -0.851 -0.635 ...
 $ y: num [1:10] -0.559 0.509 0.462 -0.714 -1.482 ...

df |> summarize(across(x:y, list(med = median, mn = mean)))

# A tibble: 1 × 6
  x_med  x_mn  z_med   z_mn   y_med   y_mn
  <dbl> <dbl>  <dbl>  <dbl>   <dbl>  <dbl>
1 0.212 0.186 -0.798 -0.554 -0.0483 0.0952

df |> summarize(across(x:y, list(median, mean)))

# A tibble: 1 × 6
    x_1   x_2    z_1    z_2     y_1    y_2
  <dbl> <dbl>  <dbl>  <dbl>   <dbl>  <dbl>
1 0.212 0.186 -0.798 -0.554 -0.0483 0.0952

`across()`: output column names with `.names`

Specifying the .names column allows for custom output names:

df_miss |> 
  summarize(
    across(
      x:z,
      list(
        med = \(x) median(x, na.rm = TRUE),
        n_miss = \(x) sum(is.na(x))
      ),
      .names = "{.fn}_for_{.col}"
    ),
    n = n(),
  )

# A tibble: 1 × 7
  med_for_x n_miss_for_x med_for_y n_miss_for_y med_for_z n_miss_for_z     n
      <dbl>        <int>     <dbl>        <int>     <dbl>        <int> <int>
1     0.174            1     -1.06            2    -0.288            0     5

`across()`: output column names with `.names`

Specifying .names is especially important if using only one function; by default, across() returns same names as input and thus will replace the input columns.
- We saw this behavior previously when using inside summarize()
e.g., coalesce(x, y) replaces all appearances of NA in x with the value y

df_miss |> 
  mutate(
    across(x:z, \(x) coalesce(x, 0))
  )

# A tibble: 5 × 3
       x      y       z
   <dbl>  <dbl>   <dbl>
1  0      0      0.495 
2 -0.829  0     -0.531 
3 -0.221 -0.430  0.0803
4  0.915 -1.40  -0.288 
5  0.570 -1.06  -0.381

`across()`: output column names with `.names`

Specifying .names is especially important if using only one function; by default, across() returns same names as input and thus will replace the input columns.
- We saw this behavior previously when using inside summarize()
To create new columns, use .names to give output new names:

df_miss |> 
  mutate(
    across(x:z, \(x) coalesce(x, 0), .names = "{.col}_na_zero")
  )

# A tibble: 5 × 6
       x      y       z x_na_zero y_na_zero z_na_zero
   <dbl>  <dbl>   <dbl>     <dbl>     <dbl>     <dbl>
1 NA     NA      0.495      0         0        0.495 
2 -0.829 NA     -0.531     -0.829     0       -0.531 
3 -0.221 -0.430  0.0803    -0.221    -0.430    0.0803
4  0.915 -1.40  -0.288      0.915    -1.40    -0.288 
5  0.570 -1.06  -0.381      0.570    -1.06    -0.381

Examples: using other data sets

Number of unique values in each column of penguins

penguins |> 
  summarize(across(everything(), \(x) length(unique(x))))

# A tibble: 1 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
    <int>  <int>          <int>         <int>             <int>       <int>
1       3      3            165            81                56          95
# ℹ 2 more variables: sex <int>, year <int>

Examples: using other data sets

The mean of every column in mtcars:

mtcars |>
  summarize(across(everything(), mean))

       mpg    cyl     disp       hp     drat      wt     qsec     vs      am
1 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625
    gear   carb
1 3.6875 2.8125

Examples: using other data sets

Group diamonds by cut, clarity, and color, then count the number of observations and compute the mean of each numeric column.

diamonds |>
  group_by(cut, clarity, color) |>
  summarize(num = n(), across(where(is.numeric), mean))

# A tibble: 276 × 11
# Groups:   cut, clarity [40]
   cut   clarity color   num carat depth table price     x     y     z
   <ord> <ord>   <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Fair  I1      D         4 1.88   65.6  56.8 7383   7.52  7.42  4.90
 2 Fair  I1      E         9 0.969  65.6  58.1 2095.  6.17  6.06  4.01
 3 Fair  I1      F        35 1.02   65.7  58.4 2544.  6.14  6.04  4.00
 4 Fair  I1      G        53 1.23   65.3  57.7 3187.  6.52  6.43  4.23
 5 Fair  I1      H        52 1.50   65.8  58.4 4213.  6.96  6.86  4.55
 6 Fair  I1      I        34 1.32   65.7  58.4 3501   6.76  6.65  4.41
 7 Fair  I1      J        23 1.99   66.5  57.9 5795.  7.55  7.46  4.99
 8 Fair  SI2     D        56 1.02   64.7  58.6 4355.  6.24  6.17  4.01
 9 Fair  SI2     E        78 1.02   63.4  59.5 4172.  6.28  6.22  3.96
10 Fair  SI2     F        89 1.08   63.8  59.5 4520.  6.36  6.30  4.04
# ℹ 266 more rows

Examples: in a function

Task: expand all date columns into year / month / day columns.

expand_dates <- function(df) {
  df |> 
    mutate(
      across(where(is.Date), list(year = year, month = month, day = mday))
    )
}

my_df <- tibble(name = c("Ant", "Bug"), date = ymd(c("2009-08-03", "2010-01-16"))) 
my_df |> 
  expand_dates()

# A tibble: 2 × 5
  name  date       date_year date_month date_day
  <chr> <date>         <dbl>      <dbl>    <int>
1 Ant   2009-08-03      2009          8        3
2 Bug   2010-01-16      2010          1       16

Filtering

across() is great with summarize() and mutate(), but not so much with filter() because there we usually combine conditions with & / |.
dplyr variants if_any() and if_all() help to combine logicals across columns.

# same as df_miss |> filter(is.na(a) | is.na(b) | is.na(c))
df_miss |> filter(if_any(x:z, is.na))

# A tibble: 2 × 3
       x     y      z
   <dbl> <dbl>  <dbl>
1 NA        NA  0.495
2 -0.829    NA -0.531

# same as df_miss |> filter(is.na(a) & is.na(b) & is.na(c))
df_miss |> filter(if_all(x:z, is.na))

# A tibble: 0 × 3
# ℹ 3 variables: x <dbl>, y <dbl>, z <dbl>

`across()` vs `pivot_longer()`

Suppose we want to compute a weighted mean.

Suppose the tibble contains both values and weights.

df_paired

# A tibble: 3 × 6
  Ant_score Ant_wts Bug_score Bug_wts Cat_score Cat_wts
      <int>   <dbl>     <int>   <dbl>     <int>   <dbl>
1        74    0.25        61    0.25        84    0.25
2        66    0.25        78    0.25        91    0.25
3        84    0.5         70    0.5         77    0.5

No way to do this with across(), but easy with pivot_longer()

`across()` vs `pivot_longer()`

( df_long <- df_paired |> 
  pivot_longer(
    cols = everything(), 
    names_to = c("group", ".value"), 
    names_sep = "_"
  ) )

# A tibble: 9 × 3
  group score   wts
  <chr> <int> <dbl>
1 Ant      74  0.25
2 Bug      61  0.25
3 Cat      84  0.25
4 Ant      66  0.25
5 Bug      78  0.25
6 Cat      91  0.25
7 Ant      84  0.5 
8 Bug      70  0.5 
9 Cat      77  0.5

df_long |> 
  group_by(group) |> 
  summarize(wm = weighted.mean(score, wts))

# A tibble: 3 × 2
  group    wm
  <chr> <dbl>
1 Ant    77  
2 Bug    69.8
3 Cat    82.2

09: Functions 1 and 2

General

Functions

Components of a function

Passing arguments

Example

Example

Style: function names

Style: spacing

Vector argument

Writing a function

Writing a function: comparison

Improving the function: remove redundant calculations

Improving the function: deal with infinity

Inside of mutate() or filter() or summarize()

Mutate function: example 1

Mutate function: example 2

Filter function: example 1

Mutate function: example 3

Mutate function: example 4

Summary function: returns a single value

Data frame argument

Motivation

Embracing

Embracing

Embracing

Embracing: example

Embracing: example

Embracing: example

Embracing: example

Non-embrace examples

Modifying multiple columns

Modifying multiple columns

across()

across(): selecting columns with .cols

across(): selecting columns with .cols

across(): calling a single function with .fns

across(): calling multiple functions with .fns

across(): calling multiple functions with .fns

across(): calling multiple functions with .fns

across(): calling multiple functions with .fns

across(): calling multiple functions with .fns

across(): output column names with .names

across(): output column names with .names

across(): output column names with .names

Examples: using other data sets

Examples: using other data sets

Examples: using other data sets

Examples: in a function

Filtering

across() vs pivot_longer()

across() vs pivot_longer()

Inside of `mutate()` or `filter()` or `summarize()`

`across()`

`across()`: selecting columns with `.cols`

`across()`: selecting columns with `.cols`

`across()`: calling a single function with `.fns`

`across()`: calling multiple functions with `.fns`

`across()`: calling multiple functions with `.fns`

`across()`: calling multiple functions with `.fns`

`across()`: calling multiple functions with `.fns`

`across()`: calling multiple functions with `.fns`

`across()`: output column names with `.names`

`across()`: output column names with `.names`

`across()`: output column names with `.names`

`across()` vs `pivot_longer()`

`across()` vs `pivot_longer()`