09: Functions 1 and 2

STA35B: Statistical Data Science 2

Akira Horiguchi
library(tidyverse)
library(nycflights13)
library(palmerpenguins)
set.seed(88)

General

Functions

  • We’ll talk in the next couple of lectures about writing our own function.
  • Why write our own?
    • Readability: have well-defined tasks that are encapsulated within individual functions, easier to debug.
    • Portability: makes it easy to re-use code.
  • Example: let’s rescale each column in df to have range between 0 and 1.
# Try to understand this code. (Do you spot the error?)
df <- tibble(a = rnorm(4), b = rnorm(4), c = rnorm(4))
df |> mutate(
  a = (a - min(a, na.rm = TRUE)) / 
    (max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),
  b = (b - min(b, na.rm = TRUE)) / 
    (max(b, na.rm = TRUE) - min(a, na.rm = TRUE)),
  c = (c - min(c, na.rm = TRUE)) / 
    (max(c, na.rm = TRUE) - min(c, na.rm = TRUE))
)
# A tibble: 4 × 3
      a     b     c
  <dbl> <dbl> <dbl>
1 0.386  5.44 0    
2 0.595  4.71 0.855
3 1      4.73 0.206
4 0      0    1    

Components of a function

  • Function name: It is stored in the R environment as an object with this name.
  • Argument(s): When calling a function, you pass a value or values to the argument(s).
    • …can be required or optional.
    • …can have default values.
  • Function Body: The sequence of commands that are executed when the function is called.
  • Return Value: The output of the function.
square <- function(x) {
  y <- x^2
  return(y)
}
  1. The function name is square.
  2. The function has only one argument; here it is called x.
  3. The function body are the lines of code between the curly braces { and }.
  4. The return value is y.
square(3)
[1] 9

Passing arguments

When calling a function, you can specify the arguments by:

  • position
mean(1:10, 0.2, TRUE)
[1] 5.5
  • complete name
mean(x = 1:10, trim = 0.2, na.rm = TRUE)
[1] 5.5
  • partial name (does not work when the abbreviation is ambiguous)
mean(x = 1:10, n = TRUE, t = 0.2)
[1] 5.5

Example

  1. Write a function get_age_in_years() which takes in a vector of strings of birthdates of the form “YYYY-MM-DD” and computes the age in years.
get_age_in_years <- function(dates) {
  time_intervals <- ymd(dates) %--% today()
  floor(time_intervals / years(1))
}
get_age_in_years(c("2011-11-11", "1997-02-08", "2025-02-08"))
[1] 14 29  1

Example

  1. A function get_num_both_na() which takes in two vectors of same length and returns the number of positions which have an NA in both vectors
get_num_both_na <- function(vec1, vec2) {
  sum(is.na(vec1) & is.na(vec2))
}
get_num_both_na(c(1, 2, NA, 3), c(4, 5, NA, NA))
[1] 1
get_num_both_na(c(1, 2, NA, NA), c(4, 5, NA, NA))
[1] 2

Style: function names

Function names should be…

  • …verbs; arguments should be nouns (typically)
    • if returning a logical value/vector, can start name as e.g., is_xx
  • …descriptive; tell the reader something about the function
# Too short
f()

# Not a verb, or descriptive
my_awesome_function()

# Long, but clear
impute_missing()
collapse_years()

Style: spacing

  • Whitespace does not affect code execution in R (unlike in Python).
  • Still, it helps readers (e.g., future you!) to use consistent spacing conventions.
unique_where <- function(df, condition, var) {
df |> 
filter({{ condition }}) |> 
distinct({{ var }}) |>  # Keep only unique/distinct rows from a data frame. 
arrange({{ var }})
}
  • For specifics, see Tidyverse style guide.
  • The keyboard shortcut Ctrl + I will “correctly” auto-indent the given line.
    • Select text then Ctrl + I to “correctly” auto-indent the selected text.
    • Ctrl + A then Ctrl + I to “correctly” auto-indent the entire file.

Vector argument

Writing a function

Recall: here’s what we were trying to do:

(a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE))
(b - min(b, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(b, na.rm = TRUE))
(c - min(c, na.rm = TRUE)) / (max(c, na.rm = TRUE) - min(c, na.rm = TRUE))
  • More generally:
(█ - min(█, na.rm = TRUE)) / (max(█, na.rm = TRUE) - min(█, na.rm = TRUE))
  • Let’s create a function that has a vector argument.
rescale01 <- function(x) {
  (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}
c(-10, 0, 10) |> rescale01()
[1] 0.0 0.5 1.0
c(1, 2, 3, NA, 5) |> rescale01()
[1] 0.00 0.25 0.50   NA 1.00

Writing a function: comparison

Compare code with functions…

df |> mutate(
  a = rescale01(a),
  b = rescale01(b),
  c = rescale01(c),
)
# A tibble: 4 × 3
      a     b     c
  <dbl> <dbl> <dbl>
1 0.386 1     0    
2 0.595 0.866 0.855
3 1     0.869 0.206
4 0     0     1    

… to code without functions:

df |> mutate(
  a = (a - min(a, na.rm = TRUE)) / 
    (max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),
  b = (b - min(b, na.rm = TRUE)) / 
    (max(b, na.rm = TRUE) - min(a, na.rm = TRUE)),
  c = (c - min(c, na.rm = TRUE)) / 
    (max(c, na.rm = TRUE) - min(c, na.rm = TRUE))
)
# A tibble: 4 × 3
      a     b     c
  <dbl> <dbl> <dbl>
1 0.386  5.44 0    
2 0.595  4.71 0.855
3 1      4.73 0.206
4 0      0    1    

Improving the function: remove redundant calculations

rescale01 <- function(x) {
  (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}

Note: rescale01 computes min(x, na.rm = TRUE) twice.

  • Redundant: what if min(x, na.rm = TRUE) takes a long time to compute?
  • Let’s compute min(x, na.rm = TRUE) only once rather than twice.
rescale01_v2 <- function(x) {
  minval <- min(x, na.rm=TRUE)
  maxval <- max(x, na.rm=TRUE)
  (x - minval) / (maxval - minval)
}
  • Much easier than having to go through copy/pasted code!

Improving the function: deal with infinity

c(1:5, Inf) |> rescale01_v2()
[1]   0   0   0   0   0 NaN
  • What if we want to modify it to deal with infinite values differently?
  • range() returns min and max of the vector, and has arguments…
    • na.rm, which removes NAs
    • finite, which removes Infs (max() and min() do not have this argument)
rescale01_v3 <- function(x) {
  rng <- range(x, na.rm = TRUE, finite = TRUE)
  minval <- rng[1]
  maxval <- rng[2]
  (x - minval) / (maxval - minval)
}
c(1:5, Inf) |> rescale01_v3()
[1] 0.00 0.25 0.50 0.75 1.00  Inf

Inside of mutate() or filter() or summarize()

We want a function inside of…

  • mutate() to return a vector of the same length as the function’s argument.
  • filter() to return a logical vector of the same length as the function’s argument.
  • summarize() to return a single value.

Mutate function: example 1

  • Compute the z-score of a vector: rescale a vector to have mean zero and standard deviation of one.
z_score <- function(x) {
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}

penguins |>
  mutate(body_mass_z = z_score(body_mass_g), .keep='used')
# A tibble: 344 × 2
   body_mass_g body_mass_z
         <int>       <dbl>
 1        3750     -0.563 
 2        3800     -0.501 
 3        3250     -1.19  
 4          NA     NA     
 5        3450     -0.937 
 6        3650     -0.688 
 7        3625     -0.719 
 8        4675      0.590 
 9        3475     -0.906 
10        4250      0.0602
# ℹ 334 more rows

Mutate function: example 2

  • Make sure sure all values in a vector lie between a value min and value max
clamp <- function(x, min_arg, max_arg) {
  case_when(
    x < min_arg ~ min_arg,
    x > max_arg ~ max_arg,
    .default = x
  )
}

clamp(1:10, min_arg = 3, max_arg = 7)
 [1] 3 3 3 4 5 6 7 7 7 7
penguins |> 
  mutate(bill_len_middle = clamp(bill_length_mm, 35, 40), .keep='used')
# A tibble: 344 × 2
   bill_length_mm bill_len_middle
            <dbl>           <dbl>
 1           39.1            39.1
 2           39.5            39.5
 3           40.3            40  
 4           NA              NA  
 5           36.7            36.7
 6           39.3            39.3
 7           38.9            38.9
 8           39.2            39.2
 9           34.1            35  
10           42              40  
# ℹ 334 more rows

Filter function: example 1

  • Which values in a vector lie outside a value min and value max?
is_outside_rng <- function(x, min_arg, max_arg) {
  (x < min_arg) | (x > max_arg)
}

is_outside_rng(1:10, min_arg = 3, max_arg = 7)
 [1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
dim(penguins)
[1] 344   8
penguins |> 
  filter(is_outside_rng(bill_length_mm, 35, 40))
# A tibble: 251 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           40.3          18                 195        3250
 2 Adelie  Torgersen           34.1          18.1               193        3475
 3 Adelie  Torgersen           42            20.2               190        4250
 4 Adelie  Torgersen           41.1          17.6               182        3200
 5 Adelie  Torgersen           34.6          21.1               198        4400
 6 Adelie  Torgersen           42.5          20.7               197        4500
 7 Adelie  Torgersen           34.4          18.4               184        3325
 8 Adelie  Torgersen           46            21.5               194        4200
 9 Adelie  Biscoe              40.6          18.6               183        3550
10 Adelie  Biscoe              40.5          17.9               187        3200
# ℹ 241 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Mutate function: example 3

  • Remove all percent signs, commas, and dollar signs from a string and then convert it to a number.
clean_number <- function(x) {
  is_pct <- str_detect(x, "%")
  num <- x |> 
    str_remove_all("%") |> 
    str_remove_all(",") |> 
    str_remove_all("\\$") |> 
    as.numeric()
  if_else(is_pct, num / 100, num)
}
clean_number("$12,300")
[1] 12300
clean_number("45%")
[1] 0.45

Mutate function: example 4

  • Another example: taking a vector, and if someone previously used numbers 997 / 998 / 999 to record missing values, replace them with NA:
fix_na <- function(x) {
  if_else(x %in% c(997, 998, 999), NA, x)
}
fix_na(c(3, 998, -54))
[1]   3  NA -54
  • You can use multiple arguments
make_a_pair <- function(x, y) {
  str_c("The pair will be ", x, " and ", y)
}
make_a_pair("Bug", "Cat")
[1] "The pair will be Bug and Cat"

Summary function: returns a single value

  • Task: collapse argument into a single string
make_and_string <- function(x) {
  str_flatten(x, collapse = ', ', last = ', and ')
}
c('cat', 'dog', 'pigeon') |> make_and_string()
[1] "cat, dog, and pigeon"
  • Allow no Oxford comma
make_and_string_v2 <- function(x, no_oxford_commas=FALSE) {
  last_str <- if_else(no_oxford_commas, ' and ', ', and ')
  str_flatten(x, collapse = ', ', last = last_str)
}
c('cat', 'dog', 'pigeon') |> make_and_string_v2()
[1] "cat, dog, and pigeon"
c('cat', 'dog', 'pigeon') |> make_and_string_v2(no_oxford_commas=TRUE)
[1] "cat, dog and pigeon"
c('cat', 'dog') |> make_and_string_v2(no_oxford_commas=TRUE)
[1] "cat and dog"
c('cat', 'dog') |> make_and_string_v2(no_oxford_commas=FALSE)
[1] "cat, and dog"
  • Automatically remove comma if only two nouns
make_and_string_v3 <- function(x, no_oxford_commas=FALSE) {
  if (length(x) == 2) {
    no_oxford_commas <- TRUE
  }
  last_str <- if_else(no_oxford_commas, ' and ', ', and ')
  str_flatten(x, collapse = ', ', last = last_str)
}
c('cat', 'dog') |> make_and_string_v3(no_oxford_commas=TRUE)
[1] "cat and dog"
c('cat', 'dog') |> make_and_string_v3(no_oxford_commas=FALSE)
[1] "cat and dog"

Data frame argument

Motivation

We will often manipulate tibbles in a repetitive way.

  • Typical format of a dplyr function:
    • take a data frame as the first argument,
    • take some extra arguments that say what to do with it,
    • return a data frame or a vector.
  • E.g., compute the mean of a tibble by groups.
compute_group_mean_v1 <- function(df, g_var, m_var) {
  df |> 
    group_by(g_var) |> 
    summarize(mean(m_var))
}
flights |> 
  compute_group_mean_v1(month, dep_time)
#> Error in `group_by()`:
#> ! Must group by variables found in `.data`.
#> ✖ Column `g_var` is not found.
  • Error message: “Column g_var is not found.” What’s happening here?
  • dplyr uses tidy evaluation to allow you to refer to data-frame columns.
    • e.g. flights |> group_by(month) rather than flights |> group_by(flights$month)
  • How do we enable such functionality in our own functions?

Embracing

  • Solution is to use embracing: wrap a variable inside of {{}}
compute_group_mean_v2 <- function(df, g_var, m_var) {
  df |> 
    group_by({{ g_var }}) |> 
    summarize(mean({{ m_var }}))
}
flights |> 
  compute_group_mean_v2(month, dep_time)
# A tibble: 12 × 2
   month `mean(dep_time)`
   <int>            <dbl>
 1     1               NA
 2     2               NA
 3     3               NA
 4     4               NA
 5     5               NA
 6     6               NA
 7     7               NA
 8     8               NA
 9     9               NA
10    10               NA
11    11               NA
12    12               NA

Embracing

  • Solution is to use embracing: wrap a variable inside of {{}}
compute_group_mean_v3 <- function(df, g_var, m_var) {
  df |> 
    group_by({{ g_var }}) |> 
    summarize(mean({{ m_var }}, na.rm=TRUE))
}
flights |> 
  compute_group_mean_v3(month, dep_time)
# A tibble: 12 × 2
   month `mean(dep_time, na.rm = TRUE)`
   <int>                          <dbl>
 1     1                          1347.
 2     2                          1348.
 3     3                          1359.
 4     4                          1353.
 5     5                          1351.
 6     6                          1351.
 7     7                          1353.
 8     8                          1350.
 9     9                          1334.
10    10                          1340.
11    11                          1344.
12    12                          1357.

Embracing

  • Typically use embracing whenever using functions like arrange(), filter(), summarize(), select(), rename(), etc.
  • We’ll see examples of how to apply it.

Embracing: example

  • Let’s create a function which does initial summary statistics of a variable
summary6 <- function(data, var) {
  data |> summarize(
    min = min({{ var }}, na.rm = TRUE),
    mean = mean({{ var }}, na.rm = TRUE),
    median = median({{ var }}, na.rm = TRUE),
    max = max({{ var }}, na.rm = TRUE),
    n = n(),
    n_miss = sum(is.na({{ var }})),
    .groups = "drop"
  )
}
flights |> 
  summary6(distance)
# A tibble: 1 × 6
    min  mean median   max      n n_miss
  <dbl> <dbl>  <dbl> <dbl>  <int>  <int>
1    17 1040.    872  4983 336776      0

Embracing: example

  • We can supply grouped data:
flights |> 
  group_by(year, month, day) |> 
  summary6(distance)
# A tibble: 365 × 9
    year month   day   min  mean median   max     n n_miss
   <int> <int> <int> <dbl> <dbl>  <dbl> <dbl> <int>  <int>
 1  2013     1     1    94 1077.    946  4983   842      0
 2  2013     1     2    94 1053.    944  4983   943      0
 3  2013     1     3    80 1037.    937  4983   914      0
 4  2013     1     4    80 1032.    937  4983   915      0
 5  2013     1     5    80 1068.    950  4983   720      0
 6  2013     1     6    80 1052.    944  4983   832      0
 7  2013     1     7    80  998.    820  4983   933      0
 8  2013     1     8    80  986.    764  4983   899      0
 9  2013     1     9    80  981.    764  4983   902      0
10  2013     1    10    80  993.    765  4983   932      0
# ℹ 355 more rows

Embracing: example

  • We can even do computations on top of variables:
flights |> 
  group_by(year, month, day) |> 
  summary6(log10(distance))
# A tibble: 365 × 9
    year month   day   min  mean median   max     n n_miss
   <int> <int> <int> <dbl> <dbl>  <dbl> <dbl> <int>  <int>
 1  2013     1     1  1.97  2.92   2.98  3.70   842      0
 2  2013     1     2  1.97  2.91   2.97  3.70   943      0
 3  2013     1     3  1.90  2.91   2.97  3.70   914      0
 4  2013     1     4  1.90  2.90   2.97  3.70   915      0
 5  2013     1     5  1.90  2.91   2.98  3.70   720      0
 6  2013     1     6  1.90  2.91   2.97  3.70   832      0
 7  2013     1     7  1.90  2.88   2.91  3.70   933      0
 8  2013     1     8  1.90  2.87   2.88  3.70   899      0
 9  2013     1     9  1.90  2.87   2.88  3.70   902      0
10  2013     1    10  1.90  2.88   2.88  3.70   932      0
# ℹ 355 more rows

Embracing: example

  • We can supply conditions as well:
unique_where <- function(df, condition, var) {
  df |> 
    filter({{ condition }}) |> 
    distinct({{ var }}) |>  # Keep only unique/distinct rows from a data frame. 
    arrange({{ var }})
}
flights |> 
  unique_where(month == 12, dest)
# A tibble: 96 × 1
   dest 
   <chr>
 1 ABQ  
 2 ALB  
 3 ATL  
 4 AUS  
 5 AVL  
 6 BDL  
 7 BGR  
 8 BHM  
 9 BNA  
10 BOS  
# ℹ 86 more rows

Non-embrace examples

See the difference between functions using embracing vs not using embracing.

  • For flights tibble, create a function called filter_severe() to find all flights that were cancelled (is.na(arr_time)) or delayed by more than an hour.
filter_severe <- function(df) {
  df |> filter(is.na(arr_time) | dep_delay > 60)
}
flights |> filter_severe()
  • Find all flights that were cancelled or delayed by more than a user supplied number of hours:
filter_severe <- function(df, hours = 1) {
  df |> filter(is.na(arr_time) | dep_delay > 60*hours)
}
flights |> filter_severe(hours = 3)
flights |> filter_severe(3)  # also fine

Modifying multiple columns

Modifying multiple columns

Consider the following tibble (rnorm(n): n independent standard normals)

  • Suppose we want to compute the median of every column:
df <- tibble(x = rnorm(10), z = rnorm(10), y = rnorm(10))
df |> summarize(n = n(), x = median(x), z = median(z), y = median(y))
# A tibble: 1 × 4
      n     x      z       y
  <int> <dbl>  <dbl>   <dbl>
1    10 0.212 -0.798 -0.0483
  • We don’t want to copy+paste more than twice; what if we had 500 columns?
  • Helpful function: across():
df |> summarize(n = n(), across(x:y, median))
# A tibble: 1 × 4
      n     x      z       y
  <int> <dbl>  <dbl>   <dbl>
1    10 0.212 -0.798 -0.0483

across()

  • In coming slides, we’ll see across() works and how to modify this behavior.
  • Three especially important arguments to across():
    • .cols: which columns to iterate over
    • .fns: what to do (function) for each column
    • .names: name output of each column

across(): selecting columns with .cols

  • For .cols, we can use same things we used for select():
df |> summarize(across(-x, median))
# A tibble: 1 × 2
       z       y
   <dbl>   <dbl>
1 -0.798 -0.0483
df |> summarize(across(c(x,z), median))
# A tibble: 1 × 2
      x      z
  <dbl>  <dbl>
1 0.212 -0.798

across(): selecting columns with .cols

Two additional arguments which are helpful: everything() and where().

  • everything() computes summaries for every non-grouping variable
  • where() allows for selecting columns based on type, e.g. where(is.numeric) for numbers, where(is.character) for strings, where(is.logical) for logicals
df2 <- cbind(df, grp = sample(c("A", "B"), 10, replace = TRUE))  # grp is either 1 or 2
df2 |> summarize(across(everything(), median))
         x          z           y grp
1 0.211684 -0.7975214 -0.04833544  NA
df2 |> summarize(across(where(is.numeric), median))
         x          z           y
1 0.211684 -0.7975214 -0.04833544
df2 |> summarize(across(where(is.character), str_flatten))
         grp
1 ABBBBBBAAA

across(): calling a single function with .fns

  • .fns says how we want data to be transformed
  • We are passing the function to across(), we are not calling the function itself.
    • Never add the () after the function when you pass to across(), otherwise…
df |> summarize(across(everything(), median()))
#> Error in `summarize()`:
#> ℹ In argument: `across(everything(), median())`.
#> Caused by error in `median.default()`:
#> ! argument "x" is missing, with no default
  • Same reason why calling median() in console will produce an error (no input!)
df |> summarize(across(everything(), median))
# A tibble: 1 × 3
      x      z       y
  <dbl>  <dbl>   <dbl>
1 0.212 -0.798 -0.0483

across(): calling multiple functions with .fns

  • We may want to apply multiple transformations or have multiple arguments
  • Motivating example: tibble with missing data
df_miss
# A tibble: 5 × 3
       x      y       z
   <dbl>  <dbl>   <dbl>
1 NA     NA      0.495 
2 -0.829 NA     -0.531 
3 -0.221 -0.430  0.0803
4  0.915 -1.40  -0.288 
5  0.570 -1.06  -0.381 
df_miss |> 
  summarize(
    across(x:z, median),
    n = n()
    )
# A tibble: 1 × 4
      x     y      z     n
  <dbl> <dbl>  <dbl> <int>
1    NA    NA -0.288     5

across(): calling multiple functions with .fns

  • If we want to pass along argument na.rm = TRUE we can create a new function in-line which calls median:
df_miss |> 
  summarize(
    across(x:z, function(x) median(x, na.rm = TRUE)),
    n = n())
# A tibble: 1 × 4
      x     y      z     n
  <dbl> <dbl>  <dbl> <int>
1 0.174 -1.06 -0.288     5
  • A function without a name is called anonymous.
  • R also allows for a shortcut for in-line function creations: \:
df_miss |> 
  summarize(
    across(x:z, \(x) median(x, na.rm = TRUE)),
    n = n())

across(): calling multiple functions with .fns

  • So we can simplify code like …
df_miss |> 
  summarize(
    x = median(x, na.rm = T),
    y = median(y, na.rm = T),
    z = median(z, na.rm = T),
    n = n())
# A tibble: 1 × 4
      x     y      z     n
  <dbl> <dbl>  <dbl> <int>
1 0.174 -1.06 -0.288     5
  • … to …
df_miss |> 
  summarize(
    across(x:z, \(a) median(a, na.rm = T)),
    n = n())
# A tibble: 1 × 4
      x     y      z     n
  <dbl> <dbl>  <dbl> <int>
1 0.174 -1.06 -0.288     5

across(): calling multiple functions with .fns

  • We might also be interested in how many missing values were removed. We can do that again using across() by using a named list to .fns argument:
df_miss |> 
  summarize(
    across(x:z, list(
      med = \(a) median(a, na.rm = TRUE),
      n_miss = \(a) sum(is.na(a))
    )),
    n = n()
  )
# A tibble: 1 × 7
  x_med x_n_miss y_med y_n_miss  z_med z_n_miss     n
  <dbl>    <int> <dbl>    <int>  <dbl>    <int> <int>
1 0.174        1 -1.06        2 -0.288        0     5
  • Columns are named using “glue”: {.col}_{.fn}
    • .col is name of original column and .fn is name of function.

across(): calling multiple functions with .fns

  • Column name examples: consider calculating medians/means for all columns
str(df)
tibble [10 × 3] (S3: tbl_df/tbl/data.frame)
 $ x: num [1:10] 1.393 -0.229 0.171 -0.852 0.349 ...
 $ z: num [1:10] -0.746 -1.156 0.899 -0.851 -0.635 ...
 $ y: num [1:10] -0.559 0.509 0.462 -0.714 -1.482 ...
df |> summarize(across(x:y, list(med = median, mn = mean)))
# A tibble: 1 × 6
  x_med  x_mn  z_med   z_mn   y_med   y_mn
  <dbl> <dbl>  <dbl>  <dbl>   <dbl>  <dbl>
1 0.212 0.186 -0.798 -0.554 -0.0483 0.0952
df |> summarize(across(x:y, list(median, mean)))
# A tibble: 1 × 6
    x_1   x_2    z_1    z_2     y_1    y_2
  <dbl> <dbl>  <dbl>  <dbl>   <dbl>  <dbl>
1 0.212 0.186 -0.798 -0.554 -0.0483 0.0952

across(): output column names with .names

  • Specifying the .names column allows for custom output names:
df_miss |> 
  summarize(
    across(
      x:z,
      list(
        med = \(x) median(x, na.rm = TRUE),
        n_miss = \(x) sum(is.na(x))
      ),
      .names = "{.fn}_for_{.col}"
    ),
    n = n(),
  )
# A tibble: 1 × 7
  med_for_x n_miss_for_x med_for_y n_miss_for_y med_for_z n_miss_for_z     n
      <dbl>        <int>     <dbl>        <int>     <dbl>        <int> <int>
1     0.174            1     -1.06            2    -0.288            0     5

across(): output column names with .names

  • Specifying .names is especially important if using only one function; by default, across() returns same names as input and thus will replace the input columns.
    • We saw this behavior previously when using inside summarize()
  • e.g., coalesce(x, y) replaces all appearances of NA in x with the value y
df_miss |> 
  mutate(
    across(x:z, \(x) coalesce(x, 0))
  )
# A tibble: 5 × 3
       x      y       z
   <dbl>  <dbl>   <dbl>
1  0      0      0.495 
2 -0.829  0     -0.531 
3 -0.221 -0.430  0.0803
4  0.915 -1.40  -0.288 
5  0.570 -1.06  -0.381 

across(): output column names with .names

  • Specifying .names is especially important if using only one function; by default, across() returns same names as input and thus will replace the input columns.
    • We saw this behavior previously when using inside summarize()
  • To create new columns, use .names to give output new names:
df_miss |> 
  mutate(
    across(x:z, \(x) coalesce(x, 0), .names = "{.col}_na_zero")
  )
# A tibble: 5 × 6
       x      y       z x_na_zero y_na_zero z_na_zero
   <dbl>  <dbl>   <dbl>     <dbl>     <dbl>     <dbl>
1 NA     NA      0.495      0         0        0.495 
2 -0.829 NA     -0.531     -0.829     0       -0.531 
3 -0.221 -0.430  0.0803    -0.221    -0.430    0.0803
4  0.915 -1.40  -0.288      0.915    -1.40    -0.288 
5  0.570 -1.06  -0.381      0.570    -1.06    -0.381 

Examples: using other data sets

  • Number of unique values in each column of penguins
penguins |> 
  summarize(across(everything(), \(x) length(unique(x))))
# A tibble: 1 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
    <int>  <int>          <int>         <int>             <int>       <int>
1       3      3            165            81                56          95
# ℹ 2 more variables: sex <int>, year <int>

Examples: using other data sets

  • The mean of every column in mtcars:
mtcars |>
  summarize(across(everything(), mean))
       mpg    cyl     disp       hp     drat      wt     qsec     vs      am
1 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625
    gear   carb
1 3.6875 2.8125

Examples: using other data sets

  • Group diamonds by cut, clarity, and color, then count the number of observations and compute the mean of each numeric column.
diamonds |>
  group_by(cut, clarity, color) |>
  summarize(num = n(), across(where(is.numeric), mean))
# A tibble: 276 × 11
# Groups:   cut, clarity [40]
   cut   clarity color   num carat depth table price     x     y     z
   <ord> <ord>   <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Fair  I1      D         4 1.88   65.6  56.8 7383   7.52  7.42  4.90
 2 Fair  I1      E         9 0.969  65.6  58.1 2095.  6.17  6.06  4.01
 3 Fair  I1      F        35 1.02   65.7  58.4 2544.  6.14  6.04  4.00
 4 Fair  I1      G        53 1.23   65.3  57.7 3187.  6.52  6.43  4.23
 5 Fair  I1      H        52 1.50   65.8  58.4 4213.  6.96  6.86  4.55
 6 Fair  I1      I        34 1.32   65.7  58.4 3501   6.76  6.65  4.41
 7 Fair  I1      J        23 1.99   66.5  57.9 5795.  7.55  7.46  4.99
 8 Fair  SI2     D        56 1.02   64.7  58.6 4355.  6.24  6.17  4.01
 9 Fair  SI2     E        78 1.02   63.4  59.5 4172.  6.28  6.22  3.96
10 Fair  SI2     F        89 1.08   63.8  59.5 4520.  6.36  6.30  4.04
# ℹ 266 more rows

Examples: in a function

  • Task: expand all date columns into year / month / day columns.
expand_dates <- function(df) {
  df |> 
    mutate(
      across(where(is.Date), list(year = year, month = month, day = mday))
    )
}

my_df <- tibble(name = c("Ant", "Bug"), date = ymd(c("2009-08-03", "2010-01-16"))) 
my_df |> 
  expand_dates()
# A tibble: 2 × 5
  name  date       date_year date_month date_day
  <chr> <date>         <dbl>      <dbl>    <int>
1 Ant   2009-08-03      2009          8        3
2 Bug   2010-01-16      2010          1       16

Filtering

  • across() is great with summarize() and mutate(), but not so much with filter() because there we usually combine conditions with & / |.
  • dplyr variants if_any() and if_all() help to combine logicals across columns.
# same as df_miss |> filter(is.na(a) | is.na(b) | is.na(c))
df_miss |> filter(if_any(x:z, is.na))
# A tibble: 2 × 3
       x     y      z
   <dbl> <dbl>  <dbl>
1 NA        NA  0.495
2 -0.829    NA -0.531
# same as df_miss |> filter(is.na(a) & is.na(b) & is.na(c))
df_miss |> filter(if_all(x:z, is.na))
# A tibble: 0 × 3
# ℹ 3 variables: x <dbl>, y <dbl>, z <dbl>

across() vs pivot_longer()

Suppose we want to compute a weighted mean.

  • Suppose the tibble contains both values and weights.
df_paired
# A tibble: 3 × 6
  Ant_score Ant_wts Bug_score Bug_wts Cat_score Cat_wts
      <int>   <dbl>     <int>   <dbl>     <int>   <dbl>
1        74    0.25        61    0.25        84    0.25
2        66    0.25        78    0.25        91    0.25
3        84    0.5         70    0.5         77    0.5 
  • No way to do this with across(), but easy with pivot_longer()

across() vs pivot_longer()

( df_long <- df_paired |> 
  pivot_longer(
    cols = everything(), 
    names_to = c("group", ".value"), 
    names_sep = "_"
  ) )
# A tibble: 9 × 3
  group score   wts
  <chr> <int> <dbl>
1 Ant      74  0.25
2 Bug      61  0.25
3 Cat      84  0.25
4 Ant      66  0.25
5 Bug      78  0.25
6 Cat      91  0.25
7 Ant      84  0.5 
8 Bug      70  0.5 
9 Cat      77  0.5 
df_long |> 
  group_by(group) |> 
  summarize(wm = weighted.mean(score, wts))
# A tibble: 3 × 2
  group    wm
  <chr> <dbl>
1 Ant    77  
2 Bug    69.8
3 Cat    82.2