09: Functions

STA35B: Statistical Data Science 2

Akira Horiguchi

Functions

  • We’ll talk in the next couple lectures about writing functions
  • Functions allow for automating tasks, with a number of advantages:
    • Readability: have well-defined tasks that are encapsulated within individual functions, easier to debug
    • Portability: makes it easy to re-use code

Vector functions

Vector functions

Vector functions take >=1 vector, returns vector as a result

  • Try to understand this code. (Do you spot the error?)
df <- tibble(a = rnorm(4), b = rnorm(4), c = rnorm(4))
df |> mutate(
  a = (a - min(a, na.rm = TRUE)) / 
    (max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),
  b = (b - min(b, na.rm = TRUE)) / 
    (max(b, na.rm = TRUE) - min(a, na.rm = TRUE)),
  c = (c - min(c, na.rm = TRUE)) / 
    (max(c, na.rm = TRUE) - min(c, na.rm = TRUE))
)
# A tibble: 4 × 3
      a     b     c
  <dbl> <dbl> <dbl>
1 0.386  5.44 0    
2 0.595  4.71 0.855
3 1      4.73 0.206
4 0      0    1    
  • Seems to be rescaling each column to have range between 0 and 1
  • Did you spot the error? (See column b)
  • Functions will generally make it harder to make these kinds of mistakes

Writing functions

Here’s what we were trying to do:

(a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE))
(b - min(b, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(b, na.rm = TRUE))
(c - min(c, na.rm = TRUE)) / (max(c, na.rm = TRUE) - min(c, na.rm = TRUE))
  • More generally:
(█ - min(█, na.rm = TRUE)) / (max(█, na.rm = TRUE) - min(█, na.rm = TRUE))
  • To write functions, need 3 things:
    • name: we’ll use rescale01
    • arguments: things that vary across calls; we’ll use x
    • body: the code that repeats
name <- function(arguments) {
  body
}

Writing functions

  • In our case:
rescale01 <- function(x) {
  (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}
  • Can test it out:
rescale01(c(-10, 0, 10))
[1] 0.0 0.5 1.0
rescale01(c(1, 2, 3, NA, 5))
[1] 0.00 0.25 0.50   NA 1.00

Writing functions

Compare code with functions…

df |> mutate(
  a = rescale01(a),
  b = rescale01(b),
  c = rescale01(c),
)
# A tibble: 4 × 3
      a     b     c
  <dbl> <dbl> <dbl>
1 0.386 1     0    
2 0.595 0.866 0.855
3 1     0.869 0.206
4 0     0     1    

… to code without functions:

df |> mutate(
  a = (a - min(a, na.rm = TRUE)) / 
    (max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),
  b = (b - min(b, na.rm = TRUE)) / 
    (max(b, na.rm = TRUE) - min(a, na.rm = TRUE)),
  c = (c - min(c, na.rm = TRUE)) / 
    (max(c, na.rm = TRUE) - min(c, na.rm = TRUE))
)
# A tibble: 4 × 3
      a     b     c
  <dbl> <dbl> <dbl>
1 0.386  5.44 0    
2 0.595  4.71 0.855
3 1      4.73 0.206
4 0      0    1    

Improving the function: remove redundant calculations

rescale01 <- function(x) {
  (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}

Note: we are computing min(x, na.rm = TRUE) twice.

  • Redundant: what if min(x, na.rm = TRUE) takes a long time to compute?
  • Let’s compute min(x, na.rm = TRUE) only once rather than twice.
rescale01_v2 <- function(x) {
  minval <- min(x, na.rm=TRUE)
  maxval <- max(x, na.rm=TRUE)
  (x - minval) / (maxval - minval)
}
  • Much easier than having to go through copy/pasted code!

Improving the function: deal with infinity

rescale01_v2(c(1:5, Inf))
[1]   0   0   0   0   0 NaN
  • What if we want to modify it to deal with infinite values differently?
  • range() returns min and max of the vector, and has arguments na.rm = and finite =, which remove NA’s and Inf’s respectively
    • max() and min() do not have the argument finite =
rescale01_v3 <- function(x) {
  rng <- range(x, na.rm = TRUE, finite = TRUE)
  minval <- rng[1]
  maxval <- rng[2]
  (x - minval) / (maxval - minval)
}
rescale01_v3(c(1:5, Inf))
[1] 0.00 0.25 0.50 0.75 1.00  Inf

Mutate functions

We’ll now look at functions which work well with mutate() and filter()

  • Consider a variation of rescale01(), where we compute the z-score: rescaling a vector to have mean zero and standard deviation of one.
z_score <- function(x) {
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}
  • Last line is returned as output of the function
  • Can also use return()
z_score <- function(x) {
  results <- (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
  return(results)
}

Mutate functions

  • Function clamp() which makes sure all values in a vector lie between a value min and value max
clamp <- function(x, min, max) {
  case_when(
    x < min ~ min,
    x > max ~ max,
    .default = x
  )
}

clamp(1:10, min = 3, max = 7)
 [1] 3 3 3 4 5 6 7 7 7 7

Mutate functions

Functions can also be applied to non-numeric variables.

  • Example: remove all percent signs, commas, and dollar signs from a string and then convert it to a number.
clean_number <- function(x) {
  is_pct <- str_detect(x, "%")
  num <- x |> 
    str_remove_all("%") |> 
    str_remove_all(",") |> 
    str_remove_all("\\$") |> 
    as.numeric()
  if_else(is_pct, num / 100, num)
}
clean_number("$12,300")
[1] 12300
clean_number("45%")
[1] 0.45

Mutate functions

  • Another example: taking a vector, and if someone previously used numbers 997 / 998 / 999 to record missing values, replace them with NA:
fix_na <- function(x) {
  if_else(x %in% c(997, 998, 999), NA, x)
}
fix_na(c(3, 998, -54))
[1]   3  NA -54
  • You can use multiple arguments
make_a_pair <- function(x, y) {
  str_c("The pair will be ", x, " and ", y)
}
make_a_pair("Bug", "Cat")
[1] "The pair will be Bug and Cat"

Summary functions

Functions which take a vector and return a single value - useful for summarize()

commas <- function(x) {
  str_flatten(x, collapse = ", ", last = ", and ")
}
commas(c("cat", "dog", "pigeon"))
[1] "cat, dog, and pigeon"
  • You can set default values for arguments
commas_v2 <- function(x, i_hate_oxford_commas=FALSE) {
  last_str <- if_else(i_hate_oxford_commas, " and ", ", and ")
  str_flatten(x, collapse = ", ", last = last_str)
}
commas_v2(c("cat", "dog", "pigeon"))
[1] "cat, dog, and pigeon"
commas_v2(c("cat", "dog", "pigeon"), i_hate_oxford_commas=TRUE)
[1] "cat, dog and pigeon"

Example

  1. Write a function age_in_years() which takes in a vector of strings of birthdates of the form “YYYY-MM-DD” and computes the age in years.
age_in_years <- function(dates) {
  time_intervals <- ymd(dates) %--% today()
  floor(time_intervals / years(1))
}
age_in_years(c("2011-11-11", "1997-02-08", "2025-02-08"))
[1] 13 28  0

Example

  1. A function both_na() which takes in two vectors of same length and returns the number of positions which have an NA in both vectors
both_na <- function(vec1, vec2) {
  sum(is.na(vec1) & is.na(vec2))
}
both_na(c(1, 2, NA, 3), c(4, 5, NA, NA))
[1] 1
both_na(c(1, 2, NA, NA), c(4, 5, NA, NA))
[1] 2

Style: function names

Function names should be…

  • …verbs; arguments should be nouns (typically)
  • …descriptive; tell the reader something about the function
# Too short
f()

# Not a verb, or descriptive
my_awesome_function()

# Long, but clear
impute_missing()
collapse_years()

Style: spacing

  • Whitespace does not affect code execution in R (unlike Python)
  • Nonetheless, it helps readers (including future you!) to use consistent spacing conventions
  • For specifics, see https://style.tidyverse.org/syntax.html#spacing
  • In RStudio, the command Ctrl + I on a given line will indent it correctly (usually)
unique_where <- function(df, condition, var) {
df |> 
filter({{ condition }}) |> 
distinct({{ var }}) |>  # Keep only unique/distinct rows from a data frame. 
arrange({{ var }})
}

Data frame functions

Data frame functions

We will often manipulate tibbles in a repetitive way

  • Translating this into functions is a bit complicated for reasons to be discussed
  • Consider trying to compute the mean of a tibble by groups.

Data frame functions

  • Consider trying to compute the mean of a tibble by groups.
g_mean <- function(df, g_var, m_var) {
  df |> 
    group_by(g_var) |> 
    summarize(mean(m_var))
}
flights |> g_mean(day, dep_time)
#> Error in `group_by()`:
#> ! Must group by variables found in `.data`.
#> ✖ Column `g_var` is not found.
  • Error message: “Column g_var is not found.” What’s happening here?
df
# A tibble: 2 × 5
  m_var g_var   age     x     y
  <dbl> <chr> <dbl> <dbl> <dbl>
1     4 Ant       3    10    -6
2     6 Bug       3    12    -4
df |> g_mean(age, x)
# A tibble: 2 × 2
  g_var `mean(m_var)`
  <chr>         <dbl>
1 Ant               4
2 Bug               6
  • Code is doing df |> group_by(g_var) instead of df |> group_by(age)

Embracing

  • Solution is to use embracing: wrap a variable inside of {{}}
g_mean_v2 <- function(df, g_var, m_var) {
  df |> 
    group_by({{ g_var }}) |> 
    summarize(mean({{ m_var }}))
}
df |> g_mean_v2(age, x)
# A tibble: 1 × 2
    age `mean(x)`
  <dbl>     <dbl>
1     3        11
df |> g_mean_v2(age, y)
# A tibble: 1 × 2
    age `mean(y)`
  <dbl>     <dbl>
1     3        -5
  • Typically use embracing whenever using functions like arrange(), filter(), summarize(), select(), rename(), etc.
  • We’ll see examples of how to apply it

Embracing

  • Let’s create a function which does initial summary statistics of a variable
summary6 <- function(data, var) {
  data |> summarize(
    min = min({{ var }}, na.rm = TRUE),
    mean = mean({{ var }}, na.rm = TRUE),
    median = median({{ var }}, na.rm = TRUE),
    max = max({{ var }}, na.rm = TRUE),
    n = n(),
    n_miss = sum(is.na({{ var }})),
    .groups = "drop"
  )
}
flights |> summary6(distance)
# A tibble: 1 × 6
    min  mean median   max      n n_miss
  <dbl> <dbl>  <dbl> <dbl>  <int>  <int>
1    17 1040.    872  4983 336776      0

Embracing

  • We can supply grouped data:
flights |> 
  group_by(year, month, day) |> 
  summary6(distance)
# A tibble: 365 × 9
    year month   day   min  mean median   max     n n_miss
   <int> <int> <int> <dbl> <dbl>  <dbl> <dbl> <int>  <int>
 1  2013     1     1    94 1077.    946  4983   842      0
 2  2013     1     2    94 1053.    944  4983   943      0
 3  2013     1     3    80 1037.    937  4983   914      0
 4  2013     1     4    80 1032.    937  4983   915      0
 5  2013     1     5    80 1068.    950  4983   720      0
 6  2013     1     6    80 1052.    944  4983   832      0
 7  2013     1     7    80  998.    820  4983   933      0
 8  2013     1     8    80  986.    764  4983   899      0
 9  2013     1     9    80  981.    764  4983   902      0
10  2013     1    10    80  993.    765  4983   932      0
# ℹ 355 more rows

Embracing

  • We can even do computations on top of variables:
flights |> 
  group_by(year, month, day) |> 
  summary6(log10(distance))
# A tibble: 365 × 9
    year month   day   min  mean median   max     n n_miss
   <int> <int> <int> <dbl> <dbl>  <dbl> <dbl> <int>  <int>
 1  2013     1     1  1.97  2.92   2.98  3.70   842      0
 2  2013     1     2  1.97  2.91   2.97  3.70   943      0
 3  2013     1     3  1.90  2.91   2.97  3.70   914      0
 4  2013     1     4  1.90  2.90   2.97  3.70   915      0
 5  2013     1     5  1.90  2.91   2.98  3.70   720      0
 6  2013     1     6  1.90  2.91   2.97  3.70   832      0
 7  2013     1     7  1.90  2.88   2.91  3.70   933      0
 8  2013     1     8  1.90  2.87   2.88  3.70   899      0
 9  2013     1     9  1.90  2.87   2.88  3.70   902      0
10  2013     1    10  1.90  2.88   2.88  3.70   932      0
# ℹ 355 more rows

Embracing

  • We can supply conditions as well:
unique_where <- function(df, condition, var) {
  df |> 
    filter({{ condition }}) |> 
    distinct({{ var }}) |>  # Keep only unique/distinct rows from a data frame. 
    arrange({{ var }})
}
flights |> unique_where(month == 12, dest)
# A tibble: 96 × 1
   dest 
   <chr>
 1 ABQ  
 2 ALB  
 3 ATL  
 4 AUS  
 5 AVL  
 6 BDL  
 7 BGR  
 8 BHM  
 9 BNA  
10 BOS  
# ℹ 86 more rows

Examples

  • For flights tibble, find all flights that were cancelled (is.na(arr_time)) or delayed by more than an hour; create a function filter_severe() to do so
filter_severe <- function(df) {
  df |> filter(is.na(arr_time) | dep_delay > 60)
}
flights |> filter_severe()
  • Finds all flights that were cancelled or delayed by more than a user supplied number of hours:
filter_severe <- function(df, hours = 1) {
  df |> filter(is.na(arr_time) | dep_delay > 60*hours)
}
flights |> filter_severe(hours = 3)
flights |> filter_severe(3)  # also fine

Modifying multiple columns

Modifying multiple columns

Consider the following tibble (rnorm(n): n independent standard normals)

  • Suppose we want to compute median of every column:
df <- tibble(a = rnorm(10), c = rnorm(10), b = rnorm(10))
df |> summarize(n = n(), a = median(a), c = median(c), b = median(b))
# A tibble: 1 × 4
      n     a      c       b
  <int> <dbl>  <dbl>   <dbl>
1    10 0.212 -0.798 -0.0483
  • Should never copy+paste more than twice; what if we had 500 columns?
  • Helpful function: across():
df |> summarize(n = n(), across(a:b, median))
# A tibble: 1 × 4
      n     a      c       b
  <int> <dbl>  <dbl>   <dbl>
1    10 0.212 -0.798 -0.0483

across()

  • In coming slides, we’ll see across() works and how to modify this behavior.
  • Three especially important arguments to across():
    • .cols: which columns to iterate over
    • .fns: what to do (function) for each column
    • .names: name output of each column

across(): selecting columns with .cols

  • For .cols, we can use same things we used for select():
df |> summarize(across(-a, median))
# A tibble: 1 × 2
       c       b
   <dbl>   <dbl>
1 -0.798 -0.0483
df |> summarize(across(c(a,c), median))
# A tibble: 1 × 2
      a      c
  <dbl>  <dbl>
1 0.212 -0.798

across(): selecting columns with .cols

Two additional arguments which are helpful: everything() and where().

  • everything() computes summaries for every non-grouping variable
  • where() allows for selecting columns based on type, e.g. where(is.numeric) for numbers, where(is.character) for strings, where(is.logical) for logicals
df2 <- cbind(df, grp = sample(c("A", "B"), 10, replace = TRUE))  # grp is either 1 or 2
df2 |> summarize(across(everything(), median))
Warning: There was 1 warning in `summarize()`.
ℹ In argument: `across(everything(), median)`.
Caused by warning in `mean.default()`:
! argument is not numeric or logical: returning NA
         a          c           b grp
1 0.211684 -0.7975214 -0.04833544  NA
df2 |> summarize(across(where(is.numeric), median))
         a          c           b
1 0.211684 -0.7975214 -0.04833544
df2 |> summarize(across(where(is.character), str_flatten))
         grp
1 ABBBBBBAAA

across(): calling a single function with .fns

  • .fns says how we want data to be transformed
  • We are passing the function to across(), we are not calling the function itself.
    • Never add the () after the function when you pass to across(), otherwise…
df |> summarize(across(everything(), median()))
#> Error in `summarize()`:
#> ℹ In argument: `across(everything(), median())`.
#> Caused by error in `median.default()`:
#> ! argument "x" is missing, with no default
  • Same reason why calling median() in console will produce an error (no input!)
df |> summarize(across(everything(), median))
# A tibble: 1 × 3
      a      c       b
  <dbl>  <dbl>   <dbl>
1 0.212 -0.798 -0.0483

across(): calling multiple functions with .fns

  • We may want to apply multiple transformations or have multiple arguments
  • Motivating example: tibble with missing data
df_miss
# A tibble: 5 × 3
       a      b       c
   <dbl>  <dbl>   <dbl>
1 NA     NA      0.495 
2 -0.829 NA     -0.531 
3 -0.221 -0.430  0.0803
4  0.915 -1.40  -0.288 
5  0.570 -1.06  -0.381 
df_miss |> 
  summarize(
    across(a:c, median),
    n = n()
    )
# A tibble: 1 × 4
      a     b      c     n
  <dbl> <dbl>  <dbl> <int>
1    NA    NA -0.288     5

across(): calling multiple functions with .fns

  • If we want to pass along argument na.rm = TRUE we can create a new function in-line which calls median:
df_miss |> 
  summarize(
    across(a:c, function(x) median(x, na.rm = TRUE)),
    n = n())
# A tibble: 1 × 4
      a     b      c     n
  <dbl> <dbl>  <dbl> <int>
1 0.174 -1.06 -0.288     5
  • R also allows for a shortcut for in-line function creations: \:
df_miss |> 
  summarize(
    across(a:c, \(x) median(x, na.rm = TRUE)),
    n = n())

across(): calling multiple functions with .fns

  • So we can simplify code like …
df_miss |> 
  summarize(
    a = median(a, na.rm = T),
    b = median(b, na.rm = T),
    c = median(c, na.rm = T),
    n = n())
# A tibble: 1 × 4
      a     b      c     n
  <dbl> <dbl>  <dbl> <int>
1 0.174 -1.06 -0.288     5
  • … to …
df_miss |> 
  summarize(
    across(a:c, \(x) median(x, na.rm = T)),
    n = n())
# A tibble: 1 × 4
      a     b      c     n
  <dbl> <dbl>  <dbl> <int>
1 0.174 -1.06 -0.288     5

across(): calling multiple functions with .fns

  • We might also be interested in how many missing values were removed. We can do that again using across() by using a named list to .fns argument:
df_miss |> 
  summarize(
    across(a:c, list(
      med = \(x) median(x, na.rm = TRUE),
      n_miss = \(x) sum(is.na(x))
    )),
    n = n()
  )
# A tibble: 1 × 7
  a_med a_n_miss b_med b_n_miss  c_med c_n_miss     n
  <dbl>    <int> <dbl>    <int>  <dbl>    <int> <int>
1 0.174        1 -1.06        2 -0.288        0     5
  • Columns are named using “glue”: {.col}_{.fn}
    • .col is name of original column and .fn is name of function.

across(): calling multiple functions with .fns

  • Column name examples: consider calculating medians/means for all columns
str(df)
tibble [10 × 3] (S3: tbl_df/tbl/data.frame)
 $ a: num [1:10] 1.393 -0.229 0.171 -0.852 0.349 ...
 $ c: num [1:10] -0.746 -1.156 0.899 -0.851 -0.635 ...
 $ b: num [1:10] -0.559 0.509 0.462 -0.714 -1.482 ...
df |> summarize(across(a:b, list(med = median, mn = mean)))
# A tibble: 1 × 6
  a_med  a_mn  c_med   c_mn   b_med   b_mn
  <dbl> <dbl>  <dbl>  <dbl>   <dbl>  <dbl>
1 0.212 0.186 -0.798 -0.554 -0.0483 0.0952
df |> summarize(across(a:b, list(median, mean)))
# A tibble: 1 × 6
    a_1   a_2    c_1    c_2     b_1    b_2
  <dbl> <dbl>  <dbl>  <dbl>   <dbl>  <dbl>
1 0.212 0.186 -0.798 -0.554 -0.0483 0.0952

across(): output column names with .names

  • Specifying the .names column allows for custom output names:
df_miss |> 
  summarize(
    across(
      a:c,
      list(
        med = \(x) median(x, na.rm = TRUE),
        n_miss = \(x) sum(is.na(x))
      ),
      .names = "{.fn}_for_{.col}"
    ),
    n = n(),
  )
# A tibble: 1 × 7
  med_for_a n_miss_for_a med_for_b n_miss_for_b med_for_c n_miss_for_c     n
      <dbl>        <int>     <dbl>        <int>     <dbl>        <int> <int>
1     0.174            1     -1.06            2    -0.288            0     5

across(): output column names with .names

  • Specifying .names is especially important if using only one function; by default, across() returns same names as input and thus will replace the input columns.
    • We saw this behavior previously when using inside summarize()
  • e.g., coalesce(x, y) replaces all appearances of NA in x with the value y
df_miss |> 
  mutate(
    across(a:c, \(x) coalesce(x, 0))
  )
# A tibble: 5 × 3
       a      b       c
   <dbl>  <dbl>   <dbl>
1  0      0      0.495 
2 -0.829  0     -0.531 
3 -0.221 -0.430  0.0803
4  0.915 -1.40  -0.288 
5  0.570 -1.06  -0.381 

across(): output column names with .names

  • Specifying .names is especially important if using only one function; by default, across() returns same names as input and thus will replace the input columns.
    • We saw this behavior previously when using inside summarize()
  • To create new columns, use .names to give output new names:
df_miss |> 
  mutate(
    across(a:c, \(x) coalesce(x, 0), .names = "{.col}_na_zero")
  )
# A tibble: 5 × 6
       a      b       c a_na_zero b_na_zero c_na_zero
   <dbl>  <dbl>   <dbl>     <dbl>     <dbl>     <dbl>
1 NA     NA      0.495      0         0        0.495 
2 -0.829 NA     -0.531     -0.829     0       -0.531 
3 -0.221 -0.430  0.0803    -0.221    -0.430    0.0803
4  0.915 -1.40  -0.288      0.915    -1.40    -0.288 
5  0.570 -1.06  -0.381      0.570    -1.06    -0.381 

Examples: using other data sets

  • Number of unique values in each column of palmerpenguins::penguins:
penguins |> 
  summarize(across(everything(), \(x) length(unique(x))))
# A tibble: 1 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
    <int>  <int>          <int>         <int>             <int>       <int>
1       3      3            165            81                56          95
# ℹ 2 more variables: sex <int>, year <int>
  • The mean of every column in mtcars:
mtcars |>
  summarize(across(everything(), mean))
       mpg    cyl     disp       hp     drat      wt     qsec     vs      am
1 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625
    gear   carb
1 3.6875 2.8125

Examples: using other data sets

  • Group diamonds by cut, clarity, and color, then count the number of observations and compute the mean of each numeric column.
diamonds |>
  group_by(cut, clarity, color) |>
  summarize(num = n(), across(where(is.numeric), mean))
`summarise()` has grouped output by 'cut', 'clarity'. You can override using
the `.groups` argument.
# A tibble: 276 × 11
# Groups:   cut, clarity [40]
   cut   clarity color   num carat depth table price     x     y     z
   <ord> <ord>   <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Fair  I1      D         4 1.88   65.6  56.8 7383   7.52  7.42  4.90
 2 Fair  I1      E         9 0.969  65.6  58.1 2095.  6.17  6.06  4.01
 3 Fair  I1      F        35 1.02   65.7  58.4 2544.  6.14  6.04  4.00
 4 Fair  I1      G        53 1.23   65.3  57.7 3187.  6.52  6.43  4.23
 5 Fair  I1      H        52 1.50   65.8  58.4 4213.  6.96  6.86  4.55
 6 Fair  I1      I        34 1.32   65.7  58.4 3501   6.76  6.65  4.41
 7 Fair  I1      J        23 1.99   66.5  57.9 5795.  7.55  7.46  4.99
 8 Fair  SI2     D        56 1.02   64.7  58.6 4355.  6.24  6.17  4.01
 9 Fair  SI2     E        78 1.02   63.4  59.5 4172.  6.28  6.22  3.96
10 Fair  SI2     F        89 1.08   63.8  59.5 4520.  6.36  6.30  4.04
# ℹ 266 more rows

Examples: in a function

  • Example: expand all date columns into year / month / day columns.
expand_dates <- function(df) {
  df |> 
    mutate(
      across(where(is.Date), list(year = year, month = month, day = mday))
    )
}

tibble(name = c("Ant", "Bug"), date = ymd(c("2009-08-03", "2010-01-16"))) |> 
  expand_dates()
# A tibble: 2 × 5
  name  date       date_year date_month date_day
  <chr> <date>         <dbl>      <dbl>    <int>
1 Ant   2009-08-03      2009          8        3
2 Bug   2010-01-16      2010          1       16

Filtering

  • across() is great with summarize() and mutate(), but not so much with filter() because there we usually combine conditions with & / |.
  • dplyr provides two variants: if_any() and if_all() to help combine logicals across columns
# same as df_miss |> filter(is.na(a) | is.na(b) | is.na(c))
df_miss |> filter(if_any(a:c, is.na))
# A tibble: 2 × 3
       a     b      c
   <dbl> <dbl>  <dbl>
1 NA        NA  0.495
2 -0.829    NA -0.531
# same as df_miss |> filter(is.na(a) & is.na(b) & is.na(c))
df_miss |> filter(if_all(a:c, is.na))
# A tibble: 0 × 3
# ℹ 3 variables: a <dbl>, b <dbl>, c <dbl>

across() vs pivot_longer()

Suppose df contains both values and weights, and we want to compute a weighted mean.

df_paired
# A tibble: 3 × 6
  Ant_score Ant_wts Bug_score Bug_wts Cat_score Cat_wts
      <int>   <dbl>     <int>   <dbl>     <int>   <dbl>
1        74    0.25        61    0.25        84    0.25
2        66    0.25        78    0.25        91    0.25
3        84    0.5         70    0.5         77    0.5 
  • No way to do this with across(), but easy with pivot_longer()

across() vs pivot_longer()

( df_long <- df_paired |> 
  pivot_longer(
    cols = everything(), 
    names_to = c("group", ".value"), 
    names_sep = "_"
  ) )
# A tibble: 9 × 3
  group score   wts
  <chr> <int> <dbl>
1 Ant      74  0.25
2 Bug      61  0.25
3 Cat      84  0.25
4 Ant      66  0.25
5 Bug      78  0.25
6 Cat      91  0.25
7 Ant      84  0.5 
8 Bug      70  0.5 
9 Cat      77  0.5 
df_long |> 
  group_by(group) |> 
  summarize(wm = weighted.mean(score, wts))
# A tibble: 3 × 2
  group    wm
  <chr> <dbl>
1 Ant    77  
2 Bug    69.8
3 Cat    82.2