09: Functions

STA35B: Statistical Data Science 2

Akira Horiguchi

Functions

We’ll talk in the next couple lectures about writing functions
Functions allow for automating tasks, with a number of advantages:
- Readability: have well-defined tasks that are encapsulated within individual functions, easier to debug
- Portability: makes it easy to re-use code

Vector functions

Vector functions take >=1 vector, returns vector as a result

Try to understand this code. (Do you spot the error?)

df <- tibble(a = rnorm(4), b = rnorm(4), c = rnorm(4))
df |> mutate(
  a = (a - min(a, na.rm = TRUE)) / 
    (max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),
  b = (b - min(b, na.rm = TRUE)) / 
    (max(b, na.rm = TRUE) - min(a, na.rm = TRUE)),
  c = (c - min(c, na.rm = TRUE)) / 
    (max(c, na.rm = TRUE) - min(c, na.rm = TRUE))
)

# A tibble: 4 × 3
      a     b     c
  <dbl> <dbl> <dbl>
1 0.386  5.44 0    
2 0.595  4.71 0.855
3 1      4.73 0.206
4 0      0    1

Seems to be rescaling each column to have range between 0 and 1
Did you spot the error? (See column b)
Functions will generally make it harder to make these kinds of mistakes

Writing functions

Here’s what we were trying to do:

(a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE))
(b - min(b, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(b, na.rm = TRUE))
(c - min(c, na.rm = TRUE)) / (max(c, na.rm = TRUE) - min(c, na.rm = TRUE))

More generally:

(█ - min(█, na.rm = TRUE)) / (max(█, na.rm = TRUE) - min(█, na.rm = TRUE))

To write functions, need 3 things:
- name: we’ll use rescale01
- arguments: things that vary across calls; we’ll use x
- body: the code that repeats

name <- function(arguments) {
  body
}

Writing functions

In our case:

rescale01 <- function(x) {
  (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}

Can test it out:

rescale01(c(-10, 0, 10))

[1] 0.0 0.5 1.0

rescale01(c(1, 2, 3, NA, 5))

[1] 0.00 0.25 0.50   NA 1.00

Writing functions

Compare code with functions…

df |> mutate(
  a = rescale01(a),
  b = rescale01(b),
  c = rescale01(c),
)

# A tibble: 4 × 3
      a     b     c
  <dbl> <dbl> <dbl>
1 0.386 1     0    
2 0.595 0.866 0.855
3 1     0.869 0.206
4 0     0     1

… to code without functions:

df |> mutate(
  a = (a - min(a, na.rm = TRUE)) / 
    (max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),
  b = (b - min(b, na.rm = TRUE)) / 
    (max(b, na.rm = TRUE) - min(a, na.rm = TRUE)),
  c = (c - min(c, na.rm = TRUE)) / 
    (max(c, na.rm = TRUE) - min(c, na.rm = TRUE))
)

# A tibble: 4 × 3
      a     b     c
  <dbl> <dbl> <dbl>
1 0.386  5.44 0    
2 0.595  4.71 0.855
3 1      4.73 0.206
4 0      0    1

Improving the function: remove redundant calculations

rescale01 <- function(x) {
  (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}

Note: we are computing min(x, na.rm = TRUE) twice.

Redundant: what if min(x, na.rm = TRUE) takes a long time to compute?
Let’s compute min(x, na.rm = TRUE) only once rather than twice.

rescale01_v2 <- function(x) {
  minval <- min(x, na.rm=TRUE)
  maxval <- max(x, na.rm=TRUE)
  (x - minval) / (maxval - minval)
}

Much easier than having to go through copy/pasted code!

Improving the function: deal with infinity

rescale01_v2(c(1:5, Inf))

[1]   0   0   0   0   0 NaN

What if we want to modify it to deal with infinite values differently?
range() returns min and max of the vector, and has arguments na.rm = and finite =, which remove NA’s and Inf’s respectively
- max() and min() do not have the argument finite =

rescale01_v3 <- function(x) {
  rng <- range(x, na.rm = TRUE, finite = TRUE)
  minval <- rng[1]
  maxval <- rng[2]
  (x - minval) / (maxval - minval)
}
rescale01_v3(c(1:5, Inf))

[1] 0.00 0.25 0.50 0.75 1.00  Inf

Mutate functions

We’ll now look at functions which work well with mutate() and filter()

Consider a variation of rescale01(), where we compute the z-score: rescaling a vector to have mean zero and standard deviation of one.

z_score <- function(x) {
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}

Last line is returned as output of the function
Can also use return()

z_score <- function(x) {
  results <- (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
  return(results)
}

Mutate functions

Function clamp() which makes sure all values in a vector lie between a value min and value max

clamp <- function(x, min, max) {
  case_when(
    x < min ~ min,
    x > max ~ max,
    .default = x
  )
}

clamp(1:10, min = 3, max = 7)

 [1] 3 3 3 4 5 6 7 7 7 7

Mutate functions

Functions can also be applied to non-numeric variables.

Example: remove all percent signs, commas, and dollar signs from a string and then convert it to a number.

clean_number <- function(x) {
  is_pct <- str_detect(x, "%")
  num <- x |> 
    str_remove_all("%") |> 
    str_remove_all(",") |> 
    str_remove_all("\\$") |> 
    as.numeric()
  if_else(is_pct, num / 100, num)
}
clean_number("$12,300")

[1] 12300

clean_number("45%")

[1] 0.45

Mutate functions

Another example: taking a vector, and if someone previously used numbers 997 / 998 / 999 to record missing values, replace them with NA:

fix_na <- function(x) {
  if_else(x %in% c(997, 998, 999), NA, x)
}
fix_na(c(3, 998, -54))

[1]   3  NA -54

You can use multiple arguments

make_a_pair <- function(x, y) {
  str_c("The pair will be ", x, " and ", y)
}
make_a_pair("Bug", "Cat")

[1] "The pair will be Bug and Cat"

Summary functions

Functions which take a vector and return a single value - useful for summarize()

commas <- function(x) {
  str_flatten(x, collapse = ", ", last = ", and ")
}
commas(c("cat", "dog", "pigeon"))

[1] "cat, dog, and pigeon"

You can set default values for arguments

commas_v2 <- function(x, i_hate_oxford_commas=FALSE) {
  last_str <- if_else(i_hate_oxford_commas, " and ", ", and ")
  str_flatten(x, collapse = ", ", last = last_str)
}
commas_v2(c("cat", "dog", "pigeon"))

[1] "cat, dog, and pigeon"

commas_v2(c("cat", "dog", "pigeon"), i_hate_oxford_commas=TRUE)

[1] "cat, dog and pigeon"

Example

Write a function age_in_years() which takes in a vector of strings of birthdates of the form “YYYY-MM-DD” and computes the age in years.

age_in_years <- function(dates) {
  time_intervals <- ymd(dates) %--% today()
  floor(time_intervals / years(1))
}
age_in_years(c("2011-11-11", "1997-02-08", "2025-02-08"))

[1] 13 28  0

Example

A function both_na() which takes in two vectors of same length and returns the number of positions which have an NA in both vectors

both_na <- function(vec1, vec2) {
  sum(is.na(vec1) & is.na(vec2))
}
both_na(c(1, 2, NA, 3), c(4, 5, NA, NA))

[1] 1

both_na(c(1, 2, NA, NA), c(4, 5, NA, NA))

[1] 2

Style: function names

Function names should be…

…verbs; arguments should be nouns (typically)
…descriptive; tell the reader something about the function

# Too short
f()

# Not a verb, or descriptive
my_awesome_function()

# Long, but clear
impute_missing()
collapse_years()

Style: spacing

Whitespace does not affect code execution in R (unlike Python)
Nonetheless, it helps readers (including future you!) to use consistent spacing conventions
For specifics, see https://style.tidyverse.org/syntax.html#spacing
In RStudio, the command Ctrl + I on a given line will indent it correctly (usually)

unique_where <- function(df, condition, var) {
df |> 
filter({{ condition }}) |> 
distinct({{ var }}) |>  # Keep only unique/distinct rows from a data frame. 
arrange({{ var }})
}

Data frame functions

We will often manipulate tibbles in a repetitive way

Translating this into functions is a bit complicated for reasons to be discussed
Consider trying to compute the mean of a tibble by groups.

Data frame functions

Consider trying to compute the mean of a tibble by groups.

g_mean <- function(df, g_var, m_var) {
  df |> 
    group_by(g_var) |> 
    summarize(mean(m_var))
}

flights |> g_mean(day, dep_time)
#> Error in `group_by()`:
#> ! Must group by variables found in `.data`.
#> ✖ Column `g_var` is not found.

Error message: “Column g_var is not found.” What’s happening here?

df

# A tibble: 2 × 5
  m_var g_var   age     x     y
  <dbl> <chr> <dbl> <dbl> <dbl>
1     4 Ant       3    10    -6
2     6 Bug       3    12    -4

df |> g_mean(age, x)

# A tibble: 2 × 2
  g_var `mean(m_var)`
  <chr>         <dbl>
1 Ant               4
2 Bug               6

Code is doing df |> group_by(g_var) instead of df |> group_by(age)

Embracing

Solution is to use embracing: wrap a variable inside of {{}}

g_mean_v2 <- function(df, g_var, m_var) {
  df |> 
    group_by({{ g_var }}) |> 
    summarize(mean({{ m_var }}))
}

df |> g_mean_v2(age, x)

# A tibble: 1 × 2
    age `mean(x)`
  <dbl>     <dbl>
1     3        11

df |> g_mean_v2(age, y)

# A tibble: 1 × 2
    age `mean(y)`
  <dbl>     <dbl>
1     3        -5

Typically use embracing whenever using functions like arrange(), filter(), summarize(), select(), rename(), etc.
We’ll see examples of how to apply it

Embracing

Let’s create a function which does initial summary statistics of a variable

summary6 <- function(data, var) {
  data |> summarize(
    min = min({{ var }}, na.rm = TRUE),
    mean = mean({{ var }}, na.rm = TRUE),
    median = median({{ var }}, na.rm = TRUE),
    max = max({{ var }}, na.rm = TRUE),
    n = n(),
    n_miss = sum(is.na({{ var }})),
    .groups = "drop"
  )
}

flights |> summary6(distance)

# A tibble: 1 × 6
    min  mean median   max      n n_miss
  <dbl> <dbl>  <dbl> <dbl>  <int>  <int>
1    17 1040.    872  4983 336776      0

Embracing

We can supply grouped data:

flights |> 
  group_by(year, month, day) |> 
  summary6(distance)

# A tibble: 365 × 9
    year month   day   min  mean median   max     n n_miss
   <int> <int> <int> <dbl> <dbl>  <dbl> <dbl> <int>  <int>
 1  2013     1     1    94 1077.    946  4983   842      0
 2  2013     1     2    94 1053.    944  4983   943      0
 3  2013     1     3    80 1037.    937  4983   914      0
 4  2013     1     4    80 1032.    937  4983   915      0
 5  2013     1     5    80 1068.    950  4983   720      0
 6  2013     1     6    80 1052.    944  4983   832      0
 7  2013     1     7    80  998.    820  4983   933      0
 8  2013     1     8    80  986.    764  4983   899      0
 9  2013     1     9    80  981.    764  4983   902      0
10  2013     1    10    80  993.    765  4983   932      0
# ℹ 355 more rows

Embracing

We can even do computations on top of variables:

flights |> 
  group_by(year, month, day) |> 
  summary6(log10(distance))

# A tibble: 365 × 9
    year month   day   min  mean median   max     n n_miss
   <int> <int> <int> <dbl> <dbl>  <dbl> <dbl> <int>  <int>
 1  2013     1     1  1.97  2.92   2.98  3.70   842      0
 2  2013     1     2  1.97  2.91   2.97  3.70   943      0
 3  2013     1     3  1.90  2.91   2.97  3.70   914      0
 4  2013     1     4  1.90  2.90   2.97  3.70   915      0
 5  2013     1     5  1.90  2.91   2.98  3.70   720      0
 6  2013     1     6  1.90  2.91   2.97  3.70   832      0
 7  2013     1     7  1.90  2.88   2.91  3.70   933      0
 8  2013     1     8  1.90  2.87   2.88  3.70   899      0
 9  2013     1     9  1.90  2.87   2.88  3.70   902      0
10  2013     1    10  1.90  2.88   2.88  3.70   932      0
# ℹ 355 more rows

Embracing

We can supply conditions as well:

unique_where <- function(df, condition, var) {
  df |> 
    filter({{ condition }}) |> 
    distinct({{ var }}) |>  # Keep only unique/distinct rows from a data frame. 
    arrange({{ var }})
}

flights |> unique_where(month == 12, dest)

# A tibble: 96 × 1
   dest 
   <chr>
 1 ABQ  
 2 ALB  
 3 ATL  
 4 AUS  
 5 AVL  
 6 BDL  
 7 BGR  
 8 BHM  
 9 BNA  
10 BOS  
# ℹ 86 more rows

Examples

For flights tibble, find all flights that were cancelled (is.na(arr_time)) or delayed by more than an hour; create a function filter_severe() to do so

filter_severe <- function(df) {
  df |> filter(is.na(arr_time) | dep_delay > 60)
}
flights |> filter_severe()

Finds all flights that were cancelled or delayed by more than a user supplied number of hours:

filter_severe <- function(df, hours = 1) {
  df |> filter(is.na(arr_time) | dep_delay > 60*hours)
}
flights |> filter_severe(hours = 3)
flights |> filter_severe(3)  # also fine

Modifying multiple columns

Consider the following tibble (rnorm(n): n independent standard normals)

Suppose we want to compute median of every column:

df <- tibble(a = rnorm(10), c = rnorm(10), b = rnorm(10))
df |> summarize(n = n(), a = median(a), c = median(c), b = median(b))

# A tibble: 1 × 4
      n     a      c       b
  <int> <dbl>  <dbl>   <dbl>
1    10 0.212 -0.798 -0.0483

Should never copy+paste more than twice; what if we had 500 columns?

Helpful function: across():

df |> summarize(n = n(), across(a:b, median))

# A tibble: 1 × 4
      n     a      c       b
  <int> <dbl>  <dbl>   <dbl>
1    10 0.212 -0.798 -0.0483

`across()`

In coming slides, we’ll see across() works and how to modify this behavior.
Three especially important arguments to across():
- .cols: which columns to iterate over
- .fns: what to do (function) for each column
- .names: name output of each column

`across()`: selecting columns with `.cols`

For .cols, we can use same things we used for select():

df |> summarize(across(-a, median))

# A tibble: 1 × 2
       c       b
   <dbl>   <dbl>
1 -0.798 -0.0483

df |> summarize(across(c(a,c), median))

# A tibble: 1 × 2
      a      c
  <dbl>  <dbl>
1 0.212 -0.798

`across()`: selecting columns with `.cols`

Two additional arguments which are helpful: everything() and where().

everything() computes summaries for every non-grouping variable
where() allows for selecting columns based on type, e.g. where(is.numeric) for numbers, where(is.character) for strings, where(is.logical) for logicals

df2 <- cbind(df, grp = sample(c("A", "B"), 10, replace = TRUE))  # grp is either 1 or 2
df2 |> summarize(across(everything(), median))

Warning: There was 1 warning in `summarize()`.
ℹ In argument: `across(everything(), median)`.
Caused by warning in `mean.default()`:
! argument is not numeric or logical: returning NA

         a          c           b grp
1 0.211684 -0.7975214 -0.04833544  NA

df2 |> summarize(across(where(is.numeric), median))

         a          c           b
1 0.211684 -0.7975214 -0.04833544

df2 |> summarize(across(where(is.character), str_flatten))

         grp
1 ABBBBBBAAA

`across()`: calling a single function with `.fns`

.fns says how we want data to be transformed
We are passing the function to across(), we are not calling the function itself.
- Never add the () after the function when you pass to across(), otherwise…

df |> summarize(across(everything(), median()))
#> Error in `summarize()`:
#> ℹ In argument: `across(everything(), median())`.
#> Caused by error in `median.default()`:
#> ! argument "x" is missing, with no default

Same reason why calling median() in console will produce an error (no input!)

df |> summarize(across(everything(), median))

# A tibble: 1 × 3
      a      c       b
  <dbl>  <dbl>   <dbl>
1 0.212 -0.798 -0.0483

`across()`: calling multiple functions with `.fns`

We may want to apply multiple transformations or have multiple arguments
Motivating example: tibble with missing data

df_miss

# A tibble: 5 × 3
       a      b       c
   <dbl>  <dbl>   <dbl>
1 NA     NA      0.495 
2 -0.829 NA     -0.531 
3 -0.221 -0.430  0.0803
4  0.915 -1.40  -0.288 
5  0.570 -1.06  -0.381

df_miss |> 
  summarize(
    across(a:c, median),
    n = n()
    )

# A tibble: 1 × 4
      a     b      c     n
  <dbl> <dbl>  <dbl> <int>
1    NA    NA -0.288     5

`across()`: calling multiple functions with `.fns`

If we want to pass along argument na.rm = TRUE we can create a new function in-line which calls median:

df_miss |> 
  summarize(
    across(a:c, function(x) median(x, na.rm = TRUE)),
    n = n())

# A tibble: 1 × 4
      a     b      c     n
  <dbl> <dbl>  <dbl> <int>
1 0.174 -1.06 -0.288     5

R also allows for a shortcut for in-line function creations: \:

df_miss |> 
  summarize(
    across(a:c, \(x) median(x, na.rm = TRUE)),
    n = n())

`across()`: calling multiple functions with `.fns`

So we can simplify code like …

df_miss |> 
  summarize(
    a = median(a, na.rm = T),
    b = median(b, na.rm = T),
    c = median(c, na.rm = T),
    n = n())

# A tibble: 1 × 4
      a     b      c     n
  <dbl> <dbl>  <dbl> <int>
1 0.174 -1.06 -0.288     5

… to …

df_miss |> 
  summarize(
    across(a:c, \(x) median(x, na.rm = T)),
    n = n())

# A tibble: 1 × 4
      a     b      c     n
  <dbl> <dbl>  <dbl> <int>
1 0.174 -1.06 -0.288     5

`across()`: calling multiple functions with `.fns`

We might also be interested in how many missing values were removed. We can do that again using across() by using a named list to .fns argument:

df_miss |> 
  summarize(
    across(a:c, list(
      med = \(x) median(x, na.rm = TRUE),
      n_miss = \(x) sum(is.na(x))
    )),
    n = n()
  )

# A tibble: 1 × 7
  a_med a_n_miss b_med b_n_miss  c_med c_n_miss     n
  <dbl>    <int> <dbl>    <int>  <dbl>    <int> <int>
1 0.174        1 -1.06        2 -0.288        0     5

Columns are named using “glue”: {.col}_{.fn}
- .col is name of original column and .fn is name of function.

`across()`: calling multiple functions with `.fns`

Column name examples: consider calculating medians/means for all columns

str(df)

tibble [10 × 3] (S3: tbl_df/tbl/data.frame)
 $ a: num [1:10] 1.393 -0.229 0.171 -0.852 0.349 ...
 $ c: num [1:10] -0.746 -1.156 0.899 -0.851 -0.635 ...
 $ b: num [1:10] -0.559 0.509 0.462 -0.714 -1.482 ...

df |> summarize(across(a:b, list(med = median, mn = mean)))

# A tibble: 1 × 6
  a_med  a_mn  c_med   c_mn   b_med   b_mn
  <dbl> <dbl>  <dbl>  <dbl>   <dbl>  <dbl>
1 0.212 0.186 -0.798 -0.554 -0.0483 0.0952

df |> summarize(across(a:b, list(median, mean)))

# A tibble: 1 × 6
    a_1   a_2    c_1    c_2     b_1    b_2
  <dbl> <dbl>  <dbl>  <dbl>   <dbl>  <dbl>
1 0.212 0.186 -0.798 -0.554 -0.0483 0.0952

`across()`: output column names with `.names`

Specifying the .names column allows for custom output names:

df_miss |> 
  summarize(
    across(
      a:c,
      list(
        med = \(x) median(x, na.rm = TRUE),
        n_miss = \(x) sum(is.na(x))
      ),
      .names = "{.fn}_for_{.col}"
    ),
    n = n(),
  )

# A tibble: 1 × 7
  med_for_a n_miss_for_a med_for_b n_miss_for_b med_for_c n_miss_for_c     n
      <dbl>        <int>     <dbl>        <int>     <dbl>        <int> <int>
1     0.174            1     -1.06            2    -0.288            0     5

`across()`: output column names with `.names`

Specifying .names is especially important if using only one function; by default, across() returns same names as input and thus will replace the input columns.
- We saw this behavior previously when using inside summarize()
e.g., coalesce(x, y) replaces all appearances of NA in x with the value y

df_miss |> 
  mutate(
    across(a:c, \(x) coalesce(x, 0))
  )

# A tibble: 5 × 3
       a      b       c
   <dbl>  <dbl>   <dbl>
1  0      0      0.495 
2 -0.829  0     -0.531 
3 -0.221 -0.430  0.0803
4  0.915 -1.40  -0.288 
5  0.570 -1.06  -0.381

`across()`: output column names with `.names`

Specifying .names is especially important if using only one function; by default, across() returns same names as input and thus will replace the input columns.
- We saw this behavior previously when using inside summarize()
To create new columns, use .names to give output new names:

df_miss |> 
  mutate(
    across(a:c, \(x) coalesce(x, 0), .names = "{.col}_na_zero")
  )

# A tibble: 5 × 6
       a      b       c a_na_zero b_na_zero c_na_zero
   <dbl>  <dbl>   <dbl>     <dbl>     <dbl>     <dbl>
1 NA     NA      0.495      0         0        0.495 
2 -0.829 NA     -0.531     -0.829     0       -0.531 
3 -0.221 -0.430  0.0803    -0.221    -0.430    0.0803
4  0.915 -1.40  -0.288      0.915    -1.40    -0.288 
5  0.570 -1.06  -0.381      0.570    -1.06    -0.381

Examples: using other data sets

Number of unique values in each column of palmerpenguins::penguins:

penguins |> 
  summarize(across(everything(), \(x) length(unique(x))))

# A tibble: 1 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
    <int>  <int>          <int>         <int>             <int>       <int>
1       3      3            165            81                56          95
# ℹ 2 more variables: sex <int>, year <int>

The mean of every column in mtcars:

mtcars |>
  summarize(across(everything(), mean))

       mpg    cyl     disp       hp     drat      wt     qsec     vs      am
1 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625
    gear   carb
1 3.6875 2.8125

Examples: using other data sets

Group diamonds by cut, clarity, and color, then count the number of observations and compute the mean of each numeric column.

diamonds |>
  group_by(cut, clarity, color) |>
  summarize(num = n(), across(where(is.numeric), mean))

`summarise()` has grouped output by 'cut', 'clarity'. You can override using
the `.groups` argument.

# A tibble: 276 × 11
# Groups:   cut, clarity [40]
   cut   clarity color   num carat depth table price     x     y     z
   <ord> <ord>   <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Fair  I1      D         4 1.88   65.6  56.8 7383   7.52  7.42  4.90
 2 Fair  I1      E         9 0.969  65.6  58.1 2095.  6.17  6.06  4.01
 3 Fair  I1      F        35 1.02   65.7  58.4 2544.  6.14  6.04  4.00
 4 Fair  I1      G        53 1.23   65.3  57.7 3187.  6.52  6.43  4.23
 5 Fair  I1      H        52 1.50   65.8  58.4 4213.  6.96  6.86  4.55
 6 Fair  I1      I        34 1.32   65.7  58.4 3501   6.76  6.65  4.41
 7 Fair  I1      J        23 1.99   66.5  57.9 5795.  7.55  7.46  4.99
 8 Fair  SI2     D        56 1.02   64.7  58.6 4355.  6.24  6.17  4.01
 9 Fair  SI2     E        78 1.02   63.4  59.5 4172.  6.28  6.22  3.96
10 Fair  SI2     F        89 1.08   63.8  59.5 4520.  6.36  6.30  4.04
# ℹ 266 more rows

Examples: in a function

Example: expand all date columns into year / month / day columns.

expand_dates <- function(df) {
  df |> 
    mutate(
      across(where(is.Date), list(year = year, month = month, day = mday))
    )
}

tibble(name = c("Ant", "Bug"), date = ymd(c("2009-08-03", "2010-01-16"))) |> 
  expand_dates()

# A tibble: 2 × 5
  name  date       date_year date_month date_day
  <chr> <date>         <dbl>      <dbl>    <int>
1 Ant   2009-08-03      2009          8        3
2 Bug   2010-01-16      2010          1       16

Filtering

across() is great with summarize() and mutate(), but not so much with filter() because there we usually combine conditions with & / |.
dplyr provides two variants: if_any() and if_all() to help combine logicals across columns

# same as df_miss |> filter(is.na(a) | is.na(b) | is.na(c))
df_miss |> filter(if_any(a:c, is.na))

# A tibble: 2 × 3
       a     b      c
   <dbl> <dbl>  <dbl>
1 NA        NA  0.495
2 -0.829    NA -0.531

# same as df_miss |> filter(is.na(a) & is.na(b) & is.na(c))
df_miss |> filter(if_all(a:c, is.na))

# A tibble: 0 × 3
# ℹ 3 variables: a <dbl>, b <dbl>, c <dbl>

`across()` vs `pivot_longer()`

Suppose df contains both values and weights, and we want to compute a weighted mean.

df_paired

# A tibble: 3 × 6
  Ant_score Ant_wts Bug_score Bug_wts Cat_score Cat_wts
      <int>   <dbl>     <int>   <dbl>     <int>   <dbl>
1        74    0.25        61    0.25        84    0.25
2        66    0.25        78    0.25        91    0.25
3        84    0.5         70    0.5         77    0.5

No way to do this with across(), but easy with pivot_longer()

`across()` vs `pivot_longer()`

( df_long <- df_paired |> 
  pivot_longer(
    cols = everything(), 
    names_to = c("group", ".value"), 
    names_sep = "_"
  ) )

# A tibble: 9 × 3
  group score   wts
  <chr> <int> <dbl>
1 Ant      74  0.25
2 Bug      61  0.25
3 Cat      84  0.25
4 Ant      66  0.25
5 Bug      78  0.25
6 Cat      91  0.25
7 Ant      84  0.5 
8 Bug      70  0.5 
9 Cat      77  0.5

df_long |> 
  group_by(group) |> 
  summarize(wm = weighted.mean(score, wts))

# A tibble: 3 × 2
  group    wm
  <chr> <dbl>
1 Ant    77  
2 Bug    69.8
3 Cat    82.2

09: Functions

Functions

Vector functions

Vector functions

Writing functions

Writing functions

Writing functions

Improving the function: remove redundant calculations

Improving the function: deal with infinity

Mutate functions

Mutate functions

Mutate functions

Mutate functions

Summary functions

Example

Example

Style: function names

Style: spacing

Data frame functions

Data frame functions

Data frame functions

Embracing

Embracing

Embracing

Embracing

Embracing

Examples

Modifying multiple columns

Modifying multiple columns

across()

across(): selecting columns with .cols

across(): selecting columns with .cols

across(): calling a single function with .fns

across(): calling multiple functions with .fns

across(): calling multiple functions with .fns

across(): calling multiple functions with .fns

across(): calling multiple functions with .fns

across(): calling multiple functions with .fns

across(): output column names with .names

across(): output column names with .names

across(): output column names with .names

Examples: using other data sets

Examples: using other data sets

Examples: in a function

Filtering

across() vs pivot_longer()

across() vs pivot_longer()

`across()`

`across()`: selecting columns with `.cols`

`across()`: selecting columns with `.cols`

`across()`: calling a single function with `.fns`

`across()`: calling multiple functions with `.fns`

`across()`: calling multiple functions with `.fns`

`across()`: calling multiple functions with `.fns`

`across()`: calling multiple functions with `.fns`

`across()`: calling multiple functions with `.fns`

`across()`: output column names with `.names`

`across()`: output column names with `.names`

`across()`: output column names with `.names`

`across()` vs `pivot_longer()`

`across()` vs `pivot_longer()`