2.2 R code execution

STA141A: Fundamentals of Statistical Data Science

Akira Horiguchi

Functions

General concepts

  • Functions are modules of code that accomplish a specific task.
  • Functions have an input of some sort of data structure (value, vector, dataframe, etc.), process it, and return an output.
  • Common R built-in functions are, e.g., sum() or mean(), where the input is a vector and the output is a number.

Components of a function

  • Function name: It is stored in the R environment as an object with this name.
  • Argument(s): When calling a function, you pass a value or values to the argument(s).
    • …can be required or optional.
    • …can have default values.
  • Function Body: The sequence of commands that are executed when the function is called.
  • Return Value: The output of the function.
square <- function(x) {
  y <- x^2
  return(y)
}
  1. The function name is square.
  2. The function has only one argument; here it is called x.
  3. The function body are the lines of code between the curly braces { and }.
  4. The return value is y.
square(3)
[1] 9

Passing arguments

When calling a function, you can specify the arguments by:

  • position
mean(1:10, 0.2, TRUE)
[1] 5.5
  • complete name
mean(x = 1:10, trim = 0.2, na.rm = TRUE)
[1] 5.5
  • partial name (does not work when the abbreviation is ambiguous)
mean(x = 1:10, n = TRUE, t = 0.2)
[1] 5.5

Some useful statistical functions

Let x and y be numeric vectors of the same length. We can calculate:

  • The mean of x by mean(x);
  • The variance of x by var(x);
  • The standard deviation of x by sd(x);
  • The covariance of x and y using cov(x, y);
  • The correlation of x and y using cor(x, y).

Customized functions – why are they useful?

  • Make code easier to understand due to an evocative name.
  • Useful to avoid code repetitions.
  • Help reduce the chance of making mistakes when you copy and paste.
    • (e.g., updating a variable name in one place, but not in another).

Customized functions – Writing your own function

How to write your own function:

FunctionName <- function(arg1, arg2, ...) {
  # what the function does with the arguments, and the output
}

Example

square_with_offset <- function(x, offset=0) {
  y <- x^2
  return(y + offset)
}
square_with_offset(3)  # default value of offset is 0
[1] 9
square_with_offset(3, -6)
[1] 3

Repetitive execution: Loops

for loop and while loop

Repetitive execution: for loop

We often will want to apply the same operation to all objects in a vector or list.

Example: square every element in the vector y

y <- sample(1:10, 4)
y
[1] 2 3 6 5
for (i in 1:4) {
  print(y[i]^2)
}
[1] 4
[1] 9
[1] 36
[1] 25

Template

for (variable in vector) {
  # commands to be repeated
}

Repetitive execution: for loop

Example for loop: cumulative sum.

y
[1] 2 3 6 5
z <- 0
for (i in 1:4) {
  z <- z + y[i]  # uses previous iteration's value of z
  print(paste("the cumulative sum of the vector y at index", i, "is:", z))
}
[1] "the cumulative sum of the vector y at index 1 is: 2"
[1] "the cumulative sum of the vector y at index 2 is: 5"
[1] "the cumulative sum of the vector y at index 3 is: 11"
[1] "the cumulative sum of the vector y at index 4 is: 16"
z
[1] 16

Repetitive execution: for loop

Example for loop: compute \(\sum_{n=1}^4 n!\)

z <- 0
for (i in 1:4) {
  y <- factorial(i)  # the factorial() function is built into R
  print(paste0("the value of  ", i, "!  is: ", y))
  z <- z + y  # uses previous iteration's value of z
}
[1] "the value of  1!  is: 1"
[1] "the value of  2!  is: 2"
[1] "the value of  3!  is: 6"
[1] "the value of  4!  is: 24"
z  # 1! + 2! + 3! + 4!
[1] 33

Q: what happens if we omit z <- 0 at line 1?

Repetitive execution: while loop

Useful for when we don’t know how many times we want to execute commands.

while (condition is true) {
  # commands to be repeated 
}

Example while loop (random walk)

x <- 0
while (-2 <= x && x <= 2) {
  curr_step <- sample(c(-1, 1), size=1)
  print(paste0("moving x=", x, " by step of size ", curr_step))
  x <- x + curr_step  # uses previous iteration's value of x
}
[1] "moving x=0 by step of size 1"
[1] "moving x=1 by step of size -1"
[1] "moving x=0 by step of size 1"
[1] "moving x=1 by step of size -1"
[1] "moving x=0 by step of size 1"
[1] "moving x=1 by step of size 1"
[1] "moving x=2 by step of size -1"
[1] "moving x=1 by step of size 1"
[1] "moving x=2 by step of size -1"
[1] "moving x=1 by step of size -1"
[1] "moving x=0 by step of size 1"
[1] "moving x=1 by step of size -1"
[1] "moving x=0 by step of size -1"
[1] "moving x=-1 by step of size 1"
[1] "moving x=0 by step of size -1"
[1] "moving x=-1 by step of size -1"
[1] "moving x=-2 by step of size -1"

Repetitive execution: while loop

Useful for when we don’t know how many times we want to execute commands.

while (condition is true) {
  # commands to be repeated 
}

Example while loop (random walk): another set of random steps

x <- 0
while (-2 <= x && x <= 2) {
  curr_step <- sample(c(-1, 1), size=1)
  print(paste0("moving x=", x, " by step of size ", curr_step))
  x <- x + curr_step  # uses previous iteration's value of x
}
[1] "moving x=0 by step of size -1"
[1] "moving x=-1 by step of size -1"
[1] "moving x=-2 by step of size -1"

Repetitive execution: while loop

Useful for when we don’t know how many times we want to execute commands.

while (condition is true) {
  # commands to be repeated 
}

Example while loop (random walk): fix the set of “random” steps

set.seed(42)  # for reproducibility; fixes any proceding "random" results
x <- 0
while (-2 <= x && x <= 2) {
  curr_step <- sample(c(-1, 1), size=1)
  print(paste0("moving x=", x, " by step of size ", curr_step))
  x <- x + curr_step  # uses previous iteration's value of x
}
[1] "moving x=0 by step of size -1"
[1] "moving x=-1 by step of size -1"
[1] "moving x=-2 by step of size -1"

Repetitive execution: while loop

Useful for when we don’t know how many times we want to execute commands.

while (condition is true) {
  # commands to be repeated 
}

It is possible that the body of a while() loop will never be executed.

x <- 4  # this will not satisfy the condition in the proceding while loop
while (-2 <= x && x <= 2) {
  curr_step <- sample(c(-1, 1), size=1)
  print(paste0("moving x=", x, " by step of size ", curr_step))
  x <- x + curr_step  # uses previous iteration's value of x
}

Comments on loops

Performs commands sequentially

  • Pro: Helpful if commands depend on the values from the previous iteration’s commands
  • Con: Is a bit clunky if we want to store results in a vector/list

Often we will want to perform the same set of (complicated) commands on different chunks of data.

  • Can use for loop, but can be difficult to understand because it is so flexible
  • Can instead use apply() family of functions

Repetitive execution: apply() family

Motivation

We often will want to apply the same function to many objects and then store the outputs for later use. Example: calculate group means of grades

grades <- list(group1 = sample(seq(0, 10, 0.1), 10), 
               group2 = sample(seq(5, 10, 0.1), 10), 
               group3 = sample(seq(10, 15, 0.1), 5))
grades
$group1
 [1] 2.4 7.3 1.7 4.8 4.6 2.3 7.0 8.8 3.6 1.9

$group2
 [1] 7.5 9.9 9.6 5.2 9.0 7.4 7.6 8.5 8.6 8.0

$group3
[1] 14.4 10.4 11.9 13.3 12.7
# One approach using a for loop
grades_group_means <- numeric(length = length(grades))  # create empty numeric vector of length 3
for (i in 1:3) { 
  grades_group_means[i] <- mean(grades[[i]]) 
}

We can simplify the above code using the apply family of functions.

apply family of functions

For now, introduce just two functions in this family:

  • lapply(X, FUN, ...): returns a list containing the result of the function FUN applied to all the elements of the list/vector X.

... indicates optional additional arguments to be passed to FUN.

  • Don’t worry about this until we see an example of this in a later slide.

Example: calculate group means of grades

# for loop
# output will be printed out but not stored anywhere
for (i in 1:3) { 
  print(mean(grades[[i]])) 
}
[1] 4.44
[1] 8.13
[1] 12.54
# for-each loop
# output will be printed out but not stored anywhere
for (x in grades){ 
  print(mean(x)) 
}
[1] 4.44
[1] 8.13
[1] 12.54
lapply(grades, mean)  # output is a list
$group1
[1] 4.44

$group2
[1] 8.13

$group3
[1] 12.54
sapply(grades, mean)  # output is a vector
group1 group2 group3 
  4.44   8.13  12.54 

Very clean!

sapply()

  • sapply(X, FUN, ...): essentially does lapply(X, FUN, ...) first and then tries to coerce the output into a vector.

Example: anonymous functions

Calculate group means using…

  • a named function
mean_v2 <- function(x) sum(x)/length(x)
sapply(grades, mean_v2)
group1 group2 group3 
  4.44   8.13  12.54 
  • an anonymous function (useful for single-use execution)
sapply(grades, function(x) sum(x)/length(x))
group1 group2 group3 
  4.44   8.13  12.54 

Instead of the keyword function, we can use \. Example:

sapply(grades, \(x) sum(x)/length(x))  # same as above 
group1 group2 group3 
  4.44   8.13  12.54 

Example: functions as objects

R is a functional programming language: functions can be used as objects!

funlist <- list(sum, mean, var, sd)
dat <- runif(10)

# Use for loop to apply dat to all functions in funlist
for (f in funlist) {
  print(f(dat))  # prints values
}
[1] 4.108169
[1] 0.4108169
[1] 0.116695
[1] 0.3416066
# Use sapply to apply dat to all functions in funlist
sapply(funlist, \(f) f(dat))  # also stores values in a vector
[1] 4.1081692 0.4108169 0.1166950 0.3416066

General comment about examples

R documentation usually has many examples of how to apply that function.

  • Try running ?lapply to pull up the documentation for the function lapply().

Differences between for() and apply family of functions

Beyond aesthetic differences…

  • for() executes commands sequentially.
  • apply family can execute commands in parallel (but don’t by default).
myfun <- function(x) {
  # commands that take a long time to execute
}

# Takes how much time?
for (x in grades) {
  myfun(x)
}

# Takes how much time (if 3 cores are used)?
lapply(grades, myfun)

apply()

The apply() function enables row-wise or column-wise repetitive operations on a matrix, array, or data frame. Examples:

  • Row Means: apply(my_matrix, 1, mean) calculates the average for every row.
  • Column Sums: apply(my_matrix, 2, sum) calculates the total for every column.
  • Custom Functions: apply(my_matrix, 1, function(x) max(x) - min(x)) calculates the range for each row.

The basic syntax is apply(X, MARGIN, FUN, ...)

  • X: The input data object (matrix, array, or data frame).
  • MARGIN: Specifies whether the function is applied to rows or columns:
    • 1: Applies the function to rows.
    • 2: Applies the function to columns.
    • c(1, 2): Applies the function to every individual cell.
  • FUN: The function to be applied (e.g., mean, sum, or a custom function).
  • ...: Optional additional arguments to be passed to FUN

Example of additional arguments: apply(my_matrix, 1, mean, trim = 0.1, na.rm=TRUE) calculates a trimmed average for every row.

apply(): Notice the shape of the output!

my_matrix <- rbind(-2, -5, -c(4:1, 2:5))
my_matrix
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]   -2   -2   -2   -2   -2   -2   -2   -2
[2,]   -5   -5   -5   -5   -5   -5   -5   -5
[3,]   -4   -3   -2   -1   -2   -3   -4   -5

Suppose we want to compute the row-wise or the column-wise mean.

apply(my_matrix, 1, mean)  # row-wise mean of unique elements
[1] -2 -5 -3
apply(my_matrix, 2, mean)  # column-wise mean of unique elements
[1] -3.666667 -3.333333 -3.000000 -2.666667 -3.000000 -3.333333 -3.666667
[8] -4.000000

Both outputs are a vector.

apply(): Notice the shape of the output!

my_matrix
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]   -2   -2   -2   -2   -2   -2   -2   -2
[2,]   -5   -5   -5   -5   -5   -5   -5   -5
[3,]   -4   -3   -2   -1   -2   -3   -4   -5

Suppose we want to compute the row-wise or the column-wise mean, but also count the number of unique elements in that row or column.

MeanAndNumUnique <- function(x) c(mean(x), length(unique(x)))  # computes mean and number of unique elements in argument
apply(my_matrix, 1, MeanAndNumUnique)  # row-wise mean and number of unique elements
     [,1] [,2] [,3]
[1,]   -2   -5   -3
[2,]    1    1    5
apply(my_matrix, 2, MeanAndNumUnique)  # column-wise mean and number of unique elements
          [,1]      [,2] [,3]      [,4] [,5]      [,6]      [,7] [,8]
[1,] -3.666667 -3.333333   -3 -2.666667   -3 -3.333333 -3.666667   -4
[2,]  3.000000  3.000000    2  3.000000    2  3.000000  3.000000    2

Both outputs are a matrix, but do the shapes make sense?

Repetitive execution: reduce

purrr::reduce()

Repeatedly applies a binary function to the elements of a vector or list.

  • (a binary function is a function with two arguments)
  • The base R version is Reduce(), but the version from the purrr package has nicer functionality.
  • reduce(<list or vector>, <binary function>)
library(purrr)
xvec <- c(1,3,5,4,2)
accumulate(xvec, `*`)  # collects intermediate steps of reduce()
[1]   1   3  15  60 120
reduce(xvec, `*`)  # typically just want the final result 
[1] 120

purrr::reduce(): example using paste()

paste('eek', 'a', 'bear')
[1] "eek a bear"
paste0('eek', 'a', 'bear')
[1] "eekabear"
paste('eek', 'a', 'bear', sep='...')
[1] "eek...a...bear"

paste() also does element-wise pasting

svec <- c('u', 'b', 'e')
paste(svec, 1:3, sep='-')
[1] "u-1" "b-2" "e-3"

So what if you want to paste together all elements of svec?

paste0(svec)  # not what we wanted
[1] "u" "b" "e"
paste0(svec[1], svec[2], svec[3])  # okay but clunky and does not generalize
[1] "ube"
reduce(svec, paste0)
[1] "ube"
accumulate(svec, paste0)
[1] "u"   "ub"  "ube"
reduce(svec, paste)
[1] "u b e"
accumulate(svec, paste)
[1] "u"     "u b"   "u b e"
reduce(svec, paste, sep='-')
[1] "u-b-e"

purrr::reduce(): example using set intersection

lst <- lapply(c(5, 3, 2), function(b) seq(0, 30, by=b))
lst
[[1]]
[1]  0  5 10 15 20 25 30

[[2]]
 [1]  0  3  6  9 12 15 18 21 24 27 30

[[3]]
 [1]  0  2  4  6  8 10 12 14 16 18 20 22 24 26 28 30
accumulate(lst, intersect)
[[1]]
[1]  0  5 10 15 20 25 30

[[2]]
[1]  0 15 30

[[3]]
[1]  0 30

Final element of accumulate(lst, intersect) will be the output of reduce(lst, intersect).

  • Same as intersect(intersect(lst[[1]], lst[[2]]), lst[[3]])
  • intersect(): must have two arguments.

purrr::reduce(): example of stacking data frames

MakeDF <- function(a) {
  data.frame(param = a, u = runif(10, min=a))
}
(av <- -1 * rexp(4))
[1] -0.2992672 -0.2799575 -0.2311141 -1.2870888
(df_list <- lapply(av, MakeDF))
[[1]]
        param           u
1  -0.2992672  0.72025839
2  -0.2992672 -0.24867826
3  -0.2992672  0.67361809
4  -0.2992672  0.58069638
5  -0.2992672 -0.07674904
6  -0.2992672  0.03995586
7  -0.2992672  0.36909267
8  -0.2992672  0.57852718
9  -0.2992672  0.97767495
10 -0.2992672  0.68758376

[[2]]
        param            u
1  -0.2799575  0.445123590
2  -0.2799575  0.807609222
3  -0.2799575 -0.037438947
4  -0.2799575  0.067277809
5  -0.2799575  0.780050158
6  -0.2799575  0.607315197
7  -0.2799575  0.027929514
8  -0.2799575 -0.224933705
9  -0.2799575 -0.100150263
10 -0.2799575 -0.002993396

[[3]]
        param           u
1  -0.2311141  0.35908022
2  -0.2311141  0.01192053
3  -0.2311141  0.65449501
4  -0.2311141 -0.22140712
5  -0.2311141  0.23115687
6  -0.2311141  0.40218047
7  -0.2311141 -0.22918060
8  -0.2311141  0.48490677
9  -0.2311141 -0.03671480
10 -0.2311141  0.21089069

[[4]]
       param            u
1  -1.287089  0.189528629
2  -1.287089  0.487288118
3  -1.287089  0.002021567
4  -1.287089 -0.752588395
5  -1.287089 -1.081295393
6  -1.287089 -1.091286429
7  -1.287089 -0.589027304
8  -1.287089  0.239374898
9  -1.287089 -1.286542449
10 -1.287089 -0.810070808
reduce(df_list, rbind)
        param            u
1  -0.2992672  0.720258394
2  -0.2992672 -0.248678257
3  -0.2992672  0.673618095
4  -0.2992672  0.580696383
5  -0.2992672 -0.076749041
6  -0.2992672  0.039955857
7  -0.2992672  0.369092672
8  -0.2992672  0.578527185
9  -0.2992672  0.977674950
10 -0.2992672  0.687583763
11 -0.2799575  0.445123590
12 -0.2799575  0.807609222
13 -0.2799575 -0.037438947
14 -0.2799575  0.067277809
15 -0.2799575  0.780050158
16 -0.2799575  0.607315197
17 -0.2799575  0.027929514
18 -0.2799575 -0.224933705
19 -0.2799575 -0.100150263
20 -0.2799575 -0.002993396
21 -0.2311141  0.359080216
22 -0.2311141  0.011920531
23 -0.2311141  0.654495006
24 -0.2311141 -0.221407118
25 -0.2311141  0.231156870
26 -0.2311141  0.402180468
27 -0.2311141 -0.229180600
28 -0.2311141  0.484906775
29 -0.2311141 -0.036714798
30 -0.2311141  0.210890690
31 -1.2870888  0.189528629
32 -1.2870888  0.487288118
33 -1.2870888  0.002021567
34 -1.2870888 -0.752588395
35 -1.2870888 -1.081295393
36 -1.2870888 -1.091286429
37 -1.2870888 -0.589027304
38 -1.2870888  0.239374898
39 -1.2870888 -1.286542449
40 -1.2870888 -0.810070808

Conditional execution

if and else

Only one condition: if statement

if (condition1) {
  statement 1
} else {
  statement 2
}

Example (one coin toss)

x <- sample(0:1, 1)
x
[1] 1
if (x==0) {
  print('x is heads')
} else {
  print('x is tails')
}
[1] "x is tails"

if and else: vectorized version

ifelse(condition1, statement1, statement2)

Example (five coin tosses)

y <- sample(0:1, 5, replace=TRUE)
y
[1] 1 1 0 1 1
z <- ifelse(y==0, 'y is heads', 'y is tails')
z
[1] "y is tails" "y is tails" "y is heads" "y is tails" "y is tails"

Example (five dice rolls)

d <- sample(1:6, 5, replace=TRUE)
d
[1] 2 2 1 2 5
ifelse((d%%2)==0, 'roll is even', 'roll is odd')
[1] "roll is even" "roll is even" "roll is odd"  "roll is even" "roll is odd" 

else if

More than two conditions:

if (condition_1) {
  # statement 1
} else if (condition_2) {
  # statement 2
} ...
} else if (condition_n) {
  # statement n
} else {
  # else statement
}

Example

x <- sample(-4:4, size=1)
x
[1] 3
if (x < 0) {
  print('squaring x to make it positive')
  x <- x^2
} else if (x > 0) {
  print('x is already positive')
} else {
  print('adding 1 to make it positive')
  x <- x+1
}
[1] "x is already positive"
x
[1] 3