s2: R Basics

STA141A: Fundamentals of Statistical Data Science

Akira Horiguchi

Vectors

Vectors

  • R operates on named data structures, which we simply refer to as data objects.
  • A vector is a single entity consisting of an ordered collection of values.
  • All values in a vector must be of the same data type.
  • Numeric vector
  • Logical vector
  • Character vector

Numeric Vectors

  • Numeric vectors are an ordered collection of numbers.
  • They can be used in arithmetic expressions, where operations are performed element by element.
  • There are also built-in R functions that take a vector and return summary measures (e.g., mean, quantiles, etc).
x <- c(4,2,7,8)  # concatenates the numbers 4, 2, 7, and 8
mean(x) 
[1] 5.25
summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.00    3.50    5.50    5.25    7.25    8.00 

Logical Vectors

  • R allows manipulation of logical quantities.
  • The elements of a logical vector can have the values TRUE, FALSE.
  • Logical vectors are generated by conditions. (Logical vectors are usually not created manually.)
  • For this purpose, we will learn about logical operators.
x <- c(4,2,7,8)  # concatenates the numbers 4, 2, 7, and 8
x < 5
[1]  TRUE  TRUE FALSE FALSE
(x %% 2) == 0  # which elements of x are even
[1]  TRUE  TRUE FALSE  TRUE

Character Vectors

  • Character strings are entered using either matching double " " or single ' quotes, e.g., "apple" or 'banana'.
  • Character vectors are a collection of character strings.
  • The paste() function is often used for creating character strings.
z <- c("hello", "world", "!")  # This is a character vector
z
[1] "hello" "world" "!"    
paste(z[1], z[2], z[3])
[1] "hello world !"

Factors

  • A factor is a vector that can contain only pre-defined values (e.g., only the strings "apple" and 'banana'), and is used to store categorical data.
  • You may want to explicitly define the levels to e.g. change their printed order.
x <- c("dog", "dog", "cat", "dog", "cat", "ewe", "bear", "bear", "bear", "bear")
table(x)  # default order is alphabetical order
x
bear  cat  dog  ewe 
   4    2    3    1 
x <- factor(x, levels=c("dog", "ewe", "cat", "bear"))
x  # vector elements remain the same
 [1] dog  dog  cat  dog  cat  ewe  bear bear bear bear
Levels: dog ewe cat bear
table(x)
x
 dog  ewe  cat bear 
   3    1    2    4 

Factors

  • A factor is a vector that can contain only pre-defined values (e.g., only the strings "apple" and 'banana'), and is used to store categorical data.
  • You may want to explicitly define the levels to e.g. change their printed order.
x <- c("dog", "dog", "cat", "dog", "cat", "ewe", "bear", "bear", "bear", "bear")
table(x)  # default order is alphabetical order
x
bear  cat  dog  ewe 
   4    2    3    1 
x <- factor(x, levels=names(sort(table(x))))  # don't need to know this code right now
x  # vector elements remain the same
 [1] dog  dog  cat  dog  cat  ewe  bear bear bear bear
Levels: ewe cat dog bear
table(x)
x
 ewe  cat  dog bear 
   1    2    3    4 

Factors

  • A factor is a vector that can contain only pre-defined values (e.g., only the strings "apple" and 'banana'), and is used to store categorical data.
  • You may want to explicitly define the levels to e.g. change their printed order.
x <- c("dog", "dog", "cat", "dog", "cat", "ewe", "bear", "bear", "bear", "bear")
table(x)  # default order is alphabetical order
x
bear  cat  dog  ewe 
   4    2    3    1 
x <- factor(x, levels=names(sort(table(x), decreasing=TRUE)))  # don't need to know this code right now
x  # vector elements remain the same
 [1] dog  dog  cat  dog  cat  ewe  bear bear bear bear
Levels: bear dog cat ewe
table(x)
x
bear  dog  cat  ewe 
   4    3    2    1 

Create a vector

Create “regular” vectors.

1:11
 [1]  1  2  3  4  5  6  7  8  9 10 11
seq(1, 17, by=2)
[1]  1  3  5  7  9 11 13 15 17
rep(-5, times=4)
[1] -5 -5 -5 -5
rep("buffalo", times=8)
[1] "buffalo" "buffalo" "buffalo" "buffalo" "buffalo" "buffalo" "buffalo"
[8] "buffalo"

You can combine c() with rep(thing_to_repeat, num_repetitions):

rep(c("eek", "a", "bear"), times=2)
[1] "eek"  "a"    "bear" "eek"  "a"    "bear"
rep(c(6, 2), times=4)
[1] 6 2 6 2 6 2 6 2

Create a vector: coercion

c() will sometimes coerce different data types into the same type.

str(c(4L, 6L))
 int [1:2] 4 6
str(c(4L, 6))  # coerces the integer 4L into the numeric 4
 num [1:2] 4 6
str(c(3, "haha"))  # coerces the numeric 3 into the character "3"
 chr [1:2] "3" "haha"
str(c(TRUE, 7))  # coerces the logical TRUE into the numeric value 1
 num [1:2] 1 7
str(c(FALSE, 7))  # coerces the logical FALSE into the numeric value 0
 num [1:2] 0 7

Roughly, order is logical < integer < numeric < character.

Create a vector with randomly generated elements

Can generate a vector containing random values using e.g., sample().

animals <- c("ant", "bug", "cat", "dog")
sample(animals, size=3)  # samples without replacement
[1] "cat" "dog" "ant"
sample(animals, size=6)  # results in an error
sample(animals, size=6, replace=TRUE)  # samples with replacement
[1] "ant" "dog" "cat" "ant" "bug" "dog"
runif(5, min=0, max=1)  # generates 3 random numbers between 0 and 1
[1] 0.1480565 0.3443080 0.6948802 0.2596952 0.4794379
rnorm(5)  # generates 3 random numbers from a standard normal distribution
[1]  1.3096479  0.3115038 -1.1938093 -1.3670770 -1.0312860

Inspecting a vector

  • There are built-in R functions that take a vector and return info about it.
x <- rpois(10000, lambda=2)  # randomly generate 10000 values from a Poission distribution with parameter lambda=2
length(x)  # returns length of the variable x
[1] 10000
str(x)  # returns the structure of the variable x
 int [1:10000] 1 2 4 2 3 3 4 1 1 1 ...
summary(x)  # returns summary statistics
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   1.000   2.000   2.015   3.000  10.000 
table(x)  # counts occurrences of each value in x
x
   0    1    2    3    4    5    6    7    8    9   10 
1371 2695 2577 1889  939  361  123   32    8    3    2 

Subsetting a vector

Square brackets [ ] are used for indexing (i.e., accessing elements of) a vector, matrix, array, list, or dataframe.

  • Three approaches to access elements:
  1. by integer index
  2. by a logical vector
  3. by name

Subsetting a vector: by integer index

Using a vector of positive integers, the corresponding elements of the vector are selected and concatenated, in that order. A vector of negative integers specifies the values to be excluded rather than included.

x <- runif(7, min=0, max=1)
x
[1] 0.15153443 0.21308165 0.49793895 0.04551215 0.83195540 0.42114951 0.38850633
x[2]
[1] 0.2130816
x[5]
[1] 0.8319554
x[-3]  # returns all but the third element (different from python!)
[1] 0.15153443 0.21308165 0.04551215 0.83195540 0.42114951 0.38850633
x[9]  # x has only 7 elements
[1] NA

Subsetting a vector: by a logical vector

Values corresponding to TRUE in the index vector are selected and those corresponding to FALSE are omitted.

x <- runif(7, min=0, max=1)
x
[1] 0.8774323 0.7797666 0.7160577 0.6435915 0.7282974 0.2961213 0.8139053
x[c(TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE)]
[1] 0.8774323 0.7797666 0.8139053
x[c(FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE)]
[1] 0.7160577
x < 0.5  # a logical vector!
[1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
x[x < 0.5]
[1] 0.2961213

Subsetting a vector: by name

If a vector has a names attribute to identify its components, a sub-vector of the names vector may be used to select the elements.

x <- rnorm(4)
names(x) <- c('a', 'b', 'c', 'd')  # Assigns names to vector's elements
x
         a          b          c          d 
 0.2196825  0.2927151 -1.5388355  0.9532625 
str(x)
 Named num [1:4] 0.22 0.293 -1.539 0.953
 - attr(*, "names")= chr [1:4] "a" "b" "c" "d"
x[c('b', 'a')]  # Selects the elements of x with the name 'b' and 'a'
        b         a 
0.2927151 0.2196825 

To get the names of a vector, can again use names(x)

Subsetting a vector: by name

To get the names of a vector, can again use names(x)

names(x)
[1] "a" "b" "c" "d"
y <- c(5, 8, 2)  # does not have names attribute
names(y)
NULL

Some comments on vectors

  • All elements of a vector must be of the same type (numeric, logical, character).
  • Vectors can be coerced to the same type: as.logical(), as.numeric(), as.character().
  • The vector’s type can be tested: use is.character(), is.logical(), etc.

Q: What will be the result of the following code?

y <- c("TRUE", "FALSE", "TRUE")
is.logical(y)

Operations with vectors: Element-wise operations

x <- c(2, 3, 4)
x
[1] 2 3 4
x + 1
[1] 3 4 5
2 * x
[1] 4 6 8
x / 2
[1] 1.0 1.5 2.0
x^2
[1]  4  9 16

Operations with vectors: Vector operations

x <- c(2, 3, 4)
y <- c(5, 7, 6)
x + y
[1]  7 10 10
y - x
[1] 3 4 2
2 * x + 3 * y
[1] 19 27 26

Operations with vectors: functions that return a single number

x <- c(2, 3, 4)
sum(x)
[1] 9
mean(x)
[1] 3
var(x)
[1] 1
sd(x)
[1] 1
summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    2.0     2.5     3.0     3.0     3.5     4.0 

Vectors: putting it all together

x <- rnorm(100)

Q: What is the sum of the first three values of the vector x?

Code
x[1] + x[2] + x[3]
sum(c(x[1], x[2], x[3]))
sum(x[1:3])

Vectors: putting it all together

x <- rnorm(100)

Q: What is the mean of the positive values of the vector x?

Code
mean(x[x > 0])

Vectors: putting it all together

x <- rnorm(100)

Q: How many negative values are there in the vector x?

Code
length(x[x < 0])
sum(x < 0)

all.equal function

  • To compare whether certain values are equal, use ==.
  • This can be done component-wise, but also for a vector.
  • To check whether all components of one vector are equal to another one, use the all.equal function.
  • It is used as all.equal(vector1, vector2, ...).
x <- seq(0.2, 1, by=0.2)
y <- c(0.2, 0.4, 0.6, 0.8, 1.0)
x == y  # due to floating-point error
[1]  TRUE  TRUE FALSE  TRUE  TRUE
all.equal(x, y)  # tests "near equality" to allow for floating-point error
[1] TRUE

Matrices

Matrices

You can think of a matrix as a collection of vectors of the same type and length.

  • You can also think of a matrix as a table with a certain number of rows and columns, where information of the same type are stored.
  • “Square” or “rectangular” data structure
  • A matrix can be created by using the matrix() function, where the number of rows and columns must be specified.
matrix(1:6, ncol = 3, nrow = 2)  # Creates matrix with values 1 to 6, with 3 columns, 2 rows
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Subsetting a matrix: by index

It is similar to vectors: each element is indexed by row and column.

A = matrix(1:6, ncol = 3, nrow = 2) 
A
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
A[1, 2]
[1] 3
A[1, ]
[1] 1 3 5
A[, 3]
[1] 5 6

Subsetting a matrix: by logical vector

It is similar to vectors: values corresponding to TRUE in the index vector are selected and those corresponding to FALSE are omitted.

A = matrix(1:6, ncol = 3, nrow = 2) 
A
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
A[c(T, F), ]
[1] 1 3 5
A[, c(F, T, F)]
[1] 3 4
A[c(T, F), c(F, T, F)]  # Can use T for TRUE and/or F for FALSE
[1] 3

Subsetting a matrix: by name

A = matrix(1:6, ncol = 3, nrow = 2) 
# Adding row and col names
colnames(A) = c("First", "Second", "Third")
rownames(A) = c("a", "b")
A
  First Second Third
a     1      3     5
b     2      4     6
A['a', ]
 First Second  Third 
     1      3      5 
A[, 'Second']
a b 
3 4 
A['a', 'Second']
[1] 3

Matrix operations: transposition, matrix multiplication

B = matrix(1:4, ncol = 2, nrow = 2)  # 2 by 2 matrix
C = t(B)  # transpose the matrix B
B
     [,1] [,2]
[1,]    1    3
[2,]    2    4
C
     [,1] [,2]
[1,]    1    2
[2,]    3    4
C %*% B  # matrix multiplication
     [,1] [,2]
[1,]    5   11
[2,]   11   25

Matrix operations: element-wise operations

B = matrix(1:4, ncol = 2, nrow = 2)  # 2 by 2 matrix
C = t(B)  # transpose the matrix B
B + C  # element-wise addition
     [,1] [,2]
[1,]    2    5
[2,]    5    8
B * C  # element-wise multiplication
     [,1] [,2]
[1,]    1    6
[2,]    6   16

Matrix operations: column-wise or row-wise

  • Applying column-wise or row-wise functions (apply() function)
A = matrix(1:6, ncol = 3, nrow = 2)
A
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
apply(A, MARGIN = 2, mean)  # Apply the mean() function to all column vectors
[1] 1.5 3.5 5.5
apply(A, MARGIN = 1, mean)  # Apply the mean() function to all row vectors
[1] 3 4

Arrays

Arrays

Arrays are a generalization of matrices.

  • High-order arrays are much rarer than matrices.
  • Use dim() to find the size of an array.
  • Use dimnames() to assign names to each element.
  • Subsetting works similarly to subsetting with the other data structures.
A = array(1:24, dim=c(2,3,4))
print(dim(A))  
[1] 2 3 4
A[1,3,2]  # access element (1,3,2) of array A
[1] 11

Lists

Lists

List elements can be of any type.

  • Thus, lists are different from vectors, matrices, and arrays.
  • You construct lists by using list().
  • You can then return length, attributes, etc., just as for vectors.
  • Lists can contain other lists.

Combining values into list

x <- list(3, "haha")
str(x)
List of 2
 $ : num 3
 $ : chr "haha"
z <- list(x, 5.4)  # list inside a list
str(z)
List of 2
 $ :List of 2
  ..$ : num 3
  ..$ : chr "haha"
 $ : num 5.4
y <- c(x, 5.4)  # a "flat" list
str(y)
List of 3
 $ : num 3
 $ : chr "haha"
 $ : num 5.4

Can also use rep(thing_to_repeat, num_repetitions):

z <- rep(y, 2)
str(z)
List of 6
 $ : num 3
 $ : chr "haha"
 $ : num 5.4
 $ : num 3
 $ : chr "haha"
 $ : num 5.4

Subsetting a list: By element position

Square brackets operator [ ] or the double square bracket operator [[ ]].

profile <- list(name="Mary", no.children=3, child.ages=c(4,7,9))

profile[1]
$name
[1] "Mary"
profile[[1]]  # What is the difference between this and profile[1]? Use str().
[1] "Mary"
str(profile[1])
List of 1
 $ name: chr "Mary"
str(profile[[1]])
 chr "Mary"

Subsetting a list: By logical vector

profile <- list(name="Mary", no.children=3, child.ages=c(4,7,9))

profile[c(T,F,T)]  # Can use T instead of TRUE; F instead of FALSE (but case sensitive -- 'true' or 'True' will return an error)
$name
[1] "Mary"

$child.ages
[1] 4 7 9

Subsetting a list: By name

[ ], [[ ]], or the $ operator.

profile <- list(name="Mary", no.children=3, child.ages=c(4,7,9))
profile$child.ages
[1] 4 7 9
profile['child.ages']
$child.ages
[1] 4 7 9
profile[['child.ages']]
[1] 4 7 9
profile[c('child.ages', 'name')]
$child.ages
[1] 4 7 9

$name
[1] "Mary"

Data frames

Examples of built-in datasets

Data frames are a convenient way to store datasets.

  • Many datasets are publicly available through base R or through an R package.
  • Commonly used datasets: USArrests, PlantGrowth, ToothGrowth, mtcars
  • Useful if you want to see if some statistical/ML method (perhaps your own!) generalizes well to various data sets.
  • Saves you the trouble of collecting/generating data yourself
MASS::housing  # dataset: Frequency Table from a Copenhagen Housing Conditions Survey
      Sat   Infl      Type Cont Freq
1     Low    Low     Tower  Low   21
2  Medium    Low     Tower  Low   21
3    High    Low     Tower  Low   28
4     Low Medium     Tower  Low   34
5  Medium Medium     Tower  Low   22
...

Data frames

A data frame is a list with class “data.frame”.

  • Most common use case: each element in the list is a vector, and all vectors have the same length.
  • Then the data frame has a rectangular structure (each column is a vector).
  • Its rows can be extracted by using matrix conventions.
  • Its columns can be extracted using matrix conventions (but also using list conventions).
  • A simple way to construct a data frame from scratch is to use the read.table() function to read an entire data frame from an external file.
    • Can also use read.csv().

Data frames: Subsetting

  • Similar to matrix subsetting, you can specify rows and columns.
  • If you do not specify rows, all rows will automatically be selected.
  • But also similar to lists because you can use [, [[ or $, and they return a vector or a dataframe.

Subsetting:

  • By index (using [ or [[).
  • By logical vector (using [ or [[).
  • By name (using [, [[, or $).

Example: Subset

# Create a dataframe
firstNames <- sample(c("Dog", "Cat", "Bug"), replace = TRUE, size = 6)
dat <- data.frame(age = 11:16, weight = 1:6, name = firstNames)
dat
  age weight name
1  11      1  Cat
2  12      2  Bug
3  13      3  Cat
4  14      4  Bug
5  15      5  Dog
6  16      6  Dog

Example: Subset by index

dat[1, 3]  # Like matrix
[1] "Cat"
dat[1, ]  # Like matrix
  age weight name
1  11      1  Cat
dat[, 1]  # Like matrix
[1] 11 12 13 14 15 16
dat[c(3,1), ]  # Like matrix
  age weight name
3  13      3  Cat
1  11      1  Cat

Example: Subset by logical vector

dat[dat$weight > 4, ]
  age weight name
5  15      5  Dog
6  16      6  Dog

Example: Subset by name

dat$age  # Like list
[1] 11 12 13 14 15 16
dat[, 'age']  # Like matrix
[1] 11 12 13 14 15 16
dat[, c('age', 'name')]  # Like matrix
  age name
1  11  Cat
2  12  Bug
3  13  Cat
4  14  Bug
5  15  Dog
6  16  Dog

Functions

General concepts

  • Functions are modules of code that accomplish a specific task.
  • Functions have an input of some sort of data structure (value, vector, dataframe, etc.), process it, and return an output.
  • Common R built-in functions are, e.g., sum() or mean(), where the input is a vector and the output is a number.

Components of a function

  • Function name: It is stored in the R environment as an object with this name.
  • Argument(s): When calling a function, you pass a value or values to the argument(s).
    • …can be required or optional.
    • …can have default values.
  • Function Body: The sequence of commands that are executed when the function is called.
  • Return Value: The output of the function.
square <- function(x) {
  y <- x^2
  return(y)
}
square(3)
[1] 9
  1. The function name is square.
  2. The function has only one argument; here it is called x.
  3. The function body are the two lines of code between the curly braces { and }.
  4. The return value is y.

Passing arguments

When calling a function, you can specify the arguments by:

  • position
mean(1:10, 0.2, TRUE)
[1] 5.5
  • complete name
mean(x = 1:10, trim = 0.2, na.rm = TRUE)
[1] 5.5
  • partial name (does not work when the abbreviation is ambiguous)
mean(x = 1:10, n = TRUE, t = 0.2)
[1] 5.5

Customized functions – why are they useful?

  • Make code easier to understand due to an evocative name.
  • Useful to avoid code repetitions.
  • Help reduce the chance of making mistakes when you copy and paste.
    • (e.g., updating a variable name in one place, but not in another).

Customized functions – Writing your own function

How to write your own function:

FunctionName <- function(arg1, arg2, ...) {
  # what the function does with the arguments, and the output
}

Example

square_with_offset <- function(x, offset=0) {
  y <- x^2
  return(y + offset)
}
square_with_offset(3)  # default value of offset is 0
[1] 9
square_with_offset(3, -6)
[1] 3

Some statistical functions

Let x and y be numeric vectors of the same length. We can calculate:

  • The mean of x by mean(x);
  • The variance of x by var(x);
  • The standard deviation of x by sd(x);
  • The covariance of x and y using cov(x, y);
  • The correlation of x and y using cor(x, y).

Loops

Repetitive execution: for loop

Template

for (variable in vector) {
  # commands to be repeated
}

Example for loop: element-wise squaring

y <- sample(1:10, 5)
y
[1] 6 4 2 1 8
z <- numeric(length(y))  # sets an empty numeric vector of length 5
for (i in 1:5) {
  z[i] <- y[i]^2
}
z
[1] 36 16  4  1 64

Repetitive execution: for loop

Example for loop: cumulative sum.

y
[1] 6 4 2 1 8
z <- 0
for (i in 1:5) {
  z <- z + y[i]  # uses previous iteration's value of z
  print(paste("the cumulative sum of the vector y at index", i, "is:", z))
}
[1] "the cumulative sum of the vector y at index 1 is: 6"
[1] "the cumulative sum of the vector y at index 2 is: 10"
[1] "the cumulative sum of the vector y at index 3 is: 12"
[1] "the cumulative sum of the vector y at index 4 is: 13"
[1] "the cumulative sum of the vector y at index 5 is: 21"
z
[1] 21

Repetitive execution: for loop

Example for loop: compute \(\sum_{n=1}^5 n!\)

z <- 0
for (i in 1:5) {
  y <- factorial(i)  # the factorial() function is built into R
  print(paste0("the value of  ", i, "!  is: ", y))
  z <- z + y  # uses previous iteration's value of z
}
[1] "the value of  1!  is: 1"
[1] "the value of  2!  is: 2"
[1] "the value of  3!  is: 6"
[1] "the value of  4!  is: 24"
[1] "the value of  5!  is: 120"
z
[1] 153

Q: what happens if we omit z <- 0 at line 1?

Repetitive execution: while loop

Useful for when we don’t know how many times we want to execute commands.

while (condition is true) {
  # commands to be repeated 
}

Example while loop (random walk)

x <- 0
while (-2 <= x && x <= 2) {
  curr_step <- sample(c(-1, 1), size=1)
  print(paste0("moving x=", x, " by step of size ", curr_step))
  x <- x + curr_step  # uses previous iteration's value of x
}
[1] "moving x=0 by step of size -1"
[1] "moving x=-1 by step of size 1"
[1] "moving x=0 by step of size -1"
[1] "moving x=-1 by step of size -1"
[1] "moving x=-2 by step of size -1"

Repetitive execution: while loop

Useful for when we don’t know how many times we want to execute commands.

while (condition is true) {
  # commands to be repeated 
}

Example while loop (random walk): another set of random steps

x <- 0
while (-2 <= x && x <= 2) {
  curr_step <- sample(c(-1, 1), size=1)
  print(paste0("moving x=", x, " by step of size ", curr_step))
  x <- x + curr_step  # uses previous iteration's value of x
}
[1] "moving x=0 by step of size 1"
[1] "moving x=1 by step of size 1"
[1] "moving x=2 by step of size 1"

Repetitive execution: while loop

Useful for when we don’t know how many times we want to execute commands.

while (condition is true) {
  # commands to be repeated 
}

Example while loop (random walk): fix the set of “random” steps

set.seed(42)  # for reproducibility; fixes any proceding "random" results
x <- 0
while (-2 <= x && x <= 2) {
  curr_step <- sample(c(-1, 1), size=1)
  print(paste0("moving x=", x, " by step of size ", curr_step))
  x <- x + curr_step  # uses previous iteration's value of x
}
[1] "moving x=0 by step of size -1"
[1] "moving x=-1 by step of size -1"
[1] "moving x=-2 by step of size -1"

Repetitive execution: while loop

Useful for when we don’t know how many times we want to execute commands.

while (condition is true) {
  # commands to be repeated 
}

It is possible that the body of a while() loop will never be executed.

x <- 4  # this will not satisfy the condition in the proceding while loop
while (-2 <= x && x <= 2) {
  curr_step <- sample(c(-1, 1), size=1)
  print(paste0("moving x=", x, " by step of size ", curr_step))
  x <- x + curr_step  # uses previous iteration's value of x
}

Comments on loops

Performs commands sequentially

  • Pro: Helpful if commands depend on the values from the previous iteration’s commands
  • Con: Is a bit clunky if we want to store results in a vector/list

Often we will want to perform the same set of (complicated) commands on different chunks of data.

  • Can use for loop, but can be difficult to understand because it is so flexible
  • Can instead use apply() family of functions

apply() family of functions

Example: calculate group means of grades

# for loop
for (i in 1:3) { 
  print(mean(grades[[i]])) 
}
[1] 4.44
[1] 8.13
[1] 12.54
# for-each loop
for (x in grades){ 
  print(mean(x)) 
}
[1] 4.44
[1] 8.13
[1] 12.54
lapply(grades, mean)
$group1
[1] 4.44

$group2
[1] 8.13

$group3
[1] 12.54
sapply(grades, mean)
group1 group2 group3 
  4.44   8.13  12.54 

Very clean!

Example: anonymous functions

Calculate group means using…

  • a named function
mean_v2 <- function(x) sum(x)/length(x)
sapply(grades, mean_v2)
group1 group2 group3 
  4.44   8.13  12.54 
  • an anonymous function (useful for single-use execution)
sapply(grades, function(x) sum(x)/length(x))
group1 group2 group3 
  4.44   8.13  12.54 

Example: functions as objects

R is a functional programming language: functions can be used as objects!

funlist <- list(sum, mean, var, sd)
dat <- runif(10)

# Use for loop to apply dat to all functions in funlist
for (f in funlist) {
  print(f(dat))  # prints values
}
[1] 4.108169
[1] 0.4108169
[1] 0.116695
[1] 0.3416066
# Use sapply to apply dat to all functions in funlist
sapply(funlist, \(f) f(dat))  # also stores values in a vector
[1] 4.1081692 0.4108169 0.1166950 0.3416066

Differences between for() and apply()

Beyond aesthetic differences…

  • for() executes commands sequentially.
  • apply() family can execute commands in parallel (but don’t by default).
myfun <- function(x) {
  # commands that take a long time to execute
}

# Takes how much time?
for (x in grades) {
  myfun(x)
}

# Takes how much time (if 3 cores are used)?
lapply(grades, myfun)

reduce

purrr::reduce()

Repeatedly applies a binary function to the elements of a vector or list.

  • (a binary function is a function with two arguments)
  • The base R version is Reduce(), but the version from the purrr package has nicer functionality.
  • reduce(<list or vector>, <binary function>)
library(purrr)
xvec <- c(1,3,5,4,2)
accumulate(xvec, `*`)  # collects intermediate steps of reduce()
[1]   1   3  15  60 120
reduce(xvec, `*`)  # typically just want the final result 
[1] 120

purrr::reduce(): example using paste()

paste('eek', 'a', 'bear')
[1] "eek a bear"
paste0('eek', 'a', 'bear')
[1] "eekabear"
paste('eek', 'a', 'bear', sep='...')
[1] "eek...a...bear"

paste() also does element-wise pasting

paste(c("a", "b", "c"), 1:3, sep='-')
[1] "a-1" "b-2" "c-3"

So what if you want to paste together all elements of a character vector?

svec <- c('a', 'p', 'p', 'l', 'e')
accumulate(svec, paste)
[1] "a"         "a p"       "a p p"     "a p p l"   "a p p l e"
reduce(svec, paste)
[1] "a p p l e"
accumulate(svec, paste0)
[1] "a"     "ap"    "app"   "appl"  "apple"
reduce(svec, paste0)
[1] "apple"
reduce(svec, paste, sep='-')
[1] "a-p-p-l-e"

purrr::reduce(): example using set intersection

lst <- lapply(c(5, 3, 2), function(b) seq(0, 30, by=b))
lst
[[1]]
[1]  0  5 10 15 20 25 30

[[2]]
 [1]  0  3  6  9 12 15 18 21 24 27 30

[[3]]
 [1]  0  2  4  6  8 10 12 14 16 18 20 22 24 26 28 30
accumulate(lst, intersect)
[[1]]
[1]  0  5 10 15 20 25 30

[[2]]
[1]  0 15 30

[[3]]
[1]  0 30

Final element of accumulate(lst, intersect) will be the output of reduce(lst, intersect).

  • Same as intersect(intersect(lst[[1]], lst[[2]]), lst[[3]])
  • intersect(): must have two arguments.

purrr::reduce(): example of stacking data frames

MakeDF <- function(a) {
  data.frame(param = a, u = runif(4, min=a))
}
(av <- -1 * rexp(10))
 [1] -0.2992672 -0.2799575 -0.2311141 -1.2870888 -0.5693856 -3.0185564
 [7] -0.4975908 -0.3545537 -1.7564090 -0.7374990
(df_list <- lapply(av, MakeDF))
[[1]]
       param         u
1 -0.2992672 0.3690927
2 -0.2992672 0.5785272
3 -0.2992672 0.9776749
4 -0.2992672 0.6875838

[[2]]
       param           u
1 -0.2799575  0.44512359
2 -0.2799575  0.80760922
3 -0.2799575 -0.03743895
4 -0.2799575  0.06727781

[[3]]
       param          u
1 -0.2311141  0.7884435
2 -0.2311141  0.6223001
3 -0.2311141  0.0650239
4 -0.2311141 -0.1781900

[[4]]
      param          u
1 -1.287089 -0.9658007
2 -1.287089 -0.7921962
3 -1.287089 -0.1906617
4 -1.287089 -0.8355938

[[5]]
       param           u
1 -0.5693856  0.55956111
2 -0.5693856 -0.55701136
3 -0.5693856  0.01990297
4 -0.5693856  0.23791847

[[6]]
      param          u
1 -3.018556 -3.0122451
2 -3.018556 -0.6813479
3 -3.018556 -2.3840054
4 -3.018556 -1.5757809

[[7]]
       param          u
1 -0.4975908  0.4693016
2 -0.4975908  0.6642751
3 -0.4975908  0.3465215
4 -0.4975908 -0.1475987

[[8]]
       param          u
1 -0.3545537 -0.2326702
2 -0.3545537 -0.2385875
3 -0.3545537  0.0588810
4 -0.3545537  0.5495114

[[9]]
      param          u
1 -1.756409 -1.7557505
2 -1.756409 -1.1815049
3 -1.756409  0.8154147
4 -1.756409  0.7950465

[[10]]
      param          u
1 -0.737499  0.5379891
2 -0.737499 -0.1587868
3 -0.737499  0.1574230
4 -0.737499  0.5551562
reduce(df_list, rbind)
        param           u
1  -0.2992672  0.36909267
2  -0.2992672  0.57852718
3  -0.2992672  0.97767495
4  -0.2992672  0.68758376
5  -0.2799575  0.44512359
6  -0.2799575  0.80760922
7  -0.2799575 -0.03743895
8  -0.2799575  0.06727781
9  -0.2311141  0.78844348
10 -0.2311141  0.62230012
11 -0.2311141  0.06502390
12 -0.2311141 -0.17819002
13 -1.2870888 -0.96580066
14 -1.2870888 -0.79219616
15 -1.2870888 -0.19066173
16 -1.2870888 -0.83559384
17 -0.5693856  0.55956111
18 -0.5693856 -0.55701136
19 -0.5693856  0.01990297
20 -0.5693856  0.23791847
21 -3.0185564 -3.01224508
22 -3.0185564 -0.68134793
23 -3.0185564 -2.38400545
24 -3.0185564 -1.57578093
25 -0.4975908  0.46930157
26 -0.4975908  0.66427514
27 -0.4975908  0.34652154
28 -0.4975908 -0.14759872
29 -0.3545537 -0.23267022
30 -0.3545537 -0.23858752
31 -0.3545537  0.05888100
32 -0.3545537  0.54951137
33 -1.7564090 -1.75575051
34 -1.7564090 -1.18150490
35 -1.7564090  0.81541467
36 -1.7564090  0.79504652
37 -0.7374990  0.53798910
38 -0.7374990 -0.15878679
39 -0.7374990  0.15742300
40 -0.7374990  0.55515619

Conditional execution

if and else

Only one condition: if statement

if (condition1) {
  statement 1
} else {
  statement 2
}

Example (one coin toss)

x <- sample(0:1, 1)
x
[1] 1
if (x==0) {
  print('x is heads')
} else {
  print('x is tails')
}
[1] "x is tails"

if and else: vectorized version

ifelse(condition1, statement1, statement2)

Example (five coin tosses)

y <- sample(0:1, 5, replace=TRUE)
y
[1] 1 1 0 1 0
z <- ifelse(y==0, 'y is heads', 'y is tails')
z
[1] "y is tails" "y is tails" "y is heads" "y is tails" "y is heads"

else if

More than two conditions:

if (condition_1) {
  # statement 1
} else if (condition_2) {
  # statement 2
} ...
} else if (condition_n) {
  # statement n
} else {
  # else statement
}

Example

x <- sample(-4:4, size=1)
x
[1] 3
if (x < 0) {
  print('squaring x to make it positive')
  x <- x^2
} else if (x > 0) {
  print('x is already positive')
} else {
  print('adding 1 to make it positive')
  x <- x+1
}
[1] "x is already positive"
x
[1] 3