s2: R Basics

STA141A: Fundamentals of Statistical Data Science

Akira Horiguchi

Vectors

R operates on named data structures, which we simply refer to as data objects.
A vector is a single entity consisting of an ordered collection of values.
All values in a vector must be of the same data type.
Numeric vector
Logical vector
Character vector

Numeric Vectors

Numeric vectors are an ordered collection of numbers.
They can be used in arithmetic expressions, where operations are performed element by element.
There are also built-in R functions that take a vector and return summary measures (e.g., mean, quantiles, etc).

x <- c(4,2,7,8)  # concatenates the numbers 4, 2, 7, and 8
mean(x)

[1] 5.25

summary(x)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.00    3.50    5.50    5.25    7.25    8.00

Logical Vectors

R allows manipulation of logical quantities.
The elements of a logical vector can have the values TRUE, FALSE.
Logical vectors are generated by conditions. (Logical vectors are usually not created manually.)
For this purpose, we will learn about logical operators.

x <- c(4,2,7,8)  # concatenates the numbers 4, 2, 7, and 8
x < 5

[1]  TRUE  TRUE FALSE FALSE

(x %% 2) == 0  # which elements of x are even

[1]  TRUE  TRUE FALSE  TRUE

Character Vectors

Character strings are entered using either matching double " " or single ' quotes, e.g., "apple" or 'banana'.
Character vectors are a collection of character strings.
The paste() function is often used for creating character strings.

z <- c("hello", "world", "!")  # This is a character vector
z

[1] "hello" "world" "!"

paste(z[1], z[2], z[3])

[1] "hello world !"

Factors

A factor is a vector that can contain only pre-defined values (e.g., only the strings "apple" and 'banana'), and is used to store categorical data.
You may want to explicitly define the levels to e.g. change their printed order.

x <- c("dog", "dog", "cat", "dog", "cat", "ewe", "bear", "bear", "bear", "bear")
table(x)  # default order is alphabetical order

x
bear  cat  dog  ewe 
   4    2    3    1

x <- factor(x, levels=c("dog", "ewe", "cat", "bear"))
x  # vector elements remain the same

 [1] dog  dog  cat  dog  cat  ewe  bear bear bear bear
Levels: dog ewe cat bear

table(x)

x
 dog  ewe  cat bear 
   3    1    2    4

Factors

A factor is a vector that can contain only pre-defined values (e.g., only the strings "apple" and 'banana'), and is used to store categorical data.
You may want to explicitly define the levels to e.g. change their printed order.

x <- c("dog", "dog", "cat", "dog", "cat", "ewe", "bear", "bear", "bear", "bear")
table(x)  # default order is alphabetical order

x
bear  cat  dog  ewe 
   4    2    3    1

x <- factor(x, levels=names(sort(table(x))))  # don't need to know this code right now
x  # vector elements remain the same

 [1] dog  dog  cat  dog  cat  ewe  bear bear bear bear
Levels: ewe cat dog bear

table(x)

x
 ewe  cat  dog bear 
   1    2    3    4

Factors

A factor is a vector that can contain only pre-defined values (e.g., only the strings "apple" and 'banana'), and is used to store categorical data.
You may want to explicitly define the levels to e.g. change their printed order.

x <- c("dog", "dog", "cat", "dog", "cat", "ewe", "bear", "bear", "bear", "bear")
table(x)  # default order is alphabetical order

x
bear  cat  dog  ewe 
   4    2    3    1

x <- factor(x, levels=names(sort(table(x), decreasing=TRUE)))  # don't need to know this code right now
x  # vector elements remain the same

 [1] dog  dog  cat  dog  cat  ewe  bear bear bear bear
Levels: bear dog cat ewe

table(x)

x
bear  dog  cat  ewe 
   4    3    2    1

Create a vector

Create “regular” vectors.

1:11

 [1]  1  2  3  4  5  6  7  8  9 10 11

seq(1, 17, by=2)

[1]  1  3  5  7  9 11 13 15 17

rep(-5, times=4)

[1] -5 -5 -5 -5

rep("buffalo", times=8)

[1] "buffalo" "buffalo" "buffalo" "buffalo" "buffalo" "buffalo" "buffalo"
[8] "buffalo"

You can combine c() with rep(thing_to_repeat, num_repetitions):

rep(c("eek", "a", "bear"), times=2)

[1] "eek"  "a"    "bear" "eek"  "a"    "bear"

rep(c(6, 2), times=4)

[1] 6 2 6 2 6 2 6 2

Create a vector: coercion

c() will sometimes coerce different data types into the same type.

str(c(4L, 6L))

 int [1:2] 4 6

str(c(4L, 6))  # coerces the integer 4L into the numeric 4

 num [1:2] 4 6

str(c(3, "haha"))  # coerces the numeric 3 into the character "3"

 chr [1:2] "3" "haha"

str(c(TRUE, 7))  # coerces the logical TRUE into the numeric value 1

 num [1:2] 1 7

str(c(FALSE, 7))  # coerces the logical FALSE into the numeric value 0

 num [1:2] 0 7

Roughly, order is logical < integer < numeric < character.

Create a vector with randomly generated elements

Can generate a vector containing random values using e.g., sample().

animals <- c("ant", "bug", "cat", "dog")
sample(animals, size=3)  # samples without replacement

[1] "cat" "dog" "ant"

sample(animals, size=6)  # results in an error

sample(animals, size=6, replace=TRUE)  # samples with replacement

[1] "ant" "dog" "cat" "ant" "bug" "dog"

runif(5, min=0, max=1)  # generates 3 random numbers between 0 and 1

[1] 0.1480565 0.3443080 0.6948802 0.2596952 0.4794379

rnorm(5)  # generates 3 random numbers from a standard normal distribution

[1]  1.3096479  0.3115038 -1.1938093 -1.3670770 -1.0312860

Inspecting a vector

There are built-in R functions that take a vector and return info about it.

x <- rpois(10000, lambda=2)  # randomly generate 10000 values from a Poission distribution with parameter lambda=2
length(x)  # returns length of the variable x

[1] 10000

str(x)  # returns the structure of the variable x

 int [1:10000] 1 2 4 2 3 3 4 1 1 1 ...

summary(x)  # returns summary statistics

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   1.000   2.000   2.015   3.000  10.000

table(x)  # counts occurrences of each value in x

x
   0    1    2    3    4    5    6    7    8    9   10 
1371 2695 2577 1889  939  361  123   32    8    3    2

Subsetting a vector

Square brackets [ ] are used for indexing (i.e., accessing elements of) a vector, matrix, array, list, or dataframe.

Three approaches to access elements:

by integer index
by a logical vector
by name

Subsetting a vector: by integer index

Using a vector of positive integers, the corresponding elements of the vector are selected and concatenated, in that order. A vector of negative integers specifies the values to be excluded rather than included.

x <- runif(7, min=0, max=1)
x

[1] 0.15153443 0.21308165 0.49793895 0.04551215 0.83195540 0.42114951 0.38850633

x[2]

[1] 0.2130816

x[5]

[1] 0.8319554

x[-3]  # returns all but the third element (different from python!)

[1] 0.15153443 0.21308165 0.04551215 0.83195540 0.42114951 0.38850633

x[9]  # x has only 7 elements

[1] NA

Subsetting a vector: by a logical vector

Values corresponding to TRUE in the index vector are selected and those corresponding to FALSE are omitted.

x <- runif(7, min=0, max=1)
x

[1] 0.8774323 0.7797666 0.7160577 0.6435915 0.7282974 0.2961213 0.8139053

x[c(TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE)]

[1] 0.8774323 0.7797666 0.8139053

x[c(FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE)]

[1] 0.7160577

x < 0.5  # a logical vector!

[1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE

x[x < 0.5]

[1] 0.2961213

Subsetting a vector: by name

If a vector has a names attribute to identify its components, a sub-vector of the names vector may be used to select the elements.

x <- rnorm(4)
names(x) <- c('a', 'b', 'c', 'd')  # Assigns names to vector's elements
x

         a          b          c          d 
 0.2196825  0.2927151 -1.5388355  0.9532625

str(x)

 Named num [1:4] 0.22 0.293 -1.539 0.953
 - attr(*, "names")= chr [1:4] "a" "b" "c" "d"

x[c('b', 'a')]  # Selects the elements of x with the name 'b' and 'a'

        b         a 
0.2927151 0.2196825

To get the names of a vector, can again use names(x)

Subsetting a vector: by name

To get the names of a vector, can again use names(x)

names(x)

[1] "a" "b" "c" "d"

y <- c(5, 8, 2)  # does not have names attribute
names(y)

NULL

Some comments on vectors

All elements of a vector must be of the same type (numeric, logical, character).
Vectors can be coerced to the same type: as.logical(), as.numeric(), as.character().
The vector’s type can be tested: use is.character(), is.logical(), etc.

Q: What will be the result of the following code?

y <- c("TRUE", "FALSE", "TRUE")
is.logical(y)

Operations with vectors: Element-wise operations

x <- c(2, 3, 4)
x

[1] 2 3 4

x + 1

[1] 3 4 5

2 * x

[1] 4 6 8

x / 2

[1] 1.0 1.5 2.0

x^2

[1]  4  9 16

Operations with vectors: Vector operations

x <- c(2, 3, 4)
y <- c(5, 7, 6)
x + y

[1]  7 10 10

y - x

[1] 3 4 2

2 * x + 3 * y

[1] 19 27 26

Operations with vectors: functions that return a single number

x <- c(2, 3, 4)
sum(x)

[1] 9

mean(x)

[1] 3

var(x)

[1] 1

sd(x)

[1] 1

summary(x)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    2.0     2.5     3.0     3.0     3.5     4.0

Vectors: putting it all together

x <- rnorm(100)

Q: What is the sum of the first three values of the vector x?

Code

x[1] + x[2] + x[3]
sum(c(x[1], x[2], x[3]))
sum(x[1:3])

Vectors: putting it all together

x <- rnorm(100)

Q: What is the mean of the positive values of the vector x?

Code

mean(x[x > 0])

Vectors: putting it all together

x <- rnorm(100)

Q: How many negative values are there in the vector x?

Code

length(x[x < 0])
sum(x < 0)

`all.equal` function

To compare whether certain values are equal, use ==.
This can be done component-wise, but also for a vector.
To check whether all components of one vector are equal to another one, use the all.equal function.
It is used as all.equal(vector1, vector2, ...).

x <- seq(0.2, 1, by=0.2)
y <- c(0.2, 0.4, 0.6, 0.8, 1.0)
x == y  # due to floating-point error

[1]  TRUE  TRUE FALSE  TRUE  TRUE

all.equal(x, y)  # tests "near equality" to allow for floating-point error

[1] TRUE

Matrices

You can think of a matrix as a collection of vectors of the same type and length.

You can also think of a matrix as a table with a certain number of rows and columns, where information of the same type are stored.
“Square” or “rectangular” data structure
A matrix can be created by using the matrix() function, where the number of rows and columns must be specified.

matrix(1:6, ncol = 3, nrow = 2)  # Creates matrix with values 1 to 6, with 3 columns, 2 rows

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Subsetting a matrix: by index

It is similar to vectors: each element is indexed by row and column.

A = matrix(1:6, ncol = 3, nrow = 2) 
A

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

A[1, 2]

[1] 3

A[1, ]

[1] 1 3 5

A[, 3]

[1] 5 6

Subsetting a matrix: by logical vector

It is similar to vectors: values corresponding to TRUE in the index vector are selected and those corresponding to FALSE are omitted.

A = matrix(1:6, ncol = 3, nrow = 2) 
A

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

A[c(T, F), ]

[1] 1 3 5

A[, c(F, T, F)]

[1] 3 4

A[c(T, F), c(F, T, F)]  # Can use T for TRUE and/or F for FALSE

[1] 3

Subsetting a matrix: by name

A = matrix(1:6, ncol = 3, nrow = 2) 
# Adding row and col names
colnames(A) = c("First", "Second", "Third")
rownames(A) = c("a", "b")
A

  First Second Third
a     1      3     5
b     2      4     6

A['a', ]

 First Second  Third 
     1      3      5

A[, 'Second']

a b 
3 4

A['a', 'Second']

[1] 3

Matrix operations: transposition, matrix multiplication

B = matrix(1:4, ncol = 2, nrow = 2)  # 2 by 2 matrix
C = t(B)  # transpose the matrix B
B

     [,1] [,2]
[1,]    1    3
[2,]    2    4

     [,1] [,2]
[1,]    1    2
[2,]    3    4

C %*% B  # matrix multiplication

     [,1] [,2]
[1,]    5   11
[2,]   11   25

Matrix operations: element-wise operations

B = matrix(1:4, ncol = 2, nrow = 2)  # 2 by 2 matrix
C = t(B)  # transpose the matrix B
B + C  # element-wise addition

     [,1] [,2]
[1,]    2    5
[2,]    5    8

B * C  # element-wise multiplication

     [,1] [,2]
[1,]    1    6
[2,]    6   16

Matrix operations: column-wise or row-wise

Applying column-wise or row-wise functions (apply() function)

A = matrix(1:6, ncol = 3, nrow = 2)
A

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

apply(A, MARGIN = 2, mean)  # Apply the mean() function to all column vectors

[1] 1.5 3.5 5.5

apply(A, MARGIN = 1, mean)  # Apply the mean() function to all row vectors

[1] 3 4

Arrays

Arrays are a generalization of matrices.

High-order arrays are much rarer than matrices.
Use dim() to find the size of an array.
Use dimnames() to assign names to each element.
Subsetting works similarly to subsetting with the other data structures.

A = array(1:24, dim=c(2,3,4))
print(dim(A))

[1] 2 3 4

A[1,3,2]  # access element (1,3,2) of array A

[1] 11

Lists

List elements can be of any type.

Thus, lists are different from vectors, matrices, and arrays.
You construct lists by using list().
You can then return length, attributes, etc., just as for vectors.
Lists can contain other lists.

Combining values into list

x <- list(3, "haha")
str(x)

List of 2
 $ : num 3
 $ : chr "haha"

z <- list(x, 5.4)  # list inside a list
str(z)

List of 2
 $ :List of 2
  ..$ : num 3
  ..$ : chr "haha"
 $ : num 5.4

y <- c(x, 5.4)  # a "flat" list
str(y)

List of 3
 $ : num 3
 $ : chr "haha"
 $ : num 5.4

Can also use rep(thing_to_repeat, num_repetitions):

z <- rep(y, 2)
str(z)

List of 6
 $ : num 3
 $ : chr "haha"
 $ : num 5.4
 $ : num 3
 $ : chr "haha"
 $ : num 5.4

Subsetting a list: By element position

Square brackets operator [ ] or the double square bracket operator [[ ]].

profile <- list(name="Mary", no.children=3, child.ages=c(4,7,9))

profile[1]

$name
[1] "Mary"

profile[[1]]  # What is the difference between this and profile[1]? Use str().

[1] "Mary"

str(profile[1])

List of 1
 $ name: chr "Mary"

str(profile[[1]])

 chr "Mary"

Subsetting a list: By logical vector

profile <- list(name="Mary", no.children=3, child.ages=c(4,7,9))

profile[c(T,F,T)]  # Can use T instead of TRUE; F instead of FALSE (but case sensitive -- 'true' or 'True' will return an error)

$name
[1] "Mary"

$child.ages
[1] 4 7 9

Subsetting a list: By name

[ ], [[ ]], or the $ operator.

profile <- list(name="Mary", no.children=3, child.ages=c(4,7,9))
profile$child.ages

[1] 4 7 9

profile['child.ages']

$child.ages
[1] 4 7 9

profile[['child.ages']]

[1] 4 7 9

profile[c('child.ages', 'name')]

$child.ages
[1] 4 7 9

$name
[1] "Mary"

Data frames

Examples of built-in datasets

Data frames are a convenient way to store datasets.

Many datasets are publicly available through base R or through an R package.
Commonly used datasets: USArrests, PlantGrowth, ToothGrowth, mtcars
Useful if you want to see if some statistical/ML method (perhaps your own!) generalizes well to various data sets.
Saves you the trouble of collecting/generating data yourself

MASS::housing  # dataset: Frequency Table from a Copenhagen Housing Conditions Survey

      Sat   Infl      Type Cont Freq
1     Low    Low     Tower  Low   21
2  Medium    Low     Tower  Low   21
3    High    Low     Tower  Low   28
4     Low Medium     Tower  Low   34
5  Medium Medium     Tower  Low   22
...

Data frames

A data frame is a list with class “data.frame”.

Most common use case: each element in the list is a vector, and all vectors have the same length.
Then the data frame has a rectangular structure (each column is a vector).
Its rows can be extracted by using matrix conventions.
Its columns can be extracted using matrix conventions (but also using list conventions).
A simple way to construct a data frame from scratch is to use the read.table() function to read an entire data frame from an external file.
- Can also use read.csv().

Data frames: Subsetting

Similar to matrix subsetting, you can specify rows and columns.
If you do not specify rows, all rows will automatically be selected.
But also similar to lists because you can use [, [[ or $, and they return a vector or a dataframe.

Subsetting:

By index (using [ or [[).
By logical vector (using [ or [[).
By name (using [, [[, or $).

Example: Subset

# Create a dataframe
firstNames <- sample(c("Dog", "Cat", "Bug"), replace = TRUE, size = 6)
dat <- data.frame(age = 11:16, weight = 1:6, name = firstNames)
dat

  age weight name
1  11      1  Cat
2  12      2  Bug
3  13      3  Cat
4  14      4  Bug
5  15      5  Dog
6  16      6  Dog

Example: Subset by index

dat[1, 3]  # Like matrix

[1] "Cat"

dat[1, ]  # Like matrix

  age weight name
1  11      1  Cat

dat[, 1]  # Like matrix

[1] 11 12 13 14 15 16

dat[c(3,1), ]  # Like matrix

  age weight name
3  13      3  Cat
1  11      1  Cat

Example: Subset by logical vector

dat[dat$weight > 4, ]

  age weight name
5  15      5  Dog
6  16      6  Dog

Example: Subset by name

dat$age  # Like list

[1] 11 12 13 14 15 16

dat[, 'age']  # Like matrix

[1] 11 12 13 14 15 16

dat[, c('age', 'name')]  # Like matrix

  age name
1  11  Cat
2  12  Bug
3  13  Cat
4  14  Bug
5  15  Dog
6  16  Dog

Functions

General concepts

Functions are modules of code that accomplish a specific task.
Functions have an input of some sort of data structure (value, vector, dataframe, etc.), process it, and return an output.
Common R built-in functions are, e.g., sum() or mean(), where the input is a vector and the output is a number.

Components of a function

Function name: It is stored in the R environment as an object with this name.
Argument(s): When calling a function, you pass a value or values to the argument(s).
- …can be required or optional.
- …can have default values.
Function Body: The sequence of commands that are executed when the function is called.
Return Value: The output of the function.

square <- function(x) {
  y <- x^2
  return(y)
}
square(3)

[1] 9

The function name is square.
The function has only one argument; here it is called x.
The function body are the two lines of code between the curly braces { and }.
The return value is y.

Passing arguments

When calling a function, you can specify the arguments by:

position

mean(1:10, 0.2, TRUE)

[1] 5.5

complete name

mean(x = 1:10, trim = 0.2, na.rm = TRUE)

[1] 5.5

partial name (does not work when the abbreviation is ambiguous)

mean(x = 1:10, n = TRUE, t = 0.2)

[1] 5.5

Customized functions – why are they useful?

Make code easier to understand due to an evocative name.
Useful to avoid code repetitions.
Help reduce the chance of making mistakes when you copy and paste.
- (e.g., updating a variable name in one place, but not in another).

Customized functions – Writing your own function

How to write your own function:

FunctionName <- function(arg1, arg2, ...) {
  # what the function does with the arguments, and the output
}

Example

square_with_offset <- function(x, offset=0) {
  y <- x^2
  return(y + offset)
}
square_with_offset(3)  # default value of offset is 0

[1] 9

square_with_offset(3, -6)

[1] 3

Some statistical functions

Let x and y be numeric vectors of the same length. We can calculate:

The mean of x by mean(x);
The variance of x by var(x);
The standard deviation of x by sd(x);
The covariance of x and y using cov(x, y);
The correlation of x and y using cor(x, y).

Loops

Repetitive execution: `for` loop

Template

for (variable in vector) {
  # commands to be repeated
}

Example for loop: element-wise squaring

y <- sample(1:10, 5)
y

[1] 6 4 2 1 8

z <- numeric(length(y))  # sets an empty numeric vector of length 5
for (i in 1:5) {
  z[i] <- y[i]^2
}
z

[1] 36 16  4  1 64

Repetitive execution: `for` loop

Example for loop: cumulative sum.

[1] 6 4 2 1 8

z <- 0
for (i in 1:5) {
  z <- z + y[i]  # uses previous iteration's value of z
  print(paste("the cumulative sum of the vector y at index", i, "is:", z))
}

[1] "the cumulative sum of the vector y at index 1 is: 6"
[1] "the cumulative sum of the vector y at index 2 is: 10"
[1] "the cumulative sum of the vector y at index 3 is: 12"
[1] "the cumulative sum of the vector y at index 4 is: 13"
[1] "the cumulative sum of the vector y at index 5 is: 21"

[1] 21

Repetitive execution: `for` loop

Example for loop: compute $\sum_{n=1}^5 n!$

z <- 0
for (i in 1:5) {
  y <- factorial(i)  # the factorial() function is built into R
  print(paste0("the value of  ", i, "!  is: ", y))
  z <- z + y  # uses previous iteration's value of z
}

[1] "the value of  1!  is: 1"
[1] "the value of  2!  is: 2"
[1] "the value of  3!  is: 6"
[1] "the value of  4!  is: 24"
[1] "the value of  5!  is: 120"

[1] 153

Q: what happens if we omit z <- 0 at line 1?

Repetitive execution: `while` loop

Useful for when we don’t know how many times we want to execute commands.

while (condition is true) {
  # commands to be repeated 
}

Example while loop (random walk)

x <- 0
while (-2 <= x && x <= 2) {
  curr_step <- sample(c(-1, 1), size=1)
  print(paste0("moving x=", x, " by step of size ", curr_step))
  x <- x + curr_step  # uses previous iteration's value of x
}

[1] "moving x=0 by step of size -1"
[1] "moving x=-1 by step of size 1"
[1] "moving x=0 by step of size -1"
[1] "moving x=-1 by step of size -1"
[1] "moving x=-2 by step of size -1"

Repetitive execution: `while` loop

Useful for when we don’t know how many times we want to execute commands.

while (condition is true) {
  # commands to be repeated 
}

Example while loop (random walk): another set of random steps

x <- 0
while (-2 <= x && x <= 2) {
  curr_step <- sample(c(-1, 1), size=1)
  print(paste0("moving x=", x, " by step of size ", curr_step))
  x <- x + curr_step  # uses previous iteration's value of x
}

[1] "moving x=0 by step of size 1"
[1] "moving x=1 by step of size 1"
[1] "moving x=2 by step of size 1"

Repetitive execution: `while` loop

Useful for when we don’t know how many times we want to execute commands.

while (condition is true) {
  # commands to be repeated 
}

Example while loop (random walk): fix the set of “random” steps

set.seed(42)  # for reproducibility; fixes any proceding "random" results
x <- 0
while (-2 <= x && x <= 2) {
  curr_step <- sample(c(-1, 1), size=1)
  print(paste0("moving x=", x, " by step of size ", curr_step))
  x <- x + curr_step  # uses previous iteration's value of x
}

[1] "moving x=0 by step of size -1"
[1] "moving x=-1 by step of size -1"
[1] "moving x=-2 by step of size -1"

Repetitive execution: `while` loop

Useful for when we don’t know how many times we want to execute commands.

while (condition is true) {
  # commands to be repeated 
}

It is possible that the body of a while() loop will never be executed.

x <- 4  # this will not satisfy the condition in the proceding while loop
while (-2 <= x && x <= 2) {
  curr_step <- sample(c(-1, 1), size=1)
  print(paste0("moving x=", x, " by step of size ", curr_step))
  x <- x + curr_step  # uses previous iteration's value of x
}

Comments on loops

Performs commands sequentially

Pro: Helpful if commands depend on the values from the previous iteration’s commands
Con: Is a bit clunky if we want to store results in a vector/list

Often we will want to perform the same set of (complicated) commands on different chunks of data.

Can use for loop, but can be difficult to understand because it is so flexible
Can instead use apply() family of functions

`apply()` family of functions

`apply()` and related functions

lapply(X, FUN, ...): returns a list containing the result of the function FUN applied to all the elements of the list/vector X.
sapply(X, FUN, ...): essentially does lapply(X, FUN, ...) first and then tries to coerce the output into a vector.

grades <- list(group1 = sample(seq(0, 10, 0.1), 10), 
               group2 = sample(seq(5, 10, 0.1), 10), 
               group3 = sample(seq(10, 15, 0.1), 5))
grades

$group1
 [1] 2.4 7.3 1.7 4.8 4.6 2.3 7.0 8.8 3.6 1.9

$group2
 [1] 7.5 9.9 9.6 5.2 9.0 7.4 7.6 8.5 8.6 8.0

$group3
[1] 14.4 10.4 11.9 13.3 12.7

Example: calculate group means of `grades`

# for loop
for (i in 1:3) { 
  print(mean(grades[[i]])) 
}

[1] 4.44
[1] 8.13
[1] 12.54

# for-each loop
for (x in grades){ 
  print(mean(x)) 
}

[1] 4.44
[1] 8.13
[1] 12.54

lapply(grades, mean)

$group1
[1] 4.44

$group2
[1] 8.13

$group3
[1] 12.54

sapply(grades, mean)

group1 group2 group3 
  4.44   8.13  12.54

Very clean!

Example: anonymous functions

Calculate group means using…

a named function

mean_v2 <- function(x) sum(x)/length(x)
sapply(grades, mean_v2)

group1 group2 group3 
  4.44   8.13  12.54

an anonymous function (useful for single-use execution)

sapply(grades, function(x) sum(x)/length(x))

group1 group2 group3 
  4.44   8.13  12.54

Example: functions as objects

R is a functional programming language: functions can be used as objects!

funlist <- list(sum, mean, var, sd)
dat <- runif(10)

# Use for loop to apply dat to all functions in funlist
for (f in funlist) {
  print(f(dat))  # prints values
}

[1] 4.108169
[1] 0.4108169
[1] 0.116695
[1] 0.3416066

# Use sapply to apply dat to all functions in funlist
sapply(funlist, \(f) f(dat))  # also stores values in a vector

[1] 4.1081692 0.4108169 0.1166950 0.3416066

Differences between `for()` and `apply()`

Beyond aesthetic differences…

for() executes commands sequentially.
apply() family can execute commands in parallel (but don’t by default).

myfun <- function(x) {
  # commands that take a long time to execute
}

# Takes how much time?
for (x in grades) {
  myfun(x)
}

# Takes how much time (if 3 cores are used)?
lapply(grades, myfun)

reduce

`purrr::reduce()`

Repeatedly applies a binary function to the elements of a vector or list.

(a binary function is a function with two arguments)
The base R version is Reduce(), but the version from the purrr package has nicer functionality.
reduce(<list or vector>, <binary function>)

library(purrr)
xvec <- c(1,3,5,4,2)
accumulate(xvec, `*`)  # collects intermediate steps of reduce()

[1]   1   3  15  60 120

reduce(xvec, `*`)  # typically just want the final result

[1] 120

`purrr::reduce()`: example using `paste()`

paste('eek', 'a', 'bear')

[1] "eek a bear"

paste0('eek', 'a', 'bear')

[1] "eekabear"

paste('eek', 'a', 'bear', sep='...')

[1] "eek...a...bear"

paste() also does element-wise pasting

paste(c("a", "b", "c"), 1:3, sep='-')

[1] "a-1" "b-2" "c-3"

So what if you want to paste together all elements of a character vector?

svec <- c('a', 'p', 'p', 'l', 'e')
accumulate(svec, paste)

[1] "a"         "a p"       "a p p"     "a p p l"   "a p p l e"

reduce(svec, paste)

[1] "a p p l e"

accumulate(svec, paste0)

[1] "a"     "ap"    "app"   "appl"  "apple"

reduce(svec, paste0)

[1] "apple"

reduce(svec, paste, sep='-')

[1] "a-p-p-l-e"

`purrr::reduce()`: example using set intersection

lst <- lapply(c(5, 3, 2), function(b) seq(0, 30, by=b))
lst

[[1]]
[1]  0  5 10 15 20 25 30

[[2]]
 [1]  0  3  6  9 12 15 18 21 24 27 30

[[3]]
 [1]  0  2  4  6  8 10 12 14 16 18 20 22 24 26 28 30

accumulate(lst, intersect)

[[1]]
[1]  0  5 10 15 20 25 30

[[2]]
[1]  0 15 30

[[3]]
[1]  0 30

Final element of accumulate(lst, intersect) will be the output of reduce(lst, intersect).

Same as intersect(intersect(lst[[1]], lst[[2]]), lst[[3]])
intersect(): must have two arguments.

`purrr::reduce()`: example of stacking data frames

MakeDF <- function(a) {
  data.frame(param = a, u = runif(4, min=a))
}
(av <- -1 * rexp(10))

 [1] -0.2992672 -0.2799575 -0.2311141 -1.2870888 -0.5693856 -3.0185564
 [7] -0.4975908 -0.3545537 -1.7564090 -0.7374990

(df_list <- lapply(av, MakeDF))

[[1]]
       param         u
1 -0.2992672 0.3690927
2 -0.2992672 0.5785272
3 -0.2992672 0.9776749
4 -0.2992672 0.6875838

[[2]]
       param           u
1 -0.2799575  0.44512359
2 -0.2799575  0.80760922
3 -0.2799575 -0.03743895
4 -0.2799575  0.06727781

[[3]]
       param          u
1 -0.2311141  0.7884435
2 -0.2311141  0.6223001
3 -0.2311141  0.0650239
4 -0.2311141 -0.1781900

[[4]]
      param          u
1 -1.287089 -0.9658007
2 -1.287089 -0.7921962
3 -1.287089 -0.1906617
4 -1.287089 -0.8355938

[[5]]
       param           u
1 -0.5693856  0.55956111
2 -0.5693856 -0.55701136
3 -0.5693856  0.01990297
4 -0.5693856  0.23791847

[[6]]
      param          u
1 -3.018556 -3.0122451
2 -3.018556 -0.6813479
3 -3.018556 -2.3840054
4 -3.018556 -1.5757809

[[7]]
       param          u
1 -0.4975908  0.4693016
2 -0.4975908  0.6642751
3 -0.4975908  0.3465215
4 -0.4975908 -0.1475987

[[8]]
       param          u
1 -0.3545537 -0.2326702
2 -0.3545537 -0.2385875
3 -0.3545537  0.0588810
4 -0.3545537  0.5495114

[[9]]
      param          u
1 -1.756409 -1.7557505
2 -1.756409 -1.1815049
3 -1.756409  0.8154147
4 -1.756409  0.7950465

[[10]]
      param          u
1 -0.737499  0.5379891
2 -0.737499 -0.1587868
3 -0.737499  0.1574230
4 -0.737499  0.5551562

reduce(df_list, rbind)

        param           u
1  -0.2992672  0.36909267
2  -0.2992672  0.57852718
3  -0.2992672  0.97767495
4  -0.2992672  0.68758376
5  -0.2799575  0.44512359
6  -0.2799575  0.80760922
7  -0.2799575 -0.03743895
8  -0.2799575  0.06727781
9  -0.2311141  0.78844348
10 -0.2311141  0.62230012
11 -0.2311141  0.06502390
12 -0.2311141 -0.17819002
13 -1.2870888 -0.96580066
14 -1.2870888 -0.79219616
15 -1.2870888 -0.19066173
16 -1.2870888 -0.83559384
17 -0.5693856  0.55956111
18 -0.5693856 -0.55701136
19 -0.5693856  0.01990297
20 -0.5693856  0.23791847
21 -3.0185564 -3.01224508
22 -3.0185564 -0.68134793
23 -3.0185564 -2.38400545
24 -3.0185564 -1.57578093
25 -0.4975908  0.46930157
26 -0.4975908  0.66427514
27 -0.4975908  0.34652154
28 -0.4975908 -0.14759872
29 -0.3545537 -0.23267022
30 -0.3545537 -0.23858752
31 -0.3545537  0.05888100
32 -0.3545537  0.54951137
33 -1.7564090 -1.75575051
34 -1.7564090 -1.18150490
35 -1.7564090  0.81541467
36 -1.7564090  0.79504652
37 -0.7374990  0.53798910
38 -0.7374990 -0.15878679
39 -0.7374990  0.15742300
40 -0.7374990  0.55515619

Conditional execution

`if` and `else`

Only one condition: if statement

if (condition1) {
  statement 1
} else {
  statement 2
}

Example (one coin toss)

x <- sample(0:1, 1)
x

[1] 1

if (x==0) {
  print('x is heads')
} else {
  print('x is tails')
}

[1] "x is tails"

`if` and `else`: vectorized version

ifelse(condition1, statement1, statement2)

Example (five coin tosses)

y <- sample(0:1, 5, replace=TRUE)
y

[1] 1 1 0 1 0

z <- ifelse(y==0, 'y is heads', 'y is tails')
z

[1] "y is tails" "y is tails" "y is heads" "y is tails" "y is heads"

`else if`

More than two conditions:

if (condition_1) {
  # statement 1
} else if (condition_2) {
  # statement 2
} ...
} else if (condition_n) {
  # statement n
} else {
  # else statement
}

Example

x <- sample(-4:4, size=1)
x

[1] 3

if (x < 0) {
  print('squaring x to make it positive')
  x <- x^2
} else if (x > 0) {
  print('x is already positive')
} else {
  print('adding 1 to make it positive')
  x <- x+1
}

[1] "x is already positive"

[1] 3

s2: R Basics

Vectors

Vectors

Numeric Vectors

Logical Vectors

Character Vectors

Factors

Factors

Factors

Create a vector

Create a vector: coercion

Create a vector with randomly generated elements

Inspecting a vector

Subsetting a vector

Subsetting a vector: by integer index

Subsetting a vector: by a logical vector

Subsetting a vector: by name

Subsetting a vector: by name

Some comments on vectors

Operations with vectors: Element-wise operations

Operations with vectors: Vector operations

Operations with vectors: functions that return a single number

Vectors: putting it all together

Vectors: putting it all together

Vectors: putting it all together

all.equal function

Matrices

Matrices

Subsetting a matrix: by index

Subsetting a matrix: by logical vector

Subsetting a matrix: by name

Matrix operations: transposition, matrix multiplication

Matrix operations: element-wise operations

Matrix operations: column-wise or row-wise

Arrays

Arrays

Lists

Lists

Combining values into list

Subsetting a list: By element position

Subsetting a list: By logical vector

Subsetting a list: By name

Data frames

Examples of built-in datasets

Data frames

Data frames: Subsetting

Example: Subset

Example: Subset by index

Example: Subset by logical vector

Example: Subset by name

Functions

General concepts

Components of a function

Passing arguments

Customized functions – why are they useful?

Customized functions – Writing your own function

Some statistical functions

Loops

Repetitive execution: for loop

Repetitive execution: for loop

Repetitive execution: for loop

Repetitive execution: while loop

Repetitive execution: while loop

Repetitive execution: while loop

Repetitive execution: while loop

Comments on loops

apply() family of functions

apply() and related functions

Example: calculate group means of grades

Example: anonymous functions

Example: functions as objects

Differences between for() and apply()

reduce

purrr::reduce()

purrr::reduce(): example using paste()

purrr::reduce(): example using set intersection

purrr::reduce(): example of stacking data frames

Conditional execution

if and else

if and else: vectorized version

`all.equal` function

Repetitive execution: `for` loop

Repetitive execution: `for` loop

Repetitive execution: `for` loop

Repetitive execution: `while` loop

Repetitive execution: `while` loop

Repetitive execution: `while` loop

Repetitive execution: `while` loop

`apply()` family of functions

`apply()` and related functions

Example: calculate group means of `grades`

Differences between `for()` and `apply()`

`purrr::reduce()`

`purrr::reduce()`: example using `paste()`

`purrr::reduce()`: example using set intersection

`purrr::reduce()`: example of stacking data frames

`if` and `else`

`if` and `else`: vectorized version

`else if`