2.1 R data structures

STA141A: Fundamentals of Statistical Data Science

Akira Horiguchi

Overview

R is a programming language for statistical computing and data visualization.

Created by statisticians with these goals in mind
Data structures will reflect this mindset

Broadly speaking

Vectors, matrices, and arrays: values must be of the same basic data type.
- numeric, character/string, logical
Lists and data frames: values can be of different data types/structures.
- can contain list of vectors, list of lists, etc.

Vectors

R operates on named data structures, which we simply refer to as data objects.
A vector is a single entity consisting of an ordered collection of values.
All values in a vector must be of the same data type.
Numeric vector
Logical vector
Character vector

Numeric Vectors

Numeric vectors are an ordered collection of numbers.
They can be used in arithmetic expressions, where operations are performed element by element.
There are also built-in R functions that take a vector and return summary measures (e.g., mean, quantiles, etc).

x <- c(4,2,7,8)  # concatenates the numbers 4, 2, 7, and 8
mean(x)

[1] 5.25

summary(x)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.00    3.50    5.50    5.25    7.25    8.00

Logical Vectors

R allows manipulation of logical quantities.
The elements of a logical vector can have the values TRUE, FALSE.
Logical vectors are generated by conditions. (Logical vectors are usually not created manually.)
For this purpose, we will learn about logical operators.

x <- c(4,2,7,8)  # concatenates the numbers 4, 2, 7, and 8
x < 5

[1]  TRUE  TRUE FALSE FALSE

(x %% 2) == 0  # which elements of x are even

[1]  TRUE  TRUE FALSE  TRUE

Character Vectors

Character strings are entered using either matching double " " or single ' quotes, e.g., "apple" or 'banana'.
Character vectors are a collection of character strings.
The paste() function is often used for creating character strings.

z <- c("hello", "world", "!")  # This is a character vector
z

[1] "hello" "world" "!"

paste(z[1], z[2], z[3])

[1] "hello world !"

paste(z, collapse='-')

[1] "hello-world-!"

paste(z, collapse='   ')

[1] "hello   world   !"

Factors

A factor is a vector that can contain only pre-defined values (e.g., only the strings "apple" and 'banana'), and is used to store categorical data.
You may want to explicitly define the levels to e.g. change their printed order.

x <- c("dog", "dog", "cat", "dog", "cat", "bear", "bear", "bear", "bear")
table(x)  # default order is alphabetical order

x
bear  cat  dog 
   4    2    3

x <- factor(x, levels=c("dog", "cat", "bear"))
x  # vector elements remain the same

[1] dog  dog  cat  dog  cat  bear bear bear bear
Levels: dog cat bear

table(x)

x
 dog  cat bear 
   3    2    4

Factors

A factor is a vector that can contain only pre-defined values (e.g., only the strings "apple" and 'banana'), and is used to store categorical data.
You may want to explicitly define the levels to e.g. change their printed order.

x <- c("dog", "dog", "cat", "dog", "cat", "bear", "bear", "bear", "bear")
table(x)  # default order is alphabetical order

x
bear  cat  dog 
   4    2    3

x <- factor(x, levels=names(sort(table(x))))  # don't need to know this code right now
x  # vector elements remain the same

[1] dog  dog  cat  dog  cat  bear bear bear bear
Levels: cat dog bear

table(x)

x
 cat  dog bear 
   2    3    4

Factors

A factor is a vector that can contain only pre-defined values (e.g., only the strings "apple" and 'banana'), and is used to store categorical data.
You may want to explicitly define the levels to e.g. change their printed order.

x <- c("dog", "dog", "cat", "dog", "cat", "bear", "bear", "bear", "bear")
table(x)  # default order is alphabetical order

x
bear  cat  dog 
   4    2    3

x <- factor(x, levels=names(sort(table(x), decreasing=TRUE)))  # don't need to know this code right now
x  # vector elements remain the same

[1] dog  dog  cat  dog  cat  bear bear bear bear
Levels: bear dog cat

table(x)

x
bear  dog  cat 
   4    3    2

Create a vector

Create “regular” vectors.

1:11

 [1]  1  2  3  4  5  6  7  8  9 10 11

seq(1, 17, by=2)

[1]  1  3  5  7  9 11 13 15 17

rep(-5, times=4)

[1] -5 -5 -5 -5

rep("buffalo", times=8)

[1] "buffalo" "buffalo" "buffalo" "buffalo" "buffalo" "buffalo" "buffalo"
[8] "buffalo"

You can combine c() with rep(thing_to_repeat, num_repetitions):

rep(c("eek", "a", "bear"), times=2)

[1] "eek"  "a"    "bear" "eek"  "a"    "bear"

rep(c(6, 2), times=4)

[1] 6 2 6 2 6 2 6 2

Create a vector: coercion

c() will sometimes coerce different data types into the same type.

str(c(4L, 6L))

 int [1:2] 4 6

str(c(4L, 6))  # coerces the integer 4L into the numeric 4

 num [1:2] 4 6

str(c(3, "haha"))  # coerces the numeric 3 into the character "3"

 chr [1:2] "3" "haha"

str(c(TRUE, 7))  # coerces the logical TRUE into the numeric value 1

 num [1:2] 1 7

str(c(FALSE, 7))  # coerces the logical FALSE into the numeric value 0

 num [1:2] 0 7

Roughly, order is logical < integer < numeric < character.

Create a vector with randomly generated elements

Can generate a vector containing random values using e.g., sample().

animals <- c("ant", "bug", "cat", "dog")
sample(animals, size=3)  # samples without replacement

[1] "bug" "ant" "dog"

sample(animals, size=6)  # results in an error

sample(animals, size=6, replace=TRUE)  # samples with replacement

[1] "bug" "cat" "cat" "cat" "bug" "bug"

runif(5, min=0, max=1)  # generates 5 random numbers between 0 and 1

[1] 0.6920607 0.9312297 0.2116521 0.3092926 0.2847530

rnorm(5)  # generates 5 random numbers from a standard normal distribution

[1]  0.04980434  1.56936624  1.41340558 -1.43368174 -0.31236580

Inspecting a vector

There are built-in R functions that take a vector and return info about it.

x <- rpois(10000, lambda=2)  # randomly generate 10000 values from a Poission distribution with parameter lambda=2
length(x)  # returns length of the variable x

[1] 10000

str(x)  # returns the structure of the variable x

 int [1:10000] 2 1 1 0 6 3 1 4 0 5 ...

summary(x)  # returns summary statistics

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   1.000   2.000   2.003   3.000  10.000

table(x)  # counts occurrences of each value in x

x
   0    1    2    3    4    5    6    7    8    9   10 
1336 2734 2685 1826  890  352  125   40   10    1    1

Subsetting a vector

Square brackets [ ] are used for indexing (i.e., accessing elements of) a vector, matrix, array, list, or dataframe.

Three approaches to access elements:

by integer index
by a logical vector
by name

Subsetting a vector: by integer index

Using a vector of positive integers, the corresponding elements of the vector are selected and concatenated, in that order. A vector of negative integers specifies the values to be excluded rather than included.

x <- runif(7, min=0, max=1)
x

[1] 0.6490791 0.1544339 0.3551344 0.5247749 0.4393103 0.3635327 0.4119610

x[2]

[1] 0.1544339

x[5]

[1] 0.4393103

x[-3]  # returns all but the third element (different from python!)

[1] 0.6490791 0.1544339 0.5247749 0.4393103 0.3635327 0.4119610

x[9]  # x has only 7 elements

[1] NA

Subsetting a vector: by a logical vector

Values corresponding to TRUE in the index vector are selected and those corresponding to FALSE are omitted.

x <- runif(7, min=0, max=1)
x

[1] 0.6186568 0.8780956 0.2644407 0.4669568 0.5955775 0.9808410 0.9834137

x[c(TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE)]

[1] 0.6186568 0.8780956 0.9834137

x[c(FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE)]

[1] 0.2644407

x < 0.5  # a logical vector!

[1] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE

x[x < 0.5]

[1] 0.2644407 0.4669568

Subsetting a vector: by name

If a vector has a names attribute to identify its components, a sub-vector of the names vector may be used to select the elements.

x <- rnorm(4)
names(x) <- c('a', 'b', 'c', 'd')  # Assigns names to vector's elements
x

          a           b           c           d 
 0.42191630  0.12014876 -0.01132619  0.92015369

str(x)

 Named num [1:4] 0.4219 0.1201 -0.0113 0.9202
 - attr(*, "names")= chr [1:4] "a" "b" "c" "d"

x[c('b', 'a')]  # Selects the elements of x with the name 'b' and 'a'

        b         a 
0.1201488 0.4219163

To get the names of a vector, can again use names(x)

Subsetting a vector: by name

To get the names of a vector, can again use names(x)

names(x)

[1] "a" "b" "c" "d"

y <- c(5, 8, 2)  # does not have names attribute
names(y)

NULL

Some comments on vectors

All elements of a vector must be of the same type (numeric, logical, character).
Vectors can be coerced to the same type: as.logical(), as.numeric(), as.character().
The vector’s type can be tested: use is.character(), is.logical(), etc.

Q: What will be the result of the following code?

y <- c("TRUE", "FALSE", "TRUE")
is.logical(y)

Operations with vectors: Element-wise operations

x <- c(2, 3, 4)
x

[1] 2 3 4

x + 1

[1] 3 4 5

2 * x

[1] 4 6 8

x / 2

[1] 1.0 1.5 2.0

x^2

[1]  4  9 16

Operations with vectors: Vector operations

x <- c(2, 3, 4)
y <- c(5, 7, 6)
x + y

[1]  7 10 10

y - x

[1] 3 4 2

2 * x + 3 * y

[1] 19 27 26

Operations with vectors: functions that return a single number

x <- c(2, 3, 4)
sum(x)

[1] 9

mean(x)

[1] 3

var(x)

[1] 1

sd(x)

[1] 1

summary(x)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    2.0     2.5     3.0     3.0     3.5     4.0

Vectors: putting it all together

x <- rnorm(100)

Q: What is the sum of the first three values of the vector x?

Code

x[1] + x[2] + x[3]
sum(c(x[1], x[2], x[3]))
sum(x[1:3])

Vectors: putting it all together

x <- rnorm(100)

Q: What is the mean of the positive values of the vector x?

Code

mean(x[x > 0])

Vectors: putting it all together

x <- rnorm(100)

Q: How many negative values are there in the vector x?

Code

length(x[x < 0])
sum(x < 0)

`all.equal` function

To compare whether certain values are equal, use ==.
This can be done component-wise, but also for a vector.
To check whether all components of one vector are equal to another one, use the all.equal function.
It is used as all.equal(vector1, vector2, ...).

x <- seq(0.2, 1, by=0.2)
y <- c(0.2, 0.4, 0.6, 0.8, 1.0)
x == y  # due to floating-point error

[1]  TRUE  TRUE FALSE  TRUE  TRUE

all.equal(x, y)  # tests "near equality" to allow for floating-point error

[1] TRUE

Matrices

You can think of a matrix as a collection of vectors of the same type and length.

You can also think of a matrix as a table with a certain number of rows and columns, where information of the same type are stored.
“Square” or “rectangular” data structure
A matrix can be created by using the matrix() function, where the number of rows and columns must be specified.

matrix(1:6, ncol = 3, nrow = 2)  # Creates matrix with values 1 to 6, with 3 columns, 2 rows

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Subsetting a matrix: by index

It is similar to vectors: each element is indexed by row and column.

A = matrix(1:6, ncol = 3, nrow = 2) 
A

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

A[1, 2]

[1] 3

A[1, ]

[1] 1 3 5

A[, 3]

[1] 5 6

Subsetting a matrix: by logical vector

It is similar to vectors: values corresponding to TRUE in the index vector are selected and those corresponding to FALSE are omitted.

A = matrix(1:6, ncol = 3, nrow = 2) 
A

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

A[c(T, F), ]

[1] 1 3 5

A[, c(F, T, F)]

[1] 3 4

A[c(T, F), c(F, T, F)]  # Can use T for TRUE and/or F for FALSE

[1] 3

Subsetting a matrix: by name

A = matrix(1:6, ncol = 3, nrow = 2) 
# Adding row and col names
colnames(A) = c("First", "Second", "Third")
rownames(A) = c("a", "b")
A

  First Second Third
a     1      3     5
b     2      4     6

A['a', ]

 First Second  Third 
     1      3      5

A[, 'Second']

a b 
3 4

A['a', 'Second']

[1] 3

Matrix operations: transposition, matrix multiplication

B = matrix(1:4, ncol = 2, nrow = 2)  # 2 by 2 matrix
C = t(B)  # transpose the matrix B
B

     [,1] [,2]
[1,]    1    3
[2,]    2    4

     [,1] [,2]
[1,]    1    2
[2,]    3    4

C %*% B  # matrix multiplication

     [,1] [,2]
[1,]    5   11
[2,]   11   25

Matrix operations: element-wise operations

B = matrix(1:4, ncol = 2, nrow = 2)  # 2 by 2 matrix
C = t(B)  # transpose the matrix B
B + C  # element-wise addition

     [,1] [,2]
[1,]    2    5
[2,]    5    8

B * C  # element-wise multiplication

     [,1] [,2]
[1,]    1    6
[2,]    6   16

Matrix operations: column-wise or row-wise

Applying column-wise or row-wise functions (apply() function)

A = matrix(1:6, ncol = 3, nrow = 2)
A

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

apply(A, MARGIN = 2, mean)  # Apply the mean() function to all column vectors

[1] 1.5 3.5 5.5

apply(A, MARGIN = 1, mean)  # Apply the mean() function to all row vectors

[1] 3 4

Arrays

Arrays are a generalization of matrices.

High-order arrays are much rarer than matrices.
Use dim() to find the size of an array.
Use dimnames() to assign names to each element.
Subsetting works similarly to subsetting with the other data structures.

A = array(1:24, dim=c(2,3,4))
print(dim(A))

[1] 2 3 4

A[1,3,2]  # access element (1,3,2) of array A

[1] 11

Lists

List elements can be of any type.

Thus, lists are different from vectors, matrices, and arrays.
You construct lists by using list().
You can then return length, attributes, etc., just as for vectors.
Lists can contain other lists.

Combining values into list

x <- list(3, "haha")
str(x)

List of 2
 $ : num 3
 $ : chr "haha"

z <- list(x, 5.4)  # list inside a list
str(z)

List of 2
 $ :List of 2
  ..$ : num 3
  ..$ : chr "haha"
 $ : num 5.4

y <- c(x, 5.4)  # a "flat" list
str(y)

List of 3
 $ : num 3
 $ : chr "haha"
 $ : num 5.4

Can also use rep(thing_to_repeat, num_repetitions):

z <- rep(y, 2)
str(z)

List of 6
 $ : num 3
 $ : chr "haha"
 $ : num 5.4
 $ : num 3
 $ : chr "haha"
 $ : num 5.4

Subsetting a list: By element position

Square brackets operator [ ] or the double square bracket operator [[ ]].

profile <- list(name="Mary", no.children=3, child.ages=c(4,7,9))

profile[1]

$name
[1] "Mary"

profile[[1]]  # What is the difference between this and profile[1]? Use str().

[1] "Mary"

str(profile[1])

List of 1
 $ name: chr "Mary"

str(profile[[1]])

 chr "Mary"

Subsetting a list: By logical vector

profile <- list(name="Mary", no.children=3, child.ages=c(4,7,9))

profile[c(T,F,T)]  # Can use T instead of TRUE; F instead of FALSE (but case sensitive -- 'true' or 'True' will return an error)

$name
[1] "Mary"

$child.ages
[1] 4 7 9

Subsetting a list: By name

[ ], [[ ]], or the $ operator.

profile <- list(name="Mary", no.children=3, child.ages=c(4,7,9))
profile$child.ages

[1] 4 7 9

profile['child.ages']

$child.ages
[1] 4 7 9

profile[['child.ages']]

[1] 4 7 9

profile[c('child.ages', 'name')]

$child.ages
[1] 4 7 9

$name
[1] "Mary"

Homogeneous vs heterogeneous data structures

Why might one want to use a vector instead of a list?

Vector

c(3, 4) * 2

[1] 6 8

List

list(3, "haha") * 2
# Error in `list(3, "haha") * 2`: non-numeric argument to binary operator

list(3, 4) * 2
# Error in list(3, 4) * 2 : non-numeric argument to binary operator

Data frames

Examples of built-in datasets

Data frames are a convenient way to store datasets.

Many datasets are publicly available through base R or through an R package.
Commonly used datasets: USArrests, PlantGrowth, ToothGrowth, mtcars
Useful if you want to see if some statistical/ML method (perhaps your own!) generalizes well to various data sets.
Saves you the trouble of collecting/generating data yourself.

MASS::housing  # dataset: Frequency Table from a Copenhagen Housing Conditions Survey

      Sat   Infl      Type Cont Freq
1     Low    Low     Tower  Low   21
2  Medium    Low     Tower  Low   21
3    High    Low     Tower  Low   28
4     Low Medium     Tower  Low   34
5  Medium Medium     Tower  Low   22
...

Data frame

A data frame is a list with class “data.frame”.

Most common use case: each element in the list is a vector, and all vectors have the same length.
Then the data frame has a rectangular structure (each column is a vector).
Its rows can be extracted by using matrix conventions.
Its columns can be extracted using matrix conventions (but also using list conventions).
A simple way to construct a data frame from scratch is to use the read.table() function to read an entire data frame from an external file.
- Can also use read.csv().

Data frame: Subsetting

Similar to matrix subsetting, you can specify rows and columns.
If you do not specify rows, all rows will automatically be selected.
But also similar to lists because you can use [, [[ or $, and they return a vector or a dataframe.

Subsetting:

By index (using [ or [[).
By logical vector (using [ or [[).
By name (using [, [[, or $).

Example: Subset

# Create a dataframe
firstNames <- sample(c("Dog", "Cat", "Bug"), replace = TRUE, size = 6)
dat <- data.frame(age = 11:16, weight = 1:6, name = firstNames)
dat

  age weight name
1  11      1  Cat
2  12      2  Dog
3  13      3  Bug
4  14      4  Bug
5  15      5  Bug
6  16      6  Cat

Example: Subset by index

dat[1, 3]  # Like matrix

[1] "Cat"

dat[1, ]  # Like matrix

  age weight name
1  11      1  Cat

dat[, 1]  # Like matrix

[1] 11 12 13 14 15 16

dat[c(3,1), ]  # Like matrix

  age weight name
3  13      3  Bug
1  11      1  Cat

Example: Subset by logical vector

dat[dat$weight > 4, ]

  age weight name
5  15      5  Bug
6  16      6  Cat

Example: Subset by name

dat$age  # Like list

[1] 11 12 13 14 15 16

dat[, 'age']  # Like matrix

[1] 11 12 13 14 15 16

dat[, c('age', 'name')]  # Like matrix

  age name
1  11  Cat
2  12  Dog
3  13  Bug
4  14  Bug
5  15  Bug
6  16  Cat

Tibble

A tibble is a modern, opinionated version of a data frame.

library(tibble)
tibble(mtcars)

# A tibble: 32 × 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# ℹ 22 more rows

Subset syntax is the same as with data frame.

2.1 R data structures

Overview

Vectors

Vectors

Numeric Vectors

Logical Vectors

Character Vectors

Factors

Factors

Factors

Create a vector

Create a vector: coercion

Create a vector with randomly generated elements

Inspecting a vector

Subsetting a vector

Subsetting a vector: by integer index

Subsetting a vector: by a logical vector

Subsetting a vector: by name

Subsetting a vector: by name

Some comments on vectors

Operations with vectors: Element-wise operations

Operations with vectors: Vector operations

Operations with vectors: functions that return a single number

Vectors: putting it all together

Vectors: putting it all together

Vectors: putting it all together

all.equal function

Matrices

Matrices

Subsetting a matrix: by index

Subsetting a matrix: by logical vector

Subsetting a matrix: by name

Matrix operations: transposition, matrix multiplication

Matrix operations: element-wise operations

Matrix operations: column-wise or row-wise

Arrays

Arrays

Lists

Lists

Combining values into list

Subsetting a list: By element position

Subsetting a list: By logical vector

Subsetting a list: By name

Homogeneous vs heterogeneous data structures

Data frames

Examples of built-in datasets

Data frame

Data frame: Subsetting

Example: Subset

Example: Subset by index

Example: Subset by logical vector

Example: Subset by name

Tibble

`all.equal` function