2.1 R data structures

STA141A: Fundamentals of Statistical Data Science

Akira Horiguchi

R is a programming language for statistical computing and data visualization.

  • Created by statisticians with these goals in mind
  • Data structures will reflect this mindset

Broadly speaking

  • Vectors, matrices, and arrays: values must be of the same basic data type.
    • numeric, character/string, logical
  • Lists and data frames: values can be of different data types/structures.
    • can contain list of vectors, list of lists, etc.

Vectors

Vectors

  • R operates on named data structures, which we simply refer to as data objects.
  • A vector is a single entity consisting of an ordered collection of values.
  • All values in a vector must be of the same data type.
  • Numeric vector
  • Logical vector
  • Character vector

Numeric Vectors

  • Numeric vectors are an ordered collection of numbers.
  • They can be used in arithmetic expressions, where operations are performed element by element.
  • There are also built-in R functions that take a vector and return summary measures (e.g., mean, quantiles, etc).
x <- c(4,2,7,8)  # concatenates the numbers 4, 2, 7, and 8
mean(x) 
[1] 5.25
summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.00    3.50    5.50    5.25    7.25    8.00 

Logical Vectors

  • R allows manipulation of logical quantities.
  • The elements of a logical vector can have the values TRUE, FALSE.
  • Logical vectors are generated by conditions. (Logical vectors are usually not created manually.)
  • For this purpose, we will learn about logical operators.
x <- c(4,2,7,8)  # concatenates the numbers 4, 2, 7, and 8
x < 5
[1]  TRUE  TRUE FALSE FALSE
(x %% 2) == 0  # which elements of x are even
[1]  TRUE  TRUE FALSE  TRUE

Character Vectors

  • Character strings are entered using either matching double " " or single ' quotes, e.g., "apple" or 'banana'.
  • Character vectors are a collection of character strings.
  • The paste() function is often used for creating character strings.
z <- c("hello", "world", "!")  # This is a character vector
z
[1] "hello" "world" "!"    
paste(z[1], z[2], z[3])
[1] "hello world !"
paste(z, collapse='-')
[1] "hello-world-!"
paste(z, collapse='   ')
[1] "hello   world   !"

Factors

  • A factor is a vector that can contain only pre-defined values (e.g., only the strings "apple" and 'banana'), and is used to store categorical data.
  • You may want to explicitly define the levels to e.g. change their printed order.
x <- c("dog", "dog", "cat", "dog", "cat", "bear", "bear", "bear", "bear")
table(x)  # default order is alphabetical order
x
bear  cat  dog 
   4    2    3 
x <- factor(x, levels=c("dog", "cat", "bear"))
x  # vector elements remain the same
[1] dog  dog  cat  dog  cat  bear bear bear bear
Levels: dog cat bear
table(x)
x
 dog  cat bear 
   3    2    4 

Factors

  • A factor is a vector that can contain only pre-defined values (e.g., only the strings "apple" and 'banana'), and is used to store categorical data.
  • You may want to explicitly define the levels to e.g. change their printed order.
x <- c("dog", "dog", "cat", "dog", "cat", "bear", "bear", "bear", "bear")
table(x)  # default order is alphabetical order
x
bear  cat  dog 
   4    2    3 
x <- factor(x, levels=names(sort(table(x))))  # don't need to know this code right now
x  # vector elements remain the same
[1] dog  dog  cat  dog  cat  bear bear bear bear
Levels: cat dog bear
table(x)
x
 cat  dog bear 
   2    3    4 

Factors

  • A factor is a vector that can contain only pre-defined values (e.g., only the strings "apple" and 'banana'), and is used to store categorical data.
  • You may want to explicitly define the levels to e.g. change their printed order.
x <- c("dog", "dog", "cat", "dog", "cat", "bear", "bear", "bear", "bear")
table(x)  # default order is alphabetical order
x
bear  cat  dog 
   4    2    3 
x <- factor(x, levels=names(sort(table(x), decreasing=TRUE)))  # don't need to know this code right now
x  # vector elements remain the same
[1] dog  dog  cat  dog  cat  bear bear bear bear
Levels: bear dog cat
table(x)
x
bear  dog  cat 
   4    3    2 

Create a vector

Create “regular” vectors.

1:11
 [1]  1  2  3  4  5  6  7  8  9 10 11
seq(1, 17, by=2)
[1]  1  3  5  7  9 11 13 15 17
rep(-5, times=4)
[1] -5 -5 -5 -5
rep("buffalo", times=8)
[1] "buffalo" "buffalo" "buffalo" "buffalo" "buffalo" "buffalo" "buffalo"
[8] "buffalo"

You can combine c() with rep(thing_to_repeat, num_repetitions):

rep(c("eek", "a", "bear"), times=2)
[1] "eek"  "a"    "bear" "eek"  "a"    "bear"
rep(c(6, 2), times=4)
[1] 6 2 6 2 6 2 6 2

Create a vector: coercion

c() will sometimes coerce different data types into the same type.

str(c(4L, 6L))
 int [1:2] 4 6
str(c(4L, 6))  # coerces the integer 4L into the numeric 4
 num [1:2] 4 6
str(c(3, "haha"))  # coerces the numeric 3 into the character "3"
 chr [1:2] "3" "haha"
str(c(TRUE, 7))  # coerces the logical TRUE into the numeric value 1
 num [1:2] 1 7
str(c(FALSE, 7))  # coerces the logical FALSE into the numeric value 0
 num [1:2] 0 7

Roughly, order is logical < integer < numeric < character.

Create a vector with randomly generated elements

Can generate a vector containing random values using e.g., sample().

animals <- c("ant", "bug", "cat", "dog")
sample(animals, size=3)  # samples without replacement
[1] "cat" "bug" "dog"
sample(animals, size=6)  # results in an error
sample(animals, size=6, replace=TRUE)  # samples with replacement
[1] "ant" "ant" "cat" "dog" "ant" "bug"
runif(5, min=0, max=1)  # generates 5 random numbers between 0 and 1
[1] 0.2378454 0.7153129 0.9154155 0.1102262 0.9308610
rnorm(5)  # generates 5 random numbers from a standard normal distribution
[1]  0.1165353  1.1950628 -2.1309916 -1.8542192 -0.3959627

Inspecting a vector

  • There are built-in R functions that take a vector and return info about it.
x <- rpois(10000, lambda=2)  # randomly generate 10000 values from a Poission distribution with parameter lambda=2
length(x)  # returns length of the variable x
[1] 10000
str(x)  # returns the structure of the variable x
 int [1:10000] 2 0 5 4 2 1 3 2 2 1 ...
summary(x)  # returns summary statistics
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   1.000   2.000   2.019   3.000   8.000 
table(x)  # counts occurrences of each value in x
x
   0    1    2    3    4    5    6    7    8 
1307 2693 2720 1821  932  349  132   35   11 

Subsetting a vector

Square brackets [ ] are used for indexing (i.e., accessing elements of) a vector, matrix, array, list, or dataframe.

  • Three approaches to access elements:
  1. by integer index
  2. by a logical vector
  3. by name

Subsetting a vector: by integer index

Using a vector of positive integers, the corresponding elements of the vector are selected and concatenated, in that order. A vector of negative integers specifies the values to be excluded rather than included.

x <- runif(7, min=0, max=1)
x
[1] 0.5655175 0.7688468 0.7220338 0.5090330 0.3008393 0.2124152 0.4767549
x[2]
[1] 0.7688468
x[5]
[1] 0.3008393
x[-3]  # returns all but the third element (different from python!)
[1] 0.5655175 0.7688468 0.5090330 0.3008393 0.2124152 0.4767549
x[9]  # x has only 7 elements
[1] NA

Subsetting a vector: by a logical vector

Values corresponding to TRUE in the index vector are selected and those corresponding to FALSE are omitted.

x <- runif(7, min=0, max=1)
x
[1] 0.88553721 0.75740886 0.09780963 0.59125508 0.65270475 0.62804784 0.86978383
x[c(TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE)]
[1] 0.8855372 0.7574089 0.8697838
x[c(FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE)]
[1] 0.09780963
x < 0.5  # a logical vector!
[1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
x[x < 0.5]
[1] 0.09780963

Subsetting a vector: by name

If a vector has a names attribute to identify its components, a sub-vector of the names vector may be used to select the elements.

x <- rnorm(4)
names(x) <- c('a', 'b', 'c', 'd')  # Assigns names to vector's elements
x
         a          b          c          d 
-0.2558558 -0.4432794  0.2562432  0.5204531 
str(x)
 Named num [1:4] -0.256 -0.443 0.256 0.52
 - attr(*, "names")= chr [1:4] "a" "b" "c" "d"
x[c('b', 'a')]  # Selects the elements of x with the name 'b' and 'a'
         b          a 
-0.4432794 -0.2558558 

To get the names of a vector, can again use names(x)

Subsetting a vector: by name

To get the names of a vector, can again use names(x)

names(x)
[1] "a" "b" "c" "d"
y <- c(5, 8, 2)  # does not have names attribute
names(y)
NULL

Some comments on vectors

  • All elements of a vector must be of the same type (numeric, logical, character).
  • Vectors can be coerced to the same type: as.logical(), as.numeric(), as.character().
  • The vector’s type can be tested: use is.character(), is.logical(), etc.

Q: What will be the result of the following code?

y <- c("TRUE", "FALSE", "TRUE")
is.logical(y)

Operations with vectors: Element-wise operations

x <- c(2, 3, 4)
x
[1] 2 3 4
x + 1
[1] 3 4 5
2 * x
[1] 4 6 8
x / 2
[1] 1.0 1.5 2.0
x^2
[1]  4  9 16

Operations with vectors: Vector operations

x <- c(2, 3, 4)
y <- c(5, 7, 6)
x + y
[1]  7 10 10
y - x
[1] 3 4 2
2 * x + 3 * y
[1] 19 27 26

Operations with vectors: functions that return a single number

x <- c(2, 3, 4)
sum(x)
[1] 9
mean(x)
[1] 3
var(x)
[1] 1
sd(x)
[1] 1
summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    2.0     2.5     3.0     3.0     3.5     4.0 

Vectors: putting it all together

x <- rnorm(100)

Q: What is the sum of the first three values of the vector x?

Code
x[1] + x[2] + x[3]
sum(c(x[1], x[2], x[3]))
sum(x[1:3])

Vectors: putting it all together

x <- rnorm(100)

Q: What is the mean of the positive values of the vector x?

Code
mean(x[x > 0])

Vectors: putting it all together

x <- rnorm(100)

Q: How many negative values are there in the vector x?

Code
length(x[x < 0])
sum(x < 0)

all.equal function

  • To compare whether certain values are equal, use ==.
  • This can be done component-wise, but also for a vector.
  • To check whether all components of one vector are equal to another one, use the all.equal function.
  • It is used as all.equal(vector1, vector2, ...).
x <- seq(0.2, 1, by=0.2)
y <- c(0.2, 0.4, 0.6, 0.8, 1.0)
x == y  # due to floating-point error
[1]  TRUE  TRUE FALSE  TRUE  TRUE
all.equal(x, y)  # tests "near equality" to allow for floating-point error
[1] TRUE

Matrices

Matrices

You can think of a matrix as a collection of vectors of the same type and length.

  • You can also think of a matrix as a table with a certain number of rows and columns, where information of the same type are stored.
  • “Square” or “rectangular” data structure
  • A matrix can be created by using the matrix() function, where the number of rows and columns must be specified.
matrix(1:6, ncol = 3, nrow = 2)  # Creates matrix with values 1 to 6, with 3 columns, 2 rows
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Subsetting a matrix: by index

It is similar to vectors: each element is indexed by row and column.

A = matrix(1:6, ncol = 3, nrow = 2) 
A
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
A[1, 2]
[1] 3
A[1, ]
[1] 1 3 5
A[, 3]
[1] 5 6

Subsetting a matrix: by logical vector

It is similar to vectors: values corresponding to TRUE in the index vector are selected and those corresponding to FALSE are omitted.

A = matrix(1:6, ncol = 3, nrow = 2) 
A
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
A[c(T, F), ]
[1] 1 3 5
A[, c(F, T, F)]
[1] 3 4
A[c(T, F), c(F, T, F)]  # Can use T for TRUE and/or F for FALSE
[1] 3

Subsetting a matrix: by name

A = matrix(1:6, ncol = 3, nrow = 2) 
# Adding row and col names
colnames(A) = c("First", "Second", "Third")
rownames(A) = c("a", "b")
A
  First Second Third
a     1      3     5
b     2      4     6
A['a', ]
 First Second  Third 
     1      3      5 
A[, 'Second']
a b 
3 4 
A['a', 'Second']
[1] 3

Matrix operations: transposition, matrix multiplication

B = matrix(1:4, ncol = 2, nrow = 2)  # 2 by 2 matrix
C = t(B)  # transpose the matrix B
B
     [,1] [,2]
[1,]    1    3
[2,]    2    4
C
     [,1] [,2]
[1,]    1    2
[2,]    3    4
C %*% B  # matrix multiplication
     [,1] [,2]
[1,]    5   11
[2,]   11   25

Matrix operations: element-wise operations

B = matrix(1:4, ncol = 2, nrow = 2)  # 2 by 2 matrix
C = t(B)  # transpose the matrix B
B + C  # element-wise addition
     [,1] [,2]
[1,]    2    5
[2,]    5    8
B * C  # element-wise multiplication
     [,1] [,2]
[1,]    1    6
[2,]    6   16

Matrix operations: column-wise or row-wise

  • Applying column-wise or row-wise functions (apply() function)
A = matrix(1:6, ncol = 3, nrow = 2)
A
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
apply(A, MARGIN = 2, mean)  # Apply the mean() function to all column vectors
[1] 1.5 3.5 5.5
apply(A, MARGIN = 1, mean)  # Apply the mean() function to all row vectors
[1] 3 4

Arrays

Arrays

Arrays are a generalization of matrices.

  • High-order arrays are much rarer than matrices.
  • Use dim() to find the size of an array.
  • Use dimnames() to assign names to each element.
  • Subsetting works similarly to subsetting with the other data structures.
A = array(1:24, dim=c(2,3,4))
print(dim(A))  
[1] 2 3 4
A[1,3,2]  # access element (1,3,2) of array A
[1] 11

Lists

Lists

List elements can be of any type.

  • Thus, lists are different from vectors, matrices, and arrays.
  • You construct lists by using list().
  • You can then return length, attributes, etc., just as for vectors.
  • Lists can contain other lists.

Combining values into list

x <- list(3, "haha")
str(x)
List of 2
 $ : num 3
 $ : chr "haha"
z <- list(x, 5.4)  # list inside a list
str(z)
List of 2
 $ :List of 2
  ..$ : num 3
  ..$ : chr "haha"
 $ : num 5.4
y <- c(x, 5.4)  # a "flat" list
str(y)
List of 3
 $ : num 3
 $ : chr "haha"
 $ : num 5.4

Can also use rep(thing_to_repeat, num_repetitions):

z <- rep(y, 2)
str(z)
List of 6
 $ : num 3
 $ : chr "haha"
 $ : num 5.4
 $ : num 3
 $ : chr "haha"
 $ : num 5.4

Subsetting a list: By element position

Square brackets operator [ ] or the double square bracket operator [[ ]].

profile <- list(name="Mary", no.children=3, child.ages=c(4,7,9))

profile[1]
$name
[1] "Mary"
profile[[1]]  # What is the difference between this and profile[1]? Use str().
[1] "Mary"
str(profile[1])
List of 1
 $ name: chr "Mary"
str(profile[[1]])
 chr "Mary"

Subsetting a list: By logical vector

profile <- list(name="Mary", no.children=3, child.ages=c(4,7,9))

profile[c(T,F,T)]  # Can use T instead of TRUE; F instead of FALSE (but case sensitive -- 'true' or 'True' will return an error)
$name
[1] "Mary"

$child.ages
[1] 4 7 9

Subsetting a list: By name

[ ], [[ ]], or the $ operator.

profile <- list(name="Mary", no.children=3, child.ages=c(4,7,9))
profile$child.ages
[1] 4 7 9
profile['child.ages']
$child.ages
[1] 4 7 9
profile[['child.ages']]
[1] 4 7 9
profile[c('child.ages', 'name')]
$child.ages
[1] 4 7 9

$name
[1] "Mary"

Data frames

Examples of built-in datasets

Data frames are a convenient way to store datasets.

  • Many datasets are publicly available through base R or through an R package.
  • Commonly used datasets: USArrests, PlantGrowth, ToothGrowth, mtcars
  • Useful if you want to see if some statistical/ML method (perhaps your own!) generalizes well to various data sets.
  • Saves you the trouble of collecting/generating data yourself
MASS::housing  # dataset: Frequency Table from a Copenhagen Housing Conditions Survey
      Sat   Infl      Type Cont Freq
1     Low    Low     Tower  Low   21
2  Medium    Low     Tower  Low   21
3    High    Low     Tower  Low   28
4     Low Medium     Tower  Low   34
5  Medium Medium     Tower  Low   22
...

Data frames

A data frame is a list with class “data.frame”.

  • Most common use case: each element in the list is a vector, and all vectors have the same length.
  • Then the data frame has a rectangular structure (each column is a vector).
  • Its rows can be extracted by using matrix conventions.
  • Its columns can be extracted using matrix conventions (but also using list conventions).
  • A simple way to construct a data frame from scratch is to use the read.table() function to read an entire data frame from an external file.
    • Can also use read.csv().

Data frames: Subsetting

  • Similar to matrix subsetting, you can specify rows and columns.
  • If you do not specify rows, all rows will automatically be selected.
  • But also similar to lists because you can use [, [[ or $, and they return a vector or a dataframe.

Subsetting:

  • By index (using [ or [[).
  • By logical vector (using [ or [[).
  • By name (using [, [[, or $).

Example: Subset

# Create a dataframe
firstNames <- sample(c("Dog", "Cat", "Bug"), replace = TRUE, size = 6)
dat <- data.frame(age = 11:16, weight = 1:6, name = firstNames)
dat
  age weight name
1  11      1  Bug
2  12      2  Cat
3  13      3  Bug
4  14      4  Bug
5  15      5  Dog
6  16      6  Bug

Example: Subset by index

dat[1, 3]  # Like matrix
[1] "Bug"
dat[1, ]  # Like matrix
  age weight name
1  11      1  Bug
dat[, 1]  # Like matrix
[1] 11 12 13 14 15 16
dat[c(3,1), ]  # Like matrix
  age weight name
3  13      3  Bug
1  11      1  Bug

Example: Subset by logical vector

dat[dat$weight > 4, ]
  age weight name
5  15      5  Dog
6  16      6  Bug

Example: Subset by name

dat$age  # Like list
[1] 11 12 13 14 15 16
dat[, 'age']  # Like matrix
[1] 11 12 13 14 15 16
dat[, c('age', 'name')]  # Like matrix
  age name
1  11  Bug
2  12  Cat
3  13  Bug
4  14  Bug
5  15  Dog
6  16  Bug