[1] 5.25
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 3.50 5.50 5.25 7.25 8.00
STA141A: Fundamentals of Statistical Data Science
TRUE
, FALSE
." "
or single '
quotes, e.g., "apple"
or 'banana'
.paste()
function is often used for creating character strings."apple"
and 'banana'
), and is used to store categorical data."apple"
and 'banana'
), and is used to store categorical data.x <- c("dog", "dog", "cat", "dog", "cat", "ewe", "bear", "bear", "bear", "bear")
table(x) # default order is alphabetical order
x
bear cat dog ewe
4 2 3 1
x <- factor(x, levels=names(sort(table(x)))) # don't need to know this code right now
x # vector elements remain the same
[1] dog dog cat dog cat ewe bear bear bear bear
Levels: ewe cat dog bear
x
ewe cat dog bear
1 2 3 4
"apple"
and 'banana'
), and is used to store categorical data.x <- c("dog", "dog", "cat", "dog", "cat", "ewe", "bear", "bear", "bear", "bear")
table(x) # default order is alphabetical order
x
bear cat dog ewe
4 2 3 1
x <- factor(x, levels=names(sort(table(x), decreasing=TRUE))) # don't need to know this code right now
x # vector elements remain the same
[1] dog dog cat dog cat ewe bear bear bear bear
Levels: bear dog cat ewe
x
bear dog cat ewe
4 3 2 1
Create “regular” vectors.
c()
will sometimes coerce different data types into the same type.
int [1:2] 4 6
num [1:2] 4 6
chr [1:2] "3" "haha"
num [1:2] 1 7
num [1:2] 0 7
Roughly, order is logical < integer < numeric < character.
Can generate a vector containing random values using e.g., sample()
.
[1] "cat" "dog" "ant"
[1] "ant" "dog" "cat" "ant" "bug" "dog"
[1] 0.1480565 0.3443080 0.6948802 0.2596952 0.4794379
[1] 1.3096479 0.3115038 -1.1938093 -1.3670770 -1.0312860
x <- rpois(10000, lambda=2) # randomly generate 10000 values from a Poission distribution with parameter lambda=2
length(x) # returns length of the variable x
[1] 10000
int [1:10000] 1 2 4 2 3 3 4 1 1 1 ...
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 1.000 2.000 2.015 3.000 10.000
x
0 1 2 3 4 5 6 7 8 9 10
1371 2695 2577 1889 939 361 123 32 8 3 2
Square brackets [ ]
are used for indexing (i.e., accessing elements of) a vector, matrix, array, list, or dataframe.
Using a vector of positive integers, the corresponding elements of the vector are selected and concatenated, in that order. A vector of negative integers specifies the values to be excluded rather than included.
[1] 0.15153443 0.21308165 0.49793895 0.04551215 0.83195540 0.42114951 0.38850633
[1] 0.2130816
[1] 0.8319554
[1] 0.15153443 0.21308165 0.04551215 0.83195540 0.42114951 0.38850633
[1] NA
Values corresponding to TRUE
in the index vector are selected and those corresponding to FALSE
are omitted.
[1] 0.8774323 0.7797666 0.7160577 0.6435915 0.7282974 0.2961213 0.8139053
[1] 0.8774323 0.7797666 0.8139053
[1] 0.7160577
[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[1] 0.2961213
If a vector has a names
attribute to identify its components, a sub-vector of the names vector may be used to select the elements.
a b c d
0.2196825 0.2927151 -1.5388355 0.9532625
Named num [1:4] 0.22 0.293 -1.539 0.953
- attr(*, "names")= chr [1:4] "a" "b" "c" "d"
b a
0.2927151 0.2196825
To get the names of a vector, can again use names(x)
To get the names of a vector, can again use names(x)
as.logical()
, as.numeric()
, as.character()
.is.character()
, is.logical()
, etc.Q: What will be the result of the following code?
Q: What is the sum of the first three values of the vector x
?
Q: What is the mean of the positive values of the vector x
?
Q: How many negative values are there in the vector x
?
all.equal
function==
.all.equal
function.all.equal(vector1, vector2, ...)
.You can think of a matrix as a collection of vectors of the same type and length.
matrix()
function, where the number of rows and columns must be specified.It is similar to vectors: each element is indexed by row and column.
It is similar to vectors: values corresponding to TRUE
in the index vector are selected and those corresponding to FALSE
are omitted.
apply()
function)Arrays are a generalization of matrices.
dim()
to find the size of an array.dimnames()
to assign names to each element.List elements can be of any type.
list()
.Square brackets operator [ ]
or the double square bracket operator [[ ]]
.
[ ]
, [[ ]]
, or the $
operator.
Data frames are a convenient way to store datasets.
USArrests
, PlantGrowth
, ToothGrowth
, mtcars
A data frame is a list with class “data.frame”.
read.table()
function to read an entire data frame from an external file.
read.csv()
.[
, [[
or $
, and they return a vector or a dataframe.Subsetting:
[
or [[
).[
or [[
).[
, [[
, or $
).R
built-in functions are, e.g., sum()
or mean()
, where the input is a vector and the output is a number.R
environment as an object with this name.When calling a function, you can specify the arguments by:
How to write your own function:
Example
Let x
and y
be numeric vectors of the same length. We can calculate:
x
by mean(x)
;x
by var(x)
;x
by sd(x)
;x
and y
using cov(x, y)
;x
and y
using cor(x, y)
.for
loopTemplate
Example for
loop: element-wise squaring
for
loopExample for
loop: cumulative sum.
[1] 6 4 2 1 8
z <- 0
for (i in 1:5) {
z <- z + y[i] # uses previous iteration's value of z
print(paste("the cumulative sum of the vector y at index", i, "is:", z))
}
[1] "the cumulative sum of the vector y at index 1 is: 6"
[1] "the cumulative sum of the vector y at index 2 is: 10"
[1] "the cumulative sum of the vector y at index 3 is: 12"
[1] "the cumulative sum of the vector y at index 4 is: 13"
[1] "the cumulative sum of the vector y at index 5 is: 21"
[1] 21
for
loopExample for
loop: compute \(\sum_{n=1}^5 n!\)
z <- 0
for (i in 1:5) {
y <- factorial(i) # the factorial() function is built into R
print(paste0("the value of ", i, "! is: ", y))
z <- z + y # uses previous iteration's value of z
}
[1] "the value of 1! is: 1"
[1] "the value of 2! is: 2"
[1] "the value of 3! is: 6"
[1] "the value of 4! is: 24"
[1] "the value of 5! is: 120"
[1] 153
Q: what happens if we omit z <- 0
at line 1?
while
loopUseful for when we don’t know how many times we want to execute commands.
Example while loop (random walk)
x <- 0
while (-2 <= x && x <= 2) {
curr_step <- sample(c(-1, 1), size=1)
print(paste0("moving x=", x, " by step of size ", curr_step))
x <- x + curr_step # uses previous iteration's value of x
}
[1] "moving x=0 by step of size -1"
[1] "moving x=-1 by step of size 1"
[1] "moving x=0 by step of size -1"
[1] "moving x=-1 by step of size -1"
[1] "moving x=-2 by step of size -1"
while
loopUseful for when we don’t know how many times we want to execute commands.
Example while loop (random walk): another set of random steps
while
loopUseful for when we don’t know how many times we want to execute commands.
Example while loop (random walk): fix the set of “random” steps
set.seed(42) # for reproducibility; fixes any proceding "random" results
x <- 0
while (-2 <= x && x <= 2) {
curr_step <- sample(c(-1, 1), size=1)
print(paste0("moving x=", x, " by step of size ", curr_step))
x <- x + curr_step # uses previous iteration's value of x
}
[1] "moving x=0 by step of size -1"
[1] "moving x=-1 by step of size -1"
[1] "moving x=-2 by step of size -1"
while
loopUseful for when we don’t know how many times we want to execute commands.
It is possible that the body of a while()
loop will never be executed.
apply()
family of functionsapply()
and related functionslapply(X, FUN, ...)
: returns a list containing the result of the function FUN
applied to all the elements of the list/vector X
.sapply(X, FUN, ...)
: essentially does lapply(X, FUN, ...)
first and then tries to coerce the output into a vector.grades
Calculate group means using…
group1 group2 group3
4.44 8.13 12.54
R is a functional programming language: functions can be used as objects!
funlist <- list(sum, mean, var, sd)
dat <- runif(10)
# Use for loop to apply dat to all functions in funlist
for (f in funlist) {
print(f(dat)) # prints values
}
[1] 4.108169
[1] 0.4108169
[1] 0.116695
[1] 0.3416066
# Use sapply to apply dat to all functions in funlist
sapply(funlist, \(f) f(dat)) # also stores values in a vector
[1] 4.1081692 0.4108169 0.1166950 0.3416066
for()
and apply()
Beyond aesthetic differences…
for()
executes commands sequentially.apply()
family can execute commands in parallel (but don’t by default).purrr::reduce()
Repeatedly applies a binary function to the elements of a vector or list.
Reduce()
, but the version from the purrr
package has nicer functionality.reduce(<list or vector>, <binary function>)
purrr::reduce()
: example using paste()
[1] "eek a bear"
[1] "eekabear"
[1] "eek...a...bear"
paste()
also does element-wise pasting
So what if you want to paste together all elements of a character vector?
purrr::reduce()
: example using set intersection[[1]]
[1] 0 5 10 15 20 25 30
[[2]]
[1] 0 3 6 9 12 15 18 21 24 27 30
[[3]]
[1] 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
purrr::reduce()
: example of stacking data frames [1] -0.2992672 -0.2799575 -0.2311141 -1.2870888 -0.5693856 -3.0185564
[7] -0.4975908 -0.3545537 -1.7564090 -0.7374990
[[1]]
param u
1 -0.2992672 0.3690927
2 -0.2992672 0.5785272
3 -0.2992672 0.9776749
4 -0.2992672 0.6875838
[[2]]
param u
1 -0.2799575 0.44512359
2 -0.2799575 0.80760922
3 -0.2799575 -0.03743895
4 -0.2799575 0.06727781
[[3]]
param u
1 -0.2311141 0.7884435
2 -0.2311141 0.6223001
3 -0.2311141 0.0650239
4 -0.2311141 -0.1781900
[[4]]
param u
1 -1.287089 -0.9658007
2 -1.287089 -0.7921962
3 -1.287089 -0.1906617
4 -1.287089 -0.8355938
[[5]]
param u
1 -0.5693856 0.55956111
2 -0.5693856 -0.55701136
3 -0.5693856 0.01990297
4 -0.5693856 0.23791847
[[6]]
param u
1 -3.018556 -3.0122451
2 -3.018556 -0.6813479
3 -3.018556 -2.3840054
4 -3.018556 -1.5757809
[[7]]
param u
1 -0.4975908 0.4693016
2 -0.4975908 0.6642751
3 -0.4975908 0.3465215
4 -0.4975908 -0.1475987
[[8]]
param u
1 -0.3545537 -0.2326702
2 -0.3545537 -0.2385875
3 -0.3545537 0.0588810
4 -0.3545537 0.5495114
[[9]]
param u
1 -1.756409 -1.7557505
2 -1.756409 -1.1815049
3 -1.756409 0.8154147
4 -1.756409 0.7950465
[[10]]
param u
1 -0.737499 0.5379891
2 -0.737499 -0.1587868
3 -0.737499 0.1574230
4 -0.737499 0.5551562
param u
1 -0.2992672 0.36909267
2 -0.2992672 0.57852718
3 -0.2992672 0.97767495
4 -0.2992672 0.68758376
5 -0.2799575 0.44512359
6 -0.2799575 0.80760922
7 -0.2799575 -0.03743895
8 -0.2799575 0.06727781
9 -0.2311141 0.78844348
10 -0.2311141 0.62230012
11 -0.2311141 0.06502390
12 -0.2311141 -0.17819002
13 -1.2870888 -0.96580066
14 -1.2870888 -0.79219616
15 -1.2870888 -0.19066173
16 -1.2870888 -0.83559384
17 -0.5693856 0.55956111
18 -0.5693856 -0.55701136
19 -0.5693856 0.01990297
20 -0.5693856 0.23791847
21 -3.0185564 -3.01224508
22 -3.0185564 -0.68134793
23 -3.0185564 -2.38400545
24 -3.0185564 -1.57578093
25 -0.4975908 0.46930157
26 -0.4975908 0.66427514
27 -0.4975908 0.34652154
28 -0.4975908 -0.14759872
29 -0.3545537 -0.23267022
30 -0.3545537 -0.23858752
31 -0.3545537 0.05888100
32 -0.3545537 0.54951137
33 -1.7564090 -1.75575051
34 -1.7564090 -1.18150490
35 -1.7564090 0.81541467
36 -1.7564090 0.79504652
37 -0.7374990 0.53798910
38 -0.7374990 -0.15878679
39 -0.7374990 0.15742300
40 -0.7374990 0.55515619
if
and else
Only one condition: if
statement
Example (one coin toss)
if
and else
: vectorized versionifelse(condition1, statement1, statement2)
Example (five coin tosses)
else if
More than two conditions:
Comments on loops
Performs commands sequentially
Often we will want to perform the same set of (complicated) commands on different chunks of data.
for
loop, but can be difficult to understand because it is so flexibleapply()
family of functions