STA35B: Statistical Data Science 2
Based heavily on Chapter 19 of R4DS 2e
Often you need to combine data from multiple sources in order to answer the questions you’re interested in.
rbind()
(if columns match)cbind()
(if rows match)Let’s see what this looks like for some small data sets
How do we combine exams_a
and info_a
?
rbind()
or cbind()
won’t work for this task.Collectively, multiple tables of data are called relational data because it is the relations, not just the individual datasets, that are important.
Relations are always defined between a pair of tables. All other relations are built up from this simple idea: the relations of three or more tables are always a property of the relations between each pair.
Here we will join the two tables together. First we need to understand how two tables can be connected through a pair of keys.
The most common place to find relational data is in a relational database management system, a term that encompasses almost all modern databases.
In a tidy data frame, each row corresponds to an observation. How can we index these observations? What variables uniquely identify each row in a table?
Exercise: what are the primary keys for the datasets info_a
and exams_a
?
It’s good practice to verify that the primary/compound key does indeed uniquely identify each observation.
# A tibble: 0 × 2
# ℹ 2 variables: student <chr>, n <int>
# A tibble: 0 × 2
# ℹ 2 variables: student <chr>, n <int>
# A tibble: 0 × 3
# ℹ 3 variables: time_hour <dttm>, origin <chr>, n <int>
It’s good practice to verify that the primary/compound key does indeed uniquely identify each observation.
For our examples, we will use
A mutating join combines variables from two data frames: it first matches observations by their keys, then copies across variables from one data frame to the other.
mutate()
, the join functions add variables to the right, so if your dataset has many variables, you won’t see the new ones.left_join()
We can use left_join(x, y)
to augment a table x
with info from another table y
.
x
.x
, the data frame you’re joining to.left_join()
We can use left_join(x, y)
to augment a table x
with info from another table y
.
x
.x
, the data frame you’re joining to.left_join()
What if the “right” table y
isn’t fully furnished?
# info_b doesn't have info about Cat, so Cat's major and gpa will be missing
exams_a |> left_join(info_b)
Joining with `by = join_by(student)`
# A tibble: 3 × 5
student exam1 exam2 major gpa
<chr> <dbl> <dbl> <chr> <dbl>
1 Ant 80 85 Math 3.5
2 Bug 70 90 CS 2.5
3 Cat 85 70 <NA> NA
# info_b doesn't have info about Ewe, so Ewe's major and gpa will be missing
exams_b |> left_join(info_b)
Joining with `by = join_by(student)`
# A tibble: 2 × 5
student exam1 exam2 major gpa
<chr> <dbl> <dbl> <chr> <dbl>
1 Dog 90 95 Math 2
2 Ewe 80 100 <NA> NA
right_join()
The other join functions have a similar syntax. We will illustrate the differences in functionality through this example.
inner_join()
The other join functions have a similar syntax. We will illustrate the differences in functionality through this example.
full_join()
The other join functions have a similar syntax. We will illustrate the differences in functionality through this example.
The primary action of a filtering join is to filter the rows.
Two types:
x
that have a match in y
.x
that do not have a match in y
. (useful for finding missing values that are implicit in the data)semi_join()
The primary action of a filtering join is to filter the rows.
anti_join()
The primary action of a filtering join is to filter the rows.
In all cases we’ve seen so far, we didn’t need to specify the keys when joining.
In all cases we’ve seen so far, we didn’t need to specify the keys when joining.
# An example where join_by() could be applied to two possible variables in info_d
info_d <- tribble(
~name_start, ~major, ~gpa, ~name_end,
"Ant", "Math", 3.5, "Ant",
"Bug", "CS", 2.5, "Bug",
"Cat", "EE", 1.5, "Dog",
"Dog", "Math", 2.0, "Cat",
)
exams_a |> left_join(info_d, join_by(student == name_start))
# A tibble: 3 × 6
student exam1 exam2 major gpa name_end
<chr> <dbl> <dbl> <chr> <dbl> <chr>
1 Ant 80 85 Math 3.5 Ant
2 Bug 70 90 CS 2.5 Bug
3 Cat 85 70 EE 1.5 Dog
# A tibble: 3 × 6
student exam1 exam2 name_start major gpa
<chr> <dbl> <dbl> <chr> <chr> <dbl>
1 Ant 80 85 Ant Math 3.5
2 Bug 70 90 Bug CS 2.5
3 Cat 85 70 Dog Math 2
Another scenario is when the two tables share a same column name but the column values mean different things.
animals <- tribble(
~name, ~year, ~species,
"Ant", 2000, "Insect",
"Bug", 2001, "Insect",
"Cat", 2002, "Mammal",
)
# info_e <- tribble(
# ~name_start, ~major, ~gpa, ~name_end,
# "Ant", "Math", 3.5, "Ant",
# "Bug", "CS", 2.5, "Bug",
# "Cat", "EE", 1.5, "Dog",
# "Dog", "Math", 2.0, "Cat",
# )
info_c |> left_join(animals) # What variables are being used as keys here?
Joining with `by = join_by(name)`
# A tibble: 3 × 5
name major gpa year species
<chr> <chr> <dbl> <dbl> <chr>
1 Ant Math 3.5 2000 Insect
2 Bug CS 2.5 2001 Insect
3 Cat EE 1.5 2002 Mammal
# exams_e refers to scores in STA 141A; exams_f refers to scores in STA 141B
exams_e <- tribble(
~student, ~exam1, ~exam2,
"Ant", 80, 85,
"Bug", 70, 90,
"Cat", 85, 70,
"Dog", 24, 10,
)
exams_f <- tribble(
~student, ~exam1, ~exam2,
"Ant", 50, 60,
"Bug", 35, 60,
"Cat", 22, 90,
)
exams_e |> left_join(exams_f, join_by(student)) # How are the exam1 and exam2 variables disambiguated?
# A tibble: 4 × 5
student exam1.x exam2.x exam1.y exam2.y
<chr> <dbl> <dbl> <dbl> <dbl>
1 Ant 80 85 50 60
2 Bug 70 90 35 60
3 Cat 85 70 22 90
4 Dog 24 10 NA NA
# Same as exams_e |> left_join(exams_f, join_by(student==student))
exams_e |> left_join(exams_f, join_by(student), suffix=c('.141A', '.141B')) # How are the exam1 and exam2 variables disambiguated?
# A tibble: 4 × 5
student exam1.141A exam2.141A exam1.141B exam2.141B
<chr> <dbl> <dbl> <dbl> <dbl>
1 Ant 80 85 50 60
2 Bug 70 90 35 60
3 Cat 85 70 22 90
4 Dog 24 10 NA NA
So far we’ve only seen equi joins (joins where the rows match if the x key equals the y key). Non-equi joins relax that restriction and discuss other ways of combining data. See Section 19.5 of R4DS 2e for more information.