# A tibble: 5 × 3
student major gpa
<chr> <chr> <dbl>
1 Ant Math 3.5
2 Bug CS 2.5
3 Cat EE 1.5
4 Dog Math 2
5 Ewe CS 3
info_b
# A tibble: 3 × 3
student major gpa
<chr> <chr> <dbl>
1 Ant Math 3.5
2 Bug CS 2.5
3 Dog Math 2
Mutating joins
A mutating join combines variables from two data frames: it first matches observations by their keys, then copies across variables from one data frame to the other.
Like mutate(), the join functions add variables to the right, so if your dataset has many variables, you won’t see the new ones.
Mutating joins: left_join()
We can use left_join(x, y) to augment a table x with info from another table y.
Primary use is to add additional metadata to the “left” table x.
Output will always have the same rows as x, the data frame you’re joining to.
# return all rows in exams_aexams_a |>left_join(info_a)
# A tibble: 3 × 5
student exam1 exam2 major gpa
<chr> <dbl> <dbl> <chr> <dbl>
1 Ant 80 85 Math 3.5
2 Bug 70 90 CS 2.5
3 Cat 85 70 EE 1.5
# return all rows in exams_bexams_b |>left_join(info_a)
# A tibble: 2 × 5
student exam1 exam2 major gpa
<chr> <dbl> <dbl> <chr> <dbl>
1 Dog 90 95 Math 2
2 Ewe 80 100 CS 3
Mutating joins: left_join()
We can use left_join(x, y) to augment a table x with info from another table y.
Primary use is to add additional metadata to the “left” table x.
Output will always have the same rows as x, the data frame you’re joining to.
# return all rows in exams_cexams_c |>left_join(info_a)
# A tibble: 3 × 5
student exam3 exam4 major gpa
<chr> <dbl> <dbl> <chr> <dbl>
1 Ant 90 95 Math 3.5
2 Bug 80 100 CS 2.5
3 Cat 95 80 EE 1.5
Mutating joins: left_join()
What if the “right” table y isn’t fully furnished?
info_b
# A tibble: 3 × 3
student major gpa
<chr> <chr> <dbl>
1 Ant Math 3.5
2 Bug CS 2.5
3 Dog Math 2
# info_b doesn't have info about Cat, so Cat's major and gpa will be missingexams_a |>left_join(info_b)
# A tibble: 3 × 5
student exam1 exam2 major gpa
<chr> <dbl> <dbl> <chr> <dbl>
1 Ant 80 85 Math 3.5
2 Bug 70 90 CS 2.5
3 Cat 85 70 <NA> NA
# info_b doesn't have info about Ewe, so Ewe's major and gpa will be missingexams_b |>left_join(info_b)
# A tibble: 2 × 5
student exam1 exam2 major gpa
<chr> <dbl> <dbl> <chr> <dbl>
1 Dog 90 95 Math 2
2 Ewe 80 100 <NA> NA
Mutating joins: right_join()
The other join functions have a similar syntax. We will illustrate the differences in functionality through this example.
# return all rows in info_b augmented with values from exams_aexams_a |>right_join(info_b)
# A tibble: 3 × 5
student exam1 exam2 major gpa
<chr> <dbl> <dbl> <chr> <dbl>
1 Ant 80 85 Math 3.5
2 Bug 70 90 CS 2.5
3 Dog NA NA Math 2
Mutating joins: inner_join()
The other join functions have a similar syntax. We will illustrate the differences in functionality through this example.
# return all rows in both exams_a and info_b (augmented)exams_a |>inner_join(info_b)
# A tibble: 2 × 5
student exam1 exam2 major gpa
<chr> <dbl> <dbl> <chr> <dbl>
1 Ant 80 85 Math 3.5
2 Bug 70 90 CS 2.5
Mutating joins: full_join()
The other join functions have a similar syntax. We will illustrate the differences in functionality through this example.
# return all rows in either exams_a and info_b (augmented)exams_a |>full_join(info_b)
# A tibble: 4 × 5
student exam1 exam2 major gpa
<chr> <dbl> <dbl> <chr> <dbl>
1 Ant 80 85 Math 3.5
2 Bug 70 90 CS 2.5
3 Cat 85 70 <NA> NA
4 Dog NA NA Math 2
Filtering joins
The primary action of a filtering join is to filter the rows.
Two types:
semi-join: return all rows in x that have a match in y.
anti-join: return all rows in x that do not have a match in y. (useful for finding missing values that are implicit in the data)
Filtering joins: semi_join()
The primary action of a filtering join is to filter the rows.
# A tibble: 3 × 3
name major gpa
<chr> <chr> <dbl>
1 Ant Math 3.5
2 Bug CS 2.5
3 Cat EE 1.5
exams_a |>left_join(info_c)Error in`left_join()`:!`by` must be supplied when `x` and`y` have no common variables.ℹ Use `cross_join()` to perform a cross-join.Run `rlang::last_trace()` to see where the error occurred.
Specifying join keys: scenario 1
In all cases we’ve seen so far, we didn’t need to specify the keys when joining.
By default, these join methods will use keys that share the same name.
But sometimes we need to specify the key columns when joining.
# A tibble: 3 × 6
student exam1 exam2 n1 major gpa
<chr> <dbl> <dbl> <chr> <chr> <dbl>
1 Ant 80 85 Bug CS 2.5
2 Bug 70 90 Ant Math 3.5
3 Cat 85 70 <NA> <NA> NA
Specifying join keys: scenario 3
Tables share a same column name but the column values mean different things.
info_c
# A tibble: 3 × 3
name major gpa
<chr> <chr> <dbl>
1 Ant Math 3.5
2 Bug CS 2.5
3 Cat EE 1.5
animals
# A tibble: 3 × 3
name year species
<chr> <dbl> <chr>
1 Ant 2000 Insect
2 Bug 2001 Insect
3 Cat 2002 Mammal
# What variables are being used as keys here?info_c |>left_join(animals)
# A tibble: 3 × 5
name major gpa year species
<chr> <chr> <dbl> <dbl> <chr>
1 Ant Math 3.5 2000 Insect
2 Bug CS 2.5 2001 Insect
3 Cat EE 1.5 2002 Mammal
Specifying join keys: scenario 4
Non-key columns share same name.
exams_141a # scores in STA 141A
# A tibble: 3 × 3
stu me1 me2
<chr> <dbl> <dbl>
1 Ant 80 85
2 Bug 70 90
3 Cat 85 70
exams_141b # scores in STA 141B
# A tibble: 2 × 4
stu me1 me2 me3
<chr> <dbl> <dbl> <dbl>
1 Ant 50 60 70
2 Bug 35 60 30
# How are the me1 and me2 variables disambiguated?exams_141a |>left_join(exams_141b, join_by(stu))
# A tibble: 3 × 6
stu me1.x me2.x me1.y me2.y me3
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Ant 80 85 50 60 70
2 Bug 70 90 35 60 30
3 Cat 85 70 NA NA NA
# How are the me1 and me2 variables disambiguated?exams_141a |>left_join(exams_141b, join_by(stu), suffix=c('.141A', '.141B'))
# A tibble: 3 × 6
stu me1.141A me2.141A me1.141B me2.141B me3
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Ant 80 85 50 60 70
2 Bug 70 90 35 60 30
3 Cat 85 70 NA NA NA
Non-equi joins
So far we’ve only seen equi joins (joins where the rows match if the x key equals the y key). Non-equi joins relax that restriction and discuss other ways of combining data. See Section 19.5 of R4DS 2e for more information.
Beyond this course
Small and locally stored data sets
So far we have worked with data sets that are:
smallish (fewer than 1 million rows),
locally stored on our computer.
After this course, you may need to go beyond this.
Bigger and external data sets
The tidyverse approach can be slow if working with larger datasets.
Posit team continually works to improve code.
dtplyr or tidytable package can get closer to data.table-level speed.
In a job, we will often work with data in external databases.
SQL allows you to access/manipluate data in databases and hence is a core tool in a data scientist’s toolkit.
Unfortunately, we will not cover it in this class.
In the meantime, you can use the dbplyr package, which allows you to work with data in external databases without having to write any SQL code. You can state your desired functionality by writing R code with tidyverse syntax; dbplyr will then run the equivalent SQL code behind the scenes.