s3.6: joining data

STA141A: Fundamentals of Statistical Data Science

Akira Horiguchi

Combining data sets

Based heavily on Chapter 19 of R4DS 2e

library(tidyverse)
library(nycflights13)

Often you need to combine data from multiple sources in order to answer the questions you’re interested in.

  • We could combine rows using rbind() (if columns match)
  • We could combine columns using cbind() (if rows match)

Let’s see what this looks like for some small data sets

Let’s combine

exams_a
# A tibble: 3 × 3
  student exam1 exam2
  <chr>   <dbl> <dbl>
1 Ant        80    85
2 Bug        70    90
3 Cat        85    70
exams_b
# A tibble: 2 × 3
  student exam1 exam2
  <chr>   <dbl> <dbl>
1 Dog        90    95
2 Ewe        80   100
exams_c
# A tibble: 3 × 3
  student exam3 exam4
  <chr>   <dbl> <dbl>
1 Ant        90    95
2 Bug        80   100
3 Cat        95    80
rbind(exams_a, exams_b) 
# A tibble: 5 × 3
  student exam1 exam2
  <chr>   <dbl> <dbl>
1 Ant        80    85
2 Bug        70    90
3 Cat        85    70
4 Dog        90    95
5 Ewe        80   100
cbind(exams_a, exams_c)
  student exam1 exam2 student exam3 exam4
1     Ant    80    85     Ant    90    95
2     Bug    70    90     Bug    80   100
3     Cat    85    70     Cat    95    80

Let’s combine

exams_a
# A tibble: 3 × 3
  student exam1 exam2
  <chr>   <dbl> <dbl>
1 Ant        80    85
2 Bug        70    90
3 Cat        85    70

Suppose we want to augment this data in exams_a with the students’ majors and gpa from the following table:

info_a
# A tibble: 5 × 3
  student major   gpa
  <chr>   <chr> <dbl>
1 Ant     Math    3.5
2 Bug     CS      2.5
3 Cat     EE      1.5
4 Dog     Math    2  
5 Ewe     CS      3  

How do we combine exams_a and info_a?

  • If we look at the data frames and think about what each row corresponds to, it becomes clear that rbind() or cbind() won’t work for this task.

Joins

Collectively, multiple tables of data are called relational data because it is the relations, not just the individual datasets, that are important.

Relations are always defined between a pair of tables. All other relations are built up from this simple idea: the relations of three or more tables are always a property of the relations between each pair.

Here we will join the two tables together. First we need to understand how two tables can be connected through a pair of keys.

Aside

The most common place to find relational data is in a relational database management system, a term that encompasses almost all modern databases.

  • If you’ve used a database before, you’ve almost certainly used SQL — Structured Query Language.

Keys

In a tidy data frame, each row corresponds to an observation. How can we index these observations? What variables uniquely identify each row in a table?

  • If each observation can be uniquely identified by a single variable, the variable is called a primary key.
    • If each observation requires multiple variables to be uniquely identified, the combination of variables is called a compound key.

Exercise: what are the primary keys for the datasets info_a and exams_a?

info_a
# A tibble: 5 × 3
  student major   gpa
  <chr>   <chr> <dbl>
1 Ant     Math    3.5
2 Bug     CS      2.5
3 Cat     EE      1.5
4 Dog     Math    2  
5 Ewe     CS      3  
exams_a
# A tibble: 3 × 3
  student exam1 exam2
  <chr>   <dbl> <dbl>
1 Ant        80    85
2 Bug        70    90
3 Cat        85    70

Primary key

It’s good practice to verify that the primary/compound key does indeed uniquely identify each observation.

  • One way is to see if any such key corresponds to more than one row.
exams_a |> count(student) |> filter(n > 1)
# A tibble: 0 × 2
# ℹ 2 variables: student <chr>, n <int>
info_a |> count(student) |> filter(n > 1)
# A tibble: 0 × 2
# ℹ 2 variables: student <chr>, n <int>
weather |> count(time_hour, origin) |> filter(n > 1)  # compound key
# A tibble: 0 × 3
# ℹ 3 variables: time_hour <dttm>, origin <chr>, n <int>

Primary key

It’s good practice to verify that the primary/compound key does indeed uniquely identify each observation.

  • One way is to see if any such key corresponds to more than one row.
  • You should also check for missing values in primary keys.
exams_d
# A tibble: 4 × 3
  student exam1 exam2
  <chr>   <dbl> <dbl>
1 Ant        90    95
2 Bug        80   100
3 Ant        95    80
4 <NA>       95    80
exams_d |> count(student)
# A tibble: 3 × 2
  student     n
  <chr>   <int>
1 Ant         2
2 Bug         1
3 <NA>        1

Basic joins

Two categories of joins: mutating joins and filtering joins

For our examples, we will use

exams_a
# A tibble: 3 × 3
  student exam1 exam2
  <chr>   <dbl> <dbl>
1 Ant        80    85
2 Bug        70    90
3 Cat        85    70
exams_b
# A tibble: 2 × 3
  student exam1 exam2
  <chr>   <dbl> <dbl>
1 Dog        90    95
2 Ewe        80   100
exams_c
# A tibble: 3 × 3
  student exam3 exam4
  <chr>   <dbl> <dbl>
1 Ant        90    95
2 Bug        80   100
3 Cat        95    80
info_a
# A tibble: 5 × 3
  student major   gpa
  <chr>   <chr> <dbl>
1 Ant     Math    3.5
2 Bug     CS      2.5
3 Cat     EE      1.5
4 Dog     Math    2  
5 Ewe     CS      3  
info_b
# A tibble: 3 × 3
  student major   gpa
  <chr>   <chr> <dbl>
1 Ant     Math    3.5
2 Bug     CS      2.5
3 Dog     Math    2  

Mutating joins

A mutating join combines variables from two data frames: it first matches observations by their keys, then copies across variables from one data frame to the other.

  • Like mutate(), the join functions add variables to the right, so if your dataset has many variables, you won’t see the new ones.

Mutating joins: left_join()

We can use left_join(x, y) to augment a table x with info from another table y.

  • Primary use is to add additional metadata to the “left” table x.
  • Output will always have the same rows as x, the data frame you’re joining to.
# return all rows in exams_a
exams_a |> left_join(info_a) 
# A tibble: 3 × 5
  student exam1 exam2 major   gpa
  <chr>   <dbl> <dbl> <chr> <dbl>
1 Ant        80    85 Math    3.5
2 Bug        70    90 CS      2.5
3 Cat        85    70 EE      1.5
# return all rows in exams_b
exams_b |> left_join(info_a)
# A tibble: 2 × 5
  student exam1 exam2 major   gpa
  <chr>   <dbl> <dbl> <chr> <dbl>
1 Dog        90    95 Math      2
2 Ewe        80   100 CS        3

Mutating joins: left_join()

We can use left_join(x, y) to augment a table x with info from another table y.

  • Primary use is to add additional metadata to the “left” table x.
  • Output will always have the same rows as x, the data frame you’re joining to.
# return all rows in exams_c
exams_c |> left_join(info_a) 
# A tibble: 3 × 5
  student exam3 exam4 major   gpa
  <chr>   <dbl> <dbl> <chr> <dbl>
1 Ant        90    95 Math    3.5
2 Bug        80   100 CS      2.5
3 Cat        95    80 EE      1.5

Mutating joins: left_join()

What if the “right” table y isn’t fully furnished?

info_b
# A tibble: 3 × 3
  student major   gpa
  <chr>   <chr> <dbl>
1 Ant     Math    3.5
2 Bug     CS      2.5
3 Dog     Math    2  
# info_b doesn't have info about Cat, so Cat's major and gpa will be missing
exams_a |> left_join(info_b)  
# A tibble: 3 × 5
  student exam1 exam2 major   gpa
  <chr>   <dbl> <dbl> <chr> <dbl>
1 Ant        80    85 Math    3.5
2 Bug        70    90 CS      2.5
3 Cat        85    70 <NA>   NA  
# info_b doesn't have info about Ewe, so Ewe's major and gpa will be missing
exams_b |> left_join(info_b)  
# A tibble: 2 × 5
  student exam1 exam2 major   gpa
  <chr>   <dbl> <dbl> <chr> <dbl>
1 Dog        90    95 Math      2
2 Ewe        80   100 <NA>     NA

Mutating joins: right_join()

The other join functions have a similar syntax. We will illustrate the differences in functionality through this example.

# return all rows in info_b augmented with values from exams_a
exams_a |> right_join(info_b)  
# A tibble: 3 × 5
  student exam1 exam2 major   gpa
  <chr>   <dbl> <dbl> <chr> <dbl>
1 Ant        80    85 Math    3.5
2 Bug        70    90 CS      2.5
3 Dog        NA    NA Math    2  

Mutating joins: inner_join()

The other join functions have a similar syntax. We will illustrate the differences in functionality through this example.

# return all rows in both exams_a and info_b (augmented)
exams_a |> inner_join(info_b)
# A tibble: 2 × 5
  student exam1 exam2 major   gpa
  <chr>   <dbl> <dbl> <chr> <dbl>
1 Ant        80    85 Math    3.5
2 Bug        70    90 CS      2.5

Mutating joins: full_join()

The other join functions have a similar syntax. We will illustrate the differences in functionality through this example.

# return all rows in either exams_a and info_b (augmented)
exams_a |> full_join(info_b)  
# A tibble: 4 × 5
  student exam1 exam2 major   gpa
  <chr>   <dbl> <dbl> <chr> <dbl>
1 Ant        80    85 Math    3.5
2 Bug        70    90 CS      2.5
3 Cat        85    70 <NA>   NA  
4 Dog        NA    NA Math    2  

Filtering joins

The primary action of a filtering join is to filter the rows.

Two types:

  1. semi-join: return all rows in x that have a match in y.
  2. anti-join: return all rows in x that do not have a match in y. (useful for finding missing values that are implicit in the data)

Filtering joins: semi_join()

The primary action of a filtering join is to filter the rows.

# return all rows in exams_a that have a match in info_b; 
# find which students in exams_a are also in info_b
exams_a |> semi_join(info_b)  
# A tibble: 2 × 3
  student exam1 exam2
  <chr>   <dbl> <dbl>
1 Ant        80    85
2 Bug        70    90

Filtering joins: anti_join()

The primary action of a filtering join is to filter the rows.

# return all rows in exams_a that don't have a match in info_b; 
# find which students are missing from info_b
exams_a |> anti_join(info_b)  
# A tibble: 1 × 3
  student exam1 exam2
  <chr>   <dbl> <dbl>
1 Cat        85    70

Specifying join keys

In all cases we’ve seen so far, we didn’t need to specify the keys when joining.

  • By default, these join methods will use keys that share the same name.
  • But sometimes we need to specify the key columns when joining.
exams_a
# A tibble: 3 × 3
  student exam1 exam2
  <chr>   <dbl> <dbl>
1 Ant        80    85
2 Bug        70    90
3 Cat        85    70
info_c
# A tibble: 3 × 3
  name  major   gpa
  <chr> <chr> <dbl>
1 Ant   Math    3.5
2 Bug   CS      2.5
3 Cat   EE      1.5
exams_a |> left_join(info_c)

Error in `left_join()`:
! `by` must be supplied when `x` and
  `y` have no common variables.
ℹ Use `cross_join()` to perform a cross-join.
Run `rlang::last_trace()` to see where the error occurred.

Specifying join keys

In all cases we’ve seen so far, we didn’t need to specify the keys when joining.

  • By default, these join methods will use keys that share the same name.
  • But sometimes we need to specify the key columns when joining.
exams_a
# A tibble: 3 × 3
  student exam1 exam2
  <chr>   <dbl> <dbl>
1 Ant        80    85
2 Bug        70    90
3 Cat        85    70
info_c
# A tibble: 3 × 3
  name  major   gpa
  <chr> <chr> <dbl>
1 Ant   Math    3.5
2 Bug   CS      2.5
3 Cat   EE      1.5
exams_a |> left_join(info_c, join_by(student==name))
# A tibble: 3 × 5
  student exam1 exam2 major   gpa
  <chr>   <dbl> <dbl> <chr> <dbl>
1 Ant        80    85 Math    3.5
2 Bug        70    90 CS      2.5
3 Cat        85    70 EE      1.5
# An example where join_by() could be applied to two possible variables in info_d
info_d <- tribble(
  ~name_start, ~major, ~gpa, ~name_end, 
  "Ant", "Math", 3.5, "Ant", 
  "Bug", "CS",   2.5, "Bug", 
  "Cat", "EE",   1.5, "Dog",
  "Dog", "Math", 2.0, "Cat",
)
exams_a |> left_join(info_d, join_by(student == name_start))
# A tibble: 3 × 6
  student exam1 exam2 major   gpa name_end
  <chr>   <dbl> <dbl> <chr> <dbl> <chr>   
1 Ant        80    85 Math    3.5 Ant     
2 Bug        70    90 CS      2.5 Bug     
3 Cat        85    70 EE      1.5 Dog     
exams_a |> left_join(info_d, join_by(student == name_end))
# A tibble: 3 × 6
  student exam1 exam2 name_start major   gpa
  <chr>   <dbl> <dbl> <chr>      <chr> <dbl>
1 Ant        80    85 Ant        Math    3.5
2 Bug        70    90 Bug        CS      2.5
3 Cat        85    70 Dog        Math    2  

Another scenario is when the two tables share a same column name but the column values mean different things.

animals <- tribble(
  ~name, ~year, ~species,
  "Ant", 2000, "Insect",
  "Bug", 2001, "Insect",
  "Cat", 2002, "Mammal",
)
# info_e <- tribble(
#   ~name_start, ~major, ~gpa, ~name_end, 
#   "Ant", "Math", 3.5, "Ant", 
#   "Bug", "CS",   2.5, "Bug", 
#   "Cat", "EE",   1.5, "Dog",
#   "Dog", "Math", 2.0, "Cat",
# )
info_c |> left_join(animals)  # What variables are being used as keys here?
# A tibble: 3 × 5
  name  major   gpa  year species
  <chr> <chr> <dbl> <dbl> <chr>  
1 Ant   Math    3.5  2000 Insect 
2 Bug   CS      2.5  2001 Insect 
3 Cat   EE      1.5  2002 Mammal 
# exams_e refers to scores in STA 141A; exams_f refers to scores in STA 141B
exams_e <- tribble(
  ~student, ~exam1, ~exam2,
  "Ant", 80, 85,
  "Bug", 70, 90,
  "Cat", 85, 70,
  "Dog", 24, 10,
)
exams_f <- tribble(
  ~student, ~exam1, ~exam2,
  "Ant", 50, 60,
  "Bug", 35, 60,
  "Cat", 22, 90,
)
exams_e |> left_join(exams_f, join_by(student))  # How are the exam1 and exam2 variables disambiguated?
# A tibble: 4 × 5
  student exam1.x exam2.x exam1.y exam2.y
  <chr>     <dbl>   <dbl>   <dbl>   <dbl>
1 Ant          80      85      50      60
2 Bug          70      90      35      60
3 Cat          85      70      22      90
4 Dog          24      10      NA      NA
# Same as exams_e |> left_join(exams_f, join_by(student==student))
exams_e |> left_join(exams_f, join_by(student), suffix=c('.141A', '.141B'))  # How are the exam1 and exam2 variables disambiguated?
# A tibble: 4 × 5
  student exam1.141A exam2.141A exam1.141B exam2.141B
  <chr>        <dbl>      <dbl>      <dbl>      <dbl>
1 Ant             80         85         50         60
2 Bug             70         90         35         60
3 Cat             85         70         22         90
4 Dog             24         10         NA         NA

Non-equi joins

So far we’ve only seen equi joins (joins where the rows match if the x key equals the y key). Non-equi joins relax that restriction and discuss other ways of combining data. See Section 19.5 of R4DS 2e for more information.