3.4 Combining datasets

STA141A: Fundamentals of Statistical Data Science

Akira Horiguchi

Combining data sets

Based heavily on Chapter 19 of R4DS 2e

library(tidyverse)
library(nycflights13)

Often you need to combine data from multiple sources in order to answer the questions you’re interested in.

  • We could combine rows using rbind() (if columns match)
  • We could combine columns using cbind() (if rows match)

Let’s see what this looks like for some small data sets

Let’s combine

exams_a
# A tibble: 3 × 3
  student exam1 exam2
  <chr>   <dbl> <dbl>
1 Ant        80    85
2 Bug        70    90
3 Cat        85    70
exams_b
# A tibble: 2 × 3
  student exam1 exam2
  <chr>   <dbl> <dbl>
1 Dog        90    95
2 Ewe        80   100
exams_c
# A tibble: 3 × 3
  student exam3 exam4
  <chr>   <dbl> <dbl>
1 Ant        90    95
2 Bug        80   100
3 Cat        95    80
rbind(exams_a, exams_b) 
# A tibble: 5 × 3
  student exam1 exam2
  <chr>   <dbl> <dbl>
1 Ant        80    85
2 Bug        70    90
3 Cat        85    70
4 Dog        90    95
5 Ewe        80   100
cbind(exams_a, exams_c)
  student exam1 exam2 student exam3 exam4
1     Ant    80    85     Ant    90    95
2     Bug    70    90     Bug    80   100
3     Cat    85    70     Cat    95    80

How do we augment the data in exams_a

…with the students’ majors and gpa from the table info_a?

exams_a
# A tibble: 3 × 3
  student exam1 exam2
  <chr>   <dbl> <dbl>
1 Ant        80    85
2 Bug        70    90
3 Cat        85    70
info_a
# A tibble: 5 × 3
  student major   gpa
  <chr>   <chr> <dbl>
1 Ant     Math    3.5
2 Bug     CS      2.5
3 Cat     EE      1.5
4 Dog     Math    2  
5 Ewe     CS      3  
  • If we look at the data frames and think about what each row corresponds to, it becomes clear that rbind() or cbind() won’t work for this task.
  • Combining two tables requires specifying the relationship between them.
  • Many valid relations; the appropriate one(s) will be dictated by what the user knows about the data and about the end goal of the analysis.

Joins

Collectively, multiple tables of data are called relational data because it is the relations, not just the individual datasets, that are important.

  • Relations are always defined between a pair of tables.
  • The relations of three or more tables are always a property of the relations between each pair.

The most common place to find relational data is in a relational database management system, a term that encompasses almost all modern databases.

  • If you’ve used a database before, you’ve almost certainly used SQL — Structured Query Language.
  • UC Davis has a free 4-module course on Coursera: SQL for Data Science

Keys

Here we will join the two tables together. First we need to understand how two tables can be connected through a pair of keys.

  • If each observation can be uniquely identified by a single variable, the variable is called a primary key.
  • If each observation requires multiple variables to be uniquely identified, the combination of variables is called a compound key.

Exercise: what are the primary keys for the datasets info_a and exams_a?

info_a
# A tibble: 5 × 3
  student major   gpa
  <chr>   <chr> <dbl>
1 Ant     Math    3.5
2 Bug     CS      2.5
3 Cat     EE      1.5
4 Dog     Math    2  
5 Ewe     CS      3  
exams_a
# A tibble: 3 × 3
  student exam1 exam2
  <chr>   <dbl> <dbl>
1 Ant        80    85
2 Bug        70    90
3 Cat        85    70

Primary key

It’s good practice to verify that the primary/compound key does indeed uniquely identify each observation.

  • One way is to see if any such key corresponds to more than one row.
exams_a |> count(student) |> filter(n > 1)
# A tibble: 0 × 2
# ℹ 2 variables: student <chr>, n <int>
info_a |> count(student) |> filter(n > 1)
# A tibble: 0 × 2
# ℹ 2 variables: student <chr>, n <int>
weather |> count(time_hour, origin) |> filter(n > 1)  # compound key
# A tibble: 0 × 3
# ℹ 3 variables: time_hour <dttm>, origin <chr>, n <int>

Primary key

It’s good practice to verify that the primary/compound key does indeed uniquely identify each observation.

  • One way is to see if any such key corresponds to more than one row.
  • You should also check for missing values in primary keys.
exams_d
# A tibble: 4 × 3
  student exam1 exam2
  <chr>   <dbl> <dbl>
1 Ant        90    95
2 Bug        80   100
3 Ant        95    80
4 <NA>       95    80
exams_d |> count(student)
# A tibble: 3 × 2
  student     n
  <chr>   <int>
1 Ant         2
2 Bug         1
3 <NA>        1

Basic joins

Two categories of joins:

  • mutating joins: combines variables from two data frames. (make bigger)
  • filtering joins: filter the rows of a data frame using another. (make smaller)

For our examples, we will use

exams_a
# A tibble: 3 × 3
  student exam1 exam2
  <chr>   <dbl> <dbl>
1 Ant        80    85
2 Bug        70    90
3 Cat        85    70
exams_b
# A tibble: 2 × 3
  student exam1 exam2
  <chr>   <dbl> <dbl>
1 Dog        90    95
2 Ewe        80   100
exams_c
# A tibble: 3 × 3
  student exam3 exam4
  <chr>   <dbl> <dbl>
1 Ant        90    95
2 Bug        80   100
3 Cat        95    80
info_a
# A tibble: 5 × 3
  student major   gpa
  <chr>   <chr> <dbl>
1 Ant     Math    3.5
2 Bug     CS      2.5
3 Cat     EE      1.5
4 Dog     Math    2  
5 Ewe     CS      3  
info_b
# A tibble: 3 × 3
  student major   gpa
  <chr>   <chr> <dbl>
1 Ant     Math    3.5
2 Bug     CS      2.5
3 Dog     Math    2  

Mutating joins

A mutating join combines variables from two data frames: it first matches observations by their keys, then copies across variables from one data frame to the other.

  • Like mutate(), the join functions add variables to the right, so if your dataset has many variables, you won’t see the new ones.

Mutating joins: left_join()

We can use left_join(x, y) to augment a table x with info from another table y.

  • Primary use is to add additional metadata to the “left” table x.
  • Output will always have the same rows as x, the data frame you’re joining to.
# return all rows in exams_a
exams_a |> left_join(info_a) 
# A tibble: 3 × 5
  student exam1 exam2 major   gpa
  <chr>   <dbl> <dbl> <chr> <dbl>
1 Ant        80    85 Math    3.5
2 Bug        70    90 CS      2.5
3 Cat        85    70 EE      1.5
# return all rows in exams_b
exams_b |> left_join(info_a)
# A tibble: 2 × 5
  student exam1 exam2 major   gpa
  <chr>   <dbl> <dbl> <chr> <dbl>
1 Dog        90    95 Math      2
2 Ewe        80   100 CS        3

Mutating joins: left_join()

We can use left_join(x, y) to augment a table x with info from another table y.

  • Primary use is to add additional metadata to the “left” table x.
  • Output will always have the same rows as x, the data frame you’re joining to.
# return all rows in exams_c
exams_c |> left_join(info_a) 
# A tibble: 3 × 5
  student exam3 exam4 major   gpa
  <chr>   <dbl> <dbl> <chr> <dbl>
1 Ant        90    95 Math    3.5
2 Bug        80   100 CS      2.5
3 Cat        95    80 EE      1.5

Mutating joins: left_join()

What if the “right” table y isn’t fully furnished?

info_b
# A tibble: 3 × 3
  student major   gpa
  <chr>   <chr> <dbl>
1 Ant     Math    3.5
2 Bug     CS      2.5
3 Dog     Math    2  
# info_b doesn't have info about Cat, so Cat's major and gpa will be missing
exams_a |> left_join(info_b)  
# A tibble: 3 × 5
  student exam1 exam2 major   gpa
  <chr>   <dbl> <dbl> <chr> <dbl>
1 Ant        80    85 Math    3.5
2 Bug        70    90 CS      2.5
3 Cat        85    70 <NA>   NA  
# info_b doesn't have info about Ewe, so Ewe's major and gpa will be missing
exams_b |> left_join(info_b)  
# A tibble: 2 × 5
  student exam1 exam2 major   gpa
  <chr>   <dbl> <dbl> <chr> <dbl>
1 Dog        90    95 Math      2
2 Ewe        80   100 <NA>     NA

Mutating joins: right_join()

The other join functions have a similar syntax. We will illustrate the differences in functionality through this example.

# return all rows in info_b augmented with values from exams_a
exams_a |> right_join(info_b)  
# A tibble: 3 × 5
  student exam1 exam2 major   gpa
  <chr>   <dbl> <dbl> <chr> <dbl>
1 Ant        80    85 Math    3.5
2 Bug        70    90 CS      2.5
3 Dog        NA    NA Math    2  

Mutating joins: inner_join()

The other join functions have a similar syntax. We will illustrate the differences in functionality through this example.

# return all rows in both exams_a and info_b (augmented)
exams_a |> inner_join(info_b)
# A tibble: 2 × 5
  student exam1 exam2 major   gpa
  <chr>   <dbl> <dbl> <chr> <dbl>
1 Ant        80    85 Math    3.5
2 Bug        70    90 CS      2.5

Mutating joins: full_join()

The other join functions have a similar syntax. We will illustrate the differences in functionality through this example.

# return all rows in either exams_a and info_b (augmented)
exams_a |> full_join(info_b)  
# A tibble: 4 × 5
  student exam1 exam2 major   gpa
  <chr>   <dbl> <dbl> <chr> <dbl>
1 Ant        80    85 Math    3.5
2 Bug        70    90 CS      2.5
3 Cat        85    70 <NA>   NA  
4 Dog        NA    NA Math    2  

Filtering joins

The primary action of a filtering join is to filter the rows.

Two types:

  1. semi-join: return all rows in x that have a match in y.
  2. anti-join: return all rows in x that do not have a match in y. (useful for finding missing values that are implicit in the data)

Filtering joins: semi_join()

The primary action of a filtering join is to filter the rows.

exams_a
# A tibble: 3 × 3
  student exam1 exam2
  <chr>   <dbl> <dbl>
1 Ant        80    85
2 Bug        70    90
3 Cat        85    70
info_b
# A tibble: 3 × 3
  student major   gpa
  <chr>   <chr> <dbl>
1 Ant     Math    3.5
2 Bug     CS      2.5
3 Dog     Math    2  
# return all rows in exams_a that have a match in info_b; 
# find which students in exams_a are also in info_b
exams_a |> semi_join(info_b)  
# A tibble: 2 × 3
  student exam1 exam2
  <chr>   <dbl> <dbl>
1 Ant        80    85
2 Bug        70    90

Filtering joins: anti_join()

The primary action of a filtering join is to filter the rows.

exams_a
# A tibble: 3 × 3
  student exam1 exam2
  <chr>   <dbl> <dbl>
1 Ant        80    85
2 Bug        70    90
3 Cat        85    70
info_b
# A tibble: 3 × 3
  student major   gpa
  <chr>   <chr> <dbl>
1 Ant     Math    3.5
2 Bug     CS      2.5
3 Dog     Math    2  
# return all rows in exams_a that don't have a match in info_b; 
# find which students are missing from info_b
exams_a |> anti_join(info_b)  
# A tibble: 1 × 3
  student exam1 exam2
  <chr>   <dbl> <dbl>
1 Cat        85    70

Specifying join keys: scenario 1

In all cases we’ve seen so far, we didn’t need to specify the keys when joining.

  • By default, these join methods will use keys that share the same name.
  • But sometimes we need to specify the key columns when joining.
exams_a
# A tibble: 3 × 3
  student exam1 exam2
  <chr>   <dbl> <dbl>
1 Ant        80    85
2 Bug        70    90
3 Cat        85    70
info_c
# A tibble: 3 × 3
  name  major   gpa
  <chr> <chr> <dbl>
1 Ant   Math    3.5
2 Bug   CS      2.5
3 Cat   EE      1.5
exams_a |> left_join(info_c)

Error in `left_join()`:
! `by` must be supplied when `x` and
  `y` have no common variables.
ℹ Use `cross_join()` to perform a cross-join.
Run `rlang::last_trace()` to see where the error occurred.

Specifying join keys: scenario 1

In all cases we’ve seen so far, we didn’t need to specify the keys when joining.

  • By default, these join methods will use keys that share the same name.
  • But sometimes we need to specify the key columns when joining.
exams_a
# A tibble: 3 × 3
  student exam1 exam2
  <chr>   <dbl> <dbl>
1 Ant        80    85
2 Bug        70    90
3 Cat        85    70
info_c
# A tibble: 3 × 3
  name  major   gpa
  <chr> <chr> <dbl>
1 Ant   Math    3.5
2 Bug   CS      2.5
3 Cat   EE      1.5
exams_a |> left_join(info_c, join_by(student==name))
# A tibble: 3 × 5
  student exam1 exam2 major   gpa
  <chr>   <dbl> <dbl> <chr> <dbl>
1 Ant        80    85 Math    3.5
2 Bug        70    90 CS      2.5
3 Cat        85    70 EE      1.5

Specifying join keys: scenario 2

An example where join_by() could be applied to two possible variables in info_d.

exams_a
# A tibble: 3 × 3
  student exam1 exam2
  <chr>   <dbl> <dbl>
1 Ant        80    85
2 Bug        70    90
3 Cat        85    70
info_d
# A tibble: 2 × 4
  n1    major   gpa n2   
  <chr> <chr> <dbl> <chr>
1 Ant   Math    3.5 Bug  
2 Bug   CS      2.5 Ant  
exams_a |> 
  left_join(info_d, join_by(student == n1))
# A tibble: 3 × 6
  student exam1 exam2 major   gpa n2   
  <chr>   <dbl> <dbl> <chr> <dbl> <chr>
1 Ant        80    85 Math    3.5 Bug  
2 Bug        70    90 CS      2.5 Ant  
3 Cat        85    70 <NA>   NA   <NA> 
exams_a |> 
  left_join(info_d, join_by(student == n2))
# A tibble: 3 × 6
  student exam1 exam2 n1    major   gpa
  <chr>   <dbl> <dbl> <chr> <chr> <dbl>
1 Ant        80    85 Bug   CS      2.5
2 Bug        70    90 Ant   Math    3.5
3 Cat        85    70 <NA>  <NA>   NA  

Specifying join keys: scenario 3

Tables share a same column name but the column values mean different things.

info_c
# A tibble: 3 × 3
  name  major   gpa
  <chr> <chr> <dbl>
1 Ant   Math    3.5
2 Bug   CS      2.5
3 Cat   EE      1.5
animals
# A tibble: 3 × 3
  name   year species
  <chr> <dbl> <chr>  
1 Ant    2000 Insect 
2 Bug    2001 Insect 
3 Cat    2002 Mammal 
# What variables are being used as keys here?
info_c |> 
  left_join(animals)  
# A tibble: 3 × 5
  name  major   gpa  year species
  <chr> <chr> <dbl> <dbl> <chr>  
1 Ant   Math    3.5  2000 Insect 
2 Bug   CS      2.5  2001 Insect 
3 Cat   EE      1.5  2002 Mammal 

Specifying join keys: scenario 4

Non-key columns share same name.

exams_141a  # scores in STA 141A
# A tibble: 3 × 3
  stu     me1   me2
  <chr> <dbl> <dbl>
1 Ant      80    85
2 Bug      70    90
3 Cat      85    70
exams_141b  # scores in STA 141B
# A tibble: 2 × 4
  stu     me1   me2   me3
  <chr> <dbl> <dbl> <dbl>
1 Ant      50    60    70
2 Bug      35    60    30
# How are the me1 and me2 variables disambiguated?
exams_141a |> 
  left_join(exams_141b, join_by(stu))  
# A tibble: 3 × 6
  stu   me1.x me2.x me1.y me2.y   me3
  <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Ant      80    85    50    60    70
2 Bug      70    90    35    60    30
3 Cat      85    70    NA    NA    NA
# How are the me1 and me2 variables disambiguated?
exams_141a |> 
  left_join(exams_141b, join_by(stu), suffix=c('.141A', '.141B'))  
# A tibble: 3 × 6
  stu   me1.141A me2.141A me1.141B me2.141B   me3
  <chr>    <dbl>    <dbl>    <dbl>    <dbl> <dbl>
1 Ant         80       85       50       60    70
2 Bug         70       90       35       60    30
3 Cat         85       70       NA       NA    NA

Non-equi joins

So far we’ve only seen equi joins (joins where the rows match if the x key equals the y key). Non-equi joins relax that restriction and discuss other ways of combining data. See Section 19.5 of R4DS 2e for more information.

Beyond this course

Small and locally stored data sets

So far we have worked with data sets that are:

  • smallish (fewer than 1 million rows),
  • locally stored on our computer.

After this course, you may need to go beyond this.

Bigger and external data sets

The tidyverse approach can be slow if working with larger datasets.

  • Posit team continually works to improve code.
  • dtplyr or tidytable package can get closer to data.table-level speed.

In a job, we will often work with data in external databases.

  • SQL allows you to access/manipluate data in databases and hence is a core tool in a data scientist’s toolkit.
  • Unfortunately, we will not cover it in this class.
  • In the meantime, you can use the dbplyr package, which allows you to work with data in external databases without having to write any SQL code. You can state your desired functionality by writing R code with tidyverse syntax; dbplyr will then run the equivalent SQL code behind the scenes.