10: data visualization

STA35B: Statistical Data Science 2

Akira Horiguchi

Purpose of visualizing data

Exploratory data analysis
Presenting findings to others

Affects how much effort to put in

If just exploring data, a plot doesn’t need to look pretty if you can interpret it
If high-stakes presentation (e.g. for job interview, raise, promotion, etc), might need to add many bespoke features (not focus of this class)

Exploratory data analysis

Cycle through the following:

Generate questions about your data.
Search for answers by visualizing, transforming, and modelling your data.
Use what you learn to refine your questions and/or generate new questions

Requires creativity and critical thinking. Two question categories:

What type of variation occurs within each variable?
- Mean, standard deviation, skewness, etc
What type of covariation occurs between variables?
- How does height vary with weight, etc

ggplot2

Visualization

We’ll see how to create beautiful visualizations using ggplot2.

library(tidyverse)
library(palmerpenguins)
library(ggthemes) # color palettes for ggplot
penguins

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Basic structure of `ggplot2`

ggplot() constructs the initial plot.

The first argument of ggplot() is the data set for the plot.
- The data set must be a data frame.
ggplot(data = mpg) creates an empty plot.

You then add one or more layers to ggplot() using +.

geom functions add a geometrical object to the plot.
- geom_point(), geom_smooth(), geom_histogram(), geom_boxplot(), etc.

Creating a ggplot

Start with function ggplot()

penguins |> 
    ggplot()

Creating a ggplot

Start with function ggplot()
Add global aesthetics (i.e., aesthetics applied to every layer in plot).

penguins |> 
    ggplot(aes(x = flipper_length_mm, 
               y = body_mass_g))

Creating a ggplot

Start with function ggplot()
Add global aesthetics (i.e., aesthetics applied to every layer in plot).
Add layers.
- Display data using geom: geometrical object used to represent data
- geom_bar(): bar chart; geom_line(): lines; geom_boxplot(): boxplot; geom_point(): scatterplot

penguins |> 
    ggplot(aes(x = flipper_length_mm, 
               y = body_mass_g)) +
    geom_point()

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Adding aesthetics and layers

We can have aesthetics change as a function of variables inside the tibble

e.g. we can differentiate penguin species via colors
When a categorical variable is mapped to an aesthetic, each unique level of the variable (here: species) gets assigned a unique aesthetic value (here: unique color)

penguins |> 
    ggplot(aes(x = flipper_length_mm,
               y = body_mass_g,
               color = species)) +
    geom_point()

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Adding aesthetics and layers

Let’s add a new layer, geom_smooth(method="lm"), which visualizes line of best fit based on a linear model

When an aesthetic mapping is added inside ggplot(), it is applied to all layers.
- So color=species inside ggplot() will group all penguins by species.
- We now have a line for each species (not one global line).

penguins |> 
    ggplot(aes(x = flipper_length_mm,
               y = body_mass_g,
               color = species)) +
    geom_point() + 
    geom_smooth(method = "lm")

`geom_smooth()` using formula = 'y ~ x'

Adding aesthetics and layers

Let’s add a new layer, geom_smooth(method="lm"), which visualizes line of best fit based on a linear model

When an aesthetic mapping is added inside a layer, it is applied to just that layer.
- So color=species inside geom_point() will group all penguins by species only for that layer.
- We now have one global line for all penguins.

penguins |> 
    ggplot(aes(x = flipper_length_mm,
               y = body_mass_g)) +
    geom_point(aes(color = species)) + 
    geom_smooth(method = "lm", color = "purple")

`geom_smooth()` using formula = 'y ~ x'

Adding aesthetics and layers

Let’s make the colors more friendly to color-blind viewers

penguins |> 
    ggplot(aes(x = flipper_length_mm,
               y = body_mass_g)) +
    geom_point(aes(color = species)) + 
    geom_smooth(method = "lm", color = "purple") + 
    scale_color_colorblind()

`geom_smooth()` using formula = 'y ~ x'

Adding aesthetics and layers

Let’s further differentiate different species via shapes.

We can specify this in a local aesthetic mapping of points using shape=
The legend will be updated to show this too!

penguins |> 
    ggplot(aes(x = flipper_length_mm,
               y = body_mass_g)) +
    geom_point(aes(color = species, 
                   shape = species)) + 
    geom_smooth(method = "lm") + 
    scale_color_colorblind()

`geom_smooth()` using formula = 'y ~ x'

Adding aesthetics and layers

Let’s further differentiate different species via shapes.

We can make all points the same color by specifying color= outside of aes()

penguins |> 
    ggplot(aes(x = flipper_length_mm,
               y = body_mass_g)) +
    geom_point(aes(shape = species), 
               color = "orange") + 
    geom_smooth(method = "lm", color="black") + 
    scale_color_colorblind()

`geom_smooth()` using formula = 'y ~ x'

Adding aesthetics and layers

Let’s further differentiate different species via shapes.

We can make all points the same shape by specifying shape= outside of aes()

penguins |> 
    ggplot(aes(x = flipper_length_mm,
               y = body_mass_g)) +
    geom_point(aes(color = species), 
               shape = 7) + 
    geom_smooth(method = "lm") + 
    scale_color_colorblind()

`geom_smooth()` using formula = 'y ~ x'

Adding aesthetics and layers

Let’s further differentiate different species via shapes.

We can also specify shape= outside of aes()

Mapping between shapes and the numbers that represent them: 0 - square, 1 - circle, 2 - triangle point up, 3 - plus, 4 - cross, 5 - diamond, 6 - triangle point down, 7 - square cross, 8 - star, 9 - diamond plus, 10 - circle plus, 11 - triangles up and down, 12 - square plus, 13 - circle cross, 14 - square and triangle down, 15 - filled square, 16 - filled circle, 17 - filled triangle point-up, 18 - filled diamond, 19 - solid circle, 20 - bullet (smaller circle), 21 - filled circle blue, 22 - filled square blue, 23 - filled diamond blue, 24 - filled triangle point-up blue, 25 - filled triangle point down blue. — Figure 1: R has 25 built-in shapes that are identified by numbers. There are some seeming duplicates: for example, 0, 15, and 22 are all squares. The difference comes from the interaction of the `color` and `fill` aesthetics. The hollow shapes (0–14) have a border determined by `color`; the solid shapes (15–20) are filled with `color`; the filled shapes (21–24) have a border of `color` and are filled with `fill`.

Adding aesthetics and layers

Let’s further differentiate different species via shapes.

We can also specify size= outside of aes()

penguins |> 
    ggplot(aes(x = flipper_length_mm,
               y = body_mass_g)) +
    geom_point(aes(color = species, 
                   shape = species), size=4) + 
    geom_smooth(method = "lm") + 
    scale_color_colorblind()

`geom_smooth()` using formula = 'y ~ x'

Now just need to add title and axis labels

penguins |> 
    ggplot(aes(x = flipper_length_mm,
               y = body_mass_g)) +
    geom_point(aes(color = species, 
                   shape = species)) + 
    geom_smooth(method = "lm") + 
  labs(
    title = "Body mass and flipper length",
    subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
    x = "Flipper length (mm)", y = "Body mass (g)",
    color = "Species", shape = "Species"
  ) +
  scale_color_colorblind()

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Visualizing distributions

Categorical variables take only one of a finite set of values

Bar charts are useful for visualizing categorical variables

penguins |> 
    ggplot(aes(x = species)) +
    geom_bar()

Numeric values we are familiar with

Histograms are useful for these - use argument binwidth =

penguins |> 
    ggplot(aes(x = body_mass_g)) +
    geom_histogram(binwidth = 200)

Visualizing distributions

You will likely need to spend time tuning the binwidth parameter

penguins |> 
    ggplot(aes(x = body_mass_g)) +
    geom_histogram(binwidth = 2000)

penguins |> 
    ggplot(aes(x = body_mass_g)) +
    geom_histogram(binwidth = 20)

Visualizing distributions

A smoothed out version of histogram which is supposed to approximate a probability density function

penguins |> 
    ggplot(aes(x = body_mass_g)) +
    geom_density()

penguins |> 
    ggplot(aes(x = body_mass_g)) +
    geom_histogram(binwidth = 200)

Visualizing distributions

Let’s check the difference between setting color = vs fill = with geom_bar:

penguins |> 
    ggplot(aes(x = species)) +
    geom_bar(color = "red")

penguins |> 
    ggplot(aes(x = species)) +
    geom_bar(fill = "red")

Visualizing distributions

Let’s check the difference between setting color = vs fill = with geom_bar:

penguins |> 
    ggplot(aes(x = species)) +
    geom_bar(color = "red")

penguins |> 
    ggplot(aes(x = species)) +
    geom_bar(fill = "red", color="purple")

Visualizing distributions

Box plots allow for visualizing the spread of a distribution
Makes it easy to see 25th percentile, median, 75th percentile, and outliers (>1.5*IQR from 25th or 75th percentile)

Visualizing distributions

Let’s see distribution of body mass by species…

…using geom_boxplot():

penguins |> 
    ggplot(aes(x = species, 
               y = body_mass_g)) +
    geom_boxplot()

…using geom_density():

penguins |> 
    ggplot(aes(x = body_mass_g, 
               color = species)) +
    geom_density(linewidth = 0.75)

Playing with visual parameters

Use alpha to add transparency

alpha is a number between 0 and 1; 0 = transparent, 1 = opaque

penguins |> 
    ggplot(aes(x = body_mass_g, 
               color = species, 
               fill = species)) +
    geom_density(alpha = 0.3)

penguins |> 
    ggplot(aes(x = body_mass_g, 
               color = species, 
               fill = species)) +
    geom_density(alpha = 0.7)

Multiple numerical variables

Already saw how to use scatter plots to visualize two numeric variables

We can use separate vals for color and shape

penguins |> 
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point(aes(color = species, shape = island))