s3.5: data visualization

STA141A: Fundamentals of Statistical Data Science

Akira Horiguchi

Purpose of visualizing data

  1. Exploratory data analysis
  2. Presenting findings to others

Affects how much effort to put in

  • If just exploring data, a plot doesn’t need to look pretty if you can interpret it
  • If high-stakes presentation (e.g. for job interview, raise, promotion, etc), might need to add many bespoke features (not focus of this class)

Exploratory data analysis

Cycle through the following:

  • Generate questions about your data.

  • Search for answers by visualizing, transforming, and modelling your data.

  • Use what you learn to refine your questions and/or generate new questions

Requires creativity and critical thinking. Two question categories:

  1. What type of variation occurs within each variable?
    • Mean, standard deviation, skewness, etc
  2. What type of covariation occurs between variables?
    • How does height vary with weight, etc

ggplot2

Visualization

We’ll see how to create beautiful visualizations using ggplot2.

library(tidyverse)
library(palmerpenguins)
library(ggthemes) # color palettes for ggplot
penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Basic structure of ggplot2

ggplot() constructs the initial plot.

  • The first argument of ggplot() is the data set for the plot.
    • The data set must be a data frame.
  • ggplot(data = mpg) creates an empty plot.

You then add one or more layers to ggplot() using +.

  • geom functions add a geometrical object to the plot.
    • geom_point(), geom_smooth(), geom_histogram(), geom_boxplot(), etc.

Creating a ggplot

  • Start with function ggplot()
penguins |> 
    ggplot()

Creating a ggplot

  • Start with function ggplot()
  • Add global aesthetics (i.e., aesthetics applied to every layer in plot).
penguins |> 
    ggplot(aes(x = flipper_length_mm, 
               y = body_mass_g)) 

Creating a ggplot

  • Start with function ggplot()
  • Add global aesthetics (i.e., aesthetics applied to every layer in plot).
  • Add layers.
    • Display data using geom: geometrical object used to represent data
    • geom_bar(): bar chart; geom_line(): lines; geom_boxplot(): boxplot; geom_point(): scatterplot
penguins |> 
    ggplot(aes(x = flipper_length_mm, 
               y = body_mass_g)) +
    geom_point()

Adding aesthetics and layers

We can have aesthetics change as a function of variables inside the tibble

  • e.g. we can differentiate penguin species via colors
  • When a categorical variable is mapped to an aesthetic, each unique level of the variable (here: species) gets assigned a unique aesthetic value (here: unique color)
penguins |> 
    ggplot(aes(x = flipper_length_mm,
               y = body_mass_g,
               color = species)) +
    geom_point()

Adding aesthetics and layers

Let’s add a new layer, geom_smooth(method="lm"), which visualizes line of best fit based on a linear model

  • When an aesthetic mapping is added inside ggplot(), it is applied to all layers.
    • So color=species inside ggplot() will group all penguins by species.
    • We now have a line for each species (not one global line).
penguins |> 
    ggplot(aes(x = flipper_length_mm,
               y = body_mass_g,
               color = species)) +
    geom_point() + 
    geom_smooth(method = "lm")

Adding aesthetics and layers

Let’s add a new layer, geom_smooth(method="lm"), which visualizes line of best fit based on a linear model

  • When an aesthetic mapping is added inside a layer, it is applied to just that layer.
    • So color=species inside geom_point() will group all penguins by species only for that layer.
    • We now have one global line for all penguins.
penguins |> 
    ggplot(aes(x = flipper_length_mm,
               y = body_mass_g)) +
    geom_point(aes(color = species)) + 
    geom_smooth(method = "lm")

Adding aesthetics and layers

Let’s make the colors more friendly to color-blind viewers

penguins |> 
    ggplot(aes(x = flipper_length_mm,
               y = body_mass_g)) +
    geom_point(aes(color = species)) + 
    geom_smooth(method = "lm") + 
    scale_color_colorblind()

Adding aesthetics and layers

Let’s further differentiate different species via shapes.

  • We can specify this in a local aesthetic mapping of points using shape=
  • The legend will be updated to show this too!
penguins |> 
    ggplot(aes(x = flipper_length_mm,
               y = body_mass_g)) +
    geom_point(aes(color = species, 
                   shape = species)) + 
    geom_smooth(method = "lm") + 
    scale_color_colorblind()

Now just need to add title and axis labels

penguins |> 
    ggplot(aes(x = flipper_length_mm,
               y = body_mass_g)) +
    geom_point(aes(color = species, 
                   shape = species)) + 
    geom_smooth(method = "lm") + 
  labs(
    title = "Body mass and flipper length",
    subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
    x = "Flipper length (mm)", y = "Body mass (g)",
    color = "Species", shape = "Species"
  ) +
  scale_color_colorblind()

Visualizing distributions

Categorical variables take only one of a finite set of values

  • Bar charts are useful for visualizing categorical variables
penguins |> 
    ggplot(aes(x = species)) +
    geom_bar()

Numeric values we are familiar with

  • Histograms are useful for these - use argument binwidth =
penguins |> 
    ggplot(aes(x = body_mass_g)) +
    geom_histogram(binwidth = 200)

Visualizing distributions

You will likely need to spend time tuning the binwidth parameter

penguins |> 
    ggplot(aes(x = body_mass_g)) +
    geom_histogram(binwidth = 2000)

penguins |> 
    ggplot(aes(x = body_mass_g)) +
    geom_histogram(binwidth = 20)

Visualizing distributions

  • A smoothed out version of histogram which is supposed to approximate a probability density function
penguins |> 
    ggplot(aes(x = body_mass_g)) +
    geom_density()

penguins |> 
    ggplot(aes(x = body_mass_g)) +
    geom_histogram(binwidth = 200)

Visualizing distributions

  • Let’s check the difference between setting color = vs fill = with geom_bar:
penguins |> 
    ggplot(aes(x = species)) +
    geom_bar(color = "red")

penguins |> 
    ggplot(aes(x = species)) +
    geom_bar(fill = "red")

Visualizing distributions

  • Let’s check the difference between setting color = vs fill = with geom_bar:
penguins |> 
    ggplot(aes(x = species)) +
    geom_bar(color = "red")

penguins |> 
    ggplot(aes(x = species)) +
    geom_bar(fill = "red", color="purple")

Visualizing distributions

  • Box plots allow for visualizing the spread of a distribution
  • Makes it easy to see 25th percentile, median, 75th percentile, and outliers (>1.5*IQR from 25th or 75th percentile)

Visualizing distributions

Let’s see distribution of body mass by species…

…using geom_boxplot():

penguins |> 
    ggplot(aes(x = species, 
               y = body_mass_g)) +
    geom_boxplot()

…using geom_density():

penguins |> 
    ggplot(aes(x = body_mass_g, 
               color = species)) +
    geom_density(linewidth = 0.75)

Playing with visual parameters

Use alpha to add transparency

  • alpha is a number between 0 and 1; 0 = transparent, 1 = opaque
penguins |> 
    ggplot(aes(x = body_mass_g, 
               color = species, 
               fill = species)) +
    geom_density(alpha = 0.3)

penguins |> 
    ggplot(aes(x = body_mass_g, 
               color = species, 
               fill = species)) +
    geom_density(alpha = 0.7)

Multiple numerical variables

Already saw how to use scatter plots to visualize two numeric variables

We can use separate vals for color and shape

penguins |> 
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point(aes(color = species, shape = island))

Too many things to remember…

Multiple numerical variables

Too many aesthetic changes (shape, color, fill, size, etc) can clutter plots

  • Can use facets
  • facet_wrap() takes a formula argument
penguins |> 
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point(aes(color = species, shape = species)) +
    facet_wrap(~island)

Multiple numerical variables

Too many aesthetic changes (shape, color, fill, size, etc) can clutter plots

  • Can use facets
  • facet_grid() takes a formula argument
penguins |> 
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point(aes(color = species, shape = species)) +
    facet_grid(sex~island)

Multiple numerical variables

Too many aesthetic changes (shape, color, fill, size, etc) can clutter plots

  • Can use facets
  • facet_grid() takes a formula argument
penguins |> 
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point(aes(color = species, shape = species)) +
    facet_grid(island~sex)

Saving plots

Once you’ve made a plot, you can save using ggsave()

  • Either can save whatever plot you made last:
penguins |> 
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point()
ggsave(filename = "penguin-plot.png")
  • Or you can save the plot object and save that
p <- penguins |> 
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point()
ggsave(filename = "penguin-plot.png", p)

More examples

Can change order of levels in factors

penguins |> 
    mutate(island = factor(island, levels=c("Dream", "Biscoe", "Torgersen"))) |> 
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point(aes(color = species, shape = species)) +
    facet_wrap(~island)

Shapes can be difficult to distinguish

penguins |> 
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point(aes(color = species, shape = species))

Sometimes easier to read by replacing shape with first letter

penguins |> 
    mutate(species_first_letter = substr(species, 1, 1)) |> 
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_text(aes(color = species, label = species_first_letter))

Might allow you to remove the legend

penguins |> 
    mutate(species_first_letter = substr(species, 1, 1)) |> 
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_text(aes(color = species, label = species_first_letter)) + 
    theme(legend.position = "none")

Change background

I don’t like the gray default background

penguins |> 
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point(aes(color = species, shape = species)) +
    theme_minimal()  # too minimal for my tastes

Change background

I don’t like the gray default background

penguins |> 
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point(aes(color = species, shape = species)) +
    theme_bw()  # Gestalt principles: how does the human brain organize visual information?

Examples in action

Many extensions: https://exts.ggplot2.tidyverse.org/gallery/

Interactive visualizations

Why do interactive visualizations?

As with static visualizations:

  • Explore data
  • As the final product itself (i.e., the “deliverable”)

Goal should determine how much time/effort to put in

(gg)plotly

pp <- penguins |> ggplot(aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(aes(color = species, shape = sex)) + facet_grid(year~island) + theme(text = element_text(size=14))
pp  # static visualization

(gg)plotly

library(plotly)  # may need install.packages("plotly")
ggplotly(pp)  # interactive visualization. toggle via single-click or double-click

Other ways (not necessary R)

  • Shiny: https://shiny.posit.co/
    • R or Python
    • Pro: free
    • Con: not as much functionality as the two below
  • Tableau: https://www.tableau.com/
    • drag and drop
    • Pro: lots of functionality, professional looking
    • Con: not free
  • d3: https://d3js.org/
    • JavaScript library
    • Pro: free, lots of functionality, professional looking
    • Con: coding might be difficult