3.3 exploratory data analysis

STA141A: Fundamentals of Statistical Data Science

Akira Horiguchi

Exploratory data analysis

Cycle through the following:

  • Generate questions about your data.

  • Search for answers by visualizing, transforming, and modelling your data.

  • Use what you learn to refine your questions and/or generate new questions

Requires creativity and critical thinking.

How representative is my data of the population I am interested in? Two question categories:

  1. What type of variation occurs within each variable?
    • Mean, standard deviation, skewness, etc
  2. What type of covariation occurs between variables?
    • How does height vary with weight, etc

Getting started

  1. What type of variation occurs within each variable?
  2. What type of covariation occurs between variables?

Why bother visualizing?

ggplot2

Visualization

We’ll see how to create beautiful visualizations using ggplot2

library(tidyverse)  # also loads ggplot2 package
library(ggthemes)  # color palettes for ggplot

…using the dataset:

library(palmerpenguins)
penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Basic structure of ggplot2

ggplot() constructs the initial plot.

  • The first argument of ggplot() is the data set for the plot.
    • The data set must be a data frame.
  • ggplot(data = mpg) creates an empty plot.

You then add one or more layers to ggplot() using +.

  • geom functions add a geometrical object to the plot.
    • geom_point(), geom_smooth(), geom_histogram(), geom_boxplot(), etc.

Creating a ggplot

  • Start with function ggplot()
penguins |> 
    ggplot()

Creating a ggplot

  • Start with function ggplot()
  • Add global aesthetics (i.e., aesthetics applied to every layer in plot).
penguins |> 
    ggplot(aes(x = flipper_length_mm, 
               y = body_mass_g)) 

Creating a ggplot

  • Start with function ggplot()
  • Add global aesthetics (i.e., aesthetics applied to every layer in plot).
  • Add layers.
    • Display data using geom: geometrical object used to represent data
    • geom_bar(): bar chart; geom_line(): lines; geom_boxplot(): boxplot; geom_point(): scatterplot
penguins |> 
    ggplot(aes(x = flipper_length_mm, 
               y = body_mass_g)) +
    geom_point()

Adding aesthetics and layers

We can have aesthetics change as a function of variables inside the tibble

  • e.g. we can differentiate penguin species via colors
  • When a categorical variable is mapped to an aesthetic, each unique level of the variable (here: species) gets assigned a unique aesthetic value (here: unique color)
penguins |> 
    ggplot(aes(x = flipper_length_mm,
               y = body_mass_g,
               color = species)) +
    geom_point()

Adding aesthetics and layers

Let’s add a new layer, geom_smooth(method="lm"), which visualizes line of best fit based on a linear model

  • When an aesthetic mapping is added inside ggplot(), it is applied to all layers.
    • So color=species inside ggplot() will group all penguins by species.
    • We now have a line for each species (not one global line).
penguins |> 
    ggplot(aes(x = flipper_length_mm,
               y = body_mass_g,
               color = species)) +
    geom_point() + 
    geom_smooth(method = "lm")

Adding aesthetics and layers

Let’s add a new layer, geom_smooth(method="lm"), which visualizes line of best fit based on a linear model

  • When an aesthetic mapping is added inside a layer, it is applied to just that layer.
    • So color=species inside geom_point() will group all penguins by species only for that layer.
    • We now have one global line for all penguins.
penguins |> 
    ggplot(aes(x = flipper_length_mm,
               y = body_mass_g)) +
    geom_point(aes(color = species)) + 
    geom_smooth(method = "lm")

Adding aesthetics and layers

Let’s make the colors more friendly to color-blind viewers

penguins |> 
    ggplot(aes(x = flipper_length_mm,
               y = body_mass_g)) +
    geom_point(aes(color = species)) + 
    geom_smooth(method = "lm") + 
    scale_color_colorblind()

Adding aesthetics and layers

Let’s further differentiate different species via shapes.

  • We can specify this in a local aesthetic mapping of points using shape=
  • The legend will be updated to show this too!
penguins |> 
    ggplot(aes(x = flipper_length_mm,
               y = body_mass_g)) +
    geom_point(aes(color = species, 
                   shape = species)) + 
    geom_smooth(method = "lm") + 
    scale_color_colorblind()

Now just need to add title and axis labels

penguins |> 
    ggplot(aes(x = flipper_length_mm,
               y = body_mass_g)) +
    geom_point(aes(color = species, 
                   shape = species)) + 
    geom_smooth(method = "lm") + 
  labs(
    title = "Body mass and flipper length",
    subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
    x = "Flipper length (mm)", y = "Body mass (g)",
    color = "Species", shape = "Species"
  ) +
  scale_color_colorblind()

Visualizing distributions

Categorical variables take only one of a finite set of values

  • Bar charts are useful for visualizing categorical variables
penguins |> 
    ggplot(aes(x = species)) +
    geom_bar()

Numeric values we are familiar with

  • Histograms are useful for these - use argument binwidth =
penguins |> 
    ggplot(aes(x = body_mass_g)) +
    geom_histogram(binwidth = 200)

Visualizing distributions

You will likely need to spend time tuning the binwidth parameter

penguins |> 
    ggplot(aes(x = body_mass_g)) +
    geom_histogram(binwidth = 2000)

penguins |> 
    ggplot(aes(x = body_mass_g)) +
    geom_histogram(binwidth = 20)

Visualizing distributions

  • A smoothed out version of histogram which is supposed to approximate a probability density function
penguins |> 
    ggplot(aes(x = body_mass_g)) +
    geom_density()

penguins |> 
    ggplot(aes(x = body_mass_g)) +
    geom_histogram(binwidth = 200)

Visualizing distributions

  • Let’s check the difference between setting color = vs fill = with geom_bar:
penguins |> 
    ggplot(aes(x = species)) +
    geom_bar(color = "red")

penguins |> 
    ggplot(aes(x = species)) +
    geom_bar(fill = "red")

  • Note that color (independent of data values) is outside of aes()

Visualizing distributions

  • Box plots allow for visualizing the spread of a distribution
  • Makes it easy to see 25th percentile, median, 75th percentile, and outliers (>1.5*IQR from 25th or 75th percentile)

Visualizing distributions

Let’s see distribution of body mass by species…

…using geom_boxplot():

penguins |> 
    ggplot(aes(x = species, 
               y = body_mass_g)) +
    geom_boxplot()

…using geom_density():

penguins |> 
    ggplot(aes(x = body_mass_g, 
               color = species)) +
    geom_density(linewidth = 0.75)

Playing with visual parameters

Use alpha to add transparency

  • alpha is a number between 0 and 1; 0 = transparent, 1 = opaque
penguins |> 
    ggplot(aes(x = body_mass_g, 
               color = species, 
               fill = species)) +
    geom_density(alpha = 0.3)

penguins |> 
    ggplot(aes(x = body_mass_g, 
               color = species, 
               fill = species)) +
    geom_density(alpha = 0.7)

Multiple numerical variables

Already saw how to use scatter plots to visualize two numeric variables

We can use separate vals for color and shape

penguins |> 
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point(aes(color = species, shape = island))

Too many things to remember…

Multiple numerical variables

Too many aesthetic changes (shape, color, fill, size, etc) can clutter plots

  • Can use facets
  • facet_wrap() takes a formula argument
penguins |> 
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point(aes(color = species, shape = species)) +
    facet_wrap(~island)

Multiple numerical variables

Too many aesthetic changes (shape, color, fill, size, etc) can clutter plots

  • Can use facets
  • facet_grid() takes a formula argument
penguins |> 
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point(aes(color = species, shape = species)) +
    facet_grid(sex~island)

Multiple numerical variables

Too many aesthetic changes (shape, color, fill, size, etc) can clutter plots

  • Can use facets
  • facet_grid() takes a formula argument
penguins |> 
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point(aes(color = species, shape = species)) +
    facet_grid(island~sex)

Saving plots

Once you’ve made a plot, you can save using ggsave()

  • Either can save whatever plot you made last:
penguins |> 
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point()
ggsave(filename = "penguin-plot.png")
  • Or you can save the plot object and save that
p <- penguins |> 
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point()
ggsave(filename = "penguin-plot.png", p)

Examples in action

Many extensions: https://exts.ggplot2.tidyverse.org/gallery/

Interactive visualizations

(gg)plotly

pp <- penguins |> ggplot(aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(aes(color = species, shape = sex)) + facet_grid(year~island) + theme(text = element_text(size=14))
pp  # static visualization

(gg)plotly

library(plotly)  # may need install.packages("plotly")
ggplotly(pp)  # interactive visualization. toggle via single-click or double-click