10: data visualization

STA35B: Statistical Data Science 2

Akira Horiguchi

Purpose of visualizing data

  1. Exploratory data analysis
  2. Presenting findings to others

Affects how much effort to put in

  • If just exploring data, a plot doesn’t need to look pretty if you can interpret it
  • If high-stakes presentation (e.g. for job interview, raise, promotion, etc), might need to add many bespoke features (not focus of this class)

Exploratory data analysis

Cycle through the following:

  • Generate questions about your data.

  • Search for answers by visualizing, transforming, and modelling your data.

  • Use what you learn to refine your questions and/or generate new questions

Requires creativity and critical thinking. Two question categories:

  1. What type of variation occurs within each variable?
    • Mean, standard deviation, skewness, etc
  2. What type of covariation occurs between variables?
    • How does height vary with weight, etc

ggplot2

Visualization

We’ll see how to create beautiful visualizations using ggplot2.

library(tidyverse)
library(palmerpenguins)
library(ggthemes) # color palettes for ggplot
penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Basic structure of ggplot2

ggplot() constructs the initial plot.

  • The first argument of ggplot() is the data set for the plot.
    • The data set must be a data frame.
  • ggplot(data = mpg) creates an empty plot.

You then add one or more layers to ggplot() using +.

  • geom functions add a geometrical object to the plot.
    • geom_point(), geom_smooth(), geom_histogram(), geom_boxplot(), etc.

Creating a ggplot

  • Start with function ggplot()
penguins |> 
    ggplot()

Creating a ggplot

  • Start with function ggplot()
  • Add global aesthetics (i.e., aesthetics applied to every layer in plot).
penguins |> 
    ggplot(aes(x = flipper_length_mm, 
               y = body_mass_g)) 

Creating a ggplot

  • Start with function ggplot()
  • Add global aesthetics (i.e., aesthetics applied to every layer in plot).
  • Add layers.
    • Display data using geom: geometrical object used to represent data
    • geom_bar(): bar chart; geom_line(): lines; geom_boxplot(): boxplot; geom_point(): scatterplot
penguins |> 
    ggplot(aes(x = flipper_length_mm, 
               y = body_mass_g)) +
    geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Adding aesthetics and layers

We can have aesthetics change as a function of variables inside the tibble

  • e.g. we can differentiate penguin species via colors
  • When a categorical variable is mapped to an aesthetic, each unique level of the variable (here: species) gets assigned a unique aesthetic value (here: unique color)
penguins |> 
    ggplot(aes(x = flipper_length_mm,
               y = body_mass_g,
               color = species)) +
    geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Adding aesthetics and layers

Let’s add a new layer, geom_smooth(method="lm"), which visualizes line of best fit based on a linear model

  • When an aesthetic mapping is added inside ggplot(), it is applied to all layers.
    • So color=species inside ggplot() will group all penguins by species.
    • We now have a line for each species (not one global line).
penguins |> 
    ggplot(aes(x = flipper_length_mm,
               y = body_mass_g,
               color = species)) +
    geom_point() + 
    geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

Adding aesthetics and layers

Let’s add a new layer, geom_smooth(method="lm"), which visualizes line of best fit based on a linear model

  • When an aesthetic mapping is added inside a layer, it is applied to just that layer.
    • So color=species inside geom_point() will group all penguins by species only for that layer.
    • We now have one global line for all penguins.
penguins |> 
    ggplot(aes(x = flipper_length_mm,
               y = body_mass_g)) +
    geom_point(aes(color = species)) + 
    geom_smooth(method = "lm", color = "purple")
`geom_smooth()` using formula = 'y ~ x'

Adding aesthetics and layers

Let’s make the colors more friendly to color-blind viewers

penguins |> 
    ggplot(aes(x = flipper_length_mm,
               y = body_mass_g)) +
    geom_point(aes(color = species)) + 
    geom_smooth(method = "lm", color = "purple") + 
    scale_color_colorblind()
`geom_smooth()` using formula = 'y ~ x'

Adding aesthetics and layers

Let’s further differentiate different species via shapes.

  • We can specify this in a local aesthetic mapping of points using shape=
  • The legend will be updated to show this too!
penguins |> 
    ggplot(aes(x = flipper_length_mm,
               y = body_mass_g)) +
    geom_point(aes(color = species, 
                   shape = species)) + 
    geom_smooth(method = "lm") + 
    scale_color_colorblind()
`geom_smooth()` using formula = 'y ~ x'

Adding aesthetics and layers

Let’s further differentiate different species via shapes.

  • We can make all points the same color by specifying color= outside of aes()
penguins |> 
    ggplot(aes(x = flipper_length_mm,
               y = body_mass_g)) +
    geom_point(aes(shape = species), 
               color = "orange") + 
    geom_smooth(method = "lm", color="black") + 
    scale_color_colorblind()
`geom_smooth()` using formula = 'y ~ x'

Adding aesthetics and layers

Let’s further differentiate different species via shapes.

  • We can make all points the same shape by specifying shape= outside of aes()
penguins |> 
    ggplot(aes(x = flipper_length_mm,
               y = body_mass_g)) +
    geom_point(aes(color = species), 
               shape = 7) + 
    geom_smooth(method = "lm") + 
    scale_color_colorblind()
`geom_smooth()` using formula = 'y ~ x'

Adding aesthetics and layers

Let’s further differentiate different species via shapes.

  • We can also specify shape= outside of aes()
Mapping between shapes and the numbers that represent them: 0 - square,  1 - circle, 2 - triangle point up, 3 - plus, 4 - cross, 5 - diamond,  6 - triangle point down, 7 - square cross, 8 - star, 9 - diamond plus,  10 - circle plus, 11 - triangles up and down, 12 - square plus,  13 - circle cross, 14 - square and triangle down, 15 - filled square,  16 - filled circle, 17 - filled triangle point-up, 18 - filled diamond,  19 - solid circle, 20 - bullet (smaller circle), 21 - filled circle blue,  22 - filled square blue, 23 - filled diamond blue, 24 - filled triangle  point-up blue, 25 - filled triangle point down blue.
Figure 1: R has 25 built-in shapes that are identified by numbers. There are some seeming duplicates: for example, 0, 15, and 22 are all squares. The difference comes from the interaction of the color and fill aesthetics. The hollow shapes (0–14) have a border determined by color; the solid shapes (15–20) are filled with color; the filled shapes (21–24) have a border of color and are filled with fill.

Adding aesthetics and layers

Let’s further differentiate different species via shapes.

  • We can also specify size= outside of aes()
penguins |> 
    ggplot(aes(x = flipper_length_mm,
               y = body_mass_g)) +
    geom_point(aes(color = species, 
                   shape = species), size=4) + 
    geom_smooth(method = "lm") + 
    scale_color_colorblind()
`geom_smooth()` using formula = 'y ~ x'

Now just need to add title and axis labels

penguins |> 
    ggplot(aes(x = flipper_length_mm,
               y = body_mass_g)) +
    geom_point(aes(color = species, 
                   shape = species)) + 
    geom_smooth(method = "lm") + 
  labs(
    title = "Body mass and flipper length",
    subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
    x = "Flipper length (mm)", y = "Body mass (g)",
    color = "Species", shape = "Species"
  ) +
  scale_color_colorblind()
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Visualizing distributions

Categorical variables take only one of a finite set of values

  • Bar charts are useful for visualizing categorical variables
penguins |> 
    ggplot(aes(x = species)) +
    geom_bar()

Numeric values we are familiar with

  • Histograms are useful for these - use argument binwidth =
penguins |> 
    ggplot(aes(x = body_mass_g)) +
    geom_histogram(binwidth = 200)

Visualizing distributions

You will likely need to spend time tuning the binwidth parameter

penguins |> 
    ggplot(aes(x = body_mass_g)) +
    geom_histogram(binwidth = 2000)

penguins |> 
    ggplot(aes(x = body_mass_g)) +
    geom_histogram(binwidth = 20)

Visualizing distributions

  • A smoothed out version of histogram which is supposed to approximate a probability density function
penguins |> 
    ggplot(aes(x = body_mass_g)) +
    geom_density()

penguins |> 
    ggplot(aes(x = body_mass_g)) +
    geom_histogram(binwidth = 200)

Visualizing distributions

  • Let’s check the difference between setting color = vs fill = with geom_bar:
penguins |> 
    ggplot(aes(x = species)) +
    geom_bar(color = "red")

penguins |> 
    ggplot(aes(x = species)) +
    geom_bar(fill = "red")

Visualizing distributions

  • Let’s check the difference between setting color = vs fill = with geom_bar:
penguins |> 
    ggplot(aes(x = species)) +
    geom_bar(color = "red")

penguins |> 
    ggplot(aes(x = species)) +
    geom_bar(fill = "red", color="purple")

Visualizing distributions

  • Box plots allow for visualizing the spread of a distribution
  • Makes it easy to see 25th percentile, median, 75th percentile, and outliers (>1.5*IQR from 25th or 75th percentile)

Visualizing distributions

Let’s see distribution of body mass by species…

…using geom_boxplot():

penguins |> 
    ggplot(aes(x = species, 
               y = body_mass_g)) +
    geom_boxplot()

…using geom_density():

penguins |> 
    ggplot(aes(x = body_mass_g, 
               color = species)) +
    geom_density(linewidth = 0.75)

Playing with visual parameters

Use alpha to add transparency

  • alpha is a number between 0 and 1; 0 = transparent, 1 = opaque
penguins |> 
    ggplot(aes(x = body_mass_g, 
               color = species, 
               fill = species)) +
    geom_density(alpha = 0.3)

penguins |> 
    ggplot(aes(x = body_mass_g, 
               color = species, 
               fill = species)) +
    geom_density(alpha = 0.7)

Multiple numerical variables

Already saw how to use scatter plots to visualize two numeric variables

We can use separate vals for color and shape

penguins |> 
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point(aes(color = species, shape = island))

Too many things to remember…

Multiple numerical variables

Too many aesthetic changes (shape, color, fill, size, etc) can clutter plots

  • Can use facets
  • facet_wrap() takes a formula argument
penguins |> 
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point(aes(color = species, shape = species)) +
    facet_wrap(~island)

Multiple numerical variables

Too many aesthetic changes (shape, color, fill, size, etc) can clutter plots

  • Can use facets
  • facet_wrap() takes a formula argument
penguins |> 
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point(aes(color = species, shape = species)) +
    facet_wrap(~island, scales="free_y")  # each panel now has its own y-axis scale

Multiple numerical variables

Too many aesthetic changes (shape, color, fill, size, etc) can clutter plots

  • Can use facets
  • facet_wrap() takes a formula argument
penguins |> 
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point(aes(color = species, shape = species)) +
    facet_wrap(~island, ncol=1)

Multiple numerical variables

Too many aesthetic changes (shape, color, fill, size, etc) can clutter plots

  • Can use facets
  • facet_wrap() takes a formula argument
penguins |> 
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point(aes(color = species, shape = species)) +
    facet_wrap(~island, ncol=2, nrow=2)

Multiple numerical variables

Too many aesthetic changes (shape, color, fill, size, etc) can clutter plots

  • Can use facets
  • facet_grid() takes a formula argument
penguins |> 
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point(aes(color = species, shape = species)) +
    facet_grid(sex~island)

Multiple numerical variables

Too many aesthetic changes (shape, color, fill, size, etc) can clutter plots

  • Can use facets
  • facet_grid() takes a formula argument
penguins |> 
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point(aes(color = species, shape = species)) +
    facet_grid(island~sex)

Multiple numerical variables

Which of the following makes it easier to compare engine size (displ) across cars with different drive trains?

ggplot(mpg, aes(x = displ)) + 
  geom_histogram() + 
  facet_wrap(~ drv, ncol=1) + 
  theme(axis.title = element_text(size=32), 
        axis.text = element_text(size=20))

ggplot(mpg, aes(x = displ)) + 
  geom_histogram() +
  facet_wrap(~ drv, nrow=1) + 
  theme(axis.title = element_text(size=32), 
        axis.text=element_text(size=20))

Can change order of levels in factors

penguins |>
    mutate(island = factor(island, levels=c("Dream", "Biscoe", "Torgersen"))) |>
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point(aes(color = species, shape = species)) +
    facet_wrap(~island)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Shapes can be difficult to distinguish

penguins |>
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point(aes(color = species, shape = species))

Sometimes easier to read by replacing shape with first letter

penguins |>
    mutate(species_first_letter = substr(species, 1, 1)) |>
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_text(aes(color = species, label = species_first_letter))

Might allow you to remove the legend

penguins |>
    mutate(species_first_letter = substr(species, 1, 1)) |>
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_text(aes(color = species, label = species_first_letter)) +
    theme(legend.position = "none")

Saving plots

Once you’ve made a plot, you can save using ggsave()

  • Either can save whatever plot you made last:
penguins |> 
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point()
ggsave(filename = "penguin-plot.png")
  • Or you can save the plot object and save that
p <- penguins |> 
    ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point()
ggsave(filename = "penguin-plot.png", p)

Exploring the diamonds data set

Diamonds data set

Consider the diamonds dataset in ggplot2

  • Contains the prices and other attributes of almost 54,000 diamonds.
diamonds
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

Geometric objects

How are these two plots similar?

ggplot(diamonds, aes(carat, price)) + 
  geom_point() + 
  theme(axis.title = element_text(size=28), 
        axis.text = element_text(size=20))

ggplot(diamonds, aes(carat, price)) + 
  geom_smooth() + 
  theme(axis.title = element_text(size=28), 
        axis.text = element_text(size=20))

  • Both plots contain the same x and y variables; both plots describe same data.
  • Each plot uses a different geometric object, geom, to represent the data.

Geometric objects

  • Can utilize multiple geoms together to elucidate relationship
ggplot(diamonds, aes(carat, price)) + 
  geom_point() + 
  geom_smooth()

Statistical transformations

Let’s create a bar chart across cuts:

ggplot(diamonds, aes(x = cut)) +
  geom_bar()
  • count is not a variable in diamonds, so how is it creating this?

Statistical transformations

Examples

  • Bar charts, histograms, and frequency polygons bin your data, then plot bin counts (the number of points that fall in each bin).
  • Smoothers fit a model to your data and then plot predictions from the model.
  • Boxplots compute the five-number summary of the distribution and then display that summary as a specially formatted box.

Statistical transformations

The algorithm used to calculate new values for a graph is called a stat.

  • (stat is short for statistical transformation)

Figure 2

Exploring variables

Two question categories:

  1. What type of variation occurs within each variable?
  2. What type of covariation occurs between 2+ variables?

Variation

Variation is tendency for values of a variable to change from measurement to measurement

  • Can be due to measurement error (e.g., measuring height with different rulers) or due to within-group variation (different people have different heights)
  • Let’s explore the distribution of weights (carat) of the ~50k diamonds from diamonds dataset.
  • carat is numerical, can use histogram:
diamonds |>
    ggplot(aes(x = carat)) +
    geom_histogram(binwidth = 0.5)

Typical values

  • In bar charts and histograms, tall bars = common values; no bars = values not seen
  • Questions to ask yourself:
    • Which values are the most common? Why?
    • Which values are rare? Why? Does that match your expectations?
    • Can you see any unusual patterns? What might explain them?
  • Let’s look at distribution of weights of smaller diamonds.
diamonds |> 
  filter(carat < 3) |> 
  ggplot(aes(x = carat)) +
  geom_histogram(binwidth = 0.01)

  • Why are there more diamonds at whole carats than fractional?
  • Why are more diamonds slightly to the right of each peak than slightly to left?

Unusual values

If you encounter unusual values in dataset, and want to ignore them for rest of analysis, two options:

  1. Drop entire row with the strange values, e.g.
diamonds_filtered <- diamonds |>
  filter(between( y, 3, 20))
  1. Replacing unusual values with missing values
diamonds2 <- diamonds |>
  mutate(y = if_else(y < 3 | y > 20, NA, y))
  • Latter is more recommended behavior.

Covariation

… is tendency for values of 2+ variables to vary together in a related way

  • How does price (numerical) of a diamond vary with quality (categorical)?
  • Can use geom_freqpoly() to show “frequency polygons” (similar to histogram)
ggplot(diamonds, aes(x = price)) +
  geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75)
  • Not super informative, since the height mainly reflects the count.

Covariation

  • More useful to understand the density of the variable (count / total number)
  • To do this, we can use after_stat(density), which does this normalization
ggplot(diamonds, aes(x = price, y = after_stat(density))) +
  geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75)
  • Seems diamond quality has no significant effect?

Covariation

  • Let’s further inspect with a box plot
ggplot(diamonds, aes(x = cut, y = price, fill = cut)) +
  geom_boxplot()
  • Can now easily compare medians and 25th/75th percentiles
  • Better quality diamonds are typically cheaper?! We’ll investigate why.

Further inspecting diamond prices

  • What might be responsible for this? Recall variables: x,y,z refer to length/width/depth of diamond in mm
colnames(diamonds)
 [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
 [8] "x"       "y"       "z"      
  • Let’s look at size vs price.

–>

Size vs price in diamonds

ggplot(diamonds, aes(x, price)) + geom_point() + theme(axis.title = element_text(size=40), axis.text = element_text(size=32))
ggplot(diamonds, aes(y, price)) + geom_point() + theme(axis.title = element_text(size=40), axis.text = element_text(size=32))
ggplot(diamonds, aes(z, price)) + geom_point() + theme(axis.title = element_text(size=40), axis.text = element_text(size=32))

  • Clearly some outliers - “zero” length / width / depth diamonds apparently?
  • But still some positive correlation between x, y, z and price
  • Let’s clean up the tibble to only view the middle 95% of values for x, y, z.

Size vs price in diamonds

  • Let’s clean up the tibble to only view the middle 95% of values for x, y, z
diamonds_middle <- diamonds |> 
  filter(between(x, quantile(diamonds$x, 0.025), quantile(diamonds$x, 0.975)),
         between(y, quantile(diamonds$y, 0.025), quantile(diamonds$y, 0.975)),
         between(z, quantile(diamonds$z, 0.025), quantile(diamonds$z, 0.975)))

Size vs price in diamonds

  • Let’s clean up the tibble to only view the middle 95% of values for x, y, z
ggplot(diamonds_middle, aes(x, price)) + geom_point() + theme(axis.title = element_text(size=40), axis.text = element_text(size=32))
ggplot(diamonds_middle, aes(y, price)) + geom_point() + theme(axis.title = element_text(size=40), axis.text = element_text(size=32))
ggplot(diamonds_middle, aes(z, price)) + geom_point() + theme(axis.title = element_text(size=40), axis.text = element_text(size=32))

  • Appears x, y, z (length/width/depth) are highly correlated with price
  • Let’s explore the relationship between estimated volume (x*y*z) and price

Size vs price in diamonds

  • Let’s look at the relationship between estimated volume (x*y*z) and price
diamonds_middle |> 
  mutate(est_volume = x*y*z) |> 
  ggplot(aes(x = est_volume, y = price)) +
  geom_point(alpha = 0.01)

  • Appears to be a strong correlation.
  • Now let’s think again as to why it was that we previously found that higher quality diamonds were cheaper.

Size vs price in diamonds

Hypothesis: higher quality diamonds are smaller, lower quality diamonds are larger.

  • To investigate, we could use boxplot: we want to visualize variation in size for each category of quality
diamonds_middle |> 
  mutate(est_volume = x*y*z) |> 
  ggplot(aes(cut, est_volume, fill=cut)) +
  geom_boxplot()

Size vs price in diamonds

  • Let’s try to visualize the different cuts on the scatter plot:
diamonds_middle |>
  mutate(est_volume = x*y*z) |>
  ggplot(aes(est_volume, price, color = cut)) +
  geom_point(alpha = 0.5)

diamonds_middle |>
  mutate(est_volume = x*y*z) |>
  ggplot(aes(est_volume, price, color = cut)) +
  geom_smooth()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'