library(tidyverse) # also loads ggplot2 package
library(ggthemes) # color palettes for ggplotSTA141A: Fundamentals of Statistical Data Science
Cycle through the following:
Generate questions about your data.
Search for answers by visualizing, transforming, and modelling your data.
Use what you learn to refine your questions and/or generate new questions
Requires creativity and critical thinking.
How representative is my data of the population I am interested in? Two question categories:
GGally::ggpairs() in GGally R packageHere we use
ggplot2 packageggplotly packageWe’ll see how to create beautiful visualizations using ggplot2
…using the dataset:
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
ggplot2ggplot() constructs the initial plot.
ggplot() is the data set for the plot.
ggplot(data = mpg) creates an empty plot.You then add one or more layers to ggplot() using +.
geom functions add a geometrical object to the plot.
geom_point(), geom_smooth(), geom_histogram(), geom_boxplot(), etc.ggplot()geom_bar(): bar chart; geom_line(): lines; geom_boxplot(): boxplot; geom_point(): scatterplotWe can have aesthetics change as a function of variables inside the tibble
Let’s add a new layer, geom_smooth(method="lm"), which visualizes line of best fit based on a linear model
ggplot(), it is applied to all layers.
color=species inside ggplot() will group all penguins by species.Let’s add a new layer, geom_smooth(method="lm"), which visualizes line of best fit based on a linear model
color=species inside geom_point() will group all penguins by species only for that layer.Let’s further differentiate different species via shapes.
shape=penguins |>
ggplot(aes(x = flipper_length_mm,
y = body_mass_g)) +
geom_point(aes(color = species,
shape = species)) +
geom_smooth(method = "lm") +
labs(
title = "Body mass and flipper length",
subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
x = "Flipper length (mm)", y = "Body mass (g)",
color = "Species", shape = "Species"
) +
scale_color_colorblind()
Categorical variables take only one of a finite set of values
Let’s see distribution of body mass by species…
Use alpha to add transparency
alpha is a number between 0 and 1; 0 = transparent, 1 = opaqueAlready saw how to use scatter plots to visualize two numeric variables
Too many aesthetic changes (shape, color, fill, size, etc) can clutter plots
Too many aesthetic changes (shape, color, fill, size, etc) can clutter plots
Too many aesthetic changes (shape, color, fill, size, etc) can clutter plots
Once you’ve made a plot, you can save using ggsave()
Many extensions: https://exts.ggplot2.tidyverse.org/gallery/