If just exploring data, a plot doesn’t need to look pretty if you can interpret it
If high-stakes presentation (e.g. for job interview, raise, promotion, etc), might need to add many bespoke features (not focus of this class)
Exploratory data analysis
Cycle through the following:
Generate questions about your data.
Search for answers by visualizing, transforming, and modelling your data.
Use what you learn to refine your questions and/or generate new questions
Requires creativity and critical thinking. Two question categories:
What type of variation occurs within each variable?
Mean, standard deviation, skewness, etc
What type of covariation occurs between variables?
How does height vary with weight, etc
ggplot2
Visualization
We’ll see how to create beautiful visualizations using ggplot2.
library(tidyverse)library(palmerpenguins)library(ggthemes) # color palettes for ggplotpenguins
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
Basic structure of ggplot2
ggplot() constructs the initial plot.
The first argument of ggplot() is the data set for the plot.
The data set must be a data frame.
ggplot(data = mpg) creates an empty plot.
You then add one or more layers to ggplot() using +.
geom functions add a geometrical object to the plot.
geom_point(), geom_smooth(), geom_histogram(), geom_boxplot(), etc.
Creating a ggplot
Start with function ggplot()
penguins |>ggplot()
Creating a ggplot
Start with function ggplot()
Add global aesthetics (i.e., aesthetics applied to every layer in plot).
penguins |>ggplot(aes(x = flipper_length_mm, y = body_mass_g))
Creating a ggplot
Start with function ggplot()
Add global aesthetics (i.e., aesthetics applied to every layer in plot).
Add layers.
Display data using geom: geometrical object used to represent data
geom_bar(): bar chart; geom_line(): lines; geom_boxplot(): boxplot; geom_point(): scatterplot
penguins |>ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +geom_point()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Adding aesthetics and layers
We can have aesthetics change as a function of variables inside the tibble
e.g. we can differentiate penguin species via colors
When a categorical variable is mapped to an aesthetic, each unique level of the variable (here: species) gets assigned a unique aesthetic value (here: unique color)
Let’s further differentiate different species via shapes.
We can also specify shape= outside of aes()
Figure 1: R has 25 built-in shapes that are identified by numbers. There are some seeming duplicates: for example, 0, 15, and 22 are all squares. The difference comes from the interaction of the color and fill aesthetics. The hollow shapes (0–14) have a border determined by color; the solid shapes (15–20) are filled with color; the filled shapes (21–24) have a border of color and are filled with fill.
Adding aesthetics and layers
Let’s further differentiate different species via shapes.
Too many aesthetic changes (shape, color, fill, size, etc) can clutter plots
Can use facets
facet_wrap() takes a formula argument
penguins |>ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +geom_point(aes(color = species, shape = species)) +facet_wrap(~island, scales="free_y") # each panel now has its own y-axis scale
Multiple numerical variables
Too many aesthetic changes (shape, color, fill, size, etc) can clutter plots
count is not a variable in diamonds, so how is it creating this?
Statistical transformations
Examples
Bar charts, histograms, and frequency polygons bin your data, then plot bin counts (the number of points that fall in each bin).
Smoothers fit a model to your data and then plot predictions from the model.
Boxplots compute the five-number summary of the distribution and then display that summary as a specially formatted box.
Statistical transformations
The algorithm used to calculate new values for a graph is called a stat.
(stat is short for statistical transformation)
Figure 2
Exploring variables
Two question categories:
What type of variation occurs within each variable?
What type of covariation occurs between 2+ variables?
Variation
Variation is tendency for values of a variable to change from measurement to measurement
Can be due to measurement error (e.g., measuring height with different rulers) or due to within-group variation (different people have different heights)
Let’s explore the distribution of weights (carat) of the ~50k diamonds from diamonds dataset.