MATH167R: Data visualization

Peter Gao

Warm-up

  1. What does the following code do?

    library(palmerpenguins)
    library(ggplot2)
    data(penguins)
    ggplot(penguins, 
           aes(x = flipper_length_mm, 
               y = body_mass_g, 
               color = species)) + 
      geom_point()

Warm-up

library(palmerpenguins)
library(ggplot2)
data(penguins)
ggplot(penguins, 
       aes(x = flipper_length_mm, 
           y = body_mass_g, 
           color = species)) + 
  geom_point()

Overview of today

  • Reviewing ggplot2
  • Advanced data visualization
  • Styling R visualizations

Review: grammar of graphics

The three basic layers:

  1. Data: a data frame with all of the variables of interest
  2. Aesthetics: graphical dimensions like x, y, color, shape, and more.
  3. Geometries: the specific markings used to illustrate your variables and aesthetics.

Review: Cats and Dogs

Suppose we have the following data on animal weights. How can we compute the average weights of cats and dogs? [example from Andrew Heiss]

head(animals, n = 10)
# A tibble: 10 × 2
   animal weight
   <chr>   <dbl>
 1 Cat      18.1
 2 Dog      37.9
 3 Cat      24.2
 4 Dog      58.8
 5 Dog      51.7
 6 Dog      38.5
 7 Cat      22.2
 8 Dog      27.8
 9 Cat      17.4
10 Cat      24.3

Code for generating data

library(tidyverse)
# example from Andrew Heiss
set.seed(12)
animals <- tibble(animal = c(rep(c("Small cat", "Big cat"), each = 250), rep("Dog", 500))) |> 
  mutate(weight = case_when(
    animal == "Small cat" ~ rnorm(n(), 20, 5),
    animal == "Big cat" ~ rnorm(n(), 60, 5),
    animal == "Dog" ~ rnorm(n(), 40, 10)
  )) |> 
  mutate(animal = ifelse(str_detect(animal, "cat"), "Cat", "Dog"))

Review: Cats and Dogs

mean_wt <- animals |>
  group_by(animal) |>
  summarize(mean_wt = mean(weight))
mean_wt
# A tibble: 2 × 2
  animal mean_wt
  <chr>    <dbl>
1 Cat       40.2
2 Dog       40.1

How can we turn this into a visual?

Review: Cats and Dogs

In ggplot2, there are two geometries for bar plots: geom_col() and geom_bar().

  • The height of the bars of geom_col() represent numerical values in each group.
  • The height of the bars of geom_bar() represent the number of rows for each group.

Review: Cats and Dogs

ggplot(data = mean_wt, 
       aes(x = animal, y = mean_wt, fill = animal)) +
  geom_col() +
  ylab("Mean weight (lb)")

Review: Cats and Dogs

# construct named vector
mean_wt_vec <- mean_wt$mean_wt
names(mean_wt_vec) <- mean_wt$animal
barplot(mean_wt_vec,
        col = c("red", "blue"),
        ylab = "Mean weight (lb)")

If possible, show the data

ggplot(animals, aes(x = animal, y = weight, color = animal)) +
  geom_point(position = position_jitter(height = 0)) +
  labs(x = NULL, y = "Mean Weight (lb)") +
  guides(color = "none")

If possible, show the data

ggplot(animals, aes(x = animal, 
                    y = weight, 
                    color = animal)) +
  geom_boxplot() +
  geom_point(position = position_jitter(height = 0), alpha = 0.5) +
  labs(x = NULL, y = "Mean weight (lb)") +
  guides(color = "none")

Review: Cats and Dogs

Takeaways:

  • Data has to be in the right format to work with ggplot2
  • Geometries can be combined for more complex visualizations
  • Only showing summary statistics can be misleading and can hide information.

Gapminder data

link

Gapminder data

  • What changes do we see over time? What do we notice in this visualization?
  • What questions arise? Do you have any doubts about this visualization?

Gapminder data

library(gapminder)
data(gapminder)
head(gapminder)
# A tibble: 6 × 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.

Gapminder data

summary(gapminder)
        country        continent        year         lifeExp     
 Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
 Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
 Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
 Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
 Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
 Australia  :  12                  Max.   :2007   Max.   :82.60  
 (Other)    :1632                                                
      pop              gdpPercap       
 Min.   :6.001e+04   Min.   :   241.2  
 1st Qu.:2.794e+06   1st Qu.:  1202.1  
 Median :7.024e+06   Median :  3531.8  
 Mean   :2.960e+07   Mean   :  7215.3  
 3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
 Max.   :1.319e+09   Max.   :113523.1  
                                       

Gapminder data

Let’s work with the following data:

gapminder_2007 <- gapminder |> filter(year == 2007)

How do we recreate the following plot?

Gapminder data

Work with the people around you to recreate this plot:

Gapminder data

Hopefully, you produced something like this:

ggplot(data = gapminder_2007,
       mapping = aes(x = gdpPercap,
                     y = lifeExp,
                     color = continent,
                     size = pop)) +
  geom_point() 

Playing with scales

Scales are another layer we can add on top of our plot:

ggplot(data = gapminder_2007,
       mapping = aes(x = gdpPercap,
                     y = lifeExp,
                     color = continent,
                     size = pop)) +
  geom_point() +
  scale_x_log10()

What do you expect the output to look like?

Playing with scales

Playing with scales

ggplot(data = gapminder_2007,
       mapping = aes(x = gdpPercap,
                     y = lifeExp,
                     color = continent,
                     size = pop)) +
  geom_point() +
  scale_x_log10() +
  scale_color_viridis_d() # discrete viridis color scale

Small multiples

“At the heart of quantitative reasoning is a single question: Compared to what? Small multiple designs, multivariate and data bountiful, answer directly by visually enforcing comparisons of changes, of the differences among objects, of the scope of alternatives. For a wide range of problems in data presentation, small multiples are the best design solution.”

Edward Tufte

Small multiples

We can often make comparisons clearer by splitting a plot into many “small multiples” which have the same axes:

Facets for small multiples

Small multiples are implemented in ggplot2 via the facet_wrap() and facet_grid() functions:

ggplot(data = gapminder_2007,
       mapping = aes(x = gdpPercap,
                     y = lifeExp,
                     color = continent,
                     size = pop)) +
  geom_point() +
  scale_x_log10() +
  facet_wrap(~continent)

Gapminder data

Work with the people around you to recreate this plot:

Themes

ggplot2 provides a default theme: gray background, basic discrete color scheme, etc.

Other themes

However, ggplot2 also comes with additional themes that you can layer on top of your plots:

ggplot(penguins, 
       aes(x = flipper_length_mm, 
           y = body_mass_g, 
           color = species)) +
  geom_point() +
  theme_minimal()

Other themes

ggplot(penguins, 
       aes(x = flipper_length_mm, 
           y = body_mass_g, 
           color = species)) +
  geom_point() +
  theme_bw()

Other themes

ggplot(penguins, 
       aes(x = flipper_length_mm, 
           y = body_mass_g, 
           color = species)) +
  geom_point() +
  theme_dark()

Other themes

Other themes can be obtained from the ggthemes package:

library(ggthemes)
ggplot(penguins, 
       aes(x = flipper_length_mm, 
           y = body_mass_g, 
           color = species)) +
  geom_point() +
  theme_economist_white()

Other themes

ggplot(penguins, 
       aes(x = flipper_length_mm, 
           y = body_mass_g, 
           color = species)) +
  geom_point() +
  theme_wsj()

Other themes

ggplot(penguins, 
       aes(x = flipper_length_mm, 
           y = body_mass_g, 
           color = species)) +
  geom_point() +
  theme_fivethirtyeight()

Themes

If you want finer control over your plot appearance, you can dive into the theme() function:

ggplot(penguins, 
       aes(x = flipper_length_mm, 
           y = body_mass_g, 
           color = species)) +
  geom_point() +
  theme(legend.text = element_text(size = 30))

Themes

Within the theme function, we can manipulate various theme elements:

Themes

Other resources: