MATH167R: Factors and Categorical Data

Peter Gao

Warm-up

  1. What does the following code do? How would you write it without a pipe?

    "hello" |> print()
    1:6 |> sample(1)

Warm-up

Answer:

  1. "hello" |> print()
    1:6 |> sample(1)
    print("hello")
    sample(1:6, 1)

Overview of today

  • Classes and attributes
  • Factors for categorical data
  • Advanced descriptive statistics and summaries

Objects revisited

Remember John Chambers’ quote: “Everything that exists in R is an object.” This is roughly correct; however, it’s important to remember that in R objects come in many shapes/flavors.

diamonds <- ggplot2::diamonds
summary(diamonds$carat)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.2000  0.4000  0.7000  0.7979  1.0400  5.0100 
summary(diamonds$cut)
     Fair      Good Very Good   Premium     Ideal 
     1610      4906     12082     13791     21551 

Why do these return two types of output?

Polymorphism

Loosely speaking, carat and cut are stored as different types of data. In this code, R is using two versions of summary(): one meant for categorical data and one meant for numerical data.

This is confusing because we need to remember that summary() works differently depending on the input, but it can also be powerful because the user doesn’t have to remember many different functions (ex. summary_numeric() or summary_character()).

This is polymorphism in action: A common functional interface can be used for different types of input.

Object-oriented programming

In object-oriented programming, developers define different classes of objects with various methods (like summary()). This means that if a developer wants to create a new class of object (ex. for a new data type), users should still be able to use common functions like print() or summary().

R developers often use object-oriented programming, but the implementation takes many forms, so we won’t go into detail in this course. What is important to remember is that the class(es) of an object determine(s) what you can do with it.

Classes

weather_forecasts <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-12-20/weather_forecasts.csv')
class(weather_forecasts$date)
[1] "Date"
summary(weather_forecasts$date)
        Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
"2021-01-30" "2021-05-31" "2021-09-30" "2021-09-30" "2022-01-30" "2022-06-01" 

Here, we see that the date variable has the class Date, which has its own version of summary().

Classes vs. data types

We previously only discussed four basic data types: logical, integer, double, and character.

  • Type (as accessed via typeof()) describes the underlying data type.

  • Class (as accessed via class()) describes an attribute that determines what you can do with this object.

class(weather_forecasts$date)
[1] "Date"
typeof(weather_forecasts$date)
[1] "double"

Attributes

In R, objects can be associated with attributes such as class, that provide information on the values contained within.

attributes(head(weather_forecasts))
$names
 [1] "date"                  "city"                  "state"                
 [4] "high_or_low"           "forecast_hours_before" "observed_temp"        
 [7] "forecast_temp"         "observed_precip"       "forecast_outlook"     
[10] "possible_error"       

$row.names
[1] 1 2 3 4 5 6

$class
[1] "tbl_df"     "tbl"        "data.frame"

Factors: categorical data

Factors are a class of object for categorical data that uses integer representation.

This can be efficient to store character vectors, because each string is only entered once. Because of this, creating data frames (but not tibbles!) in R often default to set strings as factors.

An example: months of the year

Here’s an example from R for Data Science:

Imagine we have a variable that represents month of the year:

x1 <- c("Dec", "Apr", "Jan", "Mar")

Some issues:

  • Potential for typos
x2 <- c("Dec", "Apr", "Jam", "Mar")
  • Not sorting in a reasonable way
sort(x1)
[1] "Apr" "Dec" "Jan" "Mar"

An example: months of the year

We can represent this variable using a factor by defining its levels, or the valid values this variable can take:

month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)

We can create a factor using the factor() function:

y1 <- factor(x1, levels = month_levels)
y1
[1] Dec Apr Jan Mar
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
sort(y1)
[1] Jan Mar Apr Dec
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

An example: months of the year

Note that factors are stored as integers but displayed using their levels:

class(y1)
[1] "factor"
typeof(y1)
[1] "integer"
y1
[1] Dec Apr Jan Mar
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

An example: months of the year

Note that factors are stored as integers but displayed using their levels:

as.numeric(y1)
[1] 12  4  1  3
attributes(y1)
$levels
 [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"

$class
[1] "factor"

Movies data

As an example of how factors work, let’s look at an example using Tidy Tuesday data on movie profits.

movies <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-10-23/movie_profit.csv")
knitr::kable(movies[1:20,], digits = 3, row.names = F) |>
  kableExtra::kable_styling("striped", full_width = T) |> 
  kableExtra::scroll_box(height = "300px")
...1 release_date movie production_budget domestic_gross worldwide_gross distributor mpaa_rating genre
1 6/22/2007 Evan Almighty 1.75e+08 100289690 174131329 Universal PG Comedy
2 7/28/1995 Waterworld 1.75e+08 88246220 264246220 Universal PG-13 Action
3 5/12/2017 King Arthur: Legend of the Sword 1.75e+08 39175066 139950708 Warner Bros. PG-13 Adventure
4 12/25/2013 47 Ronin 1.75e+08 38362475 151716815 Universal PG-13 Action
5 6/22/2018 Jurassic World: Fallen Kingdom 1.70e+08 416769345 1304866322 Universal PG-13 Action
6 8/1/2014 Guardians of the Galaxy 1.70e+08 333172112 771051335 Walt Disney PG-13 Action
7 5/7/2010 Iron Man 2 1.70e+08 312433331 621156389 Paramount Pictures PG-13 Action
8 4/4/2014 Captain America: The Winter Soldier 1.70e+08 259746958 714401889 Walt Disney PG-13 Action
9 7/11/2014 Dawn of the Planet of the Apes 1.70e+08 208545589 710644566 20th Century Fox PG-13 Adventure
10 11/10/2004 The Polar Express 1.70e+08 186493587 310634169 Warner Bros. G Adventure
11 6/1/2012 Snow White and the Huntsman 1.70e+08 155136755 401021746 Universal PG-13 Adventure
12 7/1/2003 Terminator 3: Rise of the Machines 1.70e+08 150358296 433058296 Warner Bros. R Action
13 5/7/2004 Van Helsing 1.70e+08 120150546 300150546 Universal PG-13 Action
14 5/22/2015 Tomorrowland 1.70e+08 93436322 207283457 Walt Disney PG Adventure
15 5/27/2016 Alice Through the Looking Glass 1.70e+08 77042381 276934087 Walt Disney PG Adventure
16 5/21/2010 Shrek Forever After 1.65e+08 238736787 756244673 Paramount Pictures PG Adventure
17 11/4/2016 Doctor Strange 1.65e+08 232641920 676486457 Walt Disney PG-13 Action
18 11/7/2014 Big Hero 6 1.65e+08 222527828 652127828 Walt Disney PG Adventure
19 3/26/2010 How to Train Your Dragon 1.65e+08 217581232 494870992 Paramount Pictures PG Adventure
20 11/2/2012 Wreck-It Ralph 1.65e+08 189412677 496511521 Walt Disney PG Adventure

Movies genre

genre_char <- movies$genre
genre_fct <- as.factor(movies$genre)
head(genre_char)
[1] "Comedy"    "Action"    "Adventure" "Action"    "Action"    "Action"   
head(genre_fct)
[1] Comedy    Action    Adventure Action    Action    Action   
Levels: Action Adventure Comedy Drama Horror
class(genre_fct)
[1] "factor"
typeof(genre_fct)
[1] "integer"

Size of character and factors

Generally, the factor representation saves space in memory:

object.size(genre_char)
27544 bytes
object.size(genre_fct) 
14376 bytes

The forcats package

The forcats package provides helpful functions for working with factors. Consider the following example:

library(forcats)
color_levels <- c(
  "red", "blue", "yellow"
)
color_var <- c("red", "yellow", "blue")
color_fct_1 <- factor(color_var, levels = color_levels)

fct_recode(): recode levels

fct_recode(color_fct_1, ruby = "red")
[1] ruby   yellow blue  
Levels: ruby blue yellow
fct_recode(color_fct_1, ruby = "red", sapphire = "blue", topaz = "yellow", w = "d")
[1] ruby     topaz    sapphire
Levels: ruby sapphire topaz

fct_collapse(): collapse levels

color_fct_1
[1] red    yellow blue  
Levels: red blue yellow
fct_collapse(color_fct_1, purple = c("red", "blue"))
[1] purple yellow purple
Levels: purple yellow

fct_other(): replace w/ “Other”

color_fct_1
[1] red    yellow blue  
Levels: red blue yellow
fct_other(color_fct_1, keep = "red")
[1] red   Other Other
Levels: red Other

forcats cheatsheet

  • Create a factor: factor(..., levels = ...)
  • Count levels: fct_count()
  • Unique levels: fct_unique()
  • Combine factor vectors: fct_c()
  • Relevel: fct_relevel()
  • Drop levels: fct_drop()
  • Add levels: fct_expand()
  • Recode levels: fct_recode()
  • Collapse levels: fct_collapse()
  • Other

Takeaways

  • Don’t memorize these functions–read the documentation!
  • Be efficient! Often someone has written a function that does exactly what you want to do.

Data manipulation and grouping

Mutating variables

The mutate() function from the tidyverse gives a convenient way to add/change columns in a data frame.

library(tidyverse)
movies <- movies |>
  mutate(genre = as.factor(genre)) |>
  mutate(return = worldwide_gross / production_budget)

Movies data

movies$genre = as.factor(movies$genre)
movies <- movies |>
  mutate(genre = fct_collapse(genre,
                              AA = c("Action", "Adventure"))) |>
  mutate(genre = fct_recode(genre, Scary = "Horror"))

Summarizing variables

Recall that the summarize() function can be used to calculate statistics on our entire data frame:

movies |>
  summarize(max_gross = max(worldwide_gross, na.rm = T))
# A tibble: 1 × 1
   max_gross
       <dbl>
1 1304866322

Summaries by group

What if we want to learn the max gross for each genre? When we have a character/factor column, we can use the group_by function in combination with summarize() to calculate group-specific statistics:

movies |>
  group_by(genre) |>
  summarize(highest_gross = max(worldwide_gross)) |>
  ungroup()

Challenge

What do you think this code returns?

movies |>
  group_by(genre) |>
  summarize(highest_gross = movie[which.max(worldwide_gross)]) |>
  ungroup()
# A tibble: 4 × 2
  genre  highest_gross                 
  <fct>  <chr>                         
1 AA     Jurassic World: Fallen Kingdom
2 Comedy The Hangover Part II          
3 Drama  ET: The Extra-Terrestrial     
4 Scary  It                            

Challenge

What do you think this code returns?

movies |>
  group_by(genre, mpaa_rating) |>
  summarize(highest_gross = movie[which.max(worldwide_gross)]) |>
  ungroup()
# A tibble: 19 × 3
   genre  mpaa_rating highest_gross                 
   <fct>  <chr>       <chr>                         
 1 AA     G           The Lion King                 
 2 AA     PG          Minions                       
 3 AA     PG-13       Jurassic World: Fallen Kingdom
 4 AA     R           Deadpool                      
 5 AA     <NA>        Conan the Barbarian           
 6 Comedy G           Gnomeo and Juliet             
 7 Comedy PG          Home Alone                    
 8 Comedy PG-13       Meet the Fockers              
 9 Comedy R           The Hangover Part II          
10 Comedy <NA>        It's a Mad Mad Mad Mad World  
11 Drama  G           Gone with the Wind            
12 Drama  PG          ET: The Extra-Terrestrial     
13 Drama  PG-13       The Twilight Saga: Eclipse    
14 Drama  R           The Passion of the Christ     
15 Drama  <NA>        The Postman Always Rings Twice
16 Scary  PG          Jaws                          
17 Scary  PG-13       I am Legend                   
18 Scary  R           It                            
19 Scary  <NA>        Friday the 13th               

Movies data

genre_medians <- movies |>
  group_by(genre) |>
  summarize(median_budget = median(production_budget),
            median_domestic = median(domestic_gross),
            median_ww = median(worldwide_gross),
            median_return = median(return)) |>
  ungroup()