What does the following code do? How would you write it without a pipe?
Answer:
Remember John Chambers’ quote: “Everything that exists in R is an object.” This is roughly correct; however, it’s important to remember that in R objects come in many shapes/flavors.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2000 0.4000 0.7000 0.7979 1.0400 5.0100
Fair Good Very Good Premium Ideal
1610 4906 12082 13791 21551
Why do these return two types of output?
Loosely speaking, carat
and cut
are stored as different types of data. In this code, R is using two versions of summary()
: one meant for categorical data and one meant for numerical data.
This is confusing because we need to remember that summary()
works differently depending on the input, but it can also be powerful because the user doesn’t have to remember many different functions (ex. summary_numeric()
or summary_character()
).
This is polymorphism in action: A common functional interface can be used for different types of input.
In object-oriented programming, developers define different classes of objects with various methods (like summary()
). This means that if a developer wants to create a new class of object (ex. for a new data type), users should still be able to use common functions like print()
or summary()
.
R developers often use object-oriented programming, but the implementation takes many forms, so we won’t go into detail in this course. What is important to remember is that the class(es) of an object determine(s) what you can do with it.
weather_forecasts <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-12-20/weather_forecasts.csv')
class(weather_forecasts$date)
[1] "Date"
Min. 1st Qu. Median Mean 3rd Qu. Max.
"2021-01-30" "2021-05-31" "2021-09-30" "2021-09-30" "2022-01-30" "2022-06-01"
Here, we see that the date
variable has the class Date
, which has its own version of summary()
.
We previously only discussed four basic data types: logical, integer, double, and character.
Type (as accessed via typeof()
) describes the underlying data type.
Class (as accessed via class()
) describes an attribute that determines what you can do with this object.
In R, objects can be associated with attributes such as class, that provide information on the values contained within.
Factors are a class of object for categorical data that uses integer representation.
This can be efficient to store character vectors, because each string is only entered once. Because of this, creating data frames (but not tibbles!) in R often default to set strings as factors.
Here’s an example from R for Data Science:
Imagine we have a variable that represents month of the year:
Some issues:
We can represent this variable using a factor by defining its levels, or the valid values this variable can take:
We can create a factor using the factor()
function:
Note that factors are stored as integers but displayed using their levels:
Note that factors are stored as integers but displayed using their levels:
As an example of how factors work, let’s look at an example using Tidy Tuesday data on movie profits.
movies <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-10-23/movie_profit.csv")
knitr::kable(movies[1:20,], digits = 3, row.names = F) |>
kableExtra::kable_styling("striped", full_width = T) |>
kableExtra::scroll_box(height = "300px")
...1 | release_date | movie | production_budget | domestic_gross | worldwide_gross | distributor | mpaa_rating | genre |
---|---|---|---|---|---|---|---|---|
1 | 6/22/2007 | Evan Almighty | 1.75e+08 | 100289690 | 174131329 | Universal | PG | Comedy |
2 | 7/28/1995 | Waterworld | 1.75e+08 | 88246220 | 264246220 | Universal | PG-13 | Action |
3 | 5/12/2017 | King Arthur: Legend of the Sword | 1.75e+08 | 39175066 | 139950708 | Warner Bros. | PG-13 | Adventure |
4 | 12/25/2013 | 47 Ronin | 1.75e+08 | 38362475 | 151716815 | Universal | PG-13 | Action |
5 | 6/22/2018 | Jurassic World: Fallen Kingdom | 1.70e+08 | 416769345 | 1304866322 | Universal | PG-13 | Action |
6 | 8/1/2014 | Guardians of the Galaxy | 1.70e+08 | 333172112 | 771051335 | Walt Disney | PG-13 | Action |
7 | 5/7/2010 | Iron Man 2 | 1.70e+08 | 312433331 | 621156389 | Paramount Pictures | PG-13 | Action |
8 | 4/4/2014 | Captain America: The Winter Soldier | 1.70e+08 | 259746958 | 714401889 | Walt Disney | PG-13 | Action |
9 | 7/11/2014 | Dawn of the Planet of the Apes | 1.70e+08 | 208545589 | 710644566 | 20th Century Fox | PG-13 | Adventure |
10 | 11/10/2004 | The Polar Express | 1.70e+08 | 186493587 | 310634169 | Warner Bros. | G | Adventure |
11 | 6/1/2012 | Snow White and the Huntsman | 1.70e+08 | 155136755 | 401021746 | Universal | PG-13 | Adventure |
12 | 7/1/2003 | Terminator 3: Rise of the Machines | 1.70e+08 | 150358296 | 433058296 | Warner Bros. | R | Action |
13 | 5/7/2004 | Van Helsing | 1.70e+08 | 120150546 | 300150546 | Universal | PG-13 | Action |
14 | 5/22/2015 | Tomorrowland | 1.70e+08 | 93436322 | 207283457 | Walt Disney | PG | Adventure |
15 | 5/27/2016 | Alice Through the Looking Glass | 1.70e+08 | 77042381 | 276934087 | Walt Disney | PG | Adventure |
16 | 5/21/2010 | Shrek Forever After | 1.65e+08 | 238736787 | 756244673 | Paramount Pictures | PG | Adventure |
17 | 11/4/2016 | Doctor Strange | 1.65e+08 | 232641920 | 676486457 | Walt Disney | PG-13 | Action |
18 | 11/7/2014 | Big Hero 6 | 1.65e+08 | 222527828 | 652127828 | Walt Disney | PG | Adventure |
19 | 3/26/2010 | How to Train Your Dragon | 1.65e+08 | 217581232 | 494870992 | Paramount Pictures | PG | Adventure |
20 | 11/2/2012 | Wreck-It Ralph | 1.65e+08 | 189412677 | 496511521 | Walt Disney | PG | Adventure |
[1] "Comedy" "Action" "Adventure" "Action" "Action" "Action"
[1] Comedy Action Adventure Action Action Action
Levels: Action Adventure Comedy Drama Horror
Generally, the factor representation saves space in memory:
forcats
packageThe forcats
package provides helpful functions for working with factors. Consider the following example:
fct_recode()
: recode levelsfct_collapse()
: collapse levelsfct_other()
: replace w/ “Other”forcats
cheatsheetfactor(..., levels = ...)
fct_count()
fct_unique()
fct_c()
fct_relevel()
fct_drop()
fct_expand()
fct_recode()
fct_collapse()
The mutate()
function from the tidyverse
gives a convenient way to add/change columns in a data frame.
Recall that the summarize()
function can be used to calculate statistics on our entire data frame:
What if we want to learn the max gross for each genre? When we have a character/factor column, we can use the group_by
function in combination with summarize()
to calculate group-specific statistics:
What do you think this code returns?
# A tibble: 4 × 2
genre highest_gross
<fct> <chr>
1 AA Jurassic World: Fallen Kingdom
2 Comedy The Hangover Part II
3 Drama ET: The Extra-Terrestrial
4 Scary It
What do you think this code returns?
# A tibble: 19 × 3
genre mpaa_rating highest_gross
<fct> <chr> <chr>
1 AA G The Lion King
2 AA PG Minions
3 AA PG-13 Jurassic World: Fallen Kingdom
4 AA R Deadpool
5 AA <NA> Conan the Barbarian
6 Comedy G Gnomeo and Juliet
7 Comedy PG Home Alone
8 Comedy PG-13 Meet the Fockers
9 Comedy R The Hangover Part II
10 Comedy <NA> It's a Mad Mad Mad Mad World
11 Drama G Gone with the Wind
12 Drama PG ET: The Extra-Terrestrial
13 Drama PG-13 The Twilight Saga: Eclipse
14 Drama R The Passion of the Christ
15 Drama <NA> The Postman Always Rings Twice
16 Scary PG Jaws
17 Scary PG-13 I am Legend
18 Scary R It
19 Scary <NA> Friday the 13th