What is the main difference between matrices and data frames?
What does the following code do?
Answer:
Matrices only contain one type of data whereas each column of a data frame may contain a different type of data.
Today we’ll use the flights
data to practice our data exploration skills.
If a package is available on CRAN, like most packages we will use for this course, you can install it using install.packages()
:
You can also install by clicking Install in the Packages tab through RStudio.
For the most part, after you install a package, it is saved on your computer until you update R, and you will not need to re-install it. Thus, you should never include a call to install.packages()
in any .R
or .Rmd
file!
After a package is installed, you can load it into your current R session using library()
:
Note that unlike install.packages()
, you do not need to include the package name in quotes.
Loading a package must be done with each new R session, so you should put calls to library()
in your .R
and .Rmd
files.
This can be done in the opening code chunk. If it is a .Rmd
, you can set the parameter include = FALSE
to hide the messages and code if the details are unimportant for the reader.
```{r, include = FALSE}
library(nycflights13)
```
Once we load the nycflights13
package, we can access the flights
data using the following command:
The flights
data is saved as a special kind of data frame called a tibble. The main difference between tibbles and data frames is that tibbles generally display more nicely.
The head()
function prints the first \(m\) rows (\(m=6\) by default):
# A tibble: 6 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
We can access documentation on the flights dataset using the ?
operator:
The summary()
function provides a default way to summarize the dataset.
year month day dep_time sched_dep_time
Min. :2013 Min. : 1.000 Min. : 1.00 Min. : 1 Min. : 106
1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 907 1st Qu.: 906
Median :2013 Median : 7.000 Median :16.00 Median :1401 Median :1359
Mean :2013 Mean : 6.549 Mean :15.71 Mean :1349 Mean :1344
3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:1744 3rd Qu.:1729
Max. :2013 Max. :12.000 Max. :31.00 Max. :2400 Max. :2359
NA's :8255
dep_delay arr_time sched_arr_time arr_delay
Min. : -43.00 Min. : 1 Min. : 1 Min. : -86.000
1st Qu.: -5.00 1st Qu.:1104 1st Qu.:1124 1st Qu.: -17.000
Median : -2.00 Median :1535 Median :1556 Median : -5.000
Mean : 12.64 Mean :1502 Mean :1536 Mean : 6.895
3rd Qu.: 11.00 3rd Qu.:1940 3rd Qu.:1945 3rd Qu.: 14.000
Max. :1301.00 Max. :2400 Max. :2359 Max. :1272.000
NA's :8255 NA's :8713 NA's :9430
carrier flight tailnum origin
Length:336776 Min. : 1 Length:336776 Length:336776
Class :character 1st Qu.: 553 Class :character Class :character
Mode :character Median :1496 Mode :character Mode :character
Mean :1972
3rd Qu.:3465
Max. :8500
dest air_time distance hour
Length:336776 Min. : 20.0 Min. : 17 Min. : 1.00
Class :character 1st Qu.: 82.0 1st Qu.: 502 1st Qu.: 9.00
Mode :character Median :129.0 Median : 872 Median :13.00
Mean :150.7 Mean :1040 Mean :13.18
3rd Qu.:192.0 3rd Qu.:1389 3rd Qu.:17.00
Max. :695.0 Max. :4983 Max. :23.00
NA's :9430
minute time_hour
Min. : 0.00 Min. :2013-01-01 05:00:00.00
1st Qu.: 8.00 1st Qu.:2013-04-04 13:00:00.00
Median :29.00 Median :2013-07-03 10:00:00.00
Mean :26.23 Mean :2013-07-03 05:22:54.64
3rd Qu.:44.00 3rd Qu.:2013-10-01 07:00:00.00
Max. :59.00 Max. :2013-12-31 23:00:00.00
As we saw last week, the $
operator allows us to pull a single column from our dataset:
For character vectors that represent categorical variables, the unique()
and table()
functions provide useful summaries:
For numeric columns, we can use many of the functions we’ve seen earlier:
What if we want to remove all rows with NA
values? Or what if we only want to look at flights from JFK?
The most basic way to take a subset of (rows of) a data frame is to define an appropriate logical vector:
What do you think the following code does?
[1] FALSE FALSE FALSE FALSE FALSE FALSE
We can then use this logical vector to index the desired rows:
What do you think the following lines of code do?
Suppose we create three separate datasets–one for each departure airport.
Other summaries might be more relevant/informative:
The previous code provides a basic way to subset datasets. When doing a lot of descriptive/exploratory analysis, many people prefer using code from the tidyverse
packages:
readr
tidyr
, dplyr
ggplot2
tibble
, purrr
, stringr
, forcats
You can install them all using
(Remember, you only need to do this once!)
There are three rules required for data to be considered tidy
Seems simple, but can sometimes be tricky. We will discuss transformating data in the future.
Recall that packages are essentially ways for you to install and use functions written by others. Occasionally, some of these functions have the same name and there is a conflict. Whichever package you load more recently using library
will mask the old function, meaning that R will default to that version.
In general, this is fine, especially with tidyverse
. These package authors know when they have masked common functions in R, and typically we will prefer tidyverse
version.
The conflict message is to make sure you know about conflicts. You can (and should) hide this in your R Markdown files by adding the parameter message=FALSE
or include=FALSE
to your code chunk when you load packages.
filter()
The filter()
function from the dplyr
package provides a way to subset data. The second argument of filter()
below looks for a logical vector defined in terms of the variables in the first argument.
Compare with:
summarize()
The summarize()
function can similarly be used to compute summary statistics:
summarize()
The summarize()
function can handle more than one statistic at once:
Pipes use the |>
operator to take the output from a previous function call and “pipe” it through to the next function.
The object before the pipe is treated as the first argument to the function coming after the pipe.
filter()
and summarize()
Pipes are useful if we want to combine multiple functions. To see how this can be useful, consider the combining the filter()
and summarize()
functions:
What does the following code do?
# A tibble: 1 × 1
mean_arr_delay
<dbl>
1 5.69
How could you do the same thing with base R?
You can save single R objects as .rds
files using saveRDS()
, multiple R objects as .RData
or .rda
files using save()
, and your entire workspace as .RData
using save.image()
.
In general, you should use .RData
for multiple objects, and generally should not use save.image()
.
save.image()
should never be a part of your workflow as it is not generally reproducible.
You can load .rds
files using readRDS()
and .Rdata
and .rda
files using load()
.
The values in quotes are all filepaths, and by default, R will search for these objects in your current working directory.
You can change where R searches for images by adjusting this filepath. For example, if you save your data in a Data
subfolder within your working directory, you might try
Often, you will read and write files as comma separated values, or .csv
. You can do this by navigating File > Import Dataset in the menu bar, but generally I recommend doing it manually using the readr
package. You will need to do so if loading data is part of your work flow, such as if it is required for an R Markdown writeup.
Two approaches:
Yesterday’s TidyTuesday featured a very simple dataset on UNESCO World Heritage sites. You can find code to download it here.
This dataset has been used for the 1 Dataset, 100 Visualizations project — take a look and think about which of these visualizations are the most effective.
Using this dataset…