What is the main difference between matrices and data frames?
What does the following code do?
Answer:
Matrices only contain one type of data whereas each column of a data frame may contain a different type of data.
Today we’ll use the flights data to practice our data exploration skills.
If a package is available on CRAN, like most packages we will use for this course, you can install it using install.packages():
You can also install by clicking Install in the Packages tab through RStudio.
For the most part, after you install a package, it is saved on your computer until you update R, and you will not need to re-install it. Thus, you should never include a call to install.packages() in any .R or .Rmd file!
After a package is installed, you can load it into your current R session using library():
Note that unlike install.packages(), you do not need to include the package name in quotes.
Loading a package must be done with each new R session, so you should put calls to library() in your .R and .Rmd files.
This can be done in the opening code chunk. If it is a .Rmd, you can set the parameter include = FALSE to hide the messages and code if the details are unimportant for the reader.
```{r, include = FALSE}
library(nycflights13)
```Once we load the nycflights13 package, we can access the flights data using the following command:
The flights data is saved as a special kind of data frame called a tibble. The main difference between tibbles and data frames is that tibbles generally display more nicely.
The head() function prints the first \(m\) rows (\(m=6\) by default):
# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
6  2013     1     1      554            558        -4      740            728
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>We can access documentation on the flights dataset using the ? operator:
The summary() function provides a default way to summarize the dataset.
      year          month             day           dep_time    sched_dep_time
 Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1   Min.   : 106  
 1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907   1st Qu.: 906  
 Median :2013   Median : 7.000   Median :16.00   Median :1401   Median :1359  
 Mean   :2013   Mean   : 6.549   Mean   :15.71   Mean   :1349   Mean   :1344  
 3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744   3rd Qu.:1729  
 Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400   Max.   :2359  
                                                 NA's   :8255                 
   dep_delay          arr_time    sched_arr_time   arr_delay       
 Min.   : -43.00   Min.   :   1   Min.   :   1   Min.   : -86.000  
 1st Qu.:  -5.00   1st Qu.:1104   1st Qu.:1124   1st Qu.: -17.000  
 Median :  -2.00   Median :1535   Median :1556   Median :  -5.000  
 Mean   :  12.64   Mean   :1502   Mean   :1536   Mean   :   6.895  
 3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:1945   3rd Qu.:  14.000  
 Max.   :1301.00   Max.   :2400   Max.   :2359   Max.   :1272.000  
 NA's   :8255      NA's   :8713                  NA's   :9430      
   carrier              flight       tailnum             origin         
 Length:336776      Min.   :   1   Length:336776      Length:336776     
 Class :character   1st Qu.: 553   Class :character   Class :character  
 Mode  :character   Median :1496   Mode  :character   Mode  :character  
                    Mean   :1972                                        
                    3rd Qu.:3465                                        
                    Max.   :8500                                        
                                                                        
     dest              air_time        distance         hour      
 Length:336776      Min.   : 20.0   Min.   :  17   Min.   : 1.00  
 Class :character   1st Qu.: 82.0   1st Qu.: 502   1st Qu.: 9.00  
 Mode  :character   Median :129.0   Median : 872   Median :13.00  
                    Mean   :150.7   Mean   :1040   Mean   :13.18  
                    3rd Qu.:192.0   3rd Qu.:1389   3rd Qu.:17.00  
                    Max.   :695.0   Max.   :4983   Max.   :23.00  
                    NA's   :9430                                  
     minute        time_hour                     
 Min.   : 0.00   Min.   :2013-01-01 05:00:00.00  
 1st Qu.: 8.00   1st Qu.:2013-04-04 13:00:00.00  
 Median :29.00   Median :2013-07-03 10:00:00.00  
 Mean   :26.23   Mean   :2013-07-03 05:22:54.64  
 3rd Qu.:44.00   3rd Qu.:2013-10-01 07:00:00.00  
 Max.   :59.00   Max.   :2013-12-31 23:00:00.00  
                                                 As we saw last week, the $ operator allows us to pull a single column from our dataset:
For character vectors that represent categorical variables, the unique() and table() functions provide useful summaries:
For numeric columns, we can use many of the functions we’ve seen earlier:
What if we want to remove all rows with NA values? Or what if we only want to look at flights from JFK?
The most basic way to take a subset of (rows of) a data frame is to define an appropriate logical vector:
What do you think the following code does?
[1] FALSE FALSE FALSE FALSE FALSE FALSEWe can then use this logical vector to index the desired rows:
What do you think the following lines of code do?
Suppose we create three separate datasets–one for each departure airport.
Other summaries might be more relevant/informative:
The previous code provides a basic way to subset datasets. When doing a lot of descriptive/exploratory analysis, many people prefer using code from the tidyverse packages:
readrtidyr, dplyrggplot2tibble, purrr, stringr, forcatsYou can install them all using
(Remember, you only need to do this once!)
There are three rules required for data to be considered tidy
Seems simple, but can sometimes be tricky. We will discuss transformating data in the future.
Recall that packages are essentially ways for you to install and use functions written by others. Occasionally, some of these functions have the same name and there is a conflict. Whichever package you load more recently using library will mask the old function, meaning that R will default to that version.
In general, this is fine, especially with tidyverse. These package authors know when they have masked common functions in R, and typically we will prefer tidyverse version.
The conflict message is to make sure you know about conflicts. You can (and should) hide this in your R Markdown files by adding the parameter message=FALSE or include=FALSE to your code chunk when you load packages.
filter()The filter() function from the dplyr package provides a way to subset data. The second argument of filter() below looks for a logical vector defined in terms of the variables in the first argument.
Compare with:
summarize()The summarize() function can similarly be used to compute summary statistics:
summarize()The summarize() function can handle more than one statistic at once:
Pipes use the |> operator to take the output from a previous function call and “pipe” it through to the next function.
The object before the pipe is treated as the first argument to the function coming after the pipe.
filter() and summarize()Pipes are useful if we want to combine multiple functions. To see how this can be useful, consider the combining the filter() and summarize() functions:
What does the following code do?
# A tibble: 1 × 1
  mean_arr_delay
           <dbl>
1           5.69How could you do the same thing with base R?
You can save single R objects as .rds files using saveRDS(), multiple R objects as .RData or .rda files using save(), and your entire workspace as .RData using save.image().
In general, you should use .RData for multiple objects, and generally should not use save.image().
save.image() should never be a part of your workflow as it is not generally reproducible.
You can load .rds files using readRDS() and .Rdata and .rda files using load().
The values in quotes are all filepaths, and by default, R will search for these objects in your current working directory.
You can change where R searches for images by adjusting this filepath. For example, if you save your data in a Data subfolder within your working directory, you might try
Often, you will read and write files as comma separated values, or .csv. You can do this by navigating File > Import Dataset in the menu bar, but generally I recommend doing it manually using the readr package. You will need to do so if loading data is part of your work flow, such as if it is required for an R Markdown writeup.
Two approaches:
Yesterday’s TidyTuesday featured a very simple dataset on UNESCO World Heritage sites. You can find code to download it here.
This dataset has been used for the 1 Dataset, 100 Visualizations project — take a look and think about which of these visualizations are the most effective.
Using this dataset…