MATH167R: Importing and Exploring Data

Peter Gao

Warm-up

  1. What is the main difference between matrices and data frames?

  2. What does the following code do?

    B <- diag(1, nrow = 4)
    B <- B + .01
    print(B[2, 3])

Warm-up

Answer:

  1. Matrices only contain one type of data whereas each column of a data frame may contain a different type of data.

  2. B <- diag(1, nrow = 4)
    B <- B + .01
    B
         [,1] [,2] [,3] [,4]
    [1,] 1.01 0.01 0.01 0.01
    [2,] 0.01 1.01 0.01 0.01
    [3,] 0.01 0.01 1.01 0.01
    [4,] 0.01 0.01 0.01 1.01

Overview of today

  • Importing and downloading data
  • Descriptive statistics and summaries
  • Filtering and sorting data
  • Pipe notation

Importing data

Today we’ll use the flights data to practice our data exploration skills.

R Packages

  • Packages bundle together code, data, and documentation in an easy to share way.
  • They come with code that others have written to extend the functionality of R.
  • Packages can range from graphical software, to web scraping tools, statistical models for spatio-temporal data, microbial data analysis tools, etc.=

Downloading packages

  • The most popular package repository is the Comprehensive R Archive Network, or CRAN
  • Other popular repositories include Bioconductor and Github

Installing packages

If a package is available on CRAN, like most packages we will use for this course, you can install it using install.packages():

You can also install by clicking Install in the Packages tab through RStudio.

For the most part, after you install a package, it is saved on your computer until you update R, and you will not need to re-install it. Thus, you should never include a call to install.packages() in any .R or .Rmd file!

Loading packages

After a package is installed, you can load it into your current R session using library():

Note that unlike install.packages(), you do not need to include the package name in quotes.

Loading packages

Loading a package must be done with each new R session, so you should put calls to library() in your .R and .Rmd files.

This can be done in the opening code chunk. If it is a .Rmd, you can set the parameter include = FALSE to hide the messages and code if the details are unimportant for the reader.

```{r, include = FALSE}
library(nycflights13)
```

Exploring data

Once we load the nycflights13 package, we can access the flights data using the following command:

library(nycflights13)
data(flights)

The flights data is saved as a special kind of data frame called a tibble. The main difference between tibbles and data frames is that tibbles generally display more nicely.

class(flights)
[1] "tbl_df"     "tbl"        "data.frame"

Exploring data

The head() function prints the first \(m\) rows (\(m=6\) by default):

head(flights)
# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
6  2013     1     1      554            558        -4      740            728
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Exploring data

We can access documentation on the flights dataset using the ? operator:

?flights

The nrow(), ncol(), and dim() functions provide information about the number of rows and columns:

nrow(flights)
[1] 336776
ncol(flights)
[1] 19
dim(flights)
[1] 336776     19

Summarizing data

The summary() function provides a default way to summarize the dataset.

summary(flights)
      year          month             day           dep_time    sched_dep_time
 Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1   Min.   : 106  
 1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907   1st Qu.: 906  
 Median :2013   Median : 7.000   Median :16.00   Median :1401   Median :1359  
 Mean   :2013   Mean   : 6.549   Mean   :15.71   Mean   :1349   Mean   :1344  
 3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744   3rd Qu.:1729  
 Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400   Max.   :2359  
                                                 NA's   :8255                 
   dep_delay          arr_time    sched_arr_time   arr_delay       
 Min.   : -43.00   Min.   :   1   Min.   :   1   Min.   : -86.000  
 1st Qu.:  -5.00   1st Qu.:1104   1st Qu.:1124   1st Qu.: -17.000  
 Median :  -2.00   Median :1535   Median :1556   Median :  -5.000  
 Mean   :  12.64   Mean   :1502   Mean   :1536   Mean   :   6.895  
 3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:1945   3rd Qu.:  14.000  
 Max.   :1301.00   Max.   :2400   Max.   :2359   Max.   :1272.000  
 NA's   :8255      NA's   :8713                  NA's   :9430      
   carrier              flight       tailnum             origin         
 Length:336776      Min.   :   1   Length:336776      Length:336776     
 Class :character   1st Qu.: 553   Class :character   Class :character  
 Mode  :character   Median :1496   Mode  :character   Mode  :character  
                    Mean   :1972                                        
                    3rd Qu.:3465                                        
                    Max.   :8500                                        
                                                                        
     dest              air_time        distance         hour      
 Length:336776      Min.   : 20.0   Min.   :  17   Min.   : 1.00  
 Class :character   1st Qu.: 82.0   1st Qu.: 502   1st Qu.: 9.00  
 Mode  :character   Median :129.0   Median : 872   Median :13.00  
                    Mean   :150.7   Mean   :1040   Mean   :13.18  
                    3rd Qu.:192.0   3rd Qu.:1389   3rd Qu.:17.00  
                    Max.   :695.0   Max.   :4983   Max.   :23.00  
                    NA's   :9430                                  
     minute        time_hour                     
 Min.   : 0.00   Min.   :2013-01-01 05:00:00.00  
 1st Qu.: 8.00   1st Qu.:2013-04-04 13:00:00.00  
 Median :29.00   Median :2013-07-03 10:00:00.00  
 Mean   :26.23   Mean   :2013-07-03 05:22:54.64  
 3rd Qu.:44.00   3rd Qu.:2013-10-01 07:00:00.00  
 Max.   :59.00   Max.   :2013-12-31 23:00:00.00  
                                                 

Summarizing one column at a time

As we saw last week, the $ operator allows us to pull a single column from our dataset:

flights$origin[1:10]
 [1] "EWR" "LGA" "JFK" "JFK" "LGA" "EWR" "EWR" "LGA" "JFK" "LGA"

Summarizing one column at a time

For character vectors that represent categorical variables, the unique() and table() functions provide useful summaries:

unique(flights$origin)
[1] "EWR" "LGA" "JFK"
table(flights$origin)

   EWR    JFK    LGA 
120835 111279 104662 

Summarizing one column at a time

For numeric columns, we can use many of the functions we’ve seen earlier:

mean(flights$dep_delay, na.rm = T)
[1] 12.63907
range(flights$dep_delay, na.rm = T)
[1]  -43 1301
max(flights$dep_delay, na.rm = T)
[1] 1301

Filtering data

Subsetting a data frame

What if we want to remove all rows with NA values? Or what if we only want to look at flights from JFK?

The most basic way to take a subset of (rows of) a data frame is to define an appropriate logical vector:

What do you think the following code does?

is_delayed <- flights$dep_delay > 60
head(is_delayed)
[1] FALSE FALSE FALSE FALSE FALSE FALSE

Subsetting a data frame

We can then use this logical vector to index the desired rows:

delayed_flights <- flights[is_delayed, ]
nrow(delayed_flights)

Check your understanding: Subsetting

What do you think the following lines of code do?

flights[flights$origin == "JFK", ]
flights[flights$air_time > 120, ]
flights[!is.na(flights$arr_delay), ]
flights[complete.cases(flights), ]

Summarizing subsets of a dataset

Suppose we create three separate datasets–one for each departure airport.

JFK_flights <- flights[flights$origin == "JFK", ]
LGA_flights <- flights[flights$origin == "LGA", ]
EWR_flights <- flights[flights$origin == "EWR", ]
mean(JFK_flights$arr_delay, na.rm = T)
[1] 5.551481
mean(LGA_flights$arr_delay, na.rm = T)
[1] 5.783488
mean(EWR_flights$arr_delay, na.rm = T)
[1] 9.107055

Summarizing subsets of a dataset

Other summaries might be more relevant/informative:

mean(JFK_flights$arr_delay > 30, na.rm = T)
[1] 0.1519908
mean(LGA_flights$arr_delay > 30, na.rm = T)
[1] 0.1451256
mean(EWR_flights$arr_delay > 30, na.rm = T)
[1] 0.172821

Summarizing subsets of a dataset

mean(JFK_flights$distance, na.rm = T)
[1] 1266.249
mean(LGA_flights$distance, na.rm = T)
[1] 779.8357
mean(EWR_flights$distance, na.rm = T)
[1] 1056.743

The tidyverse

The previous code provides a basic way to subset datasets. When doing a lot of descriptive/exploratory analysis, many people prefer using code from the tidyverse packages:

  • Reading and saving data: readr
  • Data manipulation: tidyr, dplyr
  • Data visualization: ggplot2
  • Working with different data structures: tibble, purrr, stringr, forcats

You can install them all using

install.packages("tidyverse")

(Remember, you only need to do this once!)

Tidy Data Principles

There are three rules required for data to be considered tidy

  • Each variable must have its own column
  • Each observation must have its own row
  • Each value must have its own cell

Seems simple, but can sometimes be tricky. We will discuss transformating data in the future.

Name conflicts

Recall that packages are essentially ways for you to install and use functions written by others. Occasionally, some of these functions have the same name and there is a conflict. Whichever package you load more recently using library will mask the old function, meaning that R will default to that version.

In general, this is fine, especially with tidyverse. These package authors know when they have masked common functions in R, and typically we will prefer tidyverse version.

The conflict message is to make sure you know about conflicts. You can (and should) hide this in your R Markdown files by adding the parameter message=FALSE or include=FALSE to your code chunk when you load packages.

Subsetting with filter()

The filter() function from the dplyr package provides a way to subset data. The second argument of filter() below looks for a logical vector defined in terms of the variables in the first argument.

library(dplyr)
JFK_flights <- filter(flights, origin == "JFK")
LGA_flights <- filter(flights, origin == "LGA")
EWR_flights <- filter(flights, origin == "EWR")

Compare with:

JFK_flights <- flights[flights$origin == "JFK", ]
LGA_flights <- flights[flights$origin == "LGA", ]
EWR_flights <- flights[flights$origin == "EWR", ]

Descriptive statistics with summarize()

The summarize() function can similarly be used to compute summary statistics:

summarize(flights, mean_dep_delay = mean(dep_delay, na.rm = T))
# A tibble: 1 × 1
  mean_dep_delay
           <dbl>
1           12.6

Descriptive statistics with summarize()

The summarize() function can handle more than one statistic at once:

summarize(flights, 
          mean_dep_delay = mean(dep_delay, na.rm = T),
          mean_arr_delay = mean(arr_delay, na.rm = T))
# A tibble: 1 × 2
  mean_dep_delay mean_arr_delay
           <dbl>          <dbl>
1           12.6           6.90

Pipe notation

Pipes use the |> operator to take the output from a previous function call and “pipe” it through to the next function.

The object before the pipe is treated as the first argument to the function coming after the pipe.

JFK_flights <- flights |> filter(origin == "JFK")
LGA_flights <- flights |> filter(origin == "LGA")
EWR_flights <- flights |> filter(origin == "EWR")

Piping filter() and summarize()

Pipes are useful if we want to combine multiple functions. To see how this can be useful, consider the combining the filter() and summarize() functions:

flights |>
  filter(origin == "JFK") |>
  summarize(mean_dep_delay = mean(dep_delay, na.rm = T))
# A tibble: 1 × 1
  mean_dep_delay
           <dbl>
1           12.1

Check your understanding:

What does the following code do?

flights |>
  filter(air_time > 120) |>
  summarize(mean_arr_delay = mean(arr_delay, na.rm = T))
# A tibble: 1 × 1
  mean_arr_delay
           <dbl>
1           5.69

How could you do the same thing with base R?

long_flights <- flights[flights$air_time > 120, ]
mean(long_flights$arr_delay, na.rm = T)
[1] 5.689161

Note that the class of object is slightly different!

Downloading data from the internet

Saving R Output

You can save single R objects as .rds files using saveRDS(), multiple R objects as .RData or .rda files using save(), and your entire workspace as .RData using save.image().

Saving R Output

In general, you should use .RData for multiple objects, and generally should not use save.image().

save.image() should never be a part of your workflow as it is not generally reproducible.

Loading R Output

You can load .rds files using readRDS() and .Rdata and .rda files using load().

# load only object1
readRDS("object1_only.rds")
# load object1 and object2
load("both_objects.RData")
# load my entire workspace
load("entire_workspace.RData")

Notes on Saving and Loading R Data

The values in quotes are all filepaths, and by default, R will search for these objects in your current working directory.

You can change where R searches for images by adjusting this filepath. For example, if you save your data in a Data subfolder within your working directory, you might try

load("./Data/my_data.RData")

Other types of data

Often, you will read and write files as comma separated values, or .csv. You can do this by navigating File > Import Dataset in the menu bar, but generally I recommend doing it manually using the readr package. You will need to do so if loading data is part of your work flow, such as if it is required for an R Markdown writeup.

library(readr)
# read a .csv file in a "Data" subfolder
read_csv("./Data/file.csv")
# save a .csv file in a "Data" subfolder
write_csv("./Data/data_output.csv")

Loading Tidy Tuesday data

Two approaches:

  1. Load directly from the internet using a link.
  2. Download to your computer and then load locally.

Discussion: Exploratory Data Analysis

  1. Better than EDA: start with a question, collect data, and explore your question
  2. EDA: Start with a dataset and use descriptive statistics/visualization to explore your dataset and develop research questions.

Discussion: What can/should we compute?

  1. If I’ve waited 20 minutes for my flight, how much longer should I expect to wait? (Is waiting for a plane memoryless?)
  2. Bonus: If my plane is delayed for a mechanical issue—is it still safe to get on the flight?

Another example

Yesterday’s TidyTuesday featured a very simple dataset on UNESCO World Heritage sites. You can find code to download it here.

This dataset has been used for the 1 Dataset, 100 Visualizations project — take a look and think about which of these visualizations are the most effective.

Activity

Using this dataset…

  1. What is one question you could answer with this dataset? What questions could you answer if you could incorporate other variables/datasets?
  2. What would you compute to answer your question?