Discuss the following lines of code. What do they do?
Vectorization: applying a function repeatedly to every entry in a vector/array
Vectorization allows us to quickly carry out computations for every individual in a dataset.
Note that R recycles, repeating elements of shorter vectors to match longer vectors. This is incredibly useful when done on purpose, but can also easily lead to hard-to-catch bugs in your code!
We can apply many functions component-wise to vectors, including comparison operators.
In code, entries that are TRUE
or FALSE
are called booleans (logicals in R). These are incredibly important, because they can be used to give your computer conditions. What will the following code do?
[1] 1 2 3 3 3
We can also do basic arithmetic with booleans. TRUE
is encoded as 1
and FALSE
is encoded as 0
.
[1] 3
[1] 0.6
What is this last quantity telling us?
By taking the mean, we are looking at the proportion of our vector that is TRUE
.
We can also get more complicated with our indexing.
We can also get more complicated with our indexing.
We can compare entire vectors using identical()
What do you think the function rev()
is doing in the code above?
Hint: Use ?rev
to read the help files for the function
Lists, like vectors and matrices, are a class of objects in R. Lists are special because they can store multiple different types of data.
my_list <- list("some_numbers" = 1:5,
"some_characters" = c("a", "b", "c"),
"a_matrix" = diag(2))
my_list
$some_numbers
[1] 1 2 3 4 5
$some_characters
[1] "a" "b" "c"
$a_matrix
[,1] [,2]
[1,] 1 0
[2,] 0 1
Make sure to store items within a list using the =
operator for assigning arguments, not the assignment arrow <-
There are three ways to access an item within a list
[[]]
with its name in quotes[[]]
with its index as a number$
followed by its name without quotesIf you use a single bracket to index, like we do with matrices and vectors, you will return a list with a single element.
Note that this means you can only return a single item in a list using double brackets or the dollar sign! (Why?)
This is a subtle but important difference!
You can subset a list similarly to vectors and matrices using single brackets.
We can use the same tools we used to access list elements to add to a list. However, if we use double brackets, we must use quotes, otherwise R will search for something that does not yet exist.
Call names()
to get a vector of list item names.
output
within a list, we can always search for it, regardless of how the list was created or what else it containsA data frame in R is essentially a special type of list, where each item is a vector of equal length. Typically, we say that data has \(n\) rows (one for each observation) and \(p\) columns (one for each variable)
Unlike a matrix, columns can have different types. However, many column functions still apply! (such as colSums
, summary
, etc.)
There are plenty of free datasets available through R and its packages. If you haven’t already, run install.packages("palmerpenguins")
in your console. Then, we can load the penguins
dataset.
We can use the head
function to look at the first several rows:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18.0 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
sex year
1 male 2007
2 female 2007
3 female 2007
4 <NA> 2007
5 female 2007
6 male 2007
Using the $
operator, we can access individual columns.
We can then use any of our useful functions for vectors to summarize this column (ex. max()
, min()
, mean()
, median()
, sum()
, sd()
, var()
, length()
).
An easy way to create a data frame is to use the function data.frame()
.
Like lists, make sure you define the names using =
and not <-
!
If you import or create numeric data as a matrix
, you can also convert it easily using as.data.frame()
We can subset data frames using most of the tools we’ve learned about subsetting so far. We can use keys or indices.
We can add to a data frame using rbind()
and cbind()
, but be careful with type mismatches! We can also add columns using the column index methods.
We can use str()
to see the structure of a data frame (or any other object!)
'data.frame': 4 obs. of 4 variables:
$ var1: num 1 2 3 1
$ var2: chr "a" "b" "c" "2"
$ var3: num 1 0 1 3
$ var4: num 3 2 1 4
'data.frame': 4 obs. of 4 variables:
$ var1: num 1 2 3 4
$ var2: chr "a" "b" "c" "d"
$ var3: logi TRUE FALSE TRUE FALSE
$ var4: num 3 2 1 0
Most data frames will have column names describing the variables. They can also include rownames, which we can add using rownames()
.
var1 var2 var3 var4
Obs1 1 a TRUE 3
Obs2 2 b FALSE 2
Obs3 3 c TRUE 1
Obs4 4 d FALSE 0