MATH167R: Vectorized functions and lists

Peter Gao

Warm-up

Discuss the following lines of code. What do they do?

x <- c(3 > 4, T, 5 > 6)
y <- c(1, 0, 1)
rbind(x, y)

Answer:

x <- c(3 > 4, T, 5 > 6)
y <- c(1, 0, 1)
rbind(x, y)

  [,1] [,2] [,3]
x    0    1    0
y    1    0    1

Overview of today

Vectorized functions
Lists
Data frames

Vectorized functions

Vector Arithmetic

Vectorization: applying a function repeatedly to every entry in a vector/array

Vectorization allows us to quickly carry out computations for every individual in a dataset.

x <- 1:5
y <- -1:-5
y

[1] -1 -2 -3 -4 -5

x + y

[1] 0 0 0 0 0

x * y

[1]  -1  -4  -9 -16 -25

Vector Arithmetic

Note that R recycles, repeating elements of shorter vectors to match longer vectors. This is incredibly useful when done on purpose, but can also easily lead to hard-to-catch bugs in your code!

2 * x

[1]  2  4  6  8 10

c(1, -1) * x

[1]  1 -2  3 -4  5

c(1, -1) + x

[1] 2 1 4 3 6

Vector Arithmetic

We can apply many functions component-wise to vectors, including comparison operators.

x >= 3

[1] FALSE FALSE  TRUE  TRUE  TRUE

y < -2

[1] FALSE FALSE  TRUE  TRUE  TRUE

(x >= 3) & (y < -2)

[1] FALSE FALSE  TRUE  TRUE  TRUE

x == c(1, 3, 2, 4, 5)

[1]  TRUE FALSE FALSE  TRUE  TRUE

Boolean Vectors

In code, entries that are TRUE or FALSE are called booleans (logicals in R). These are incredibly important, because they can be used to give your computer conditions. What will the following code do?

x[x > 3] <- 3
x

[1] 1 2 3 3 3

Boolean Vectors

We can also do basic arithmetic with booleans. TRUE is encoded as 1 and FALSE is encoded as 0.

# First reset x
x <- 1:5
sum(x >= 3)

[1] 3

mean(x >= 3)

[1] 0.6

What is this last quantity telling us?

By taking the mean, we are looking at the proportion of our vector that is TRUE.

Complicated indexing

We can also get more complicated with our indexing.

# Return the second and third elements of 
y[c(2, 3)]

[1] -2 -3

# Return the values of x greater than 3
x[x >= 3]

[1] 3 4 5

Complicated indexing

We can also get more complicated with our indexing.

# Values of x that match the index of the values of y that are less than -2
x[y < -2]

[1] 3 4 5

# which() returns the index of entries that are TRUE
which(y < -2)

[1] 3 4 5

Complicated indexing

We can compare entire vectors using identical()

identical(x, -rev(y))

[1] FALSE

What do you think the function rev() is doing in the code above?

Hint: Use ?rev to read the help files for the function

Lists

Lists, like vectors and matrices, are a class of objects in R. Lists are special because they can store multiple different types of data.

my_list <- list("some_numbers" = 1:5,
                "some_characters" = c("a", "b", "c"),
                "a_matrix" = diag(2))
my_list

$some_numbers
[1] 1 2 3 4 5

$some_characters
[1] "a" "b" "c"

$a_matrix
     [,1] [,2]
[1,]    1    0
[2,]    0    1

Make sure to store items within a list using the = operator for assigning arguments, not the assignment arrow <-

Accessing List Elements

There are three ways to access an item within a list

double brackets [[]] with its name in quotes
double brackets [[]] with its index as a number
dollar sign $ followed by its name without quotes

Accessing List Elements

my_list[["some_numbers"]]

[1] 1 2 3 4 5

my_list[[1]]

[1] 1 2 3 4 5

my_list$some_numbers

[1] 1 2 3 4 5

Why double brackets?

If you use a single bracket to index, like we do with matrices and vectors, you will return a list with a single element.

my_list[1]

$some_numbers
[1] 1 2 3 4 5

my_list[[1]]

[1] 1 2 3 4 5

Note that this means you can only return a single item in a list using double brackets or the dollar sign! (Why?)

Why double brackets?

This is a subtle but important difference!

my_list[1] + 1

Error in my_list[1] + 1: non-numeric argument to binary operator

my_list[[1]] + 1

[1] 2 3 4 5 6

Subsetting a list

You can subset a list similarly to vectors and matrices using single brackets.

my_list[1:2]

$some_numbers
[1] 1 2 3 4 5

$some_characters
[1] "a" "b" "c"

my_list[-2]

$some_numbers
[1] 1 2 3 4 5

$a_matrix
     [,1] [,2]
[1,]    1    0
[2,]    0    1

Adding to a list

We can use the same tools we used to access list elements to add to a list. However, if we use double brackets, we must use quotes, otherwise R will search for something that does not yet exist.

Adding to a list

my_list$a_boolean <- FALSE
my_list[["a_list"]] <- list("recursive" = TRUE)
my_list

$some_numbers
[1] 1 2 3 4 5

$some_characters
[1] "a" "b" "c"

$a_matrix
     [,1] [,2]
[1,]    1    0
[2,]    0    1

$a_boolean
[1] FALSE

$a_list
$a_list$recursive
[1] TRUE

Names of List Items

Call names() to get a vector of list item names.

names(my_list)

[1] "some_numbers"    "some_characters" "a_matrix"        "a_boolean"      
[5] "a_list"

Why bother?

Lists give us key-value pairs, also known as dictionaries or associative arrays
This means we can look up items in a list by name, rather than location
For example, if we know we are looking for output within a list, we can always search for it, regardless of how the list was created or what else it contains

Data Frames

Data frames

A data frame in R is essentially a special type of list, where each item is a vector of equal length. Typically, we say that data has $n$ rows (one for each observation) and $p$ columns (one for each variable)

Unlike a matrix, columns can have different types. However, many column functions still apply! (such as colSums, summary, etc.)

Example data frames in R

There are plenty of free datasets available through R and its packages. If you haven’t already, run install.packages("palmerpenguins") in your console. Then, we can load the penguins dataset.

# load palmer penguins package
library(palmerpenguins)

# open penguins data as a data frame
data(penguins)
penguins <- as.data.frame(penguins)

Penguins data

We can use the head function to look at the first several rows:

head(penguins)

  species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1  Adelie Torgersen           39.1          18.7               181        3750
2  Adelie Torgersen           39.5          17.4               186        3800
3  Adelie Torgersen           40.3          18.0               195        3250
4  Adelie Torgersen             NA            NA                NA          NA
5  Adelie Torgersen           36.7          19.3               193        3450
6  Adelie Torgersen           39.3          20.6               190        3650
     sex year
1   male 2007
2 female 2007
3 female 2007
4   <NA> 2007
5 female 2007
6   male 2007

How many columns are in this dataset?
How many rows are in this dataset?

Penguins data

Using the $ operator, we can access individual columns.

head(penguins$bill_length_mm)

[1] 39.1 39.5 40.3   NA 36.7 39.3

We can then use any of our useful functions for vectors to summarize this column (ex. max(), min(), mean(), median(), sum(), sd(), var(), length()).

Penguins data

mean(penguins$bill_length_mm)

[1] NA

Note that we may have to drop missing values, using the argument na.rm = T.

mean(penguins$bill_length_mm, na.rm = T)

[1] 43.92193

Creating a data frame

An easy way to create a data frame is to use the function data.frame().

Like lists, make sure you define the names using = and not <-!

my_data <- data.frame("var1" = 1:3,
                      "var2" = c("a", "b", "c"),
                      "var3" = c(TRUE, FALSE, TRUE))
my_data

  var1 var2  var3
1    1    a  TRUE
2    2    b FALSE
3    3    c  TRUE

Creating a data frame

If you import or create numeric data as a matrix, you can also convert it easily using as.data.frame()

my_matrix <- matrix(1:9, nrow = 3, ncol = 3)
my_matrix

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

as.data.frame(my_matrix)

Subsetting data frames

We can subset data frames using most of the tools we’ve learned about subsetting so far. We can use keys or indices.

my_data$var1

[1] 1 2 3

my_data["var1"]

my_data[["var1"]]

[1] 1 2 3

Subsetting data frames

my_data[1]

my_data[[1]]

[1] 1 2 3

my_data[, 1]

[1] 1 2 3

my_data[1, ]

  var1 var2 var3
1    1    a TRUE

Adding to a data frame

We can add to a data frame using rbind() and cbind(), but be careful with type mismatches! We can also add columns using the column index methods.

# These all do the same thing
my_data <- cbind(my_data, "var4" = c(3, 2, 1))
my_data$var4 <- c(3, 2, 1)
my_data[, "var4"] <- c(3, 2, 1)
my_data[["var4"]] <- c(3, 2, 1)
my_data

  var1 var2  var3 var4
1    1    a  TRUE    3
2    2    b FALSE    2
3    3    c  TRUE    1

Adding to a data frame

rbind(my_data, c(1, 2, 3, 4))

  var1 var2 var3 var4
1    1    a    1    3
2    2    b    0    2
3    3    c    1    1
4    1    2    3    4

rbind(my_data, list(4, "d", FALSE, 0))

  var1 var2  var3 var4
1    1    a  TRUE    3
2    2    b FALSE    2
3    3    c  TRUE    1
4    4    d FALSE    0

Investigating a data frame

We can use str() to see the structure of a data frame (or any other object!)

my_data2 <- rbind(my_data, c(1, 2, 3, 4))
str(my_data2)

'data.frame':   4 obs. of  4 variables:
 $ var1: num  1 2 3 1
 $ var2: chr  "a" "b" "c" "2"
 $ var3: num  1 0 1 3
 $ var4: num  3 2 1 4

my_data2 <- rbind(my_data, list(4, "d", FALSE, 0))
str(my_data2)

'data.frame':   4 obs. of  4 variables:
 $ var1: num  1 2 3 4
 $ var2: chr  "a" "b" "c" "d"
 $ var3: logi  TRUE FALSE TRUE FALSE
 $ var4: num  3 2 1 0

Investigating a data frame

Most data frames will have column names describing the variables. They can also include rownames, which we can add using rownames().

rownames(my_data2) <- c("Obs1", "Obs2", "Obs3", "Obs4")
my_data2

     var1 var2  var3 var4
Obs1    1    a  TRUE    3
Obs2    2    b FALSE    2
Obs3    3    c  TRUE    1
Obs4    4    d FALSE    0