We want to know the relationship between our covariates and our response, we can do this with a method called regression. Regression provides us with a statistical method to conduct inference and prediction.
What is the difference between inference and prediction?
Recall: Regression
inference: assess the relationship between our variables, our statistical model as a whole, predictor importance
What is the relationship between sleep and GPA?
Is parents’ education or parents’ income more important for explaining income?
prediction: predict new/future outcomes from new/future covariates
Can we predict test scores based on hours spent studying?
Linear Regression
Given our response \(Y\) and our predictors \(X_1, \ldots, X_p\), a linear regression model takes the form:
Previously, we focused on using this model for inference (e.g. \(H_0: \beta_1 = 0\)). Today, we are shifting our focus to prediction.
Statistical Prediction
Often, we use a linear model even when we know that the assumptions for inference are broken: for example, when errors are not normally distributed.
Though the model may be wrong, it may still be useful for learning about the relationship between response and covariates.
In addition, we can use a working model for prediction.
Training and testing data
When training predictive algorithms, it is common to split your data into two categories:
Training data: data used to train/fit our model
Test data: data used to evaluate/test the performance of our model
Suppose we have training data \(X_{i1}, \ \ldots,\ X_{ip}, Y_i, \ i = 1,\ldots,n\) which we use to estimate regression coefficents \(\widehat\beta_0, \ \widehat\beta_1, \ldots \widehat\beta_p\) (Recall: we use \(\widehat{}\) “hats” to indicate estimates).
Making predictions
Now suppose we are given new testing data \(X^*_1,\ \ldots, X^*_{p}\) and asked to predict the associated \(Y^∗\). Using our estimated linear model, the prediction is: \[\widehat{Y}^* = \widehat\beta_0 + \widehat{\beta}_1 X^*_1 + \cdots + \widehat{\beta}_p X^*_p\]
We define the mean squared error, or mean squared prediction error, as
\[\mathbb{E}[(Y^* -\widehat{Y}^*)^2].\]
In other words, the test error is defined as the expected squared difference between a new prediction and the truth, where the expectation is taken over the population distribution of test data.
Test error
Mean squared prediction error:\(\mathbb{E}[(Y^* -\widehat{Y}^*)^2],\) the expected squared difference between a new prediction and the truth.
The MSPE quantifies the predictive power of our model, but we do not actually have access to the full population distribution.
As such, we must estimate the MSPE using our data in order to use it to assess our model.
In particular, we could use the MSPE for:
predictive assessment: understanding the magnitude of errors we should expect when making future predictions
model/method selection: choosing among multiple candidate models.
Training Error
A natural estimator for the test error might be the observed average training error:
In general, this will be too optimistic. Why? We obtained parameter estimates \(\widehat{\beta}_0,\ \ldots,\ \widehat{\beta}_p\) based on our training data.
Training Error
More specifically (with a linear regression model), we chose parameters to minimize the squared errors (hence, least squares).
We cannot expect this error rate to generalize to test data.
Critically, the more complex our model, the more optimistic our training data will be as an estimate for our test data
# Fit linear model and calculate training errorlm_fit <-lm(y ~ x, data = df_train)yhat_train <-predict(lm_fit)# (y_i - yhat_i)^2train_err <-mean((df_train$y - yhat_train)^2)# Calculate testing erroryhat_test <-predict(lm_fit, data.frame(x = df_test$x))test_err <-mean((df_test$y - yhat_test)^2)# Add linear model and error text to plotg_train2 <- g_train +labs(subtitle =paste("Training error:", round(train_err, 3))) +geom_line(aes(y =fitted(lm_fit)), col ="red", lwd =1.5)g_test2 <- g_test +labs(subtitle =paste("Test error:", round(test_err, 3))) +geom_line(aes(y =fitted(lm_fit), x = df_train$x), col ="red", lwd =1.5)ggarrange(g_train2, g_test2)
Example: Training Error vs Test Error for linear model
Example: Degree 5 polynomial
# Fit 5 degree polynomial and calculate training errorlm_fit_5 <-lm(y ~poly(x, 5), data = df_train)yhat_train_5 <-predict(lm_fit_5)train_err_5 <-mean((df_train$y - yhat_train_5)^2)# Calculate testing erroryhat_test_5 <-predict(lm_fit_5, data.frame(x = df_test$x))test_err_5 <-mean((df_test$y - yhat_test_5)^2)# Create smooth line for plottingx_fit <-data.frame(x =seq(-3, 3, length =100))line_fit <-data.frame(x = x_fit, y =predict(lm_fit_5, newdata = x_fit))# Add linear model and error text to plotg_train3 <- g_train +labs(subtitle =paste("Training error:", round(train_err_5, 3))) +geom_line(data = line_fit, aes(y = y, x = x), col ="red", lwd =1.5)g_test3 <- g_test +labs(subtitle =paste("Test error:", round(test_err_5, 3))) +geom_line(data = line_fit, aes(y = y, x = x), col ="red", lwd =1.5)ggarrange(g_train3, g_test3)
Example: Degree 5 polynomial
Example: Degree 10 polynomial
# Fit 10 degree polynomial and calculate training errorlm_fit_10 <-lm(y ~poly(x, 10), data = df_train)yhat_train_10 <-predict(lm_fit_10)train_err_10 <-mean((df_train$y - yhat_train_10)^2)# Calculate testing erroryhat_test_10 <-predict(lm_fit_10, data.frame(x = df_test$x))test_err_10 <-mean((df_test$y - yhat_test_10)^2)# Create smooth line for plottingx_fit <-data.frame(x =seq(-3, 3, length =100))line_fit <-data.frame(x = x_fit, y =predict(lm_fit_10, newdata = x_fit))# Add linear model and error text to plotg_train4 <- g_train +labs(subtitle =paste("Training error:", round(train_err_10, 3))) +geom_line(data = line_fit, aes(y = y, x = x), col ="red", lwd =1.5)g_test4 <- g_test +labs(subtitle =paste("Test error:", round(test_err_10, 3))) +geom_line(data = line_fit, aes(y = y, x = x), col ="red", lwd =1.5)ggarrange(g_train4, g_test4)
Example: Degree 10 polynomial
Example: Degree 15 polynomial
# Fit 15 degree polynomial and calculate training errorlm_fit_15 <-lm(y ~poly(x, 15), data = df_train)yhat_train_15 <-predict(lm_fit_15)train_err_15 <-mean((df_train$y - yhat_train_15)^2)# Calculate testing erroryhat_test_15 <-predict(lm_fit_15, data.frame(x = df_test$x))test_err_15 <-mean((df_test$y - yhat_test_15)^2)# Create smooth line for plottingx_fit <-data.frame(x =seq(-3, 3, length =100))line_fit <-data.frame(x = x_fit, y =predict(lm_fit_15, newdata = x_fit))# Add linear model and error text to plotg_train5 <- g_train +labs(subtitle =paste("Training error:", round(train_err_15, 3))) +geom_line(data = line_fit, aes(y = y, x = x), col ="red", lwd =1.5)g_test5 <- g_test +labs(subtitle =paste("Test error:", round(test_err_15, 3))) +geom_line(data = line_fit, aes(y = y, x = x), col ="red", lwd =1.5)ggarrange(g_train5, g_test5)
Example: Degree 15 polynomial
Bias-Variance Tradeoff
The formula for test error can be decomposed as a function of the bias and variance of our estimates.
Bias: the expected difference between our estimate and the truth
Variance: the variability of our estimate of the truth
Bias-Variance Tradeoff
For example, suppose \(Y^*=\mu+\epsilon\) where \(\epsilon\) is mean zero random error with variance \(\sigma^2\):
It is easy to fit a model with extremely low variance and extremely high bias, or extremely low bias and extremely high variance, but these models will not have a good test error. Ideally, we would like an estimate that has both low bias and low variance, but this is not always possible!
As a general rule of thumb, as we increase the complexity of our model, the bias will decrease but the variance will increase. You can think of our bias decreasing because we are training our model to be more specific to our training data. You can think of the variance increasing because our model will be overfit, and will vary substantially with new training data.
Cross-validation
Sample-splitting
Where do we get our “training data” and our “test data” in practice? We can’t just simulate more data for our test data in the real world. In practice, we use a technique called sample-splitting:
Split the data set into two (or more…) parts
First part of data: train the model/method
Second part of data: make predictions
Evaluate observed test error
Sample-splitting
library(forcats)# Generate dataset.seed(302)n <-50x <-sort(runif(n, -3, 3))y <-2*x +2*rnorm(n)# Randomly split data into two partsinds <-sample(rep(1:2, length = n))split <-as.factor(inds) |>fct_recode(Training ="1", Test ="2")data <-data.frame("x"= x, "y"= y, "split"= split)g_split <-ggplot(data, aes(x = x, y = y, color = split)) +geom_point() +labs(title ="Sample-splitting") + my_themeg_split
Sample-splitting
Sample-splitting
# Train on the training splitdata_train <- data |> dplyr::filter(split =="Training")lm_1 <-lm(y ~ x, data = data_train)lm_10 <-lm(y ~poly(x, 10), data = data_train)# Predict on the second halfdata_test <- data |> dplyr::filter(split =="Test")pred_1 <-predict(lm_1, data.frame(x = data_test$x))pred_10 <-predict(lm_10, data.frame(x = data_test$x))# Calculate test errortest_err_1 <-mean((data_test$y - pred_1)^2)test_err_10 <-mean((data_test$y - pred_10)^2)
Sample-splitting
# Create smooth lines for plottingx_fit <-data.frame(x =seq(min(data$x), max(data$x), length =100))line_fit_1 <-data.frame(x = x_fit, y =predict(lm_1, newdata = x_fit))line_fit_10 <-data.frame(x = x_fit, y =predict(lm_10, newdata = x_fit))# Add linear model and error text to plotg_split1 <- g_split +labs(subtitle =paste("Degree 1 Test error:", round(test_err_1, 3))) +geom_line(data = line_fit_1, aes(y = y, x = x), col ="black", lwd =1.5)g_split10 <- g_split +labs(subtitle =paste("Degree 10 Test error:", round(test_err_10, 3))) +geom_line(data = line_fit_10, aes(y = y, x = x), col ="black", lwd =1.5)ggarrange(g_split1, g_split10, common.legend =TRUE, legend ="bottom")
Sample-splitting
Sample-splitting
Sample-splitting gives us a powerful tool to evaluate the performance of estimator on data that was not used to train it. Of course, this comes at the cost of sacrificing some of our data that could be used in order to train a model with as much information as possible, but this is a sacrifice we must make in practice if we want to appropriately evaluate the predictive performance of our models/methods!
However, we can do better than just splitting our data in half.
k-fold Cross-validation
In general, the process of \(k\)-fold cross-validation is as follows:
Split data into \(k\) parts (folds)
Use all but 1 fold as your training data and fit the model
Use the remaining fold for your test data and make predictions
Switch which fold is your test data and repeat steps 2 and 3 until all folds have been test data (k times)
Compute squared error
Commonly, we use \(k=5\) and \(k=10\).
k-fold Cross-validation
k-fold Cross-validation
For each split, we calculate the out-of-sample mean squared prediction error. For example, for the first split, we calculate
\[MSE_1 = \dfrac{1}{m}\sum_{j = 1}^{m} (Y_{j,test} - \widehat{Y}_{j,test})^2,\] where \(m\) is the number of observations in the test set.
Then we calculate the cross-validation estimate of our test error
\[CV_{(k)} = \dfrac{1}{k}\sum_{i=1}^{k}MSE_i.\]
In words, our cross-validation estimate of the test error is the average of the mean squared errors across the splits!
Leave-One-Out Cross-Validation
Leave-one-out cross-validation (LOOCV) is a special instance of \(k\)-fold cross-validation where \(k\) is equal to the sample size \(n\). This means you fit \(n\) different models using \(n-1\) data points for your training data, and then test on the remaining individual data point.
Bias-variance Tradeoff
LOOCV strengths in that it uses as much data as possible to train the model while still having out-of-sample test data. Thus, it will provide approximately unbiased estimates of the test error, and is the best choice for minimizing bias.
However, LOOCV has one major drawback: variance. Because it fits \(n\) models, each of which is trained on almost identical training data, the outputs of the models are highly correlated. All other things being equal, the mean of correlated observations has a higher variance than the mean of uncorrelated observations. Thus, LOOCV has higher variance than \(k\)-fold CV with \(k<n\).
As a rule of thumb, stick to \(k=5\) and \(k=10\). These values tend to result in an ideal balance in terms of the bias-variance tradeoff.
What to do after cross-validation?
We use cross-validation as a method to evaluate our model, not to build our model! Thus, once we have used \(k\)-fold cross-validation to come up with a reasonable estimate of our out-of-sample test-error (and perhaps to choose among several models based on this test error), we use the full data to train our final predictive model.
Thus, using cross-validation, we are able to build a model based on our full data, and still evaluate the performance of our model on out-of-sample data!
k-fold Cross-validation Example
set.seed(302)# Split data in 5 parts, randomlyk <-5inds <-sample(rep(1:k, length = n))# Use data from beforedata <-data.frame("x"= x, "y"= y, "split"= inds)# Empty matrix to store predictions, 2 columns for 2 modelspred_mat <-matrix(NA, n, 2)for (i in1:k) { data_train <- data |> dplyr::filter(split != i) data_test <- data |> dplyr::filter(split == i)# Train our models lm_1_cv <-lm(y ~ x, data = data_train) lm_10_cv <-lm(y ~poly(x, 10), data = data_train)# Record predictions pred_mat[inds == i, 1] <-predict(lm_1_cv, data.frame(x = data_test$x)) pred_mat[inds == i, 2] <-predict(lm_10_cv, data.frame(x = data_test$x))}# Compute average MSE to get CV errorcv_err <-colMeans((pred_mat - data$y)^2)
k-fold Cross-validation Example
# Create predictive modellm_1 <-lm(y ~ x, data = data)lm_10 <-lm(y ~poly(x, 10), data = data)# Create smooth lines for plottingx_fit <-data.frame(x =seq(min(data$x), max(data$x), length =100))line_fit_1 <-data.frame(x = x_fit, y =predict(lm_1, newdata = x_fit))line_fit_10 <-data.frame(x = x_fit, y =predict(lm_10, newdata = x_fit))# Recode split for legenddata <- data |>mutate(Fold =fct_recode(as.factor(split), "Fold 1"="1", "Fold 2"="2","Fold 3"="3", "Fold 4"="4", "Fold 5"="5"))g_cv <-ggplot(data, aes(x = x, y = y, color = Fold)) +geom_point() +labs(title ="Cross-validation") + my_themeg_cv1 <- g_cv +labs(subtitle =paste("Degree 1 CV error:", round(cv_err[1], 3))) +geom_line(data = line_fit_1, aes(y = y, x = x), col ="black", lwd =1.5)g_cv10 <- g_cv +labs(subtitle =paste("Degree 10 CV error:", round(cv_err[2], 3))) +geom_line(data = line_fit_10, aes(y = y, x = x), col ="black", lwd =1.5)ggarrange(g_cv1, g_cv10, common.legend =TRUE, legend ="bottom")
k-fold Cross-validation Example
Statistical Prediction Algorithms
We’ve talked about how regression can be used for both statistical inference and prediction. We are now going to shift to talk about algorithms designed specifically for statistical prediction.
We will discuss only a few methods for this course to teach you about the principles of how they work. This will serve as an introduction to topics you may encounter if you venture further into fields such as machine learning.
Statistical Prediction Algorithms
The normal distribution, specified by parameters for the mean and the variance, is an example of a parametric model. We are now going to discuss some non-parametric models. As the name implies, these models do not assume that the data are drawn from some distribution with a fixed number of parameters.
Non-parametric models are not motivated by inference on model parameters. Instead, the focus is often on pure prediction. How can we use data to help us classify and predict the world around us? For the most part, many of these methods would have been impossible before the rise of high-powered computing.
k-nearest neighbors
k-nearest neighbors is a non-parametric prediction algorithm.
Input: training data \(X_i = (X_{i1}, \ldots, X_{ip})\) and output class \(Y_i\) for \(i = 1,\ldots, n\)
Input: test data \(X^* = (X^*_1, \ldots, X^*_p)\) with unknown output class \(Y^*\)
Algorithm: Find \(X_{(1)},\ldots, X_{(k)}\), the \(k\) nearest points to \(X^*\)
Output: Predicted class \(\widehat{Y}^*\), the most common class from \(X_{(1)},\ldots, X_{(k)}\)
k-nearest neighbors
This method is simple and highly flexible, but it can also be slow and inappropriate for certain settings.
Image Credit: An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani
k-nearest neighbors
Image Credit: An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani
Choosing k
What happens when \(k=1\)? \(k = n\)?
In order to choose an “optimal” \(k\), it is common to use cross-validation:
For \(k=1,\ldots,n\) (or a more restrictive range), fit \(k\)-nearest neighbors
Use cross-validation to identify \(k\) with the minimum test error
Re-train a \(k\)-nearest neighbor model using all the data with optimal choice of \(k\)
Cross-validation is a great tool to choose values for tuning parameters across a number of algorithms, not just \(k\)-nearest neighbors!
In practice, we don’t typically predict with a single tree. Instead, we combine multiple trees and take the average/most common of their predictions. Keeping with the tree metaphor, this is referred to as the random forest algorithm.
One major benefit of tree-based prediction algorithms is that they provide a powerful and interpretable tool for assessing variable importance. For example, we can look across all trees in a random forest, assess how often variables were used for splitting and how helpful those splits were for prediction, and identify the variables that improve our predictions the most!
(For more on this, see the Gini index and “An Introduction to Statistical Learning” by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani.)
Statistical Prediction Algorithms
This was a very brief overfiew of a very small subset of the algorithms used for statistical prediction. Other common examples include:
neural networks,
support vector machines,
kernel classifiers,
Bayesian classifiers,
and more…
These algorithms, and related topics, are likely things that you will cover if you take courses more focused on machine learning.