Lab 8

Author

YOUR NAME HERE

Remember, follow the instructions below and use R Markdown to create a pdf document with your code and answers to the following questions on Gradescope. You may find a template file by clicking “Code” in the top right corner of this page.

A. Bootstrapping the sampling distribution of the median

  1. Using the penguins dataset in the palmerpenguins package, construct a confidence interval for the mean body_mass_g for female Adelie penguins based on using a normal distribution based on the central limit theorem. You should compute the confidence interval without using confint().

  2. Construct a bootstrap confidence interval for the mean body_mass_g for female Adelie penguins using 10000 resamples.

  3. Construct a bootstrap confidence interval for the median body_mass_g for female Adelie penguins using 10000 resamples.

B. Simulations

  1. Suppose that \(Y\sim \mathrm{Poisson}(X)\) where \(X\sim \mathrm{Exponential}(1)\). Use simulation to estimate \(E(Y)\) and \(\mathrm{Var}(Y)\).

  2. For this question, you will write a simulation to test the frequentist coverage of a 95% confidence interval for a proportion based on the normal approximation.

    1. First, write a function that takes two inputs: n and p. Your function should randomly generate some \(X\sim \mathrm{Binomial}(n, p)\), compute \(\widehat{p}= X/n\), and then compute the corresponding normal distribution-based confidence interval for \(p\) based on your sample \(\widehat{p}\). Your function should return TRUE if \(p\) is in the confidence interval. You may use the following formula for the confidence interval:

    \[\widehat{p}\pm z_{.975}\sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}}\]

    1. Next, write a second function that takes three inputs: n, p, and n_runs, representing the number of times to run your simulation. This function should use your function from (a) to simulate n_runs binomial random variables and return the proportion of the n_runs for which \(p\) is contained in the confidence interval.

    2. Test your function from (b) with n = 20, p = .5, and n_runs = 1000.

    3. Use your simulation code to investigate the following questions: For what values of n and p is the frequentist coverage close to the expected 95% value? For what values of n and p is the frequentist coverage very different to the expected 95% value?

C. Hypothesis Testing

Use the following code to obtain the Hawaiian Airlines and Alaska Airlines flights from the nycflights13 package.

library(tidyverse)
library(nycflights13)
data("flights")
flights_sample <- flights |> 
  filter(carrier %in% c("HA", "AS"))
  1. Compute a 95% confidence interval for the mean arr_delay for Alaska Airlines flights. Interpret your results.

  2. Compute a 95% confidence interval for the mean arr_delay for Hawaiian Airlines flights. Interpret your results.

  3. Compute a 95% confidence interval for the proportion of flights for which arr_delay > 0 for Hawaiian Airlines flights. Interpret your results.

  4. Consider the null hypothesis that the mean arr_delay for Alaska is equal to the mean arr_delay for Hawaiian and the alternative hypothesis that the mean arr_delay values are different for the two airlines. Perform an appropriate hypothesis test and interpret your results.

D. Linear Regression

Researchers at the University of Texas in Austin, Texas tried to figure out what causes differences in instructor teaching evaluation scores. Use the following code to load data on 463 courses. A full description of the data can be found here.

evals <- readr::read_csv("https://www.openintro.org/book/statdata/evals.csv")
  1. Carry out a linear regression with score as the response variable and age as the single explanatory variable. Interpret your results.

  2. Extend your regression model by adding an additional explanatory variable. What happens to your results? Are the new \(p\)-values appropriate to use?