library(tidyverse)
library(nycflights13)
data("flights")
<- flights |>
flights_sample filter(carrier %in% c("HA", "AS"))
Lab 8
Remember, follow the instructions below and use R Markdown to create a pdf document with your code and answers to the following questions on Gradescope. You may find a template file by clicking “Code” in the top right corner of this page.
A. Bootstrapping the sampling distribution of the median
Using the
penguins
dataset in thepalmerpenguins
package, construct a confidence interval for the meanbody_mass_g
for female Adelie penguins based on using a normal distribution based on the central limit theorem. You should compute the confidence interval without usingconfint()
.Construct a bootstrap confidence interval for the mean
body_mass_g
for female Adelie penguins using 10000 resamples.Construct a bootstrap confidence interval for the median
body_mass_g
for female Adelie penguins using 10000 resamples.
B. Simulations
Suppose that \(Y\sim \mathrm{Poisson}(X)\) where \(X\sim \mathrm{Exponential}(1)\). Use simulation to estimate \(E(Y)\) and \(\mathrm{Var}(Y)\).
For this question, you will write a simulation to test the frequentist coverage of a 95% confidence interval for a proportion based on the normal approximation.
- First, write a function that takes two inputs:
n
andp
. Your function should randomly generate some \(X\sim \mathrm{Binomial}(n, p)\), compute \(\widehat{p}= X/n\), and then compute the corresponding normal distribution-based confidence interval for \(p\) based on your sample \(\widehat{p}\). Your function should returnTRUE
if \(p\) is in the confidence interval. You may use the following formula for the confidence interval:
\[\widehat{p}\pm z_{.975}\sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}}\]
Next, write a second function that takes three inputs:
n
,p
, andn_runs
, representing the number of times to run your simulation. This function should use your function from (a) to simulaten_runs
binomial random variables and return the proportion of then_runs
for which \(p\) is contained in the confidence interval.Test your function from (b) with
n = 20
,p = .5
, andn_runs = 1000
.Use your simulation code to investigate the following questions: For what values of
n
andp
is the frequentist coverage close to the expected 95% value? For what values ofn
andp
is the frequentist coverage very different to the expected 95% value?
- First, write a function that takes two inputs:
C. Hypothesis Testing
Use the following code to obtain the Hawaiian Airlines and Alaska Airlines flights from the nycflights13
package.
Compute a 95% confidence interval for the mean
arr_delay
for Alaska Airlines flights. Interpret your results.Compute a 95% confidence interval for the mean
arr_delay
for Hawaiian Airlines flights. Interpret your results.Compute a 95% confidence interval for the proportion of flights for which
arr_delay > 0
for Hawaiian Airlines flights. Interpret your results.Consider the null hypothesis that the mean
arr_delay
for Alaska is equal to the meanarr_delay
for Hawaiian and the alternative hypothesis that the meanarr_delay
values are different for the two airlines. Perform an appropriate hypothesis test and interpret your results.
D. Linear Regression
Researchers at the University of Texas in Austin, Texas tried to figure out what causes differences in instructor teaching evaluation scores. Use the following code to load data on 463 courses. A full description of the data can be found here.
<- readr::read_csv("https://www.openintro.org/book/statdata/evals.csv") evals
Carry out a linear regression with
score
as the response variable andage
as the single explanatory variable. Interpret your results.Extend your regression model by adding an additional explanatory variable. What happens to your results? Are the new \(p\)-values appropriate to use?