Lab 5

Author

YOUR NAME HERE

Remember, follow the instructions below and use R Markdown to create a pdf document with your code and answers to the following questions on Gradescope. You may find a template file by clicking “Code” in the top right corner of this page.

0. Cook County Assessor’s Office

For this lab, you will work with data from Cook County Assessor’s Office from Illinois, which inspects properties across Chicago and its suburbs to assess the value of each property to determine the amount in property taxes owed by each property owner. All property owners are required to pay taxes which are used to fund public services at the state level. These assessments are based on property values which are often estimated via statistical models that account for variables like the size and location of a property.

Since these models determine how much property owners must pay each year, it is desirable to have assessments that are fair. However, in 2017, the office of the former Cook County Assessor Joseph Berrios was sued by two Chicago nonprofits who alleged that Berrios’ office “disproportionately put the burden of residential property taxes on minority homeowners,” so that wealthy property owners paid proportionally less in taxes compared with lower-income, and often minority, property owners. The Chicago Tribune investigated property assessments from 2003 to 2015, arguing that assessments had indeed been discrimnatory. Their four-part investigation can be found here.

Since this investigation, the Cook County Assessor’s Office has strived to be more transparent in disclosing their methods and data for property valuation. In this assignment, we will look at data they have released on property valuation from 2013-2019. The office has also published open-source code for their models, which is written in R! This assignment is based on a module developed by instructors at UC Berkeley.

A. Residential Sales Data

  1. Download the data from this link. How many rows are there in this dataset? What does each row represent? (Hint: be precise here).

  2. Examine the Site Desirability variable. What do each of the levels of this variable represent? You may need to refer to the codebook to learn about this variable. Is it explained how this variable is determined?

  3. Give an example of a variable that is not included in this dataset that could be useful in determining property value.

  4. Create a histogram of Sale Price for this dataset. Identify one issue with this visualization and attempt to address this issue.

  5. For the rest of the assignment, we will focus on a subset of properties. Provide code that creates a new data.frame called clean_data that contains only properties whose sale price is at least $500. Create a new column in this data frame called log_sale_price that contains the log-transformed Sale Price values.

  6. Visualize the association between number of bedrooms and log_sale_price using parallel box plots. You may need to convert Bedrooms to a factor before you are able to construct the parallel box plots. For clarity, only include properties with 10 or fewer bedrooms. Interpret your results.

  7. Create a new factor variable called age_bin that has levels 1-20, 21-40, 41-60, 61-80, 81-100, and 100+. Visualize the association between age_bin and log_sale_price using parallel box plots. Interpret your results.

B. Assessor First Pass Values

  1. Not all of the properties in the above dataset have public assessment values. You can download another dataset containing “First Pass Values” representing the Assessor’s initial valuations for a set of properties in 2019 here. How many rows are in this dataset?

  2. Use an appropriate function to combine the first pass values data with the clean_data from Part A. You should keep only rows that have both log_sale_price (from clean_data) and First Pass Value 1 from the first pass values data. How many rows are in this combined dataset?

  3. Create a scatter plot with log(First Pass Value 1) on the x-axis and log_sale_price on the y-axis. Add a line to your plot indicating the line where y=x. Interpret your results. What do points above the line represent? What do points below the line represent?