Lab 5
Remember, follow the instructions below and use R Markdown to create a pdf document with your code and answers to the following questions on Gradescope. You may find a template file by clicking “Code” in the top right corner of this page.
0. Cook County Assessor’s Office
For this lab, you will work with data from Cook County Assessor’s Office from Illinois, which inspects properties across Chicago and its suburbs to assess the value of each property to determine the amount in property taxes owed by each property owner. All property owners are required to pay taxes which are used to fund public services at the state level. These assessments are based on property values which are often estimated via statistical models that account for variables like the size and location of a property.
Since these models determine how much property owners must pay each year, it is desirable to have assessments that are fair. However, in 2017, the office of the former Cook County Assessor Joseph Berrios was sued by two Chicago nonprofits who alleged that Berrios’ office “disproportionately put the burden of residential property taxes on minority homeowners,” so that wealthy property owners paid proportionally less in taxes compared with lower-income, and often minority, property owners. The Chicago Tribune investigated property assessments from 2003 to 2015, arguing that assessments had indeed been discrimnatory. Their four-part investigation can be found here.
Since this investigation, the Cook County Assessor’s Office has strived to be more transparent in disclosing their methods and data for property valuation. In this assignment, we will look at data they have released on property valuation from 2013-2019. The office has also published open-source code for their models, which is written in R! This assignment is based on a module developed by instructors at UC Berkeley.
A. Residential Sales Data
Download the data from this link. How many rows are there in this dataset? What does each row represent? (Hint: be precise here).
Examine the
Site Desirability
variable. What do each of the levels of this variable represent? You may need to refer to the codebook to learn about this variable. Is it explained how this variable is determined?Give an example of a variable that is not included in this dataset that could be useful in determining property value.
Create a histogram of
Sale Price
for this dataset. Identify one issue with this visualization and attempt to address this issue.For the rest of the assignment, we will focus on a subset of properties. Provide code that creates a new
data.frame
calledclean_data
that contains only properties whose sale price is at least $500. Create a new column in this data frame calledlog_sale_price
that contains the log-transformedSale Price
values.Visualize the association between number of bedrooms and
log_sale_price
using parallel box plots. You may need to convertBedrooms
to a factor before you are able to construct the parallel box plots. For clarity, only include properties with 10 or fewer bedrooms. Interpret your results.Create a new factor variable called
age_bin
that has levels1-20
,21-40
,41-60
,61-80
,81-100
, and100+
. Visualize the association betweenage_bin
andlog_sale_price
using parallel box plots. Interpret your results.
B. Assessor First Pass Values
Not all of the properties in the above dataset have public assessment values. You can download another dataset containing “First Pass Values” representing the Assessor’s initial valuations for a set of properties in 2019 here. How many rows are in this dataset?
Use an appropriate function to combine the first pass values data with the
clean_data
from Part A. You should keep only rows that have bothlog_sale_price
(fromclean_data
) andFirst Pass Value 1
from the first pass values data. How many rows are in this combined dataset?Create a scatter plot with
log(First Pass Value 1)
on the x-axis andlog_sale_price
on the y-axis. Add a line to your plot indicating the line wherey=x
. Interpret your results. What do points above the line represent? What do points below the line represent?