Problem 1 – the boston housing data - Technology - …course1.winona.edu/bdeppa/Stat...

4
DSCI 425 – Supervised Learning Assignment 1 – Multiple Linear Regression (105 points) PROBLEM 1 – THE BOSTON HOUSING DATA The Boston Housing data set was the basis for a 1978 paper by Harrison and Rubinfeld, which discussed approaches for using housing market data to estimate the willingness to pay for clean air. The authors employed a hedonic price model, based on the premise that the price of the property is determined by structural aributes (such as size, age, condition) as well as neighborhood aributes (such as crime rate, accessibility, environmental factors). This type of approach is often used to quantify the effects of environmental factors that affect the price of a property. Data were gathered for 506 census tracts in the Boston Standard Metropolitan Statistical Area (SMSA) in 1970, collected from a number of sources including the 1970 US Census and the Boston Metropolitan Area Planning Commiee. The variables used to develop the Harrison Rubinfeld housing value equation are listed in the table below. Variables Used in the Harrison-Rubinfeld Housing Value Equation VARIABLE TYPE DEFINITION SOURCE CMEDV Dependent Variable (Y) Median value of homes in thousands of dollars 1970 U.S. Census RM Structural Average number of rooms 1970 U.S. Census AGE % of units built prior to 1940 1970 U.S. Census B Neighborhood % of population that is black 1970 U.S. Census LSTAT % of population that is lower socioeconomic status 1970 U.S. Census CRIM Crime rate measure FBI (1970) ZN % of residential land zoned for lots > than 25,000 sq. ft. Metro Area Planning Commission (1972) INDUS % of non-retail business acres (proxy for industry) Mass. Dept. of Commerce & Development (1965) TAX Property tax rate Mass. Taxpayers Foundation (1970) PTRATIO Pupil-Teacher ratio Mass. Dept. of Ed (’71-‘72) CHAS Dummy variable indicating proximity to Charles River (1 = on river) 1970 U.S. Census Tract maps DIS Accessibility Weighted distances to major employment centers in area Schnare dissertation (Unpublished, 1973) RAD Index of accessibility to radial highways MIT Boston Project NOX Air Pollution Nitrogen oxide concentrations (pphm) TASSIM 1

Transcript of Problem 1 – the boston housing data - Technology - …course1.winona.edu/bdeppa/Stat...

Page 1: Problem 1 – the boston housing data - Technology - …course1.winona.edu/bdeppa/Stat 425/Assignments/DSCI 425... · Web viewMass. Taxpayers Foundation (1970) PTRATIO Pupil-Teacher

DSCI 425 – Supervised LearningAssignment 1 – Multiple Linear Regression (105 points)

PROBLEM 1 – THE BOSTON HOUSING DATAThe Boston Housing data set was the basis for a 1978 paper by Harrison and Rubinfeld, which discussed approaches for using housing market data to estimate the willingness to pay for clean air. The authors employed a hedonic price model, based on the premise that the price of the property is determined by structural attributes (such as size, age, condition) as well as neighborhood attributes (such as crime rate, accessibility, environmental factors). This type of approach is often used to quantify the effects of environmental factors that affect the price of a property. Data were gathered for 506 census tracts in the Boston Standard Metropolitan Statistical Area (SMSA) in 1970, collected from a number of sources including the 1970 US Census and the Boston Metropolitan Area Planning Committee. The variables used to develop the Harrison Rubinfeld housing value equation are listed in the table below.

Variables Used in the Harrison-Rubinfeld Housing Value EquationVARIABLE TYPE DEFINITION SOURCECMEDV Dependent

Variable (Y)Median value of homes in thousands of dollars

1970 U.S. Census

RMStructural

Average number of rooms 1970 U.S. CensusAGE % of units built prior to 1940 1970 U.S. Census

B

Neighborhood

% of population that is black 1970 U.S. CensusLSTAT % of population that is lower

socioeconomic status1970 U.S. Census

CRIM Crime rate measure FBI (1970)

ZN % of residential land zoned for lots > than 25,000 sq. ft.

Metro Area Planning Commission (1972)

INDUS % of non-retail business acres (proxy for industry)

Mass. Dept. of Commerce & Development (1965)

TAX Property tax rate Mass. Taxpayers Foundation (1970)

PTRATIO Pupil-Teacher ratio Mass. Dept. of Ed (’71-‘72)

CHAS Dummy variable indicating proximity to Charles River (1 = on river)

1970 U.S. Census Tract maps

DISAccessibility

Weighted distances to major employment centers in area

Schnare dissertation (Unpublished, 1973)

RAD Index of accessibility to radial highways MIT Boston Project

NOX Air Pollution Nitrogen oxide concentrations (pphm) TASSIM

1

Page 2: Problem 1 – the boston housing data - Technology - …course1.winona.edu/bdeppa/Stat 425/Assignments/DSCI 425... · Web viewMass. Taxpayers Foundation (1970) PTRATIO Pupil-Teacher

REFERENCE

Harrison, D., and Rubinfeld, D. L., “Hedonic Housing Prices and the Demand for Clean Air,” Journal of Environmental Economics and Management, 5 (1978), 81-102.

Develop a regression models for predicting CMEDV using the available predictors in the table above. Note that all variable are numeric with the exception of CHAS which is in indicator/dummy variable indicating whether or not the census tract is located along the Charles River in Boston. The file Boston.csv on my website can be read into R as shown in the handouts. > Boston = read.table(file.choose(),header=T,sep=”,”)> Boston$CHAS = as.factor(Boston$CHAS) because this 0/1 coded, you do not have to do this.> bos.lm = lm(CMEDV~.,data=Boston)

Your analysis should be thorough! Document the model development process by copying and pasting relevant R commands, output, and graphics into your write-up.

Grading rubric (50 points)1) In this part of your analysis of these data you will fit a simple MLR model to these

data without trying to address any model deficiencies etc.a) Fit a base model and discuss any deficiencies (but don’t try to fix them). (5 pts.)b) Stepwise reduction of base model and discussion of final model. (5 pts.)c) Use cross-validation methods to estimate the prediction error of this model using

split-sample, k-fold, and the .632 bootstrap approaches. (10 pts.)

2) In this part of your analysis of these data you will develop a MLR that addresses any deficiencies you identified in part (1). Things to consider would be adding higher order terms (polynomials terms) and power transformations. In end I would like you to compare the predictive performance of this model to the one you developed in part (1).a) Model development, documentation, and discussion. (15 pts.)b) Fitting final model, critiquing it, and discussing any deficiencies. (5 pts.)c) Use cross-validation methods to estimate the prediction error of this model

using split-sample, k-fold, and the .632 bootstrap approaches. All prediction measures should be for the response in the ORIGINAL scale, thus you will need to back-transform your predictions in the CV process. (10 pts.)

2

Page 3: Problem 1 – the boston housing data - Technology - …course1.winona.edu/bdeppa/Stat 425/Assignments/DSCI 425... · Web viewMass. Taxpayers Foundation (1970) PTRATIO Pupil-Teacher

PROBLEM 2 – LISTING PRICE OF HOMES IN THE TWIN CITIES METRO AREAThese data are contained in the TC Homes (train).csv file on the website. The variable descriptions are below. TC Homes (test).csv contains homes I would like you to use your final model to predicting the list price force in the ORIGINAL scale. Whatever data torturing you do the training data will also need to be done to the test cases as well.

Variable Info DescriptionListPrice Response

(Y)

Current List Price ($)

BEDS X1 # of BedroomsBATHS X2 # of Bathrooms (can be fractional)SQFT X3 Square footage of home (ft.2)LotSize X 4 Square footage of lot (ft.2) – missing for several

of the homes in these data.YearBuilt X6 Year the home was built, could be used to create

a new variable called Age = 2014 - YearBuiltParkingSpots X7 # of Parking Spots (I assume off-street parking)HasGarage X 8 Garage or No (Nominal)DOM X 9 Days on the market, number of days the home

has been listed for sale.BeenReduced X10 Has the price been reduced from the original

listing price – Y or N. (Nominal)SoldPrev X12 Has the home been sold previously? Y or N (Nominal)Latitude X13 Latitude (degrees)Longitude X14 Longitude (degrees)ShortSale X15 Is more money owed on the home than what the asking

price is? Y or N (Nominal)

Grading Rubric (55 points)a) Fitting base model, critiquing it, and discussing any deficiencies. (5 pts.)

b) Model development, documentation, and discussion. (20 pts.)

Consideration of assumptions Possible predictor transformations Stepwise procedures

c) Fitting final model, critiquing it, interpreting it, and discussing any deficiencies. (5 pts.)

3

Page 4: Problem 1 – the boston housing data - Technology - …course1.winona.edu/bdeppa/Stat 425/Assignments/DSCI 425... · Web viewMass. Taxpayers Foundation (1970) PTRATIO Pupil-Teacher

d) Cross-validation results and discussion for predicting the response in the original scale. (10 pts.)

e) Give me your predicted list price for the test cases contained in the file TC Homes (test).csv using your model. I will discuss how to do this this class. (10 pts.)

4