(Machine Learning) Clustering & Classifying Houses in King County, WA
-
Upload
mohammed-al-hamadi -
Category
Data & Analytics
-
view
79 -
download
1
Transcript of (Machine Learning) Clustering & Classifying Houses in King County, WA
Clustering & Classifying Houses in King County, WAMOHAMMED ALHAMADI - PROJECT 1
Acknowledgement
This project was done as a partial requirement for the course Introduction to Machine Learning offered online fall-2016 at the Tandon Online, Tandon
School of Engineering, NYU.
Outlineo The dataset
o Loading & Exploring the dataset
o Clustering Zip codes and Prices
o Predicting House Prices Using Support Vector Machine Algorithm
o Decreasing Correlation Between Independent Variableso Scaling Data for SVM
o References
The Dataset• House sales in King County in Washington State
• From May 2014 to May 2015
• 21,613 observations and 21 features
ID, Date, Price, Bedrooms, Bathrooms, SQFT Living (living area in square feet), SQFT Lot (lot area in square feet), Floors, Waterfront, View, Grade (house grade ranging from 1 to 13), Condition (house condition ranging from 1 to 5), SQFT Above (living area excluding the basement), SQFT Basement (basement area), Yr Built (the year in which the house was built), Yr Renovated (the year in which the house was renovated), Zipcode, Lat (latitude), Long (longitude), SQFT Living15 (living area in square feet for the nearest 15 neighbors), and SQFT LOT15 (lot area in square feet for the nearest 15 neighbors).
Loading and exploring the datahouses2 <- read.csv("/Users/mohammedalhamadi/GoogleDrive/R_code/data/kc_house_data.csv", header=TRUE)dim(houses2) [1] 21613 21
names(houses2)[1] "id" "date" "price" "bedrooms" "bathrooms" [6] "sqft_living" "sqft_lot" "floors" "waterfront" "view" [11] "condition" "grade" "sqft_above" "sqft_basement" "yr_built" [16] "yr_renovated" "zipcode" "lat" "long" "sqft_living15"[21] "sqft_lot15"
Loading and exploring the data (cont.)str(houses2)'data.frame': 21613 obs. of 21 variables: $ id : num 7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ... $ date : Factor w/ 372 levels "20140502T000000",..: 165 221 291 221 284 11 57 252 340 306 ... $ price : num 221900 538000 180000 604000 510000 ... $ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ... $ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ... $ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ... $ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ... $ floors : num 1 2 1 1 1 1 2 1 1 2 ... $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ... $ view : int 0 0 0 0 0 0 0 0 0 0 ... $ condition : int 3 3 3 5 3 3 3 3 3 3 ... $ grade : int 7 7 6 7 8 11 7 7 7 7 ... $ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ... $ sqft_basement: int 0 400 0 910 0 1530 0 0 730 0 ... $ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ... $ yr_renovated : int 0 1991 0 0 0 0 0 0 0 0 ... $ zipcode : int 98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ... $ lat : num 47.5 47.7 47.7 47.5 47.6 ... $ long : num -122 -122 -122 -122 -122 ... $ sqft_living15: int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ... $ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
Discovering correlations between featurescor(houses2[,3:length(houses2)])# first 2 features are excluded
Grade and sqft_living = 0.763Grade and sqft_above = 0.756Price and sqft_living = 0.702Bathrooms and sqft_above = 0.685Price and grade = 0.667Bathrooms and grade = 0.665Price and sqft_above = 0.606Price and sqft_living15 = 0.585Price and bathrooms = 0.525
# most correlations are insignificant. The most significant correlations in order:
Correlations plots (1)plot(houses2[,c(6,3)], main="Correlation between SQFT Living and Price", xlab="Living area in square ft", ylab="Price", col="blue")
# Correlation between sqft_living and price
Correlations plots (2)plot(houses2[,c(6,12)], main="Correlation between SQFT Living and Grade", xlab="Living area in square ft", ylab="Grade", col="dark green")
# Correlation between sqft_living and grade
Correlations plots (3)plot(houses2[,c(12,3)], main="Correlation between grade and price", xlab="Grade", ylab="Price", col="red") # Correlation between grade and price
Clustering zip codes and prices datazip_and_price <- houses2[1:5000, c("zipcode", "price")] # consider first 5000 observationsscaledZP <- scale(zip_and_price) # scale for comparabilitydist_scaledZP <- dist(scaledZP, method="euclidean") # use Euclidean distanceclusters <- hclust(dist_scaledZP, method="ward.D") plot(clusters) # plot clusters in a dendrogram
Clustering zip codes and prices data (cont.)groups <- cutree(clusters, k=6)rect.hclust(clusters, k=6, border="blue")
Predicting House Prices Using Support Vector Machine Algorithm# Creating a categorical variable from the prices data
quantile(houses2$price)
0% 25% 50% 75% 100% 75000 321950 450000 645000 7700000
So we can have our 4 classes like this:
Houses more expensive than 645,000 Expensive 5373 houses
Houses between 450,000 and 645,000 High 5376 houses
Houses between 321,950 and 450,000 Ok 5460 houses
Houses cheaper than 321,950 Cheap 5404 houses
Predicting House Prices Using Support Vector Machine Algorithm (cont.)
summary(houses2$price_categ)
Cheap Ok High Expensive 5404 5460 5376 5373
houses2$price_categ<-cut(houses2$price, c(0,321950,450000,645000,7700000), labels=c("Cheap", "Ok", "High", "Expensive"))
# add a column to the data to hold this categorical variable
# view a summary of the categorical variable
Predicting House Prices Using Support Vector Machine Algorithm (cont.)
# choose relevant columns from the data set based on correlation analysis done earliercols <- c("sqft_living", "grade", "sqft_living15", "bathrooms", "view", "price_categ")
# let’s define a training data set and a testing data setset.seed(100) training_size <- round(0.7 * dim(houses2)[1])training_sample <- sample(dim(houses2)[1], training_size, replace=FALSE) training_houses <- houses2[training_sample,cols]testing_houses <- houses2[-training_sample,cols]
library(e1071)svmfit <- svm(price_categ~., data=training_houses, kernel="linear", cost=0.1, scale=FALSE)
# calling SVM function on the training data
Predicting House Prices Using Support Vector Machine Algorithm (cont.)
# plotting the classification graph of the SVM two variables at a time
plot(svmfit, training_houses, sqft_living~grade) plot(svmfit, training_houses, bathrooms~sqft_living)
Predicting House Prices Using Support Vector Machine Algorithm (cont.)
# use function (tune) to try to find the best variables, the best kernel and the best cost parameter to minimize the errortuned <- tune(svm, price_categ~bathrooms+sqft_living, data=training_houses, kernel="linear", ranges=list(cost=c(10)))print(tuned)
Error estimation of ‘svm’ using 10-fold cross validation: 0.5506647
tuned <- tune(svm, price_categ~bathrooms+sqft_living, data=training_houses, kernel="linear", ranges=list(cost=c(100)))print(tuned)
Error estimation of ‘svm’ using 10-fold cross validation: 0.5488125
Predicting House Prices Using Support Vector Machine Algorithm (cont.)
# continue calling the (tune) function and change the combination of the independent variables in the formula, the kernel, and the cost parameter. Cost is passed as a list of 6 numbers ranging from 0.001 to 100 in 10-times increments.
tuned <- tune(svm, price_categ~., data=training_houses, kernel="linear", ranges=list(cost=c(0.001, 0.01, 0.1, 1, 10, 100)))summary(tuned)Parameter tuning of ‘svm’:- sampling method: 10-fold cross validation - best parameters: cost 10 - best performance: 0.5088919 - Detailed performance results: cost error dispersion1 1e-03 0.5197316 0.011260592 1e-02 0.5125271 0.015383303 1e-01 0.5109408 0.014575294 1e+00 0.5092224 0.015323965 1e+01 0.5088919 0.015642216 1e+02 0.5093546 0.01505520
The best cost (10) since it gave the least error
Predicting House Prices Using Support Vector Machine Algorithm (cont.)
# Again call (tune) function and change the independent variables in the formula, the kernel, and the cost parameter. This is a summary of output of different combinations:
Independent Variables KernelError based on cost
0.001 0.01 0.1 1 10 100
All variables Linear 0.52 0.51 0.51 0.51 0.51 0.51
All variables Polynomial 0.60 0.56 0.54 0.53 0.53 0.53
All variables Radial 0.53 0.51 0.50 0.50 0.50 0.50
All variables Sigmoid 0.55 0.58 0.64 0.64 0.64 0.64
Grade Radial 0.56 0.53 0.53 0.53 0.53 0.53
Grade & sqft_living Radial 0.55 0.51 0.51 0.51 0.51 0.51
Grade, sqft_living & sqft_living15 Radial 0.55 0.51 0.51 0.51 0.51 0.51
Grade, sqft_living, sqft_living15 & bathrooms Radial 0.54 0.52 0.51 0.51 0.50 0.50
# We can see that the best case happened when we used all the variables, the radial kernel, and cost parameter=10.
#Let’s use these parameters:
svmfit <- svm(training_houses$price~., data=training_houses, kernel="radial", cost=10, scale=FALSE)
all data radial kernel cost=10
Predicting House Prices Using Support Vector Machine Algorithm (cont.)print(svmfit)Call:svm(formula = training_houses$price ~ ., data=training_houses, kernel="radial", cost=10, scale = FALSE)Parameters: SVM-Type: C-classification SVM-Kernel: radial cost: 10 gamma: 0.1111111 Number of Support Vectors: 14406
p <- predict(svmfit, testing_houses[,cols], type="class")table(p, testing_houses[,6])
p Cheap Ok High Expensive Cheap 463 144 106 22 Ok 143 423 114 71 High 95 133 343 94 Expensive 871 968 1039 1455
mean(p==testing_houses[,6])[1] 0.413942 Prediction accuracy (41%), pretty low
Decreasing correlation between independent variableso Data can contain attributes that are highly correlated with each other
o Many methods perform better if highly correlated attributes are removed
o Checking correlation between our independent variables:
o We can see many correlations above 65% which is high
o We want to eliminate thato Choose other variables with low inter-correlation and high correlation with price_categ
sqft_living grade sqft_living_15 bathrooms viewsqft_living 1 0.76 0.76 0.75 0.28
grade 0.76 1 0.71 0.66 0.25sqft_living_15 0.76 0.71 1 0.57 0.28
bathrooms 0.75 0.66 0.57 1 0.19view 0.28 0.25 0.28 0.19 1
Decreasing correlation between independent variables (cont.)o After analyzing correlations of our data set, we chose the following variables:o "sqft_living", "floors", "view", "sqft_basement", and "lat"
cols <- c("sqft_living", "floors", "view", "sqft_basement", "lat", "price_categ")# Running everything againset.seed(100)training_size <- round(0.7 * dim(houses2)[1])training_sample <- sample(dim(houses2)[1], training_size, replace=FALSE)training_houses <- houses2[training_sample,cols]testing_houses <- houses2[-training_sample,cols]# Again, try different kernels and see what’s the besttuned <- tune(svm, price_categ~., data=training_houses, kernel="radial", ranges=list(cost=c(0.001, 0.01, 0.1, 1, 10, 100)))tuned <- tune(svm, price_categ~., data=training_houses, kernel="linear", ranges=list(cost=c(0.001, 0.01, 0.1, 1, 10, 100)))tuned <- tune(svm, price_categ~., data=training_houses, kernel="polynomial", ranges=list(cost=c(0.001, 0.01, 0.1, 1, 10, 100)))tuned <- tune(svm, price_categ~., data=training_houses, kernel="sigmoid", ranges=list(cost=c(0.001, 0.01, 0.1, 1, 10, 100)))
summary(tuned)# every time kernel is changed, call (summary) to see the best cost parameter
Decreasing correlation between independent variables (cont.)o The best kernel was found to be the radial kernel again
svmfit <- svm(training_houses$price_categ~., data=training_houses, kernel="radial", cost=100, scale=FALSE)
p <- predict(svmfit, testing_houses[,cols], type="class") mean(p==testing_houses[,6])[1] 0.514343
Prediction accuracy increased to 51%
Data Scaling for SVMo“ Scaling before applying SVM is very important” ~ Hsu et al
o Advantages of scaling:o avoid attributes in greater numeric ranges dominating those in smaller numeric rangeso avoid numerical difficulties during the calculation
# Scaling the datatraining_houses2 <- training_housestraining_houses2[1:5] <- scale(training_houses2[1:5])
testing_houses2 <- testing_housestesting_houses2[1:5] <- scale(testing_houses2[1:5])
svmfit <- svm(training_houses2$price_categ~., data=training_houses2, kernel="radial", cost=100, scale=FALSE)p <- predict(svmfit, testing_houses2[,cols], type="class")mean(p==testing_houses2[,6])
# Calling SVM again and checking the accuracy
[1] 0.6904688 Big increase in accuracy from 41% to 69%
References Support Vector Machines The Interface to libsvm in package e1071 by David Meyer ∗https://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf
Support Vector Machines (SVM) Overview and Demo using R https://www.youtube.com/watch?v=ueKqDlMxueE
DSO 530: Logistic Regression in R https://www.youtube.com/watch?v=mteljf020EE
DSO 530: Decision Trees in R (Classification) https://www.youtube.com/watch?v=GOJN9SKl_OE
Feature Selection with the Caret R Package http://machinelearningmastery.com/feature-selection-with-the-caret-r-package/
A Practical Guide to Support Vector Classification Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin https://cloud.scorm.com/content/courses/P0P5ZE81VQ/78c05b7f-2582-46cf-ab6b-160ea2e02a6a/0/story_content/external_files/A%20Practical%20Guide%20to%20Support%20Vector%20Machines.pdf
Cluster analysis in R http://www.statmethods.net/advstats/cluster.html
Questions?THANKS!