Statistical modeling in San Francisco Crime Prediction

Analysis on San Francisco Crime Rate M.Sc. Student: Jiaying Li, Supervisor: Dr. Ian McLeod Department of Statistical and Actuarial Sciences, University of Western Ontario Kaggle aspires to be "The Home of Data Science". It curates many interesting chal- lenges for modern data analytic meth- ods. Some challenges offer large mone- tary awards and others are offered for in- terested students. There are nearly half a million registered users on this website. Kaggle This dataset is available from Kaggle and comprises more than 1.7 million records of crimes in the city during the period 2003-2015. The response variable to be predicted is the crime type which has 39 categories as summarized in the barchart. The main predictors available were date including time and GPS. This data is an example of "long" data. TREA PORNOGRAPHY/OBSCENE MAT GAMBLING SEX OFFENSES NON FORCIBLE EXTORTION BRIBERY BAD CHECKS FAMILY OFFENSES SUICIDE LOITERING EMBEZZLEMENT ARSON LIQUOR LAWS RUNAWAY DRIVING UNDER THE INFLUENCE KIDNAPPING RECOVERED VEHICLE DISORDERLY CONDUCT DRUNKENNESS SEX OFFENSES FORCIBLE STOLEN PROPERTY PROSTITUTION TRESPASS WEAPON LAWS SECONDARY CODES FORGERY/COUNTERFEITING FRAUD ROBBERY MISSING PERSON SUSPICIOUS OCC BURGLARY WARRANTS VANDALISM VEHICLE THEFT DRUG/NARCOTIC ASSAULT NON−CRIMINAL OTHER OFFENSES LARCENY/THEFT Frequency of Crimes in San Francisco Count 0 50000 100000 150000 Data Description The raw response “Category” is highly imbalanced, with a lowest and highest fre- quency to be 1 and 174,588 respectively. In the preliminary analysis, “Category” was then aggregated and reduced to 5 lev- els based on a common classification of crimes in law. Preliminary analysis As shown in the map and the barchart above, crime types differ in different re- gion, but they share similar trend along with hour. Visualization Nearest neighbour classifier, naive Bayes, quadratic discriminant analysis Multinomial regression One-vs.-rest and Tree methods, and random forest with gradient boosting Neural nets, Support vector machines Two-stages model as a combination of above. All methods are already implemented in R in well-established R packages on CRAN. Methods Used For test case, I provided both a posterior probability for each of the 39 categories, and a final classification on each incident. Logloss = - 1 N N i=1 39 j =1 y ij log(p ij ), where N is the number of images in the test set, y ij =1 if observation i is in class j , and 0 otherwise, p ij is the predicted prob- ability that observation i belongs to class j . Classification rate = n c N , where n c is the number of images correctly classified. Evaluation Predict raw response with 39 classes di- rectly. Model Variables Classification rate Logloss naive Bayes Pd,Hr,Yr,Week,DayOfWeek 22.85% 2.56 kNN PdDistrict,Hr,Yr,X,Y 22.33% 11.17 QDA X,Y 8.48% 3.47 NNet X,Y 19.97% 2.68 SVM PdDistrict,Hr,Yr,X,Y 22.41% Not Applicable C5.0 X,Y 28.07% 2.50 CART X,Y 28.01% 2.87 RF PdDistrict,Hr,Yr,X,Y 23.78% 6.39 Boosting PdDistrict,X,Y 25.34% 5.12 Predict 5 aggregated levels only. Model Variables Classification rate naive Bayes PdDistrict 38.78% kNN Pd,X,Y,Hr,Yr 30.05% QDA X,Y 30.48% NNet DOW,Pd,X,Y,Hr,Yr 37.92% ovo grpreg DOW,Pd,X,Y,Hr,Yr 38.81% ova grpreg DOW,Pd,X,Y,Hr,Yr 38.78% SVM DOW,Pd,X,Y,Hr,Yr 37.92% C5.0 X,Y 44.02% CART X,Y(stratified) 43.33% RF PdDistrict,Hr,Yr,X,Y 39.72% Boosting X,Y 41.61% Model outputs One could group the crimes into 5 cate- gories by making each class have approx- imately equal numbers, as far as possi- ble, to get something more balanced. The model first classifies incidents into 5 ag- gregated levels, followed by the second stage to predict for each of the 39 crime categories. Model Variables Classification rate Logloss C5.0+naiveBayes PdDistrict 26.08% 2.94 C5.0+C5.0 X,Y 27.11% 2.97 C5.0+Boosting PdDistrict,Hr,Yr,X,Y 27.45% 2.95 2-stage model My best predictor has a classification rate of 28% which is much better than random guessing! I have also submitted my best predictions using C5.0 to the Kaggle web- site, the log-loss or entropy is used to eval- uate the model performance. For my pre- dictor this was 2.50, which is close to 2.26, which is the best so far on Kaggle. After some trials and error, location serves as the most important factor in all mod- els. Time of the day is useful to some ex- tent, but models are more likely to suffer from overfiting and a decrease in predic- tion power if date and time are included as predictor variables. Conclusion 1. D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The R Journal, 5(1), 144-161. URL http://journal.r- 2. Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning (Vol. 1). Springer, Berlin: Springer series in statistics. References I would like to thank San Francisco OpenData for the data source, and Kaggle for the platform. Acknowledgement

Transcript of Statistical modeling in San Francisco Crime Prediction

Page 1: Statistical modeling in San Francisco Crime Prediction

Analysis on San Francisco Crime RateM.Sc. Student: Jiaying Li, Supervisor: Dr. Ian McLeod

Department of Statistical and Actuarial Sciences, University of Western Ontario

•Kaggle aspires to be "The Home of DataScience". It curates many interesting chal-lenges for modern data analytic meth-ods. Some challenges offer large mone-tary awards and others are offered for in-terested students. There are nearly half amillion registered users on this website.


•This dataset is available from Kaggle andcomprises more than 1.7 million recordsof crimes in the city during the period2003-2015. The response variable to bepredicted is the crime type which has 39categories as summarized in the barchart.

•The main predictors available were dateincluding time and GPS. This data is anexample of "long" data.




















Frequency of Crimes in San Francisco






Data Description

•The raw response “Category” is highlyimbalanced, with a lowest and highest fre-quency to be 1 and 174,588 respectively.In the preliminary analysis, “Category”was then aggregated and reduced to 5 lev-els based on a common classification ofcrimes in law.

Preliminary analysis

•As shown in the map and the barchartabove, crime types differ in different re-gion, but they share similar trend alongwith hour.


•Nearest neighbour classifier, naive Bayes,quadratic discriminant analysis

•Multinomial regression•One-vs.-rest and•Tree methods, and random forest with

gradient boosting•Neural nets, Support vector machines•Two-stages model as a combination of


All methods are already implemented in Rin well-established R packages on CRAN.

Methods Used

•For test case, I provided both a posteriorprobability for each of the 39 categories,and a final classification on each incident.

•Logloss = − 1N


∑39j=1 yij log(pij),

where N is the number of images in thetest set, yij = 1 if observation i is in class j,and 0 otherwise, pij is the predicted prob-ability that observation i belongs to classj.

•Classification rate = nc

N , where nc is thenumber of images correctly classified.


•Predict raw response with 39 classes di-rectly.

Model Variables Classification rate Logloss

naive Bayes Pd,Hr,Yr,Week,DayOfWeek 22.85% 2.56

kNN PdDistrict,Hr,Yr,X,Y 22.33% 11.17

QDA X,Y 8.48% 3.47

NNet X,Y 19.97% 2.68

SVM PdDistrict,Hr,Yr,X,Y 22.41%Not


C5.0 X,Y 28.07% 2.50

CART X,Y 28.01% 2.87

RF PdDistrict,Hr,Yr,X,Y 23.78% 6.39

Boosting PdDistrict,X,Y 25.34% 5.12

•Predict 5 aggregated levels only.

Model Variables Classification rate

naive Bayes PdDistrict 38.78%

kNN Pd,X,Y,Hr,Yr 30.05%

QDA X,Y 30.48%

NNet DOW,Pd,X,Y,Hr,Yr 37.92%

ovo grpreg DOW,Pd,X,Y,Hr,Yr 38.81%

ova grpreg DOW,Pd,X,Y,Hr,Yr 38.78%

SVM DOW,Pd,X,Y,Hr,Yr 37.92%

C5.0 X,Y 44.02%

CART X,Y(stratified) 43.33%

RF PdDistrict,Hr,Yr,X,Y 39.72%

Boosting X,Y 41.61%

Model outputs

•One could group the crimes into 5 cate-gories by making each class have approx-imately equal numbers, as far as possi-ble, to get something more balanced. Themodel first classifies incidents into 5 ag-gregated levels, followed by the secondstage to predict for each of the 39 crimecategories.

Model Variables Classification rate Logloss

C5.0+naiveBayes PdDistrict 26.08% 2.94

C5.0+C5.0 X,Y 27.11% 2.97

C5.0+Boosting PdDistrict,Hr,Yr,X,Y 27.45% 2.95

2-stage model

•My best predictor has a classification rateof 28% which is much better than randomguessing! I have also submitted my bestpredictions using C5.0 to the Kaggle web-site, the log-loss or entropy is used to eval-uate the model performance. For my pre-dictor this was 2.50, which is close to 2.26,which is the best so far on Kaggle.

•After some trials and error, location servesas the most important factor in all mod-els. Time of the day is useful to some ex-tent, but models are more likely to sufferfrom overfiting and a decrease in predic-tion power if date and time are includedas predictor variables.


1. D. Kahle and H. Wickham. ggmap: Spatial Visualization withggplot2. The R Journal, 5(1), 144-161. URL

2. Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statisticallearning (Vol. 1). Springer, Berlin: Springer series in statistics.


I would like to thank San Francisco OpenData for the datasource, and Kaggle for the platform.
