New York Metro Data Analysis

12
Analyzing New York Metro Image credits "NYC subway-4D" by CountZ at English Wikipedia. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:NYC_subway- 4D.svg#/media/File:NYC_subway-4D.svg https://c1.staticflickr.com/3/2188/2363525822_9285c44cd7_b.jpg *proud member of Udacity Intro to Data Science Final Project [email protected] © Kevin Hung 2015 by Kevin Hung @ .

description

Kevin HungStatistical Hypothesis Test and Machine Learning with Linear Regression to analyze and predict New York City Metro RidershipUdacity Intro to Data Science ud359 Final Project

Transcript of New York Metro Data Analysis

Page 1: New York Metro Data Analysis

Analyzing New York Metro

Image credits

"NYC subway-4D" by CountZ at English Wikipedia. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:NYC_subway-

4D.svg#/media/File:NYC_subway-4D.svg

https://c1.staticflickr.com/3/2188/2363525822_9285c44cd7_b.jpg

*proud member of

Udacity Intro to Data ScienceFinal Project

[email protected]

© Kevin Hung 2015

by Kevin Hung

@.

Page 2: New York Metro Data Analysis

Question of Interest|How to Model Hourly Ridership Entries?

Image credits

http://i.dailymail.co.uk/i/pix/2012/11/05/article-2227401-15DC2645000005DC-20_634x407.jpg

Page 3: New York Metro Data Analysis

sec3: Clue #1| Hourly Schedule

• Do people follow a predictable timetable or itinerary in ridership?

• Peak hours seem intuitive for plausible reasons

Page 4: New York Metro Data Analysis

sec3: Clue #2 | Work Week

• Whopping 10 Million Difference in our subset!

• Found a Great Feature

Page 5: New York Metro Data Analysis

sec3: Clue #3 | Is it Raining?

• Do people tend to ride less on rainy days, or is it because there are more non-rainy days than rainy ones in our particular

dataset?

• Next: Let’s test this!

Page 6: New York Metro Data Analysis

sec1 | Statistical TestQ: Why use statistical significance test?

A1: Draw valid inferences!

A2: Formal framework to compare & evaluate data

A3: Tell us if perceived effects are reflective as a whole

Single-Sided Mann-Whitney U-Test[1]: Are 2 populations the same?

H0: “The distributions of rainy and non-rainy ridership populations are equal!”

HA: “No! Ridership of one population tends to be bigger than the other”

Page 7: New York Metro Data Analysis

sec1 | Mann-Whitney U-TestResult

Reject the Null Hypothesis!

H0: “The distributions of rainy and non-rainy ridership populations are equal!”

HA: “No! Ridership of one population tends to be bigger than the other”

Rain may be a good feature…

Page 8: New York Metro Data Analysis

sec2 | Building Our Model

We’ll use the Normal Equation to Find our Solution!

Easy as

123

Design (Data → Features → Matrix)

Target (Ridership Entries as Integer Vector)

Parameters (Solution Vector thatMinimizes Squared Error)

Page 9: New York Metro Data Analysis

sec2 | Linear RegressionOur Model

Coefficient of Determination

Interpretation

• ~ 53% of the variation in ridership entries is

explained by our model

Page 10: New York Metro Data Analysis

sec2 | Model Appropriateness

• Residual Plots Show that our model often under

predicts ridership for entries 2000+

• Using Hour, Weekday, UNIT, Rain may not be adequate!

• Suggestions: High Bias Model → Find more Features

Page 11: New York Metro Data Analysis

sec4 | Conclusion

• Mann-Whitney U-Test & Paired Histogram Show Possibility of People

Tending to Ride the Metro More on Non-Rainy days

• Rainy Feature Contributes to 1% Increase in R2

• Need More Features: Incorporate Weather Factors? Foggy? Thunder?

Temperature Values?Image creditshttp://pix.avaxnews.com/avaxnews/6a/ab/0001ab6a_medium.jpeg

Page 12: New York Metro Data Analysis

• [1] "Mann–Whitney U Test." Wikipedia . Wikimedia Foundation, n.d. Web.• [2] "CS220 Lecture notes." Andrew Ng .

http://cs229.stanford.edu/notes/cs229notes1.pdf• [3] Frost, Jim. "Regression Analysis: How Do I Interpret Rsquared and

Assess the GoodnessofFit?“ Minitab, 30 May 2013. Web. 15 Sept. 2015.• [4] NIST/SEMATECH eHandbook of Statistical Methods,

http://www.itl.nist.gov/div898/handbook/pri/section2/pri24.htm• [5] "GraphPad Curve Fitting Guide." GraphPad Curve Fitting Guide .

GraphPad, n.d. Web. 16 Sept. 2015. <http://www.graphpad.com/guides/prism/6/curvefitting/index.htm?reg_analysischeck_linearreg.htm>

* data sources: <http://web.mta.info/developers/developer-data-terms.html#data>

sec0 | References