Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

Post on 06-Jan-2017

222 views 0 download

Transcript of Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

Black Boxes and Unicorns

Jeremy Achin | Data Scientist & CEO| DataRobot

Jeremy Achin?

3

DataRobot Company History

2012 2H 2013 1H 2013 2H 2014 1H 2014 2H 2015 1H

June ‘12Founded

June ‘13Seed Funding

$3.3M

July ‘14Series A

$21M

2015 2H

Bigger & Better Announcements Coming Soon!

DataRobot: better predictive models faster

https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726

Leo Breiman (classification & regression trees, random forest, and my personal hero)

https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726

Leo Breiman (classification & regression trees, random forest, and my personal hero)

2001: Statistical Modeling: The Two Cultures

● An attack on statisticians who rely solely on regression models

● Argued we should be using the techniques that obtain the best results

● Even a carefully built regression model is just one of many possible representations of the underlying reality

“If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data [regression] models and adopt a more diverse set of tools.”

https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726

14 Years LaterExcellent progress in recent years but...

● still armies of people taking months to manually build regression models (especially in larger companies)

● non-regression methods still thought of as “black box”

Black Box (n) /blak bäks/

Black Box (n) /blak bäks/A phrase people use when they’re scared of technology they don’t understand and want to keep doing the same thing they’ve been doing for the last twenty years.

What do we really need to know about a predictive model?

1. Overall Performance on Out-of-Sample (Validation) Data

2. Predicted vs Actual by Variable

3. How a model’s predictions change as values of input

variables change

What do we really need to know about a predictive model?

1. Overall Performance on Out-of-Sample (Validation) Data

2. Predicted vs Actual by Variable

3. How a model’s predictions change as values of input

variables change

None of these depend on the specific algorithm you are using. Even #3!

Overall Out-of-Sample Performance

Mean Absolute Error

Weighted Mean Absolute Error

Root Mean Squared Error

Root Mean Squared Mean F Score

Mean Consequential Error

Mean Average Precision

Multi-class Log Loss

Hamming Loss

Mean Utility

Continuous Ranked

AUC

Average Precision (column-wise)

GiniAverage Among Top P

Mean Average Precision (row-wise)

`

Normalized Discounted Cumulative Gain@k

Mean Average Precision@n

Levenshtein Distance

Average Precision

Absolute Error

Probability ScoreLogarithmic Error

Hospital Readmission Model Assessment and Interpretation

Number of Prior Visits to Hospital

Hos

pita

l Rea

dmis

sion

Rat

e

Hospital Readmission Model Assessment and Interpretation

Number of Prior Visits to Hospital

Hos

pita

l Rea

dmis

sion

Rat

e

Actual Hospital Readmission

Rate

Hospital Readmission Model Assessment and Interpretation

Number of Prior Visits to Hospital

Hos

pita

l Rea

dmis

sion

Rat

e

Predicted Hospital

Readmission Rate

Hospital Readmission Model Assessment and Interpretation

Number of Prior Visits to Hospital

Hos

pita

l Rea

dmis

sion

Rat

e

Hospital Readmission Model Assessment and Interpretation

Number of Prior Visits to Hospital

Hos

pita

l Rea

dmis

sion

Rat

e

Partial Dependence

Partial Dependence

10.13.2 Partial Dependence Plots . . . . . . . . . . . . . 369

https://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf

Compliance (n) /kəmˈplīəns/

Compliance (n) /kəmˈplīəns/A word people use as a last resort to defend the status quo after they realize that their 100 variable regression model is an arbitrary representation of reality that is less accurate, robust, and interpretable than modern alternatives.

Arbitrary Representations of RealityThree statisticians sitting at a bar...

One more round?

ftp://ftp.nhtsa.dot.gov/GES/GES12/

● 153,077 Police-reported accidents

● 58 Variables

Goal: Try to Predict Probability of a Fatality

Variable Name Restraint Misuse: Roll Over: Alcohol Involved:Is Driver:

Regression Coefficient 0.509 0.355 0.089-0.694

Arbitrary Representations of Reality

Model Performance (Log Loss): 0.469

"Looks like as long as we use seat belts and don't rollover, we’ll survive. Having alcohol in the system doesn’t make much of a difference..

Also, being the driver is safe, so I'm driving home"

Statistician #1

Variable Name Restraint Misuse: Roll Over: Alcohol Involved:Is Driver:

Regression Coefficient 0.509 0.355 0.089-0.694

Arbitrary Representations of Reality

Model Performance (Log Loss): 0.469

"Looks like as long as we use seat belts and don't rollover, we’ll survive. Having alcohol in the system doesn’t make much of a difference..

Also, being the driver is safe, so I'm driving home"

Model Performance (Log Loss): 0.467

"Hmmm... looks like drinking and driving leads to fatal crashes. Probably shouldn't have another round."

Also, the later the better, so let's just wait here until midnight"

Variable Name Alcohol Involved: Age: Restraint Misuse:Hour of Accident:

RegressionCoefficient 1.866 0.008 0.000-0.019

Statistician #2Statistician #1

Variable Name Restraint Misuse: Roll Over: Alcohol Involved:Is Driver:

Regression Coefficient 0.509 0.355 0.089-0.694

Arbitrary Representations of Reality

Model Performance (Log Loss): 0.469

"Looks like as long as we use seat belts and don't rollover, we’ll survive. Having alcohol in the system doesn’t make much of a difference..

Also, being the driver is safe, so I'm driving home"

Model Performance (Log Loss): 0.422

"No, no, no, we just need to wear lap and shoulder belts with our booster seats, and be police officers. Look at those coefficients!

Furthermore, my model is better, so I'm right."

Variable Name Alcohol Involved: Age: Restraint Misuse:Hour of Accident:

RegressionCoefficient 1.866 0.008 0.000-0.019

Variable Name Opening Door In Motion: Is Police Officer: Booster Seat Used:Lap And Shoulder Belt:

RegressionCoefficient 0.449-0.412-0.787-1.897

Statistician #3Statistician #2Statistician #1

Model Performance (Log Loss): 0.467

"Hmmm... looks like drinking and driving leads to fatal crashes. Probably shouldn't have another round."

Also, the later the better, so let's just wait here until midnight"

The Killer Potato

The Killer Potato

Obligatory Data Scientist Definition Slide

Hacking Skills

Maths & Stats

Domain Knowledge

Data Science

● Foundational Statistics● Internals of Algorithms● Practical Knowledge

and Experience

● Programming○ Get Data○ Manipulate Data○ Explore Data○ Build Models○ Implement Models

● Understand the Business Problem

● Understanding of the Data

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

The current path to becoming a Data Scientist

A Better Way

AUTOMATED USINGMODERN TOOLS AND

COMPUTATIONAL POWER

Takeaways● There are technique-agnostic ways to

assess and interpret predictive models.

● The shortage of Data Scientists will be solved by a combination of pragmatic education and levels of automation currently not thought possible.

Three quick tips for entrepreneurs

Watch out for Lean Startup & MVP Zealots

Minimum viable product (MVP) get the smallest functional product into the market asap to derisk the investment.

Watch out for Lean Startup & MVP Zealots

Minimum viable product (MVP) is the product with the highest return on investment versus risk.

Minimum viable product (MVP) get the smallest functional product into the market asap to derisk the investment.

Be Paranoid and Don’t Rely on Hope.

Choose the Right Investors & Advisors

CHRIS LYNCH HARRY WELLER

Jason Seats Jit Saxena Kevin Dick

Ray Tacoma

Brad Gillespie

© DataRobot, Inc. All rights reserved.Confidential

jeremy@datarobot.com