Just the basics_strata_2013

58
Photo by mikebaird, www.flickr.com/photos/mikebaird Just the Basics: Core Data Science Skills William Cukierski, PhD [email protected] Ben Hamner [email protected]

Transcript of Just the basics_strata_2013

Page 1: Just the basics_strata_2013

Photo  by  mikebaird,  www.flickr.com/photos/mikebaird  

 Just the Basics: Core Data Science Skills    William Cukierski, [email protected]!Ben [email protected]!    

Page 2: Just the basics_strata_2013

JUST the basics!

We mean the basics!–  Ask dumb questions!

(we’ll give dumb answers)!–  We can’t be comprehensive, but

we can omit pretense and jargon!

–  Expect a little Python, R, Matlab, Excel, command line, hand-waving!

Page 3: Just the basics_strata_2013

Pronounced Kah-gull (as in waggle),not Kegel (as in bagel)!

Page 4: Just the basics_strata_2013

Before we get started!

You’ll need a Kaggle account!www.kaggle.com/account/register!

!

Create a team for the competition!www.kaggle.com/c/just-the-basics-strata-2013!Add (Strata) to the end of your team name!!e.g. – William Cukierski (Strata) !

Page 5: Just the basics_strata_2013

Agenda: Preliminaries Identifying a ProblemPerforming the analysisVisualizing the SolutionContest!!

Page 6: Just the basics_strata_2013

Will background!Physics & Biomedical Engineering!

–  Studied machine learning for diagnosis of pathology images!

–  Constantly reinventing sophomore-level CS concepts!

Former “successful” machine learning competitor!

–  Successful?!•  Finished near top?!•  Got me a job?!•  Fooled people into believing I

understand stats(a.k.a. “data scientist”)!

Page 7: Just the basics_strata_2013

Biomedical Engineering & Electrical Engineering!

–  Applied machine learning to improve brain-computer interface!

–  Software development in various languages / domains!

Machine learning competitions!–  Top finishes in many 2010-2011!–  Teamed up with Will on several!–  Switched to the dark side, spent much

of the past year designing competitions at Kaggle!

Ben Background!

Driving a Brain-Controlled Wheelchair

Page 8: Just the basics_strata_2013

The unfortunate hype of modern analytics!•  BIG DATA!!•  Every second 6.2 trillion exabytes of data are being collected!•  Need shared vocabulary, shared protocols!•  Need to leverage!

–  weather reports!–  surveys!–  text documents!–  human genomes!–  regulatory information!–  cell phone logs!–  satellite surveillance !–  etc.!–  etc.!–  etc.!

Page 9: Just the basics_strata_2013

What do we do about it?!

•  Create committees, consortiums, taxonomies, platforms, frameworks, clouds!

•  Create acronyms for our committees, consortiums, taxonomies, platforms, frameworks, clouds!

•  Go to conferences to promote and learn about our acronym’d things!

•  And if time permits and the mood strikes?!

work

Page 10: Just the basics_strata_2013
Page 11: Just the basics_strata_2013

I’m ready to leave now !

Page 12: Just the basics_strata_2013

Big Data Barry!Lives by the Shirky Principle:!

Preserving the problem to which he is the solution!

Favorite talking points!

Data provenance, data warehousing, data privacy, data regulations, data silos, need for standards, need for standards on standards of standards, lack of data correctness, need for communication!

Source: http://mojette.deviantart.com/!

Page 13: Just the basics_strata_2013

Listen, I’ve been in this field for 22 years. The Bayesian guys in the modeling group are never gonna talk to the IT guys because they don’t speak the same language. In my 22 years of experience, what we need are tighter standards around what the processes should be for requesting data, how that data should be stored, and who should have access to the data. Also privacy.

Privacy is a thing about which I have no clue, but nonetheless I’m compelled to steamroll even the most

benign use of our data for anything beyond occupying a database. Oh, and speaking of databases and my 22

years of experience, we need stricter governance about the schemas a policies that inform the ways the data gets

federated, so the model guys will stop trying to implement things that’ll never work.…!

Page 14: Just the basics_strata_2013

Seriously,guys, let me out !

Page 15: Just the basics_strata_2013

The plight of the data scientist!

Job description:!Data Scientists (n.) Person who is better at statistics than any software engineer and better at software engineering than any statistician.!!Job reality:!Data Scientists (n.) Person who is worse at statistics than any statistician and worse at software engineering than any software engineer.!!!

Page 16: Just the basics_strata_2013
Page 17: Just the basics_strata_2013

This problem can only be solved by an 8th-order

kernel projection onto an orthonormal space of

homoscedastic eigentensors

The boss is going to have my neck if I

can’t get this Hadoop iPhone app ready in

time for Strata

I’m making an Excel VBA script to access our Oracle database and find the mean of the revenue column!

Data science (noun): Statistics done wrong

Page 18: Just the basics_strata_2013

Data scienceThe application of scientific experimentation (hypothesis testing, model generation, statistical analysis) in problem-agnostic ways. !!Not data science!{infographics, apps, site architecture, sending JSON thingies around, Javascript frameworks, web analytics, plotting tweets on maps, cloud storage, domains that end in .io, any idea/thing/product that touches data}!

Page 19: Just the basics_strata_2013

Agenda: Preliminaries Identifying a ProblemPerforming the analysisVisualizing the SolutionContest!!

Page 20: Just the basics_strata_2013

Optimization What’s the best the can happen?

Predictive Modeling What will happen next?

Forecasting/extrapolation What if these trends continue?

Statistical analysis Why is this happening?

Alerts What actions are needed?

Query/drill down What exactly is the problem?

Ad hoc reports How many, how often, where?

Standard reports What happened?

Gain

Soph

istic

atio

n

Analytics

Access and reporting

Source: Competing on Analytics, Davenport/Harris, 2007!

Page 21: Just the basics_strata_2013

When to use data!

Asking specific questions is mostly harmless!–  How many users bought shampoo X at store Y last quarter?!

Prediction is not a free lunch!–  Being data-driven and wrong is easy and bad!–  Fancy models should serve fancy questions!

•  Don’t forecast something that can be measured!

Human knowledge precedes machine knowledge!–  Sometimes black boxes work!–  Often, they don’t: earthquakes, finance models, etc.!

Page 22: Just the basics_strata_2013

When to use data!

Human experts are good at generalization!!Human experts are bad at!

–  Accurate predictions!–  Estimating the uncertainty of their predictions!–  Making the same prediction under the same evidence!–  Updating predictions in the face of new evidence!–  Ignoring unrelated evidence!

Page 23: Just the basics_strata_2013

http://www.nytimes.com/interactive/science/rock-paper-scissors.html!

Page 24: Just the basics_strata_2013

We need to teach the computer to generalize

laptop:~ wcuk$ RUN IT’S A BEAR -bash: BEAR: threat not found

Page 25: Just the basics_strata_2013

…without overfitting

laptop:~ wcuk$ RUN IT’S A BEAR run: Must specify one of –black –grizzly –teddy laptop:~ wcuk$ RUN IT’S A BEAR -grizzly run: Are you sure you want to run? (y/n) y run: Enter the bear’s name: Rupert run: Is it Rupert with the scar on his ear? He’s cool. He’s more of a salmon kind of bear. (y/n): n run:...RUN!!!!!!!

Page 26: Just the basics_strata_2013

“If you wish to make an apple pie from scratch, you must first invent the universe.” – Carl Sagan!

Storing data!

Binary! Text! Database!

Page 27: Just the basics_strata_2013

Reading data into a useful format!

We overcomplicate storage and formats!–  Databases are quite often a bad choice!–  Most data science is a batch process on tabular data!–  Your debugging cycle should be fast!

Why text?!–  Simple!–  Universal!–  Fast (to read/write/debug)!–  Transparent!

Page 28: Just the basics_strata_2013

Most data is not useful for scientific experimentation!Too “macro” (lacking causal detail)! Meant for human consumption!

Page 29: Just the basics_strata_2013

Structured data is not always machine ready !Game 1!

Seat 1: Solracca ($95.30 in chips) Seat 2: BrickT63 ($127.10 in chips)

Seat 3: sven160482 ($184.30 in chips) Seat 4: Adelantez ($103 in chips)

Seat 6: manfred zeal ($155.50 in chips) Solracca: posts small blind $0.50

BrickT63: posts big blind $1 *** HOLE CARDS ***

sven160482: raises $1 to $2

Adelantez: raises $5.50 to $7.50 manfred zeal: folds

Solracca: folds BrickT63: folds

sven160482: folds Uncalled bet ($5.50) returned to Adelantez

Adelantez collected $5.50 from pot *** SUMMARY ***

Total pot $5.50 | Rake $0 Seat 4: Adelantez collected ($5.50)

Game 2!Seat 1: Kingcovey ($108.65 in chips) Seat 3: VoronIN_exe ($119.80 in chips) Seat 4: ehle123 ($104 in chips) Seat 5: MercuriusAA ($107.60 in chips) Seat 6: budapestkin ($133.15 in chips) budapestkin: posts small blind $0.50 Kingcovey: posts big blind $1 *** HOLE CARDS *** VoronIN_exe: raises $2 to $3

ehle123: folds MercuriusAA: folds budapestkin: calls $2.50 Kingcovey: folds *** FLOP *** [7c Tc Ks] budapestkin: checks VoronIN_exe: bets $4.45 budapestkin: calls $4.45 *** TURN *** [7c Tc Ks] [8c] budapestkin: checks VoronIN_exe: checks *** RIVER *** [7c Tc Ks 8c] [Kc] budapestkin: bets $11

VoronIN_exe: folds Uncalled bet ($11) returned to budapestkin budapestkin collected $15.15 from pot *** SUMMARY *** Total pot $15.90 | Rake $0.75 Seat 6: budapestkin collected ($15.15)

Page 30: Just the basics_strata_2013

A word of caution on scraping!•  Scraping is time intensive, unleveraged, brittle!•  Before you code, research existing libraries!!

–  Will solve 95% of the problems you don’t even know you will have!–  E.g. web scraping using python’s BeautifulSoup!

page = urllib2.urlopen("http://www.kaggle.com/competitions") soup = BeautifulSoup(page.read()) allLinks = soup.find_all('a') allLinks = uniqify(allLinks) for link in allLinks: match = (re.search('^/c/.*', link.get('href'))) if match:

fileName = link.get('href'); fileName = fileName.replace('/','_') + ".zip" fileName = fileName[3:] getStuff(fileName, "http://www.kaggle.com" + link.get("href") + "/publicleaderboarddata.zip")

Page 31: Just the basics_strata_2013

Excel has a time and place!–  Looking at data!–  Pivot tables!–  Quick plots to verify things!

Never:!–  Pass spreadsheets around!–  “Code” in Excel!–  Create workflows that require copy/

pasting data around!

Page 32: Just the basics_strata_2013

Excel!

Page 33: Just the basics_strata_2013

Agenda: Preliminaries Identifying a ProblemPerforming the analysisVisualizing the SolutionContest!!

Page 34: Just the basics_strata_2013

Command line!

Page 35: Just the basics_strata_2013

Glossary!

features = attributes = independent variables!

targets = gold standard = ground truth = dependent variable(s)!

training set = data & targets use to train a model!

validation set = data & targets used as feedback in model training!

test set = separate data & targets used only to evaluate the model!

cross validation = partitioning the training set to estimate how well a

model will generalize!

Page 36: Just the basics_strata_2013

Train!

Test!

Read! Feature Extraction! Learn!

Generalize!

Page 37: Just the basics_strata_2013

Bayes theorem!

How to update beliefs in the face of evidence?!For proposition A and evidence B:!

–  P(A) = prior (belief in A)!–  P(B) = evidence!–  P(A | B) = posterior (belief in A given B)!–  P(B | A) = likelihood!

P (A|B) =P (B|A)P (A)

P (B)

P (female|long hair) =P (long hair|female)P (female)

P (long hair)

Page 38: Just the basics_strata_2013

R!

Page 39: Just the basics_strata_2013

MATLAB!

Page 40: Just the basics_strata_2013

Agenda: Preliminaries Identifying a ProblemPerforming the analysisVisualizing the SolutionContest!!

Page 41: Just the basics_strata_2013
Page 42: Just the basics_strata_2013

Visualization!

Speak the language of your audience!–  Use simple plots!–  Use units that matter (dollars, time, widgets)!–  Include the units!!–  Don’t use acronyms!!

!Most visualization should be internal facing (am I doing this right?) and not external facing (hey check this out!)!

Page 43: Just the basics_strata_2013

•  Plotting raw features!•  Looking for outliers,

anomalies, correlation!

•  Verifying feature selection or dimensionality reduction!

•  Looking at manifold density!•  Looking at class separation!

•  Babysitting model performance!•  Looking for optima!•  Watching for sensitivity to initial

conditions, perturbations!

•  Summarizing!•  Checking the result is reasonable!•  Comparisons to the alternative!

Page 44: Just the basics_strata_2013

Your job is to solve a problem!–  Sell the message, not the graphic!

Avoid chartjunk!“The purpose of decoration varies — to make the graphic appear more scientific and precise, to enliven the display, to give the designer an opportunity to exercise artistic skills. Regardless of its cause, it is all non-data-ink or redundant data-ink, and it is often chartjunk.” –Edward Tufte!

Page 45: Just the basics_strata_2013

source: http://i.dailymail.co.uk/i/pix/2012/03/21/article-2118152-124602BE000005DC-0_964x528.jpg

Page 46: Just the basics_strata_2013

source: http://www.fivethirtyeight.com/2009/10/older-and-wealthier-people-are-more.html

Page 47: Just the basics_strata_2013

Election fraud: 2D histograms of the number of units for a given voter turnout (x axis) and the percentage of votes (y axis) for the winning party!

source: http://www.pnas.org/content/early/2012/09/20/1210722109.abstract

Page 48: Just the basics_strata_2013

ggplot2!

Page 49: Just the basics_strata_2013

Agenda: Preliminaries Identifying a ProblemPerforming the analysisVisualizing the SolutionContest!!

Page 50: Just the basics_strata_2013

Make a spam detector!

The data represents a corpus of emails. Some are spam and some are normal.!•  Due to time constraints, feature extraction is done for you:!

–  train.csv - contains 600 emails x 100 features!–  train_labels.csv – contains the 600 training labels (1 = spam, 0 =

normal)!–  test.csv - contains 4000 emails x 100 features!

•  Submit a file with each of the 4000 predictions on a separate line (in the same order as test.csv).!–  No header is necessary!–  Predictions can be continuous numbers or 0/1 labels!

Page 51: Just the basics_strata_2013

Return%   ProductID   Dept   Price   MFR  1.94   54323   Household   54.95   USA  

0.023   92356   Household   9.95   USA  0.8   78023   Computer   4.5   China  

0.01   12340   Audio   109.99   China  0.41   31240   Audio   29.99   Taiwan  0.97   12351   Hardware   54.95   Mexico  

0.0115   90141   Hardware   4.99   USA  0.4   81240   Hardware   6.55   Taiwan  

0.03   14896   Computer   211.99   Korea  0.205   62132   Computer   1100   USA  

1.6878   54323   Audio   34.99   USA  0.0345   92356   Audio   7.99   USA  

0.64   78023   Household   229.9   Brazil  0.72   12340   Audio   19.95   Mexico  0.41   31240   Computer   6.99   Taiwan  1.94   54323   Hardware   11.99   Taiwan  

0.023   92356   Household   2.05   USA  0.08   78023   Computer   99.99   USA  2.09   12340   Computer   129.99   China  1.1   31240   Audio   18.99   China  

How the leaderboard works!

Page 52: Just the basics_strata_2013

Return%   ProductID   Dept   Price   MFR  1.94   54323   Household   54.95   USA  

0.023   92356   Household   9.95   USA  0.8   78023   Computer   4.5   China  

0.01   12340   Audio   109.99   China  0.41   31240   Audio   29.99   Taiwan  0.97   12351   Hardware   54.95   Mexico  

0.0115   90141   Hardware   4.99   USA  0.4   81240   Hardware   6.55   Taiwan  

0.03   14896   Computer   211.99   Korea  0.205   62132   Computer   1100   USA  

1.6878   54323   Audio   34.99   USA  0.0345   92356   Audio   7.99   USA  

0.64   78023   Household   229.9   Brazil  0.72   12340   Audio   19.95   Mexico  0.41   31240   Computer   6.99   Taiwan  1.94   54323   Hardware   11.99   Taiwan  

0.023   92356   Household   2.05   USA  0.08   78023   Computer   99.99   USA  2.09   12340   Computer   129.99   China  1.1   31240   Audio   18.99   China  

Training

Test

How the leaderboard works!

Solution “Ground Truth”

Page 53: Just the basics_strata_2013

Return%   ProductID   Dept   Price   MFR  1.94   54323   Household   54.95   USA  

0.023   92356   Household   9.95   USA  0.8   78023   Computer   4.5   China  

0.01   12340   Audio   109.99   China  0.41   31240   Audio   29.99   Taiwan  0.97   12351   Hardware   54.95   Mexico  

0.0115   90141   Hardware   4.99   USA  0.4   81240   Hardware   6.55   Taiwan  

0.03   14896   Computer   211.99   Korea  0.205   62132   Computer   1100   USA  

1.6878   54323   Audio   34.99   USA  0.0345   92356   Audio   7.99   USA  

0.64   78023   Household   229.9   Brazil  ?   12340   Audio   19.95   Mexico  ?   31240   Computer   6.99   Taiwan  ?   54323   Hardware   11.99   Taiwan  ?   92356   Household   2.05   USA  ?   78023   Computer   99.99   USA  ?   12340   Computer   129.99   China  ?   31240   Audio   18.99   China  

Training

Test

How the leaderboard works!

Solution “Ground Truth”

Page 54: Just the basics_strata_2013

Return%   ProductID   Dept   Price   MFR  1.94   54323   Household   54.95   USA  

0.023   92356   Household   9.95   USA  0.8   78023   Computer   4.5   China  

0.01   12340   Audio   109.99   China  0.41   31240   Audio   29.99   Taiwan  0.97   12351   Hardware   54.95   Mexico  

0.0115   90141   Hardware   4.99   USA  0.4   81240   Hardware   6.55   Taiwan  

0.03   14896   Computer   211.99   Korea  0.205   62132   Computer   1100   USA  

1.6878   54323   Audio   34.99   USA  0.0345   92356   Audio   7.99   USA  

0.64   78023   Household   229.9   Brazil  0.03   12340   Audio   19.95   Mexico  

1.298   31240   Computer   6.99   Taiwan  0.94   54323   Hardware   11.99   Taiwan  0.04   92356   Household   2.05   USA  0.36   78023   Computer   99.99   USA  1.2 12340   Computer   129.99   China  

0.02   31240   Audio   18.99   China  

Training

Test

How the leaderboard works!

Submission

Page 55: Just the basics_strata_2013

Return%   ProductID   Dept   Price   MFR  1.94   54323   Household   54.95   USA  

0.023   92356   Household   9.95   USA  0.8   78023   Computer   4.5   China  

0.01   12340   Audio   109.99   China  0.41   31240   Audio   29.99   Taiwan  0.97   12351   Hardware   54.95   Mexico  

0.0115   90141   Hardware   4.99   USA  0.4   81240   Hardware   6.55   Taiwan  

0.03   14896   Computer   211.99   Korea  0.205   62132   Computer   1100   USA  

1.6878   54323   Audio   34.99   USA  0.0345   92356   Audio   7.99   USA  

0.64   78023   Household   229.9   Brazil  0.03   12340   Audio   19.95   Mexico  

1.298   31240   Computer   6.99   Taiwan  0.94   54323   Hardware   11.99   Taiwan  0.04   92356   Household   2.05   USA  0.36   78023   Computer   99.99   USA  1.2 12340   Computer   129.99   China  

0.02   31240   Audio   18.99   China  

Training

Test

How the leaderboard works!

Submission

Public Leaderboard  Private Leaderboard  

Page 56: Just the basics_strata_2013

Area under the receiver-operating characteristic curve !

Page 57: Just the basics_strata_2013

Example Model!

Page 58: Just the basics_strata_2013

Think about!

•  Missing values!•  Noise!•  Combinations of features!•  Transformations of features (e.g. log)!•  Combinations of methods!•  Overfitting!•  Binary vs. continuous predictions!•  How good is a good spam detector?!