Iterating over statistical models: NCAA tournament edition

50
ITERATING OVER STATISTICAL MODELS NCAA TOURNAMENT EDITION

Transcript of Iterating over statistical models: NCAA tournament edition

Page 1: Iterating over statistical models: NCAA tournament edition

ITERATING OVER STATISTICAL MODELSNCAA TOURNAMENT EDITION

Page 2: Iterating over statistical models: NCAA tournament edition

Daniel Lee

WHY ARE YOU LISTENING TO ME?

▸ Stan developerhttp://mc-stan.org

▸ Researcher at Columbia

▸ Co-founder of Stan Grouptraining / statistical support / consulting

▸ email: [email protected] twitter: @djsyclik / @mcmc_stan web: http://syclik.com

Page 3: Iterating over statistical models: NCAA tournament edition

MARCH MADNESS

Page 4: Iterating over statistical models: NCAA tournament edition

WHAT IS MARCH MADNESS?

Page 5: Iterating over statistical models: NCAA tournament edition

WHAT IS MARCH MADNESS?

Page 6: Iterating over statistical models: NCAA tournament edition

CHAMPIONSHIP GAME: VILLANOVA VS NORTH CAROLINA

Page 7: Iterating over statistical models: NCAA tournament edition

P(VILLANOVA WIN)?CHAMPIONSHIP GAME: VILLANOVA VS NORTH CAROLINA

Page 8: Iterating over statistical models: NCAA tournament edition

P(VILLANOVA WIN)?

VILLANOVA WILDCATS VS. NORTH CAROLINA TAR HEELS

▸ 1:00. 70 - 69

▸ 0:35. 72 - 69

▸ 0:23. 72 - 71

Page 9: Iterating over statistical models: NCAA tournament edition

P(VILLANOVA WIN)?

VILLANOVA WILDCATS VS. NORTH CAROLINA TAR HEELS

▸ 1:00. 70 - 69

▸ 0:35. 72 - 69

▸ 0:23. 72 - 71

▸ 0:13. 74 - 71

Page 10: Iterating over statistical models: NCAA tournament edition

P(VILLANOVA WIN)?

VILLANOVA WILDCATS VS. NORTH CAROLINA TAR HEELS

▸ 1:00. 70 - 69

▸ 0:35. 72 - 69

▸ 0:23. 72 - 71

▸ 0:13. 74 - 71

▸ 0:06. 74 - 74

Page 11: Iterating over statistical models: NCAA tournament edition

P(VILLANOVA WIN)?

VILLANOVA WILDCATS VS. NORTH CAROLINA TAR HEELS

▸ 1:00. 70 - 69

▸ 0:35. 72 - 69

▸ 0:23. 72 - 71

▸ 0:13. 74 - 71

▸ 0:06. 74 - 74

▸ 0:00. 77 - 74. Villanova wins.

Page 12: Iterating over statistical models: NCAA tournament edition
Page 13: Iterating over statistical models: NCAA tournament edition

STATISTICAL MODELSTREAT

AS CODE

Page 14: Iterating over statistical models: NCAA tournament edition

WRITING CODE IS A DISCIPLINE

Page 15: Iterating over statistical models: NCAA tournament edition

WRITING CODE IS A DISCIPLINE

▸ design patterns

▸ testing

▸ code review

▸ maintenance

▸ modularity

▸ collaboration

CAN BE

Page 16: Iterating over statistical models: NCAA tournament edition

IS STATISTICAL MODELING A DISCIPLINE?

▸ Art or science?

▸ Models have names

▸ Statistical model vs implementation

▸ Collaboration on the statistical model?

Page 17: Iterating over statistical models: NCAA tournament edition

TREAT STATISTICAL MODELS AS CODE

WHAT DO WE NEED TO DO

▸ Elevate statistical models: first class entity

▸ Modularize

▸ Language

▸ Discuss subtle details

▸ Collaborate

Page 18: Iterating over statistical models: NCAA tournament edition

TREAT STATISTICAL MODELS AS CODE

STAN GETS US CLOSE

▸ statistical modeling language

▸ domain-specific languagehas its own grammar; not R or BUGS!

▸ rstanshinystan, RStudio integration, rstanarm

▸ open-sourcecore libraries are new BSD

▸ Stan program

▸ plain text

▸ plays nicely with source repositories

▸ imperative language

Note: Stan isn’t the only thing you can do this with

Page 19: Iterating over statistical models: NCAA tournament edition

MODELING BASKETBALL

Page 20: Iterating over statistical models: NCAA tournament edition

MARCH MADNESS

BASKETBALL HISTORY

▸ 1891

▸ Dr. James Naismith. Springfield, MA

▸ Non-contact conditioning

▸ 13 simple rules

▸ Peach basket

Page 21: Iterating over statistical models: NCAA tournament edition

MARCH MADNESS

BASKETBALL HISTORY

▸ 1891

▸ Dr. James Naismith. Springfield, MA

▸ Non-contact conditioning

▸ 13 simple rules

▸ Peach basket

10. The umpire shall be judge of the men and shall note the fouls and notify the referee when three consecutive fouls have been made. He shall have power to disqualify men according to Rule 5.

Page 22: Iterating over statistical models: NCAA tournament edition

MARCH MADNESS

BASKETBALL NOW

▸ 2 x 20 min half

▸ Increasing score

▸ Points increment by 2, 3, and 1

▸ 5 players, unlimited substitutions

▸ player DQ: 5th foul

▸ Bonus: 7 team foulsDouble bonus: 10 team fouls

Page 23: Iterating over statistical models: NCAA tournament edition

MARCH MADNESS

DATA

▸ 351 NCAA Division 1 Men’s basketball teams

▸ 33 conferences

▸ 5421 games

▸ 24 - 35 games per team

▸ Max 3 observations

Page 24: Iterating over statistical models: NCAA tournament edition
Page 25: Iterating over statistical models: NCAA tournament edition

IS THIS BIG DATA?

Page 26: Iterating over statistical models: NCAA tournament edition

TALL DATA VS WIDE DATA

▸ Tall datalots of replications

▸ Wide datalots of fieldsday, home, score, ot, fgm, fga, 3pm, 3pa, 3a, fta, ftm, or, dr, ast, to, stl, blk, pf

Page 27: Iterating over statistical models: NCAA tournament edition

THREE STEPS OF BAYESIAN DATA ANALYSIS

Page 28: Iterating over statistical models: NCAA tournament edition

ANDREW GELMAN IN BDA

THE THREE STEPS OF BAYESIAN DATA ANALYSIS

1. Set up full probability model

2. Condition on observed data

3. Evaluate the fit of the model

Page 29: Iterating over statistical models: NCAA tournament edition

ANDREW GELMAN IN BDA

THE THREE STEPS OF BAYESIAN DATA ANALYSIS

1. Set up full probability modelWrite a Stan program

2. Condition on observed data

3. Evaluate the fit of the model

Page 30: Iterating over statistical models: NCAA tournament edition

ANDREW GELMAN IN BDA

THE THREE STEPS OF BAYESIAN DATA ANALYSIS

1. Set up full probability modelWrite a Stan program

2. Condition on observed dataRun RStan

3. Evaluate the fit of the model

Page 31: Iterating over statistical models: NCAA tournament edition

ANDREW GELMAN IN BDA

THE THREE STEPS OF BAYESIAN DATA ANALYSIS

1. Set up full probability modelWrite a Stan program

2. Condition on observed dataRun RStan

3. Evaluate the fit of the modelR, ShinyStan, posterior predictive checks

Page 32: Iterating over statistical models: NCAA tournament edition

ITERATING OVER MODELS

Page 33: Iterating over statistical models: NCAA tournament edition

ITERATING OVER STATISTICAL MODELS

STATISTICAL MODEL #1

▸ Only 2015-2016 matters

▸ Teams have a latent ability

▸ “logistic regression”

▸ “Bradley-Terry model”

y ⇠ bernoulli(logit

�1(✓1 � ✓2))

Page 34: Iterating over statistical models: NCAA tournament edition
Page 35: Iterating over statistical models: NCAA tournament edition

P(VILLANOVA > UNC | DATA) = 0.73

Page 36: Iterating over statistical models: NCAA tournament edition
Page 37: Iterating over statistical models: NCAA tournament edition
Page 38: Iterating over statistical models: NCAA tournament edition

ITERATING OVER STATISTICAL MODELS

TREAT STATISTICAL MODELS AS CODE

▸ Statistical model in a separate file

▸ Git

▸ Testing

▸ Inspection of fit

▸ Backtest on historical data

▸ Priors

▸ “Model #1”

Page 39: Iterating over statistical models: NCAA tournament edition

IN WRITING, YOU MUST KILL ALL YOUR DARLINGS.

Willliam Faulkner

Page 40: Iterating over statistical models: NCAA tournament edition

ITERATING OVER STATISTICAL MODELS

STATISTICAL MODEL #2

▸ Home court advantage!

▸ Teams have a latent ability

▸ “logistic regression”

▸ “Bradley-Terry model”

y ⇠ bernoulli(logit

�1(↵+ ✓1 � ✓2))

Page 41: Iterating over statistical models: NCAA tournament edition
Page 42: Iterating over statistical models: NCAA tournament edition

P(VILLANOVA > UNC | DATA) = 0.52

Page 43: Iterating over statistical models: NCAA tournament edition
Page 44: Iterating over statistical models: NCAA tournament edition

ITERATING OVER STATISTICAL MODELS

STATISTICAL MODEL #3

▸ Assumptions:

▸ Only 2015-2016 matters

▸ Teams have a latent ability

▸ Model points

▸ Add a home court advantage

Page 45: Iterating over statistical models: NCAA tournament edition
Page 46: Iterating over statistical models: NCAA tournament edition
Page 47: Iterating over statistical models: NCAA tournament edition
Page 48: Iterating over statistical models: NCAA tournament edition

ITERATING OVER STATISTICAL MODELS

HOW DID WE DO?

▸ Kaggle: 87 / 608

▸ What went wrong?

▸ What’s next?

Page 49: Iterating over statistical models: NCAA tournament edition

STATISTICAL MODELSTREAT

AS CODE

Page 50: Iterating over statistical models: NCAA tournament edition

THANKS▸ Collaborative effort

▸ NCAA modeling:Rob Trangucci

▸ Stan teamAndrew Gelman, Bob Carpenter, Matt Hoffman, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, Allen Riddell, Marco Ignacio, Jeff Arnold, Mitzi Morris, Rob Goedman, Brian Lau, Jonah Gabry, Alp Kucukelbir, Robert Grant, Dustin Tran, Krzysztof Sakrejda, Aki Vehtari, Rayleigh Lei, Sebastian Weber

HELP▸ http://mc-stan.org

▸ stan-users mailing list

▸ Stan Group Inc.http://stan.fit training / statistical support / consulting

[email protected] / @djsyclik / @mcmc_stan