Placeholders - Machine Learningcs229.stanford.edu/proj2015/113_poster.pdf · Placeholders: The...

1
PredicAng Final Scores of Major League Baseball Games Nicolas Cserepy, Robbie Ostrow, Ben Weems December 8 th , 2015 Nico Cserepy ([email protected]) Ben Weems ([email protected]) Robbie Ostrow ([email protected]) CS 229: Machine Learning 1. Retrosheet.org 2. Chadwick: SoGware Tools for Scoring Baseball Games 3. sportsbookreview.org Data Sources Baseball: America’s naAonal pasAme. The MLB had $7.2 billion in revenue in 2010, and, according to CNBC, $30 to $40 billion is bet illegally on the game every year. There are 2,430 baseball games in a season. That’s over 190,000 plate appearances each year. We leveraged comprehensive data since 1980 – over 7 million data points – to predict the scores of baseball games. We used the data to repeatedly simulate every at bat of every game. SophisAcated analysis of these results revealed surprisingly predicAve staAsAcs that leave us quesAoning the proverbial wisdom: “The house always wins.” IntroducAon SimulaAng Games We treat baseball games as a Markov decision process. However, there are too many possible states and transiAons to feasibly calculate true probabiliAes. As such, we randomly sample from the state space to esAmate the true distribuAon of games. Exploring the State Space StaAsAcs are ubiquitous in Baseball. We know every player’s barng average, every pitcher’s earned run average – we can calculate any staAsAc we need, at arbitrary levels of granularity. A player’s hirng staAsAcs, however, are not sufficient to calculate the probability distribuAon for some at bat. These probabiliAes are a combinaAon of hirng staAsAcs, pitching staAsAcs, and environmental variables (like runners on base, handed-ness, etc.). We have 7.2 million instances of at bats, which we featurized into about 100 features (mostly sparse). Each instance is matched to a result, like “single,” “strikeout,” or “fielder’s choice”. Splirng these examples into 70% development and 30% tesAng, we run mulAnomial logisAc regression to more accurately calculate P(acAon|state). Learning P(acAons|state) We have successfully represented Major League Baseball games as a Markov Chain, and through Monte Carlo simulaAons of these games we can generate meaningful results that are compeAAve with state-of-the-art techniques. By focusing on high-confidence games, we generate meaningful results. We found, with p < .05, that our predicAons for games with more than 80% confidence are expected to guess the correct result for the over/under. We’re sAll working on tweaking our feature selecAon and algorithm, and expect to improve our model and results in the coming days. Conclusions “AcAons” is the set of possible results for the bawer, and “outcomes” is the set of possible states aGer a play has been made. 1. Enter start state (Away team hirng, nobody on base, etc.) 2. Repeat unAl game end (a) Calculate P(acAons|state) (learn this probability!) (b) Choose weighted random acAon (c) Calculate P(outcomes|acAon) (we assume that this is the same for all games) (d) Choose weighted random outcome (e) Go to the state that outcome specifies 3. Gather staAsAcs about simulaAon We simulate each game 10,000 Ames. The key to simulaAng accurate games is learning P(acAons|state). Table 1. A few examples of features. Binary-Valued Real-Valued has_one_out bat_singles_per_try_against_same_hand is_7 th _inning pitcher_doubles_per_bawer bawer_and_pitcher_same_handed bat_homers_in_last_30_awempts runner_on_second bat_doubles_per_try_against_diff_hand in_stadium_5 bat_walks_per_plate_appearance Confidence in result Average return on $100 bet Graph 1. Confidence vs. Return Runs scored Count of games Graph 2: Example simulaAon

Transcript of Placeholders - Machine Learningcs229.stanford.edu/proj2015/113_poster.pdf · Placeholders: The...

Page 1: Placeholders - Machine Learningcs229.stanford.edu/proj2015/113_poster.pdf · Placeholders: The various elements included in this poster are ones we oGen see in medical, research,

PosterPrintSize:Thispostertemplateis24”highby36”wide.Itcanbeusedtoprintanyposterwitha2:3aspectraAoincluding36x54and48x72.

Placeholders:ThevariouselementsincludedinthisposterareonesweoGenseeinmedical,research,andscienAficposters.Feelfreetoedit,move,add,anddeleteitems,orchangethelayouttosuityourneeds.Alwayscheckwithyourconferenceorganizerforspecificrequirements.

ImageQuality:YoucanplacedigitalphotosorlogoartinyourposterfilebyselecAngtheInsert,Picturecommand,orbyusingstandardcopy&paste.Forbestresults,allgraphicelementsshouldbeatleast150-200pixelsperinchintheirfinalprintedsize.Forinstance,a1600x1200pixelphotowillusuallylookfineupto8“-10”wideonyourprintedposter.Topreviewtheprintqualityofimages,selectamagnificaAonof100%whenpreviewingyourposter.Thiswillgiveyouagoodideaofwhatitwilllooklikeinprint.Ifyouarelayingoutalargeposterandusinghalf-scaledimensions,besuretopreviewyourgraphicsat200%toseethemattheirfinalprintedsize.

Pleasenotethatgraphicsfromwebsites(suchasthelogoonyourhospital'soruniversity'shomepage)willonlybe72dpiandnotsuitableforprinAng.

[Thissidebarareadoesnotprint.]

ChangeColorTheme:Thistemplateisdesignedtousethebuilt-incolorthemesinthenewerversionsofPowerPoint.Tochangethecolortheme,selecttheDesigntab,thenselecttheColorsdrop-downlist.

Thedefaultcolorthemeforthistemplateis“Office”,soyoucanalwaysreturntothataGertryingsomeofthealternaAves.

PrinAngYourPoster:Onceyourposterfileisready,visitwww.genigraphics.comtoorderahigh-quality,affordableposterprint.EveryorderreceivesafreedesignreviewandwecandeliverasfastasnextbusinessdaywithintheUSandCanada.Genigraphics®hasbeenproducingoutputfromPowerPoint®longerthananyoneintheindustry;daAngbacktowhenwehelpedMicrosoG®designthePowerPoint®soGware.

USandCanada:1-800-790-4001Email:[email protected]

[Thissidebarareadoesnotprint.]

PredicAngFinalScoresofMajorLeagueBaseballGames

NicolasCserepy,RobbieOstrow,BenWeems

December8th,2015NicoCserepy([email protected])BenWeems([email protected])RobbieOstrow([email protected])

CS229:MachineLearning1. Retrosheet.org2. Chadwick:SoGwareToolsforScoringBaseballGames3. sportsbookreview.org

DataSources

Baseball:America’snaAonalpasAme.TheMLBhad$7.2billioninrevenuein2010,and,accordingtoCNBC,$30to$40billionisbetillegallyonthegameeveryyear.Thereare2,430baseballgamesinaseason.That’sover190,000plateappearanceseachyear.Weleveragedcomprehensivedatasince1980–over7milliondatapoints–topredictthescoresofbaseballgames.Weusedthedatatorepeatedlysimulateeveryatbatofeverygame.SophisAcatedanalysisoftheseresultsrevealedsurprisinglypredicAvestaAsAcsthatleaveusquesAoningtheproverbialwisdom:“Thehousealwayswins.”

IntroducAon

SimulaAngGames

WetreatbaseballgamesasaMarkovdecisionprocess.However,therearetoomanypossiblestatesandtransiAonstofeasiblycalculatetrueprobabiliAes.Assuch,werandomlysamplefromthestatespacetoesAmatethetruedistribuAonofgames.

ExploringtheStateSpace

StaAsAcsareubiquitousinBaseball.Weknoweveryplayer’sbarngaverage,everypitcher’searnedrunaverage–wecancalculateanystaAsAcweneed,atarbitrarylevelsofgranularity.Aplayer’shirngstaAsAcs,however,arenotsufficienttocalculatetheprobabilitydistribuAonforsomeatbat.TheseprobabiliAesareacombinaAonofhirngstaAsAcs,pitchingstaAsAcs,andenvironmentalvariables(likerunnersonbase,handed-ness,etc.).Wehave7.2millioninstancesofatbats,whichwefeaturizedintoabout100features(mostlysparse).Eachinstanceismatchedtoaresult,like“single,”“strikeout,”or“fielder’schoice”.Splirngtheseexamplesinto70%developmentand30%tesAng,werunmulAnomiallogisAcregressiontomoreaccuratelycalculateP(acAon|state).

LearningP(acAons|state)

WehavesuccessfullyrepresentedMajorLeagueBaseballgamesasaMarkovChain,andthroughMonteCarlosimulaAonsofthesegameswecangeneratemeaningfulresultsthatarecompeAAvewithstate-of-the-arttechniques.Byfocusingonhigh-confidencegames,wegeneratemeaningfulresults.Wefound,withp<.05,thatourpredicAonsforgameswithmorethan80%confidenceareexpectedtoguessthecorrectresultfortheover/under.We’resAllworkingontweakingourfeatureselecAonandalgorithm,andexpecttoimproveourmodelandresultsinthecomingdays.

Conclusions

“AcAons”isthesetofpossibleresultsforthebawer,and“outcomes”isthesetofpossiblestatesaGeraplayhasbeenmade.1.  Enterstartstate(Awayteamhirng,nobodyonbase,etc.)2.  RepeatunAlgameend

(a)  CalculateP(acAons|state)(learnthisprobability!)(b)  ChooseweightedrandomacAon(c)  CalculateP(outcomes|acAon)(weassumethatthisisthesamefor

allgames)(d)  Chooseweightedrandomoutcome(e)  Gotothestatethatoutcomespecifies

3.GatherstaAsAcsaboutsimulaAonWesimulateeachgame10,000Ames.ThekeytosimulaAngaccurategamesislearningP(acAons|state).

Table1.Afewexamplesoffeatures.

Binary-Valued Real-Valuedhas_one_out bat_singles_per_try_against_same_hand

is_7th_inning pitcher_doubles_per_bawer

bawer_and_pitcher_same_handed bat_homers_in_last_30_awempts

runner_on_second bat_doubles_per_try_against_diff_hand

in_stadium_5 bat_walks_per_plate_appearanceConfidenceinresult

Averagereturnon$100bet

Graph1.Confidencevs.Return

Runsscored

Countofgames

Graph2:ExamplesimulaAon