ASTR633 Astrophysical Techniques BAYESIAN STATISTICS · We have 36 possible dice outcomes (1+1,...
Transcript of ASTR633 Astrophysical Techniques BAYESIAN STATISTICS · We have 36 possible dice outcomes (1+1,...
1
ASTR633AstrophysicalTechniques
BAYESIANSTATISTICS(originalnotesbyEricNielsen,editedbyMikeLiuandJonathanWilliams)
Frequentist(a.k.a.“classical")statisticsiswhatwehavedonesofar:
1. ThereisaPlatonicideal(“parentpopulation”)fortheparameteryouaremeasuring.Thisisafixedvaluewithnouncertainty,andthusprobabilitystatementsaboutitaremeaningless.
2. Probabilitydistributionsrefertothechancethatthetruevalueiswithinagivenconfidenceintervalofthemeanfromourmeasurements.
3. Repeatedmeasurementsgetyoueverclosertofindingthetruevalue.Bayesianstatistics(whichactuallycamebefore"classical")insteadbelievesthefollowing:
1. ThePlatonicmeanisnotausefulconcept.Onlythedataarereal.(“Theworldismessy.”)
2. Probabilityisusedtodefineourconfidencethataparameterwemeasurefromthedataaccuratelydescribestheuniverse.
3. Probabilitiescanbeassignedtothingsotherthandata,includingmodelparametersandmodelthemselves.
4. Wecanincorporateourpriorknowledge,sinceweknowmoreabouttheuniversethanjustthisonedataset.
Choicebetweenthetwoisinsomesensephilosophicalbuthasprofoundconsequences,aswewillsee.Toillustratethedifference:
Deductivelogic–givenacause,wecandetermineitsoutcome,e.g.givenafaircoin,whatistheprobabilitythat10tosseswillproduce10heads,9heads+1tail,etc.Thisistypicalforpuremathematics:givensomecoreaxioms,derivetheoutcomes.Inductivelogic-giventhatcertaineffectsareobserved,whatis(are)theunderlyingcause(s)?E.g.if10flipsyielded7heads,isthecoinfairorbiased?Thisistypicalforobservationalscience(andindeedeverydaylife).
Bothschoolsofthoughthavevalue,andindeedoftenproducethesameresults.Maximumlikelihoodisamajorconceptinbothparadigms.
2
BriefHistoryTheproblemofinferringcausesfromeffectswasfirstaddressedbyReverendThomasBayes(1701-1761),publishedposthumouslyin1763. Initialbelief+NewdataàImprovedBelief (“prior”)(“likelihood”)(“posterior”,orprobabilitydistributionfunction)Mostrecentposteriorthenbecomesthepriorforthenextroundofestimating.Pierre-SimonLaplace(1749-1827)independentlydiscoveredBayes’work(1774)andgreatlyclarifiedit.Alsoappliedittoimportantproblems,e.g.estimatedthemassofSaturnto<1%accuracyfromthemodernvalue.Somesayitshouldbecalled“Laplacianstatistics”.Thenlargelyignoreduntilthemiddle20thcentury.Nowwidelyusedasthelong-standingchallengeofBayesiancalculationsbeingmoredifficultthanfrequentistoneshasbeenfinallyovercomethankstomoderncomputers.Also,MarkovChainMonteCarlo(MCMC)hasallowedBayesianstodoalotmorethanfrequentistscan. BayesTheoremBayes’Theoremisjustastatementaboutconditionalprobabilityandfollowsdirectlyfromtherulesofprobability.ConsidereventXandeventY: ProbthatbothX&Ywillhappen:P(X,Y)=P(X|Y)xP(Y)
where“|”means“given”“,”means“and”
Similarly,P(Y,X)=P(Y|X)xP(X)ButweknowP(X,Y)=P(Y,X)SothenP(X|Y)xP(Y)=P(Y|X)xP(X). P(Y|X)=P(X|Y)xP(Y)/P(X)Oftenwrittentoexplicitlyacknowledgetheexistenceofbackgroundinformation(“I”): P(Y|X,I)=P(X|Y,I)xP(Y,I)/P(X,I)
3
In-classexample:theMontyHallproblemToseetherelevanceforus,replace X=data Y=model(yourhypothesis) P(model|data)=P(data|model)xP(model)/P(data) “posterior” “likelihood”“prior”“evidence”
“posterior”=whatyougetafteryouexaminethedata,i.e.theprobabilitydistributionfunction(PDF)foryourmodelparameters,e.g.y=A+Bx.“likelihood”=howlikelyisitthatyourmodelcanproduceyourdata?Thisiswhereyoudoactualstatistics,butoftenverystraight-forward.“prior”=whatyouknewbeforeyouexaminethedata,e.g.whatdidyouthinkofyourmodelbeforeyouwenttothetelescope?Thisisthemostsubjective/controversialpartofBayesiananalysis,thoughoftentimesthechoicedoesnotimpactthebasicoutcome.“evidence”=thiscanbeignoredforparameterestimation,sinceitjustprovidestheoverallnormalization.(importantinmodelselection,whichwewillnotdiscuss.)
Bayesianstatisticsisthenjustastraight-forwardwaytoconstrainmodels,basedonbothyourdataandpriorknowledge.Whilethederivationisnon-controversial,notethattheinterpretationofP(model|data)isonlymeaningfulinBayesianstatistics.i.e.itcorrespondstoourstateofknowledge(i.e.belief)aboutamodelanditsparametersgiventhedata(e.g.Laplace’sestimateofthemassofSaturn,giventheorbitaldata).Itisnotmeaningfulinfrequentiststatistics–thereisonlyonemassforSaturn,notadistributionofmasses.Ofcourse,thesumofthefinalPDF(theposterior)mustbe1,i.e.
!𝑃(model|data) = 1whichmeansyoucanuseanunnormalizedexpressionforthelikelihoodand/ortheprior.
Fortheprior,youcanuseanyinforelevanttotheparametersofyourmodel.Canalsojuststartfromthescratchwithnoprior(“uninformedprior”).Oftenthat’sPrior=1,whichiscomputationallyisthesameasmaximumlikelihood.Butsometimesyoucanusephysicalintuitiontoinformyourunderstandingoftheresult.
4
Simpleexample:xkcdcartoonWewanttoderivethePDFforB(“boom”fortheSun),withtwopossiblevaluesforB B=0àSunhasnotexploded B=1àSunhasexploded. AndofcourseP(B=0)+P(B=1)=1Frequentistapproach:ouronlyknowledgeistheresultsfromthemachine.
Wehave36possiblediceoutcomes(1+1,1+2,2+1,etc.)sotheprobabilityofany1eventis1/36~3%.
Sothemachinelies3%ofthetimeandtellsthetruththeother97%.SoifthemachinesaystheSunjustexploded,there’sa97%chanceitdidanda3%chanceitdidn’t. P(B=0)=3% P(B=1)=97%
Bayesianapproach:P(model|data)~P(data|model)xP(model)
SowewanttoevaluatethisforthetwovaluesofBusingBayes’Theorum:P(B=0|data)~P(data|B=0)xP(B=0)P(B=1|data)~P(data|B=1)xP(B=1)
Step1:likelihood.Justlikethefrequentistapproach P(data|B=0)=3% P(data|B=1)=97%Step2:prior.Whatistheprobabilitythatourbasicunderstandingofstellarevolutioniswrong?,say1inamillion: P(B=0)=0.999999 P(B=1)=0.000001Putthemtogether: P(B=0|data)~3%x0.999999~3% P(B=1|data)~97%x0.000001~0.0001Normalizethetotalprobabilities: P(B=0|data)+P(B=1|data)=1 P(B=0|data)=3%/(3%+0.0001%)~100% P(B=1|data)=0.0001%/(3%+0.0001%)~0.00003%SotheSunexplodinggot30xmorelikelythanourprior,buttheconclusionthatisstillthesameàitdidnotexplode.
5
ModelparameterestimationusingBayesianinferenceAverycommonandpowerfulapplicationofBayes’Theoremisparameterestimation. P(model|data)µP(data|model)xP(model) “posterior” “likelihood”“prior”Don’tthinkofthisasanequationyoucomputeonce.Insteadcomputelots(andlots!)oftimesforallpossiblevaluesofyourmodelparameters.Theresultistheposterior(a.k.a.PDF)foryourparameters.
Goingbackwardsishard:findingthebest-fittingparametersforacomplicatedmodelishard.E.g.if𝑦 = 𝐴 + 𝐵45sin(𝐷𝑥 + 𝑒)</>,itwouldbehardtofindbest{A,B,C,D,E}.
Goingforwardiseasy:givensomechoiceof{A,B,C,D,E},you(i.e.thecomputer)caneasilycomputethevalueofy.Sojustdolotsofcalculationsoveragridof{A,B,C,D,E}toextractthePDF(a.k.a.“bruteforce”).Canbedonealsofornon-analyticmodels.
What’sthelikelihood?Anyequationthatdescribesaprobability.Inastronomy,thetwomostcommonare:
1. Countingthings:Poissonstatistics2. Genericdatawithuncertainties:chi-square
Poissionlikelihood:
1. Counthowmany“things”fellintoasinglebin(M=#ofobservedthings)2. Usethemodelparameterstopredictnumberineachbin(E=expected#)3. Calculatetheprobability:
𝑃(𝑑𝑎𝑡𝑎|𝑚𝑜𝑑𝑒𝑙) = 𝐸F
𝑀! 𝑒IJ
4. Multiplybytheprior.5. Repeatforeachbin,thenmultiplyalltheprobabilities(oneforeachbin)
together.
Chi-squarelikelihood(you’vealreadyseenthisinmaximumlikelihood):Genericchi-squareformulaisallyouneed:
𝜒> =!(𝑀L − 𝐸L)>
𝜎L>
O
LP<
whichisreadilygeneralizedtomulti-dimensionaldata:
𝜒> =(𝑀5< − 𝐸5<)>
𝜎5<>+Q𝑀R<𝐸R<S
>
𝜎R<>+(𝑀5> − 𝐸5>)>
𝜎5>>
theprobabilityisthenjustgivenby𝑃(𝑑𝑎𝑡𝑎|𝑚𝑜𝑑𝑒𝑙)µexp V− WX
>Y
6
Astrophysicalexample:age-datingoffieldB&Astars(Nielsenetal2013,ApJ,776,4)B&Astarsareprimetargetsfordirectimagingofplanets(e.g.HR8799,betaPic).Whendirectlyimagingplanets,youngerisbetterbecausetheplanetsarebrighter.Sowe'rehighlymotivatedtodetermineagesofB&Astars&theiruncertainties.
P(model|data)~P(data|model)xP(model)“posterior’’“likelihood”“prior”
data=M(V),(B-V)+theiruncertaintiesmodel=predictionsfor{M(V),B-V}fromstellarevolutionarymodelsasafunctionof{[Fe/H],age,stellarmass}likelihood:chi-squarefordata(M(V),B-V)withrespecttomodelpredictionsfor(M(V),B-V)asfunctionof{[Fe/H],age,mass}
where“O”=observedand“E”=expectedfromthemodelsthenlikelihoodisthen
whichproducesa3-dgridoflikelihoods,withaxes{[age,mass,[Fe/H]}
Whatisthemostlikelyageofthestar?
7
priors:-metallicitydistributionofsolarneighborhood(e.g.lowZstarsarerare) adoptGaussianfor[Fe/H]withmean=0,sigma=0.1dex-flatagedistribution,i.e.constantstarformationrateovertime-SalpeterIMF(thisturnsouttohavelittleeffect)multiplylikelihoodbythepriors,normalizetheoverallPDFsumto1.0,marginalizetoshow1-dPDFs,seecovariancesfrom2-dPDFs
seeNielsenetal.2013,ApJ,776,4
8
MarkovChainMonteCarloparameterestimationReferences-Ford2005,AJ,129,1706-Foreman-Mackeyetal.2013,PASP,125,306 http://dfm.io/emcee/current/ -Sharma2017,ARAA,55,213 https://github.com/sanjibs/bmcmcStellarageexampleisbruteforce;namelycomputethelikelihoodoverawidegridofmodelvalues.Oftenthisis,atminimum,wasteful.Formanyproblems,itisalsoinfeasible.ForNmodelparameterswithRgridstepsperparameter,thenumberofcalculationsis~R**N. e.g.50**2=2500elementarray 50**7=781billionelementarray(severalTbofmemory)WewantaprocessthatquicklyfindsthepeakofthePDFthenspendsmostofthetimenearthepeakmappingitsshape(i.e.doingcalculations),wheretheprobabilityishighest.AvoidregionswithlowprobabilitytosavecomputingtimeàMarkovChainMonteCarlo(MCMC)
AMarkovchainissequenceofrandomvariablesinwhichtheprobabilityofstepsbetweenadjacentstepsisdependentonlyonthecurrentstateofthesystem(ithasnomemoryofthepastorpredictionforthefuture).
MCMCproducesa“chain”(aseriesofcalculations)thatasymptoticallyapproachesthePDF,e.g.68%ofthestepswilloccurinthe68%confidencelimitofthePDF,95%ofstepsinthe95%CL,etc.Itwilltellusthepeak&theimportantpartofthePDF,namelythepartwithnon-negligibleprobability.RecipeforaMetropolis-HastingsMCMC(simplestMCMCimplementation):
• Startwithinitialguessofmodelparameters{A0,B0,C0,…}andcomputetheprobabilityofgettingourdatagiventhoseinitialvalues(i.e.thelikelihood)=P0
• Determinethenextstepinthechain:1. Randomlyvaryeachoftheparameterstogetanewset{Ai,Bi,Ci,…}(“trial”)2. Computetheprobabilityforthesenewparameters:Pi3. IsPi>Pi-1?
i. Ifyes,adoptthenewparameters(“takethestep”,movefromlowprobabilitylocationtohighone)
ii. Ifno,generateauniformrandomnumberUifrom0to1.
9
iii. IfUi<Pi/Pi-1,thenadoptthenewparameters.Otherwisekeepthecurrentones.Allowsexplorationoflowerprobabilityregions,butnottoolowprobability
4. ReturntoStep(1).Repeatmanytimes(e.g.>106).
ForStep1:simplestcaseistoassumeeachparameterisaGaussian,sogeneratenewvaluesbydrawingfromaGaussianwithafixedstandarddeviationandmeanatthevalueofthecurrentstep: Ai=Ai-1+G*σA Bi=Bi-1+G*σBwhereGisanormallydistributedrandomvariablewithmean=0andvariance=1.Note:sigmasmustremainconstantthroughoutthechain,orelsefinalresultsarenotavalidmeasureofthePDF.
Howtochoosesigmas?Wanta“reasonable”acceptancerate~25%,i.e.fractionofproposedstepsinStep3thatistaken.
Ifratetoolargeàinefficientb/cstepsizetoosmall.Ifratetoosmallàinefficientb/cstepstoolarge,sonevermoving.
Candoaninitialtestrunofvariousstepsizesandadjustacceptancerate.
• Theresult:MCMCchainisanarrayofvaluesforeachmodelparameter.E.g.1000stepsofy=A+Bxgives1000valuesofA&1000valuesofBàthesearraysareourmodelparameterPDFs!infact,theresultalsoprovidesthejointPDFforallourparameters
• ThisMCMCPDFisthelikelihoodinBayes’Theorem.Ifwehaveaprior,multiplyitbythePDFtogetthePosterior.
• Sidebenefit:ThePDFforanyquantityderivedfromthemodelparametersisjust
thechainofthatquantitycalculatedfromthemodelchains.
NowyougettodoityourselfinthecomputationallyintensiveProblemSet#6