ASTR633 Astrophysical Techniques BAYESIAN STATISTICS · We have 36 possible dice outcomes (1+1,...

9
1 ASTR633 Astrophysical Techniques BAYESIAN STATISTICS (original notes by Eric Nielsen, edited by Mike Liu and Jonathan Williams) Frequentist (a.k.a. “classical") statistics is what we have done so far: 1. There is a Platonic ideal (“parent population”) for the parameter you are measuring. This is a fixed value with no uncertainty, and thus probability statements about it are meaningless. 2. Probability distributions refer to the chance that the true value is within a given confidence interval of the mean from our measurements. 3. Repeated measurements get you ever closer to finding the true value. Bayesian statistics (which actually came before "classical") instead believes the following: 1. The Platonic mean is not a useful concept. Only the data are real. (“The world is messy.”) 2. Probability is used to define our confidence that a parameter we measure from the data accurately describes the universe. 3. Probabilities can be assigned to things other than data, including model parameters and model themselves. 4. We can incorporate our prior knowledge, since we know more about the universe than just this one dataset. Choice between the two is in some sense philosophical but has profound consequences, as we will see. To illustrate the difference: Deductive logic – given a cause, we can determine its outcome, e.g. given a fair coin, what is the probability that 10 tosses will produce 10 heads, 9 heads+1 tail, etc. This is typical for pure mathematics: given some core axioms, derive the outcomes. Inductive logic - given that certain effects are observed, what is (are) the underlying cause(s)? E.g. if 10 flips yielded 7 heads, is the coin fair or biased? This is typical for observational science (and indeed everyday life). Both schools of thought have value, and indeed often produce the same results. Maximum likelihood is a major concept in both paradigms.

Transcript of ASTR633 Astrophysical Techniques BAYESIAN STATISTICS · We have 36 possible dice outcomes (1+1,...

Page 1: ASTR633 Astrophysical Techniques BAYESIAN STATISTICS · We have 36 possible dice outcomes (1+1, 1+2, 2+1, etc.) so the probability of any 1 event is 1/36 ~ 3%. So the machine lies

1

ASTR633AstrophysicalTechniques

BAYESIANSTATISTICS(originalnotesbyEricNielsen,editedbyMikeLiuandJonathanWilliams)

Frequentist(a.k.a.“classical")statisticsiswhatwehavedonesofar:

1. ThereisaPlatonicideal(“parentpopulation”)fortheparameteryouaremeasuring.Thisisafixedvaluewithnouncertainty,andthusprobabilitystatementsaboutitaremeaningless.

2. Probabilitydistributionsrefertothechancethatthetruevalueiswithinagivenconfidenceintervalofthemeanfromourmeasurements.

3. Repeatedmeasurementsgetyoueverclosertofindingthetruevalue.Bayesianstatistics(whichactuallycamebefore"classical")insteadbelievesthefollowing:

1. ThePlatonicmeanisnotausefulconcept.Onlythedataarereal.(“Theworldismessy.”)

2. Probabilityisusedtodefineourconfidencethataparameterwemeasurefromthedataaccuratelydescribestheuniverse.

3. Probabilitiescanbeassignedtothingsotherthandata,includingmodelparametersandmodelthemselves.

4. Wecanincorporateourpriorknowledge,sinceweknowmoreabouttheuniversethanjustthisonedataset.

Choicebetweenthetwoisinsomesensephilosophicalbuthasprofoundconsequences,aswewillsee.Toillustratethedifference:

Deductivelogic–givenacause,wecandetermineitsoutcome,e.g.givenafaircoin,whatistheprobabilitythat10tosseswillproduce10heads,9heads+1tail,etc.Thisistypicalforpuremathematics:givensomecoreaxioms,derivetheoutcomes.Inductivelogic-giventhatcertaineffectsareobserved,whatis(are)theunderlyingcause(s)?E.g.if10flipsyielded7heads,isthecoinfairorbiased?Thisistypicalforobservationalscience(andindeedeverydaylife).

Bothschoolsofthoughthavevalue,andindeedoftenproducethesameresults.Maximumlikelihoodisamajorconceptinbothparadigms.

Page 2: ASTR633 Astrophysical Techniques BAYESIAN STATISTICS · We have 36 possible dice outcomes (1+1, 1+2, 2+1, etc.) so the probability of any 1 event is 1/36 ~ 3%. So the machine lies

2

BriefHistoryTheproblemofinferringcausesfromeffectswasfirstaddressedbyReverendThomasBayes(1701-1761),publishedposthumouslyin1763. Initialbelief+NewdataàImprovedBelief (“prior”)(“likelihood”)(“posterior”,orprobabilitydistributionfunction)Mostrecentposteriorthenbecomesthepriorforthenextroundofestimating.Pierre-SimonLaplace(1749-1827)independentlydiscoveredBayes’work(1774)andgreatlyclarifiedit.Alsoappliedittoimportantproblems,e.g.estimatedthemassofSaturnto<1%accuracyfromthemodernvalue.Somesayitshouldbecalled“Laplacianstatistics”.Thenlargelyignoreduntilthemiddle20thcentury.Nowwidelyusedasthelong-standingchallengeofBayesiancalculationsbeingmoredifficultthanfrequentistoneshasbeenfinallyovercomethankstomoderncomputers.Also,MarkovChainMonteCarlo(MCMC)hasallowedBayesianstodoalotmorethanfrequentistscan. BayesTheoremBayes’Theoremisjustastatementaboutconditionalprobabilityandfollowsdirectlyfromtherulesofprobability.ConsidereventXandeventY: ProbthatbothX&Ywillhappen:P(X,Y)=P(X|Y)xP(Y)

where“|”means“given”“,”means“and”

Similarly,P(Y,X)=P(Y|X)xP(X)ButweknowP(X,Y)=P(Y,X)SothenP(X|Y)xP(Y)=P(Y|X)xP(X). P(Y|X)=P(X|Y)xP(Y)/P(X)Oftenwrittentoexplicitlyacknowledgetheexistenceofbackgroundinformation(“I”): P(Y|X,I)=P(X|Y,I)xP(Y,I)/P(X,I)

Page 3: ASTR633 Astrophysical Techniques BAYESIAN STATISTICS · We have 36 possible dice outcomes (1+1, 1+2, 2+1, etc.) so the probability of any 1 event is 1/36 ~ 3%. So the machine lies

3

In-classexample:theMontyHallproblemToseetherelevanceforus,replace X=data Y=model(yourhypothesis) P(model|data)=P(data|model)xP(model)/P(data) “posterior” “likelihood”“prior”“evidence”

“posterior”=whatyougetafteryouexaminethedata,i.e.theprobabilitydistributionfunction(PDF)foryourmodelparameters,e.g.y=A+Bx.“likelihood”=howlikelyisitthatyourmodelcanproduceyourdata?Thisiswhereyoudoactualstatistics,butoftenverystraight-forward.“prior”=whatyouknewbeforeyouexaminethedata,e.g.whatdidyouthinkofyourmodelbeforeyouwenttothetelescope?Thisisthemostsubjective/controversialpartofBayesiananalysis,thoughoftentimesthechoicedoesnotimpactthebasicoutcome.“evidence”=thiscanbeignoredforparameterestimation,sinceitjustprovidestheoverallnormalization.(importantinmodelselection,whichwewillnotdiscuss.)

Bayesianstatisticsisthenjustastraight-forwardwaytoconstrainmodels,basedonbothyourdataandpriorknowledge.Whilethederivationisnon-controversial,notethattheinterpretationofP(model|data)isonlymeaningfulinBayesianstatistics.i.e.itcorrespondstoourstateofknowledge(i.e.belief)aboutamodelanditsparametersgiventhedata(e.g.Laplace’sestimateofthemassofSaturn,giventheorbitaldata).Itisnotmeaningfulinfrequentiststatistics–thereisonlyonemassforSaturn,notadistributionofmasses.Ofcourse,thesumofthefinalPDF(theposterior)mustbe1,i.e.

!𝑃(model|data) = 1whichmeansyoucanuseanunnormalizedexpressionforthelikelihoodand/ortheprior.

Fortheprior,youcanuseanyinforelevanttotheparametersofyourmodel.Canalsojuststartfromthescratchwithnoprior(“uninformedprior”).Oftenthat’sPrior=1,whichiscomputationallyisthesameasmaximumlikelihood.Butsometimesyoucanusephysicalintuitiontoinformyourunderstandingoftheresult.

Page 4: ASTR633 Astrophysical Techniques BAYESIAN STATISTICS · We have 36 possible dice outcomes (1+1, 1+2, 2+1, etc.) so the probability of any 1 event is 1/36 ~ 3%. So the machine lies

4

Simpleexample:xkcdcartoonWewanttoderivethePDFforB(“boom”fortheSun),withtwopossiblevaluesforB B=0àSunhasnotexploded B=1àSunhasexploded. AndofcourseP(B=0)+P(B=1)=1Frequentistapproach:ouronlyknowledgeistheresultsfromthemachine.

Wehave36possiblediceoutcomes(1+1,1+2,2+1,etc.)sotheprobabilityofany1eventis1/36~3%.

Sothemachinelies3%ofthetimeandtellsthetruththeother97%.SoifthemachinesaystheSunjustexploded,there’sa97%chanceitdidanda3%chanceitdidn’t. P(B=0)=3% P(B=1)=97%

Bayesianapproach:P(model|data)~P(data|model)xP(model)

SowewanttoevaluatethisforthetwovaluesofBusingBayes’Theorum:P(B=0|data)~P(data|B=0)xP(B=0)P(B=1|data)~P(data|B=1)xP(B=1)

Step1:likelihood.Justlikethefrequentistapproach P(data|B=0)=3% P(data|B=1)=97%Step2:prior.Whatistheprobabilitythatourbasicunderstandingofstellarevolutioniswrong?,say1inamillion: P(B=0)=0.999999 P(B=1)=0.000001Putthemtogether: P(B=0|data)~3%x0.999999~3% P(B=1|data)~97%x0.000001~0.0001Normalizethetotalprobabilities: P(B=0|data)+P(B=1|data)=1 P(B=0|data)=3%/(3%+0.0001%)~100% P(B=1|data)=0.0001%/(3%+0.0001%)~0.00003%SotheSunexplodinggot30xmorelikelythanourprior,buttheconclusionthatisstillthesameàitdidnotexplode.

Page 5: ASTR633 Astrophysical Techniques BAYESIAN STATISTICS · We have 36 possible dice outcomes (1+1, 1+2, 2+1, etc.) so the probability of any 1 event is 1/36 ~ 3%. So the machine lies

5

ModelparameterestimationusingBayesianinferenceAverycommonandpowerfulapplicationofBayes’Theoremisparameterestimation. P(model|data)µP(data|model)xP(model) “posterior” “likelihood”“prior”Don’tthinkofthisasanequationyoucomputeonce.Insteadcomputelots(andlots!)oftimesforallpossiblevaluesofyourmodelparameters.Theresultistheposterior(a.k.a.PDF)foryourparameters.

Goingbackwardsishard:findingthebest-fittingparametersforacomplicatedmodelishard.E.g.if𝑦 = 𝐴 + 𝐵45sin(𝐷𝑥 + 𝑒)</>,itwouldbehardtofindbest{A,B,C,D,E}.

Goingforwardiseasy:givensomechoiceof{A,B,C,D,E},you(i.e.thecomputer)caneasilycomputethevalueofy.Sojustdolotsofcalculationsoveragridof{A,B,C,D,E}toextractthePDF(a.k.a.“bruteforce”).Canbedonealsofornon-analyticmodels.

What’sthelikelihood?Anyequationthatdescribesaprobability.Inastronomy,thetwomostcommonare:

1. Countingthings:Poissonstatistics2. Genericdatawithuncertainties:chi-square

Poissionlikelihood:

1. Counthowmany“things”fellintoasinglebin(M=#ofobservedthings)2. Usethemodelparameterstopredictnumberineachbin(E=expected#)3. Calculatetheprobability:

𝑃(𝑑𝑎𝑡𝑎|𝑚𝑜𝑑𝑒𝑙) = 𝐸F

𝑀! 𝑒IJ

4. Multiplybytheprior.5. Repeatforeachbin,thenmultiplyalltheprobabilities(oneforeachbin)

together.

Chi-squarelikelihood(you’vealreadyseenthisinmaximumlikelihood):Genericchi-squareformulaisallyouneed:

𝜒> =!(𝑀L − 𝐸L)>

𝜎L>

O

LP<

whichisreadilygeneralizedtomulti-dimensionaldata:

𝜒> =(𝑀5< − 𝐸5<)>

𝜎5<>+Q𝑀R<𝐸R<S

>

𝜎R<>+(𝑀5> − 𝐸5>)>

𝜎5>>

theprobabilityisthenjustgivenby𝑃(𝑑𝑎𝑡𝑎|𝑚𝑜𝑑𝑒𝑙)µexp V− WX

>Y

Page 6: ASTR633 Astrophysical Techniques BAYESIAN STATISTICS · We have 36 possible dice outcomes (1+1, 1+2, 2+1, etc.) so the probability of any 1 event is 1/36 ~ 3%. So the machine lies

6

Astrophysicalexample:age-datingoffieldB&Astars(Nielsenetal2013,ApJ,776,4)B&Astarsareprimetargetsfordirectimagingofplanets(e.g.HR8799,betaPic).Whendirectlyimagingplanets,youngerisbetterbecausetheplanetsarebrighter.Sowe'rehighlymotivatedtodetermineagesofB&Astars&theiruncertainties.

P(model|data)~P(data|model)xP(model)“posterior’’“likelihood”“prior”

data=M(V),(B-V)+theiruncertaintiesmodel=predictionsfor{M(V),B-V}fromstellarevolutionarymodelsasafunctionof{[Fe/H],age,stellarmass}likelihood:chi-squarefordata(M(V),B-V)withrespecttomodelpredictionsfor(M(V),B-V)asfunctionof{[Fe/H],age,mass}

where“O”=observedand“E”=expectedfromthemodelsthenlikelihoodisthen

whichproducesa3-dgridoflikelihoods,withaxes{[age,mass,[Fe/H]}

Whatisthemostlikelyageofthestar?

Page 7: ASTR633 Astrophysical Techniques BAYESIAN STATISTICS · We have 36 possible dice outcomes (1+1, 1+2, 2+1, etc.) so the probability of any 1 event is 1/36 ~ 3%. So the machine lies

7

priors:-metallicitydistributionofsolarneighborhood(e.g.lowZstarsarerare) adoptGaussianfor[Fe/H]withmean=0,sigma=0.1dex-flatagedistribution,i.e.constantstarformationrateovertime-SalpeterIMF(thisturnsouttohavelittleeffect)multiplylikelihoodbythepriors,normalizetheoverallPDFsumto1.0,marginalizetoshow1-dPDFs,seecovariancesfrom2-dPDFs

seeNielsenetal.2013,ApJ,776,4

Page 8: ASTR633 Astrophysical Techniques BAYESIAN STATISTICS · We have 36 possible dice outcomes (1+1, 1+2, 2+1, etc.) so the probability of any 1 event is 1/36 ~ 3%. So the machine lies

8

MarkovChainMonteCarloparameterestimationReferences-Ford2005,AJ,129,1706-Foreman-Mackeyetal.2013,PASP,125,306 http://dfm.io/emcee/current/ -Sharma2017,ARAA,55,213 https://github.com/sanjibs/bmcmcStellarageexampleisbruteforce;namelycomputethelikelihoodoverawidegridofmodelvalues.Oftenthisis,atminimum,wasteful.Formanyproblems,itisalsoinfeasible.ForNmodelparameterswithRgridstepsperparameter,thenumberofcalculationsis~R**N. e.g.50**2=2500elementarray 50**7=781billionelementarray(severalTbofmemory)WewantaprocessthatquicklyfindsthepeakofthePDFthenspendsmostofthetimenearthepeakmappingitsshape(i.e.doingcalculations),wheretheprobabilityishighest.AvoidregionswithlowprobabilitytosavecomputingtimeàMarkovChainMonteCarlo(MCMC)

AMarkovchainissequenceofrandomvariablesinwhichtheprobabilityofstepsbetweenadjacentstepsisdependentonlyonthecurrentstateofthesystem(ithasnomemoryofthepastorpredictionforthefuture).

MCMCproducesa“chain”(aseriesofcalculations)thatasymptoticallyapproachesthePDF,e.g.68%ofthestepswilloccurinthe68%confidencelimitofthePDF,95%ofstepsinthe95%CL,etc.Itwilltellusthepeak&theimportantpartofthePDF,namelythepartwithnon-negligibleprobability.RecipeforaMetropolis-HastingsMCMC(simplestMCMCimplementation):

• Startwithinitialguessofmodelparameters{A0,B0,C0,…}andcomputetheprobabilityofgettingourdatagiventhoseinitialvalues(i.e.thelikelihood)=P0

• Determinethenextstepinthechain:1. Randomlyvaryeachoftheparameterstogetanewset{Ai,Bi,Ci,…}(“trial”)2. Computetheprobabilityforthesenewparameters:Pi3. IsPi>Pi-1?

i. Ifyes,adoptthenewparameters(“takethestep”,movefromlowprobabilitylocationtohighone)

ii. Ifno,generateauniformrandomnumberUifrom0to1.

Page 9: ASTR633 Astrophysical Techniques BAYESIAN STATISTICS · We have 36 possible dice outcomes (1+1, 1+2, 2+1, etc.) so the probability of any 1 event is 1/36 ~ 3%. So the machine lies

9

iii. IfUi<Pi/Pi-1,thenadoptthenewparameters.Otherwisekeepthecurrentones.Allowsexplorationoflowerprobabilityregions,butnottoolowprobability

4. ReturntoStep(1).Repeatmanytimes(e.g.>106).

ForStep1:simplestcaseistoassumeeachparameterisaGaussian,sogeneratenewvaluesbydrawingfromaGaussianwithafixedstandarddeviationandmeanatthevalueofthecurrentstep: Ai=Ai-1+G*σA Bi=Bi-1+G*σBwhereGisanormallydistributedrandomvariablewithmean=0andvariance=1.Note:sigmasmustremainconstantthroughoutthechain,orelsefinalresultsarenotavalidmeasureofthePDF.

Howtochoosesigmas?Wanta“reasonable”acceptancerate~25%,i.e.fractionofproposedstepsinStep3thatistaken.

Ifratetoolargeàinefficientb/cstepsizetoosmall.Ifratetoosmallàinefficientb/cstepstoolarge,sonevermoving.

Candoaninitialtestrunofvariousstepsizesandadjustacceptancerate.

• Theresult:MCMCchainisanarrayofvaluesforeachmodelparameter.E.g.1000stepsofy=A+Bxgives1000valuesofA&1000valuesofBàthesearraysareourmodelparameterPDFs!infact,theresultalsoprovidesthejointPDFforallourparameters

• ThisMCMCPDFisthelikelihoodinBayes’Theorem.Ifwehaveaprior,multiplyitbythePDFtogetthePosterior.

• Sidebenefit:ThePDFforanyquantityderivedfromthemodelparametersisjust

thechainofthatquantitycalculatedfromthemodelchains.

NowyougettodoityourselfinthecomputationallyintensiveProblemSet#6