Introduction to Statistical Methods for Data Analysis · Introduction to Statistical Methods for...

Dr Lorenzo Moneta CERN PH-SFT

CH-1211 Geneva 23 sftweb.cern.ch

root.cern.ch

1

Introduction to Statistical Methods for Data Analysis

http://sftweb.cern.ch

http://root.cern.ch

Lorenzo Moneta CERN PH-SFT Data Analysis Tutorial at UERJ 2015: Introduction to Statistics

• Probability definition • Probability Density Functions • Some typical distributions • Bayes Theorem • Parameter Estimation • Hypothesis Testing

2

Outline


• A lot of the material for this introduction to statistical methods is extracted from a course: –Statistical Methods for Data Analysis

(Luca Lista, INFN Napoli)

–Material available also in his book • Statistical Methods for Data Analysis in Particle Physics

(Springer) – http://www.springer.com/us/book/9783319201757

• Other suggested book is –Data Analysis in High Energy Physics (Wiley)

3

References

http://www.springer.com/us/book/9783319201757


• Two main different definitions: –Frequentist

• Probability is the ratio of the number of occurrences of an event to the total number of experiments, in the limit of very large number of repeatable experiments.

• Can only be applied to a specific classes of events (repeatable experiments)

• Meaningless to state: “probability that the lightest SuSy particle’s mass is less tha 1 TeV”

–Bayesian • Probability measures someone’s the degree of belief that

something is or will be true: would you bet? • Probability measures someone’s the degree of belief that

something is or will be true: would you bet? – Probability that Barcelona will win the next Champion League

4

Definition Of Probability


• Assume all accessible cases are equally probable • Valid on discrete cases only

–Problem in continuous cases (definition of metrics)

5

Classical Probability


• Distribution of number of successes on N trials –e.g. spinning a coin or a dice N times

• Each trial has a probability p of success

• Average: <n> = Np • Variance: <n2>-<n>2 = Np(1-p) • Used for efficiency • In ROOT is available as

6

Binomial Distribution

ROOT::Math::binomial_pdf(n,p,N)


• Law of large numbers

• this means also that

• circular definition of probabilities –a phenomenon can be proven to be random only if we

observe infinite cases

7

Frequentist Probability


• Probability of A, given B : P(A|B) –probability that an event known to belong to set B is also

member of set A –P(A | B) = P(A ∩ B) / P(B)

–A is independent of B ifthe conditional probability of A given B is equal to theprobability of A: • P(A | B) = P(A)

–Hence, if A is independent on B • P(A | B) = P(A) P(B)

–If A is independent on B, B is independent on A8

Conditional Probability

Lorenzo Moneta CERN PH-SFT Data Analysis Tutorial at UENRJ 2015: Introduction to Statistics 9

Prob. Density Functions (PDF)


• Average = µ • Variance = σ2 • Widely used

because of thecentral limit theorem

10

Gaussian (Normal) Distribution

x5− 4− 3− 2− 1− 0 1 2 3 4 5

PDF(

x)

0

0.2

0.4

0.6

0.8

1

1.2

=0.3σ=0 µ

=1σ=0 µ

=3σ=0 µ

=1σ=-2 µ

Gaussian PDF

TMath::Gaus(x, μ, σ,true) ROOT::Math::normal_pdf( x, σ, μ )TF1 f(“f”,”gausn”,xmin,xmax);x = gRandom->Gaus(μ, σ);

N.B. “gausn” for a normalised (PDF) Gaussian


• Sum of n random variables xn converges to a Gaussian, irrespective of the original distributions of the variables xn (only some basic regularity conditions must hold) –∑xn → Gaussian –Example adding n flat distributions

11

Central limit theorem

/ ndf 2χ 87.47 / 83

Constant 3.7± 306.4

Mean 0.013± 5.011

Sigma 0.009± 1.293

0 1 2 3 4 5 6 7 8 9 100

50

100

150

200

250

300 / ndf 2χ 87.47 / 83


Mean 0.013± 5.011

Sigma 0.009± 1.293

<x> for n = 5 (x is uniform in [0,10])

/ ndf = 422.9 / 972χ


Mean 0.022± 4.989

Sigma 0.015± 2.031

0 1 2 3 4 5 6 7 8 9 100

20

40

60

80

100

120

140

160

180

200

220 / ndf = 422.9 / 972χ


Mean 0.022± 4.989

Sigma 0.015± 2.031

<x> for n = 2 (x is uniform in [0,10])

n = 2 n = 5


• Standard Deviation

• Model for position of rain drops, time of cosmic ray passage, etc..

• Basic distribution for pseudo-random number generation

12

Uniform (“flat”) distribution

ROOT::Math::uniform_pdf( x, a, b)x = gRandom->Uniform(a, b);


• Given a PDF f(x) the cumulative is defined as

• The PDF for F is uniform distributed in [0,1]

• Inverting the cumulative distribution one can generate pseudo-random numbers according to any distribution

13

Cumulative Distribution


x5− 4− 3− 2− 1− 0 1 2 3 4 500.050.1

0.150.20.250.30.350.4

normal_pdf

x5− 4− 3− 2− 1− 0 1 2 3 4 5

p

0

0.2

0.4

0.6

0.8

1

normal_cdfnormal_cdf_c

p0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x

3−

2−

1−

0

1

2

3 normal_quantilenormal_quantile_c

• Probability density function – ROOT::Math::normal_pdf(x,σ,μ)

• Cumulative distribution and its complement (right tail integral) – ROOT::Math::normal_cdf(x,σ,μ) – ROOT::Math::normal_cdf_c(x,σ,μ)

• Inverse of the cumulative distributions (quantile distributions) – ROOT::Math::normal_quantile(p,σ)– ROOT::Math::normal_quantile_c(p,σ)

14

Example of Cumulative Distributions


• Probability to have n entries in x a subset of X >> x

• Limit of binomial distribution when p = x/X = 𝜈/N << 1 –P(n | 𝜈, N) for N → ∞ is a Poisson( n | 𝜈)

–Limit of Poisson for large 𝜈 is a Gaussian

15

Poisson Distribution

ROOT::Math::poisson_pdf(n,𝝂)


• Poisson becomes a Gaussian for large 𝜈

16

Poisson limit for large 𝜈


• Add an asymmetric power-law tail to a Gaussian PDF with proper normalisation and continuity of PDF and its derivative

17

Crystal Ball Function

ROOT::Math::crystalball_pdf(x,α,n,σ,μ)TF1 f(“f”,”crystalballn”,xmin,xmax)


• Model the fluctuations in the energy loss of particles in this layers

18

Landau Distribution

ROOT::Math::landau_pdf(x,s,m)TF1 f(“f”,”landaun”,xmin,xmax)

Lorenzo Moneta CERN PH-SFT Data Analysis Tutorial at UERJ 2015: Introduction to Statistics 19

Bayes Theorem


• A person received a diagnosis of a serious illness • The probability to detect positively a ill person is

~100% • The probability to give a positive result on a healthy

person is 0.2%

• What is the probability that the person is really ill? • Is 99.8% a reasonable answer ?

20

A concrete example


• Result: • P(ill | +) ~ P(ill) / (P(ill) + P(+| healthy) )

• Using some numbers • P(ill) = 0.1 % • P(+ | healthy) = 0.2%

• Then we have: • P(ill|+) = .1 /(.1+.2) = 33 %

22

Result from Bayes theorem (2)


• Likelihood function: –given some observed events: x1,… xn –Likelihood function is the PDF of the variables x1,… xn –L (x1,… xn | 𝛳1,…𝛳n )

• Bayes theorem can be written as

23

Likelihood Function

likelihood function prior probability

normalisation term

likelihood prior

normalisation term

posterior


Repeated use of Bayes theorem


• Posterior summarises all information on the unknown parameters θ given the data

• From the posterior one can estimate best parameter values and probability intervals (credible intervals)

• Result depends on the prior distribution

25

Bayesian Inference


• Perform analytical integration –feasible in very few simple cases

• Use numerical integration –May be CPU intensive –difficult for large multi-dimensional cases

• Markov Chain Monte Carlo • sample parameter space efficiently using a random walk

heading to the regions of higher probability • Metropolis algorithm to sample according to a PDF f(x)

26

How to compute the Posterior PDF


Markov-Chain Monte Carlo

Available in ROOT in the RooStats package


• Bayesian probability is subjective –depends on prior probabilities or degrees of belief about

the unknown parameters • Problem on how to represent lack of knowledge

–e.g. uniform distribution is not invariant under coordinate transformations • uniform in log𝛳 is scale-invariant

– Jeffreys prior: prior invariant under parameter transformation

• Recommend a study of the sensitivity of the result on the chosen prior PDF

28

Problem with Bayesian approach


Frequentist vs Bayesian Inference



root.cern.ch

30

Parameter Estimation

• Parameter estimate • Likelihood function • Maximum Likelihood method • Property of estimators


http://root.cern.ch


Statistical Inference


Parameter estimators


Likelihood Function


Maximum Likelihood Estimates


Gaussian approximation


• Consistency • Bias • Efficiency • Robustness

36

Estimator properties


Estimator consistency


Bias


Efficiency


Robustness


Parameter uncertainties with ML


Error Determination



root.cern.ch

43

Hypothesis Testing

• Definition of hypothesis testing • Neyman-Pearson lemma and

Likelihood ratio


http://root.cern.ch


Hypothesis Tests


• H0 : null hypothesis –the hypothesis we want to prove that is false –e.g. the data contains only background (no Higgs signal)

• H1 : alternate hypothesis –e.g. the data contains signal (Higgs) and background

• α : significance level: probability to reject H1 if true (error of first kind) –α = 1 - selection efficiency

• 𝛽 : probability to reject H0 if true (error of second kind) –power (probability to reject H0 if H1 is true) = 1 - 𝛽 –𝛽= misidentification probability

45

Hypothesis Test


Example: Cut analysis


Likelihood Ratio


Newman Pearson Lemma


• We will look next lectures on how to – how to use multivariate (machine learning) methods to do

classification and more – estimate the parameter uncertainty (errors) in maximum

likelihood fits – estimate confidence intervals – use hypothesis tests for estimate the discovery

significance of new particles

• We will complement this with examples in TMVA, RooFit and RooStats

49

Summary

Introduction to Statistical Methods for Data Analysis · Introduction to Statistical Methods for...

Documents

Transcript of Introduction to Statistical Methods for Data Analysis · Introduction to Statistical Methods for...