Probability Distributions and Frequentist Statistics “A single death is a tragedy, a million...

Post on 22-Dec-2015

222 views 2 download

Transcript of Probability Distributions and Frequentist Statistics “A single death is a tragedy, a million...

Probability Distributions

and Frequentist

Statistics

“A single death is a tragedy, a million deaths is a statistic”

Joseph Stalin

Can we answer that?

1st draw

M RedN-M Blue

2nd draw

?

N Balls Total

?

P(R1|I) = (M/N)

The Red and the BlueRed-2 R2 = (R1 + B1), R2

M RedN-M Blue

N Balls Total

R2 = R1 ,R2 + B1 , R2

P(R2 |I ) = P(R1 , R2 | I ) + P(B1 , R2 | I )

= P(R1 | I ) P(R2 | R1 , I ) + P(B1 | I ) P(R2| B1 , I )

N - 1 M - 1

M N - M N - 1

M N

= + N M N

=

= P(R1 |I )

Using product rule

... = P(R3 |I ) etc

The Outcome of first draw is a “nuisance” parameter. Marginalize = Integrate over all options.

Marginalization

RAIN NO RAIN

CLOUDS

NO CLOUDS

1/6

0

1/3

1/2

1/2

1/2

1/6 5/6Chance of Rain

Chance of Cloud

MarginalizationWhere Ai represents a set of Mutually Exclusive and Exhaustive possibilities, then marginalization or integrating out of “nuisance parameters” takes the form:

P(|D,I) = i P(, Ai |D,I)

Or in the limit of a continuously variable parameter A (rather than discrete case above) P changes into a probability density function:

P(|D,I) = dA P(, A|D,I)

This technique is often required in inference, for example we may be interested in the frequency of a sinusoidal signal in noisy data, but not interested in the amplitude (a nuisance parameter)

Probability DistributionsWe denote probability distributions over all possible values of a variable x by p(x) .

Discrete

Continuous

Cumulative

Lim [p(x < X < x+δx)] / δxδx→ 0

Properties of Probability DistributionsThe expectation value for a function g(X) is the weighted average:

g(X) = g(x) p(x) (discrete)All x

ʃ g(x) f(x) dx (continuous)

If it exists, this is the first moment, or mean of the distribution.The rth moment for a random variable X about the origin (x=0) is:

’r =Xr = xr p(x) (discrete)All x

ʃ xr f(x) dx (continuous)

The mean = ’1 = X is the 1st moment about the origin.

Properties of Probability Distributions

Therefore the variance x2 = X2 – X 2

The rth central moment for a random variable X about the mean(origin=) is:

r =(X-) r = (x-)r p(x) (discrete)All x

ʃ (x-)r f(x) dx (continuous)

First central moment: 1 = (X-) = 0Second central moment: Var(X) = x

2 = ( X - )2 x

2 = ( X - )2 = ( X2 – 2X + 2) = X2 – 2 X + 2

= X2 – 22 + 2 = X2 – 2 = X2 – X 2

Properties of Probability DistributionsThird central moment: 3

= ( X - )3 Skewness Fourth central moment: 4

= ( X - )4 Kurtosis

The median and the mode both provide estimates of central tendency for a distribution, and are in many cases more robust against outliers than the mean.

Example: Mean and Median filtering

Mean Filter

Median Filter

Image degraded by salt noise

The Uniform DistributionA flat distribution with peak value normalized so that the area under the curve=1

Uniform PDF Cumulative Uniform PDF

• Commonly used as an ingnorance prior to express impartiality (a lack of bias) of the value of a quantity over the given interval.

• Round-off error, quantization error are uniformly distributed

The Binomial DistributionBinomial statistics apply when there are exactly two mutually exclusive outcomes of a trial (labelled "success" and "failure“). The binomial distribution gives the probability of observing k successes in n trials, with the probability of success on a single trial denoted by p (p is assumed fixed for all trials).

Fixed n, Varying p Fixed p, Varying n

• Among the most useful discrete distribution functions in statistics.

• Multinomial distribution is a generalization for the case where there is more than a binary outcome.

n

The Negative Binomial DistributionClosely related to the Binomial distribution, the Negative Binomial Distribution applies under the same circumstances but where the variable of interest is the number of trials n to obtain k successes and n-k failures (rather than the number of successes in N trials). For n Bernoulli trials each with success fraction p, the negative_binomial distribution gives the probability of observing k failures and n-k successes with success on the last trial:

The Poisson DistributionAnother crucial discrete distribution function, the Poisson expresses the probability of a number of events k (e.g. failures, arrivals, occurrences ...) occurring in a fixed period of time (or fixed area of space), provided these events occur with a known mean rate λ (events/time), and are independent of the

previous event.

• Poisson distribution is the limiting case of a binomial distribution where the probability for success p goes to zero while the number of trials n grows such that λ = np is finite.

• Examples: photons received from a star in an interval; meteorite impacts over an area; pedestrians crossing at an intersection etc…

The Normal (Gaussian) DistributionThe Normal or Gaussian distribution is probably the most well known statistical distribution. A Gaussian with mean zero and standard deviation one is known as the Standard Normal Distribution. Given mean μ and standard deviation σ it has the PDF:

• Continuous distribution which is the limiting case for a binomial as the number of trials (and successes) is very large.

• Its pivotal role in statistics is partly due to the Central Limit Theorem (see later).

Examples: Gaussian DistributionsHuman IQ Distribution

The Power Law DistributionPower law distributions are ubiquitous in science, occurring in diverse phenomena, including city sizes, incomes, word frequencies, and earthquake magnitudes. A power-law implies that small occurrences are extremely common, whereas large instances are extremely rare. This “law” takes a number of forms (can be referred to as Zipf and sometimes Pareto). A simple illustrative power law is:

Power Law PDF - Linear Scale Power Law PDF – Log-Log scale

k=0.5K=1.0K=2.0

Example Power Laws from Nature

Physics Example: Cosmic Ray Spectrum

The Exponential DistributionThe exponential distribution is a continuous probability distribution with an exponential falloff controlled by the rate parameter λ: larger values of λ entail a more rapid falloff in the distribution.

• The exponential distribution is used to model times between independent events which happen at a constant average rate (e.g. lifetimes, waiting times).

The gamma DistributionThe gamma distribution is a two-parameter continuous pdf characterized by two parameters usually designated the shape parameter k and the scale parameter θ. When k=1 it coincides with the exponential distribution, and is also closely related to the Poisson and Chi Squared Distributions.

Gamma PDF:

Where the Gamma function is defined:

• The Gamma distribution gives a flexible class of PDFs for nonnegative phenomena, often used in modeling waiting times.

• Conjugate for the Poisson PDF

The Beta DistributionThe family of beta probability distributions is defined on the fixed interval [0,1] and parameterized by two positive shape parameters, α and β. In Bayesian statistics it is frequently encountered as a prior for the binomial distribution.

Beta PDF:

Where the Beta function is defined:

• The family of Beta distributions allows for a wide variety of shapes over a fixed interval.

• If likelihood function is a binomial, then a Beta prior will lead to another beta function for the posterior.

• The role of the Beta function can be thought of as a simple normalization to ensure that the total PDF integrates to 1.0

Central Limit Theorem: Experimental demonstration

.....

Central Limit Theorem: A Bayesian demonstration

x1 dx1x2 dx2

y dy

X1 x1 to dx1

X2 x2 to dx2

Y y to dyI Y is the sum of X1 and X2

P(Y |I ) = dX1 dX2 P(Y, X1 , X2 | I )

P(x1 |I ) = f1 (x1)P(x2 |I ) = f2 (x2)

= dX1 dX2 P(X1 | I ) P(X2 | I ) P(Y | X1 , X2 , I ) Using the product rule, and independence of X1 , X2

P(Y | X1 , X2 , I ) = δ (y – x1 – x2 ) Because y = x1 + x2

Therefore P(Y |I ) = dX1 f1 (x1) dX2 f2 (x2) δ (y – x1 – x2 ) = dX1 f1 (x1) f2 (y – x1)

Convolution Integral

Central Limit Theorem: Convolution Demonstration