Models Distributions

Statistical Methods SS2013. 02. Statistical models and Probability distributions 1

Contents

1 Introduction to statistical models (modelling data) 1

2 Bayesian hypothesis testing (model comparison) 2

3 Continuous univariate probability distributions 5

3.1 Gaussian distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.2 Probability density function and cumultative distribution function . . . . . . 6

3.3 Expectation and summarizing probability distributions . . . . . . . . . . . . . 7

4 Random number generation 9

5 Multiple variables: covariance 9

6 Multivariate probability distributions 10

7 Exercises 12

1 Introduction to statistical models (modelling data)

Interpreting observational data is about modelling data. Although we can calculate certainstatistics from a set of data, such as the mean or sample standard deviation, these areusually of limited use on their own. Instead, we normally use data to fit and to comparemodels.

We perform experiments or make observations in order to learn about a physical phenomenon.We use a model to represent the data in a more compact form and in a form which gives usscientific meaning: we do data reduction (reductionism) with models. We also introduce amodel for this phenomenon because we cannot observe the entire phenomenon directly: wenormally only observe a subset of the phenomenon and/or the data we observe have noise(i.e. a non-deterministic component). Statistics is essentially the science of making inferenceswith such incomplete and/or noisy data.

Imagine we wanted to learn the distribution of heights of the human population of the planet.It is essentially impossible to measure the height of every individual. We therefore measurejust a subset of people, a sample, and use this to infer the entire distribution.

To proceed we must assume some model, lets call it M1, for the distribution of true heights,X, which we describe as a probability distribution function (PDF), P (X|M1). M1 will specifythe form of this PDF (M1 is a proposition). For example, M1 may state that the PDF is aGaussian with a certain mean and certain standard deviation. Another model, M2 may statethat it is a different distribution, e.g. a gamma distribution. A third model, M3, may statethat it is a Gaussian but with an unspecified mean and unspecified standard deviation, in

C.A.L. Bailer-Jones Last updated: 2013-03-04 18:43


which case we may want to estimate the parameters of the data from the model (e.g. usingthe maximum likelihood method, which will be covered in lecture 9). All of these are modelsof the phenonemon itself.

X is an arbitrary true height. xi is the ith measured height in our sample, the complete

set of which we may write as {x} or just as D. All measurements are noisy, so each xi willdeviate from the corresponding true height, Xi, by some unknown but not arbitrary amount.In order to learn anything about the model from these data we need to have a model forthis noise. We describe this also using a PDF, P (x|X,MA). We read this as assumingmodel MA for the noise, what is the PDF of measuring height x of an individual with trueheight X. (This, incidentally, is usually called the likelihood, but thats just another wordfor probability or probability density.) Quite often we assume a Gaussian noise model withmean equal to the true, unknown height, and standard deviation equal to our measurementuncertainty. (This may be determined by calibration or may be inferred from the data.)

Often we work with parametrized models: M3 in the above example is a model with twofree parameters; M1 is a special case with no free parameters. (Extra lecture 9B introducesnon-parametric models.)

Statistics is mostly about using data to determine something about or with a model (or setof models). We can distinguish between three things we may want to do

1. Parameter estimation. Given a model M with parameters , and set of data, inferthe values of the model parameters, or more correctly, infer P (|D,M).

2. Model comparison. Given a set of models {Mk} (parametrized or not), find outwhich one is best supported by the data.

3. Prediction. Given a set of data D and (possibly parametrized) model M (which mayhave been identified/fitted from the data), predict the data at some new location.

2 Bayesian hypothesis testing (model comparison)

Suppose we have observed the optical spectrum of a new astronomical object, which welldenote as D (for data). We want to determine whether it is a star, galaxy or quasar, whichare propostions/models/hypotheses which will denote as denote as C1, C2 and C3. Becausethe spectrum is noisy, and because all astronomical objects are slightly different, we cannotbe 100% sure which class it is. A model for the data for each class is given by P (D|Ck).We can think of this as a generative model: given the class, we generate a PDF over thepossible values of the data. A spectrum is multidimensional (a vector). It might be easier tothink of D as being a scalar, such as the (astronomical) colour, in which case P (D|Ck) is aone-dimensional distribution over D.

Note that P (D|Ck) does not tell us the probability of the object being of class Ck. Thatwould be P (Ck|D). Instead, P (D|Ck) tells us the probability of observing data D assumingit is of class Ck. This is fundamentally different. Given our generative model (which we willtypically derive from our theory of astronomical objects), how can we do classification?

For simplicity, let us assume we just have two classes, star (which Ill call Hp), and non-star,



which is the complement of this, Hp. From the basic law of probability

P (D) = P (D,Hp) + P (D,Hp)

= P (D|Hp)P (Hp) + P (D|Hp)P (Hp) (1)

where P (D) is the probability (density) of having observed this particular piece of data atall (under either hypothesis). We are interested in P (Hp|D) = 1 P (Hp|D). This is relatedto the above quantities via Bayes theorem

P (Hp|D) = P (D|Hp)P (Hp)P (D)

(2)

Substituting equation 1 into this gives

P (Hp|D) = P (D|Hp)P (Hp)P (D|Hp)P (Hp) + P (D|Hp)P (Hp)

(3)

Dividing out by the numerator gives

P (Hp|D) = 11 + 1R

(4)

where

R =P (D|Hp)P (Hp)P (D|Hp)P (Hp)

(5)

is the (posterior) odds ratio of the two hypothesis.1 If we assign them equal prior probabilities,then R = P (D|Hp)/P (D|Hp) is just the ratio of the two likelihoods (sometimes called theBayes factor in this context).

We can of course generalize the above to non-simple hypotheses where we have many hy-potheses or models. But we can still use the odds ratio to compare any two models.

Example. A test for Downs syndrome is 70% reliable, meaning that if the baby has Downssyndrome, then the test will be positive with a probability of 0.7 (true positive rate). Theprobability of a false positive is 0.05. About 1 in 700 babies born have Downs syndrome.A woman tests positive. What is the probability that her baby has Downs syndrome?Disclaimer: I do not claim these figures accurately represent real tests.

# d is positive result of test. h is hypothesis that baby has Downs

# specify the following:

p_d_h


The answer is 0.02. The test isnt very reliable, so it doesnt greatly increase the chance ofthe baby having Downs over the prior. Lets see how this result varies as a function of theassumed probabilities.

## Vary reliability of test P(D|H)

p_d_h


plot(log10(p_h), log10(p_h_d), type="l", xlab="log10[P(Downs)]",

ylab="log10[P(Downs | +ve result)]")

The take-home message from this example is that when presented with the information in theproblem, we intuitively (and incorrectly) interpret P (D|Hp) the 0.05 false positive statistic as P (Hp|D), and therefore conclude that the probability of having the disease, which isP (Hp|D) = 1 P (Hp|D), to be 0.95. Its actually much lower because the prior probabilityof the disease, P (Hp), also known as the base rate, is so small. So if you test positive for arare disease, do the above analysis! (Better still, do it to help you decide whether or not totake the test in the first place.)

3 Continuous univariate probability distributions

For a continuous variable, we deal with the probability density function (PDF), f(x). This isa density, not a probability, so f(x)dx is the infinitesimal probability of x in the range x tox+ dx. Thus by integration, the probability of x lying between x1 and x2 is x2

x1

f(x)dx

and in general must be solved numerically.

3.1 Gaussian distribution

The Gaussian or Normal distribution is perhaps the best known and most commonly useddistribution in the physical sciences. (Well find out why in lecture 3.) Its density function is

f(x) =1

2piexp

[(x )

2

22

]We can plot this in R using the dnorm function

x


3.2 Probability density function and cumultative distribution function

f(x) =1

2piexp

[(x )

2

22

]f(x)dx is probability that data lies between x and x+ dx

R functions

dnorm(x) = f(x), the (probability) density function (PDF)

pnorm(q) = p = x=q f(x), the cumulative distribution function (CDF)

qnorm(p) is the quantile function and is the inverse of pnorm(q)

Given a value of p, qnorm(p) gives the value of x which bounds this in f(x).

Thus the probability, P (q), that x lies beyond q (i.e. x > +q or x < q) is 2(1 pnorm(q))and the probability that x lies within q (i.e. | x |


Note that R sensibly refuses to plot to infinity.

We can use the quantile function to check how close a set of data is to a Gaussian distribution(test for normality). The idea is to plot the sample quantiles against the same quantilesdrawn from a Gaussian distribution.

temp


as E[E[x]] = E[x]. We can remember this as the variance is expectation of the square minusthe square of the expectation (or mean square minus square mean). For a set of data {xi}this is just

V ar(x) = x2 x2 or (6)V ar(x) =

1

n

i

(xi x)2 (7)

The standard deviation is defined as

=V ar(x)

which for a sample of data is

=

1

n

i

(xi )2

Typically the mean, , has to be calculated from the data. In that case we have effectivelyused up one of the n measurements. In using all the data again to calculate the standarddeviation, we will typically underestimate it by a factor of n/(n 1). To avoid this that is,to get an unbiased estimate of the standard deviation we place n with n1 when calculatingthe sample standard deviation

=

1

n 1i

(xi x)2

(This is a direct consequence of the degrees of freedom. In essence, the mean is not inde-pendent of the n values. So once weve measured the mean, there are only n 1 independentvalues left in the data set. More on this in lectures 3 and 4.) Likewise, one can refer to thesample variance, which also has this n1 term. As the two R functions sd and var calculatethe mean from the sample provided, they both use the n 1 term.As variance is a squared quantity, it tells you nothing about the asymmetry of a distribution.But this we can find out with the skew

=1

n3

i

(xi x)3

which is a dimensionless number. Positively skewed data has an asymmetry about the meanwith a tail to positive values. The kurtosis is the next higher power, and measures howcentrally concentrated a distribution is

c =1

n4

i

(xi x)4 3

It is also dimensionless. (The 3 is there so that a Gaussian has zero kurtosis.)Variance, skew and kurtosis have a generalization in the moments of a distribution. Thekth moment of p(x) is defined as

mk =

xkp(x)dx



and the kth central moment of p(x) is defined as

mk =

(x )kp(x)dx

In the same way as with expectation values, the kth sample moment of a set of data {xi} ismk =

1

n

i

xki

4 Random number generation

sample(10)

sample(x=10, size=3)


set.seed(100) # controls the (pseudo) random number sequence


set.seed(100)


sample(c(1,5,8,13,6,27), size=4)

sample(c(1,5,8,13,6,27), size=4, replace=TRUE)

R is equipped with functions which draws random variables from many standard PDFs suchas rnorm, rpois, rbinom. E.g.

x


Note that 1 +1. = +1 corresponds to perfect correlation: the two vectors are equalapart from some scale factor (which cancels in the ratio) and an offset (which is removed bythe mean). = 1 likewise corresponds to perfect negative correlation. If = 0 then thevariables are uncorrelated.

If we have two or more variables, then we can form a covariance matrix, in which the diagonalelements are the variances and the off-diagonal elements are the covariances. In other wordsthe element cij is the covariance between variable i and variable j.

6 Multivariate probability distributions

We can obviously have probability density functions of more than one variable.

The one-dimensional Gaussian probability distribution is

f(x) =1

2piexp

[(x )

2

22

]In p dimensions, we talk of the joint probability distribution. The p-dimensional GaussianPDF is

fx(x1, . . . , xp) =1

(2pi)p/2||1/2 exp(1

2(x )T1(x )

)where is the pp covariance matrix of the data, || is its determinant and (x)T denotesthe transpose of the p1 column vector (x). The covariance matrix is a symmetric matrix.If the variables are independent then the covariance matrix is diagonal. For two dimensions

=

[2x xy

xy 2y

]and the distribution is

fx(x1, x2) =1

(2pi)||1/2 exp(1

2(x )T1(x )

)where || = 2x2y(1 2).Sketch contours of 2D Gaussian distributions with different sigmas and degrees of correlation

If we have a 2D PDF, e.g. P (x, y), then we might want to know P (x | y=y0). This is theconditional PDF given that y = y0 and is clearly a 1D distribution. Consider it as a slicethrough the 2D PDF holding y constant. Generally you will need to renormalize the PDF:Recall the case of discrete probabilities with two events, A and B where

P (A,B) = P (A|B)P (B)P (A|B) = P (A,B)

P (B)



We may also want to know P (x) regardless of the value of y. This is the marginal PDF of xover y. From the basic laws of probability for discrete variables with two cases, A and B:

P (A) = P (A,B)P (B) + P (A,B)P (B)

so in general

P (A) =i

P (A|Bi)P (Bi)

For continuous variables this is

P (x) =

+

P (x|y)P (y)dy

=

+

P (x, y)dy

To draw random variables from a multivariate normal distribution you can use mvrnorm{MASS}in R. You have to specify the mean vector and the covariance matrix.



7 Exercises

1. What are the bounds of the 99% confidence interval (centered on the mean) for (a) astandardized Gaussian; (b) a Gaussian variable with mean 2 and standard deviation 4?

2. Plot the geyser{MASS} data set. Plot each of the variables (individually) as a histogramand experiment with different binwidths. Overplot your histogram with your bestestimate of a Gaussian for those data, remembering to have a common normalization.Plot the conditional PDF of the waiting time given that the duration is (a) less than3, (b) between 4.5 and 5.0.

3. Reproduce the plots in the Downs syndrome example. Investigate the impact of chang-ing the probabilities on the plots and on the conclusions drawn from the test.


Introduction to statistical models (modelling data)Bayesian hypothesis testing (model comparison)Continuous univariate probability distributionsGaussian distributionProbability density function and cumultative distribution functionExpectation and summarizing probability distributions

Random number generationMultiple variables: covarianceMultivariate probability distributionsExercises

Models Distributions

Documents

Transcript of Models Distributions