Basic principles of statistical inference

31
http://evolve.zoo.ox.ac.uk University of Oxford, Department of Zoology Evolution & Infectious Disease South Parks Road Oxford OX1 3PS, UK Basic principles of statistical inference Oliver Pybus

Transcript of Basic principles of statistical inference

Page 1: Basic principles of statistical inference

h t t p : / / e v o l v e . z o o . o x . a c . u k

University of Oxford, Department of Zoology

Evolution & Infectious DiseaseSouth Parks Road

Oxford OX1 3PS, UK

Basic principles of statistical inference

Oliver Pybus

Page 2: Basic principles of statistical inference

What Is Statistical Inference?

“The use of a sample of data to

draw inferences or conclusions about some aspect of the

situation from which the data were taken.”

Page 3: Basic principles of statistical inference

HYPOTHESIS

Holy Trinity Hendrik van Balen, 1620

MODEL

HOLY DATA

Page 4: Basic principles of statistical inference

• Typically an alignment of gene sequences, with date and location of sampling for each. Sometimes phenotypic trait data is available.

• Population genetic inference usually requires that sequences are randomly sampled.

• Phylogenetic inference problems often don’t require random sampling.

• The data is assumed to be correct, although uncertainty in the data can sometimes be modelled (e.g. ambiguous nucletoides).

• Alignment uncertainty is usually ignored.

Data

Page 5: Basic principles of statistical inference

• A mathematical representation of the situation under study (or of some aspect of the situation).

• There are several types of evolutionary models, often used in combination.

• Nucleotide substitution models (e.g. JC, HKY, GTR)

• Molecular clock models (e.g. strict, relaxed, local)

• Trait evolution models (e.g. strict, relaxed random walks)

• Population models (e.g. coalescent models)

• Each model has a number of model parameters.

• Is a phylogeny a model or a set of parameters?

Model

Page 6: Basic principles of statistical inference

•A theory about the situation under study.

•Mathematically, a hypothesis is some statement about the parameter values of the model.

e.g. “the dN/dS of my sequences is >1”

e.g. “the date of this phylogenetic node is 1982”

e.g. “the tree topology of my sequences is ”

•The data can be used to assess whether a given hypothesis is reasonable or not.

•The model may have many parameters but the hypothesis may only concern one of them. The rest are called nuisance parameters.

Hypothesis

Page 7: Basic principles of statistical inference

POINT ESTIMATION

• Using the data to estimate values for one or more model parameters.

• e.g. “When was the most recent common ancestor of my sampled sequences?”

• Data: sequence alignment, dates of sampling, tree topology

• Model: HKY

• Estimated parameter: age of root node

• Nuisance parameters: ages of the other nodes in the tree

Types of Inference

Page 8: Basic principles of statistical inference

INTERVAL ESTIMATION

• Using the data to provide a range that represents the degree of uncertainty in the estimate of a parameter.

e.g. “What are the 95% confidence intervals for the evolutionary rate of my sequences?”

Types of Inference

Page 9: Basic principles of statistical inference

HYPOTHESIS TESTING

• Using the data to measure the relative plausibility of different statements about the model parameters.

e.g. “Is my estimated dN/dS value significantly greater than 1?”

e.g. “Does REV fit my data better than HKY?”

e.g. “Can I reject a strict molecular clock?”

e.g. “Do sequences A, B & C form a clade in my tree?”

Types of Inference

Page 10: Basic principles of statistical inference

MODEL SELECTION

• The process of finding the most appropriate model for your data.

• It involves a trade-off between “goodness of fit” and “predictive power’.

• Adding parameters increases the former but decreases the latter.

• Remember, even the best-fitting model may explain the data poorly.

Types of Inference

PLoS Biology | www.plosbiology.org 0686

Primer

May 2006 | Volume 4 | Issue 5 | e151

There are no mathematical equations in On the Origin of Species. A good thing too, you might think, and it is undoubtedly true that Darwin’s clear and fl owing

narrative style helped ensure the popularity of his writings. Modern research in evolutionary biology can make for less easy reading. Much of it concerns the development of an expanding arsenal of mathematical and statistical techniques, necessary to do battle with the relentless onslaught of gene and genome sequences. Of course, the discrete, ordered nature of genetic information and the stochastic character of Mendelian inheritance have naturally lent themselves to numerical analysis. Consequently, the mathematical foundations of evolutionary genetics have, somewhat unusually for biology, tended to precede the data to which they are applied. The Genetical Theory of Natural Selection by R. A. Fisher, published only fi fty years after Darwin’s death, is full of equations [1].

The simplest weapon in the armoury of evolutionary genetics is genetic distance, a measure of the number of evolutionary changes between sequences from different organisms. Genetic distances can be calculated for a pair of sequences by simply counting the number of nucleotides or amino acids that differ between them. Unfortunately, this approach underestimates the amount of evolutionary change because it does not account for the fact that each site may change more than once during evolutionary history. Statistical tools, called nucleotide or amino acid substitution models, are therefore used to estimate genetic distances between sequences. There is a bewildering hierarchy of substitution models available, each making a different and specifi c set of assumptions about the evolutionary process of sequence change [2]. The simplest models assume that all types of mutation are equivalent and that all sites in a sequence change at the same rate. More complex models loosen these assumptions, allowing heterogeneity in the process of sequence change, but they can be reliably applied to larger datasets only. The task of deciding amongst these competing models is known as statistical model selection and can be thought of as a trade-off between model accuracy and model complexity. The degree to which a model fi ts the data at hand (accuracy) is always improved by adding more parameters (complexity), but since the amount of data remains constant the statistical uncertainty about each parameter increases. In addition, the biological meaning of each parameter becomes harder to decipher so the explanatory power of the model decreases (Figure 1). Thus the chosen model should have enough parameters to adequately explain the data—but no more. Once an appropriate model is chosen, genetic distances are combined using other statistical techniques to generate a phylogenetic tree of the sequences being studied [2]. The lengths of the branches in the phylogeny thus represent estimated numbers of sequence changes (Figure 2A).

However, genetic distances are rather crude indicators of evolutionary history. A small genetic distance between two sequences may suggest a recent common ancestor, but is also consistent with a slower rate of sequence change and a more ancient common ancestor (i.e., genetic distance = evolutionary rate × time). Genetic distances alone are therefore of little use if, for example, we wish to know the age of the common ancestor of mammals, or the rate at which bacterial antibiotic resistance genes evolve. Such questions can be answered only if independent information about

Model Selection and the Molecular ClockOliver G. Pybus

Citation: Pybus OG (2006) Model selection and the molecular clock. PLoS Biol 4(5): e151. DOI: 10.1371/journal.pbio.0040151

DOI: 10.1371/journal.pbio.0040151

Copyright: © 2006 Oliver G. Pybus. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Oliver G. Pybus is in the Department of Zoology, University of Oxford, Oxford, United Kingdom. E-mail: [email protected]

Primers provide a concise introduction into an important aspect of biology highlighted by a current PLoS Biology research article.

DOI: 10.1371/journal.pbio.0040151.g001

Figure 1. An Illustration of the General Properties of Model Selection(A) A hypothetical dataset consisting of thirteen points plotted on two axes. (B) A simple model, represented by a straight line through the points. This model has few parameters but does not fi t the data particularly well. (C) A very complex model, which fi ts the data almost perfectly but has too many parameters. The estimated parameters tell us little about the biological process that gave rise to the data. (D) A model with an intermediate number of parameters represented by a curve. This fi ts the data well but still has relatively few parameters and therefore has greater explanatory power.

Page 11: Basic principles of statistical inference

Inference Frameworks

• There are several different types of statistical inference.

• Evolutionary problems often involve complex models with many parameters, sometimes limited amounts of data.

• Likelihood and Bayesian inference are practical methods in this situation. They are related, but differ philosophically in the way they view probability.

Page 12: Basic principles of statistical inference

• Likelihood (frequentist) inference: Probabilities refer only to the outcome of experiments (i.e. data). They represent the frequencies of outcomes if the experiment were repeated many times. The degree to which data supports a hypothesis is termed likelihood.

• Bayesian inference: Both data and model parameters are described by probabilities. Probability reflects our degree of belief in a hypothesis, as well as representing the outcome of experiments. Hence hypotheses have probabilities even in the absence of data.

Inference Frameworks

Page 13: Basic principles of statistical inference

• BIAS: The average deviation of an estimate from the true value.

• VARIANCE: Imprecision, or the degree of uncertainty in an estimate. Reflected in large confidence intervals, or in a wide “spread” of values when estimation is re-run many times.

• CONSISTENCY: The convergence of an estimate to the true parameter value as sample size increases.

• ERROR: The failure of hypothesis tests to get the right answer as often as they should.

Properties of Inference Methods

Page 14: Basic principles of statistical inference

• “A technique for assessing the relative merits of different hypotheses in the light of the data.”

• The probability of observing data D, given hypothesis H, is denoted P(D|H). It is defined by the model and is a probability density function.

• The likelihood of hypothesis H, given data D, is denoted L(H|D). This is a likelihood function and is directly proportional to P(D|H).

i.e. L(H|D) ∝ P(D|H)

• The constant of proportionality is arbitrary, but is the same for all hypotheses under the same data.

• Unlike probabilities, likelihoods do not sum to 1!

Likelihood

Page 15: Basic principles of statistical inference

• Inferences are made by comparing the likelihoods of different hypotheses on the same data (D).

e.g. hypotheses H1 & H2 have likelihoods L(H1|D) & L(H2|D)

• NEVER COMPARE LIKELIHOODS ON DIFFERENT DATA!

• Likelihoods are very small numbers, so the natural logarithm of the likelihood, denoted l(H|D), is used.

• Likelihood ratios are therefore log-likelihood differences.i.e. L(H1|D) / L(H2|D) = l(H1|D) - l(H2|D)

Likelihood Ratios

Page 16: Basic principles of statistical inference

• Suppose we have a model with one parameter, x1.

• We find the value of x1 that maximises the likelihood. This value is the maximum likelihood estimate (MLE) of x1.

• A plot of x1 against log-likelihood is a log-likelihood curve.

• Algebra or optimisation algorithms find the highest point of the curve.

Maximum Likelihood Estimation

l(x1 |D

)

max[l(x1 |D)]

MLE

x1

Page 17: Basic principles of statistical inference

Maximum Likelihood Estimation

L[p|h,n]= p (1-p) n-hh( )n

h

p = h/n^1.00E-05

2.00E-05

3.00E-05

4.00E-05

5.00E-05

6.00E-05

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

L

p

Situation: Flipping a coin n=10 times

Data: h=6 Heads, (n-h)=4 Tails

Model: binomial distribution

Likelihood function:

MLE:

Page 18: Basic principles of statistical inference

• If we have two parameters, x1 & x2, then the pair of x1 & x2 values that maximises the likelihood function must be found.

• Optimisation algorithms try to find the maximum point.

• This can be a computationally difficult problem.

Maximum Likelihood Estimation

44 Jonathan Richard Shewchuk

-4

-2

0

2

4

6

-2

0

2

4

6

-250

0

250

500

4

-2

0

2

4

6

(a)

1

2

1

-4 -2 2 4 6

-2

2

4

6

1

(b)2

0

-0.04 -0.02 0.02 0.04

-200

200

400

600

(c)

-4 -2 2 4 6

-2

2

4

6

1

(d)2

0

Figure 37: Convergence of the nonlinear Conjugate Gradient Method. (a) A complicated function with many

local minima and maxima. (b) Convergence path of Fletcher-Reeves CG. Unlike linear CG, convergence

does not occur in two steps. (c) Cross-section of the surface corresponding to the first line search. (d)

Convergence path of Polak-Ribiere CG.

Page 19: Basic principles of statistical inference

• The likelihood ratio statistic (LRS) is used to compare hypotheses.

• Suppose we have two hypotheses, H1 & H2. These can represent specific parameter values, or regions of parameter space, or whole models.

• LRS = 2 * { max[l(H1|D)] - max[l(H2|D)] }

Likelihood Ratio Statistic

Page 20: Basic principles of statistical inference

• If the hypotheses are nested (one is a special case of the other) then the LRS can be used to compare their goodness-of-fit.*

• The LRS is also used to calculate confidence limits for MLEs.*

• Non-nested hypotheses can be compared using the Akaike Information Criterion (AIC). For hypothesis H:

AIC(H) = max[ l(H|D) ] - n

where n is the number of parameters in the model.

• “Better” hypotheses have higher AIC values.

Likelihood Ratio Statistic

* Not covered in this lecture.

Page 21: Basic principles of statistical inference

• Bayesian inference produces a posterior probability distribution rather than a likelihood curve.

• The “posterior” combines information from the data and from previous knowledge. (Likelihood = the data only.)

• Each parameter in the model has a prior probability distribution representing our previous knowledge about that parameter.

• The “prior” can be strong or weak...

• e.g. “Human heights follow a normal distribution with mean = 1.7m and standard deviation = 15cm”

• e.g. “Human heights follow a uniform (flat) distribution between 10-10 m and 1026 m.”

Bayesian Inference

Page 22: Basic principles of statistical inference

https://xkcd.com/1132/

Page 23: Basic principles of statistical inference

Bayesian Inference

If the posterior and the prior look similar then the data is not very informative about the parameter in question.

Posterior distributions are defined using Bayes’ Theorem.*

* Not covered in this lecture.

Page 24: Basic principles of statistical inference

Posterior Distributions

Posterior distributions are very difficult to calculate directly. However, we can approximate the posterior distribution by using Markov Chain Monte

Carlo (MCMC) sampling.

This algorithm walks around parameter space in a pseudo-random way. It moves quickly through ‘low’ regions and slowly in ‘high’ regions whilst

keeping a record of where it has been.

Page 25: Basic principles of statistical inference

Posterior Distributions

Parameter (x)

BPE

post

erio

r pr

obab

ility

di

stri

butio

n

Parameters are estimated by finding the mean or median of the posterior distribution. These are Bayesian posterior estimates (BPEs).

Page 26: Basic principles of statistical inference

Posterior Distributions

Parameter (x)

MPE (x)

Posterior distribution

Here, the posterior is cut off at x=0.75 by the prior. The data support values of x>0.75, but the prior won’t allow this.

0

1

2

3

4

5

6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Exponential growth rate, r

Post

erio

r pro

bab

ilit

y d

ensi

ty

Prior distribution

Page 27: Basic principles of statistical inference

Credible Regions

Parameter (x)

The Bayesian equivalent of a confidence interval is called the highest posterior density (HPD) credible region. This is the

smallest region that contains 95% of the posterior probability.

lower 95% HPD limit

post

erio

r pr

obab

ility

di

stri

butio

n

upper 95% HPD limit

Page 28: Basic principles of statistical inference

Credible Regions

Parameter (x)

Para

met

er (y

)

90% HPD region

95% HPD region

Page 29: Basic principles of statistical inference

• Posteriors are proper probability distributions (unlike likelihood curves). So hypotheses can be tested by simply inspecting areas under the curve (e.g. is parameter x>1.0?)

• The Bayesian equivalent of the likelihood ratio statistic is the “Bayes Factor”.

Bayesian Hypothesis Testing

That is the worst model I have ever seen.

Page 30: Basic principles of statistical inference

• The Bayes factor (BF) is the ratio of the probability of model 1 to the probability of model 2, on data D.

• The models don’t have to be nested.

• M1 and M2 can represent different models or different regions of parameter space (e.g. M1=x<1 vs M2=x>1).

• Calculating p(D|M1) is computationally difficult.

Bayesian Model Selection

BF = p(D|M1) / p(D|M2)

Page 31: Basic principles of statistical inference

Bayesian Model Selection

Bayes factor Interpretation

<1 M1 is actually worse than M2

1 to 3 Barely worth mentioning

3 to 10 Substantial support for M1

10 to 30 Strong support for M1

30 to 100 Very strong support for M1

>100 Decisive evidence in favour of M1