An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know...

32
An introduction to the An introduction to the Bootstrap method Bootstrap method Hugh Shanahan Hugh Shanahan University College London University College London November 2001 November 2001 I know that it will happen, Because I believe in the certainty of chance The Divine Comedy

Transcript of An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know...

Page 1: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.

An introduction to the An introduction to the Bootstrap methodBootstrap method

Hugh ShanahanHugh Shanahan

University College London University College London

November 2001November 2001

I know that it will happen, Because I believe in the certainty of chance

The Divine Comedy

Page 2: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.

OutlineOutline

• Origin of StatisticsOrigin of Statistics• Central Limit TheoremCentral Limit Theorem

• Difficulties in “Standard Statistics”Difficulties in “Standard Statistics”• Bootstrap - the basic ideaBootstrap - the basic idea• A simple exampleA simple example• Case Study I : Phylogenetic TreesCase Study I : Phylogenetic Trees• Case Study II : Bayesian NetworksCase Study II : Bayesian Networks• ConclusionsConclusions

Page 3: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.

Statistics 101Statistics 101

• We want the ‘average’ and ‘error’ for some We want the ‘average’ and ‘error’ for some variablevariable• Time between first and second division of frog Time between first and second division of frog

embryoembryo• Half-life of a radioactive sampleHalf-life of a radioactive sample• How many days does Wimbledon get delayed How many days does Wimbledon get delayed

by (grrr……..)by (grrr……..)

Page 4: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.

StrategyStrategy• Assuming only statistical variationAssuming only statistical variation• Carry out measurement “many” timesCarry out measurement “many” times

• Error decreases as number of measurements increaseError decreases as number of measurements increase

Page 5: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.

In fact, there’s a huge amount of statistical In fact, there’s a huge amount of statistical machinery going on with this…….machinery going on with this…….

Assume the Central Limit TheoremAssume the Central Limit Theorem

““If random samples of n observations yIf random samples of n observations y11, y, y22, …y, …ynn are are

drawn from a population of finite mean drawn from a population of finite mean and variance and variance 22, then when n is sufficiently large, the sampling , then when n is sufficiently large, the sampling distribution of the sample mean can be approximated distribution of the sample mean can be approximated by a normal density with mean by a normal density with mean yy = = and standardand standard

deviation deviation yy = = nn1/21/2””

THE MOST IMPORTANT THEOREM OF STATISTICSTHE MOST IMPORTANT THEOREM OF STATISTICS

Page 6: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.

Consequences of CLTConsequences of CLT

• Averages taken from Averages taken from any any distribution distribution (your experimental data) will have a normal (your experimental data) will have a normal distributiondistribution• The error for such an observable will The error for such an observable will decrease slowly as the number of decrease slowly as the number of observations increaseobservations increase

But nobody tells you how big the sample has to be..But nobody tells you how big the sample has to be..

Page 7: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.

Normal distributionNormal distribution Averages of N.D.Averages of N.D.

distributiondistribution Averages of Averages of distribution distribution

Page 8: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.

Uniform distributionUniform distribution Averages of U.D.Averages of U.D.

Page 9: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.

Research is more than Statistics Research is more than Statistics 101 !!101 !!

• Very often, we are looking at quite complicated Very often, we are looking at quite complicated objects, objects, not just single variables. Even if we not just single variables. Even if we assume CLT, then it is not clear how to propagate assume CLT, then it is not clear how to propagate the uncertainty through to the final objects we are the uncertainty through to the final objects we are looking at. looking at.

• It is not clear when we have a large enough It is not clear when we have a large enough sample, we should do a histogram, but this may sample, we should do a histogram, but this may not be possible. not be possible.

Page 10: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.

What the statistician sees….What the statistician sees….(or rather what they talk about)(or rather what they talk about)

• The The probability distributionprobability distribution rather than the data rather than the data• But we just have the data ! But we just have the data !

• The bootstrap method attempts to determineThe bootstrap method attempts to determine the probability distribution from the data the probability distribution from the data itself, without recourse to CLT.itself, without recourse to CLT.

• The bootstrap method is not a way of reducing The bootstrap method is not a way of reducing the error ! It only tries to estimate it.the error ! It only tries to estimate it.

Page 11: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.

Basic idea of BootstrapBasic idea of Bootstrap

• Originally, from some list of data, one Originally, from some list of data, one computes an computes an object.object.

• Create an artificial list by randomly drawing Create an artificial list by randomly drawing elements from that list. elements from that list. Some elements will Some elements will be picked more than once. be picked more than once.

• Compute a new object.Compute a new object.• Repeat 100-1000 times and look at the Repeat 100-1000 times and look at the

distribution of these objects.distribution of these objects.

Page 12: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.

A simple exampleA simple example

• Data available comparing grades before and Data available comparing grades before and after leaving graduate school amongst 15 after leaving graduate school amongst 15 U.S. Universities.U.S. Universities.

• Some linear correlation between grades Some linear correlation between grades (high incoming usually means high (high incoming usually means high outgoing). outgoing). =0.776=0.776

• But how reliable is this result ?But how reliable is this result ?

Page 13: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.
Page 14: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.
Page 15: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.
Page 16: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.
Page 17: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.

Addendum : The Jack-knifeAddendum : The Jack-knife

• Jack-knife is a special kind of bootstrap.Jack-knife is a special kind of bootstrap.• Each bootstrap subsample has all but one of Each bootstrap subsample has all but one of

the original elements of the list.the original elements of the list.• For example, if original list has 10 For example, if original list has 10

elements, then there are 10 jack-knife elements, then there are 10 jack-knife subsamples.subsamples.

Page 18: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.

How many bootstraps ?How many bootstraps ?

• No clear answer to this. Lots of theorems on No clear answer to this. Lots of theorems on asymptotic convergence, but no real asymptotic convergence, but no real estimates !estimates !

• Rule of thumb : try it 100 times, then 1000 Rule of thumb : try it 100 times, then 1000 times, and see if your answers have times, and see if your answers have changed by much.changed by much.

• Anyway have NAnyway have NNN possible subsamples possible subsamples

Page 19: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.

Is it reliable ?Is it reliable ?

• A very very good question !A very very good question !• Jury still out on how far it can be applied, Jury still out on how far it can be applied,

but for now nobody is going to shoot you but for now nobody is going to shoot you down for using it.down for using it.

• Good agreement for Normal (Gaussian) Good agreement for Normal (Gaussian) distributions, skewed distributions tend to distributions, skewed distributions tend to more problematic, particularly for the tails, more problematic, particularly for the tails, (boot strap underestimates the errors). (boot strap underestimates the errors).

Page 20: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.

Case Study I : Phylogenetic Case Study I : Phylogenetic TreesTrees

Get a multiple sequence Get a multiple sequence alignmentalignment

C1 C2 C3 S1 A A GS2 A A AS3 G G AS4 A G A

Construct a Tree using your favourite method(Parsimony, ML, etc..)

Page 21: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.

How confident are we of this tree ?How confident are we of this tree ?

• For example, how confident are we that two For example, how confident are we that two sequences are in the same clade ?sequences are in the same clade ?

• I.E. what is the probability distribution of I.E. what is the probability distribution of our confidence of the branches ?our confidence of the branches ?

• Certainly not a problem that Stat. 101 can Certainly not a problem that Stat. 101 can handle !handle !

• Bootstrap can provide a way of determining Bootstrap can provide a way of determining this (first thought of by Felsenstein, 1985)this (first thought of by Felsenstein, 1985)

Page 22: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.
Page 23: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.

Having created an ensemble of Phylogenetic trees,one can elucidate the statistical frequency of variousfeatures of the tree.E.G. Do two sequences lie in the same clade ?

Can this be used for statistical significance ? This is very much an open question !!!!(Be cautious, and assume not…...)

Page 24: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.

Case Study II : Gene expression Case Study II : Gene expression data and Bayesian (Probabilistic) data and Bayesian (Probabilistic)

networksnetworks• A method for elucidating which genes is A method for elucidating which genes is

regulating the production of what genes.regulating the production of what genes.• Problem is that it is difficult to determine Problem is that it is difficult to determine

how reliable the edges of the network ishow reliable the edges of the network is• The bootstrap method is the favoured The bootstrap method is the favoured

approach…..approach…..

Page 25: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.
Page 26: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.

Ideally, what you want is the followingIdeally, what you want is the following

Page 27: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.
Page 28: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.

Formally, we get a joint probability distributionFormally, we get a joint probability distributionwhich takes the form :which takes the form :

P(G1,G2,….) = … x P(G3 | G1, G2 ) x …P(G1,G2,….) = … x P(G3 | G1, G2 ) x … … … x P(G7 | G3 ) x …x P(G7 | G3 ) x …

etc….etc….

More importantly, we can tell which genes More importantly, we can tell which genes directly affect which genes (e.g. G1 and G2 directly affect which genes (e.g. G1 and G2 acting on G3) and which ones are indirect acting on G3) and which ones are indirect (e.g. G6 acting on G3)(e.g. G6 acting on G3)

Page 29: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.

But there is a problem….But there is a problem….

• Finding the right network is an NP-hard Finding the right network is an NP-hard problem.problem.

• Have to apply various heuristic techniques….Have to apply various heuristic techniques….• Also, given the paucity of data it is not clear Also, given the paucity of data it is not clear

that any given connection between two genes that any given connection between two genes is not a spurious correlation that will vanish is not a spurious correlation that will vanish with more statistics. with more statistics.

Page 30: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.
Page 31: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.

Summary of the Bootstrap Summary of the Bootstrap methodmethod

• Original object O (a tree, a best fit...) is computed from a “list of data” (numbers, sequences, microarray data,….).• Construct a new list, with the same number of elements, from the original list by randomly picking elements from the list. Any one element from the list can be picked any number of times.• Compute new object, call it O1

• Repeat the process many times (typically 100-1000).• The elements {O1 , O2 , ……} are assumed to be taken from a statistical distribution, so one can compute averages, variances, etc.

Page 32: An introduction to the Bootstrap method Hugh Shanahan University College London November 2001 I know that it will happen, Because I believe in the certainty.

ConclusionsConclusions• Don’t feel bad if this went over your head !Don’t feel bad if this went over your head !

• I’m happy to explain this again……..I’m happy to explain this again……..

• Textbook : Textbook : Randomization, Bootstrap and Monte Randomization, Bootstrap and Monte Carlo Methods in BiologyCarlo Methods in Biology, B.F.J. Manly, Chapman & Hall, B.F.J. Manly, Chapman & Hall

• Many extra subtleties, (parametric, non-Many extra subtleties, (parametric, non-parametric, random numbers) have not been parametric, random numbers) have not been discussed.discussed.

• Do NOT scrimp on the explanation of this Do NOT scrimp on the explanation of this method when you are writing it up !!!method when you are writing it up !!!