Comp. Genomics Recitation 3 The statistics of database searching.

29
Comp. Genomics Recitation 3 The statistics of database searching

Transcript of Comp. Genomics Recitation 3 The statistics of database searching.

Page 1: Comp. Genomics Recitation 3 The statistics of database searching.

Comp. Genomics

Recitation 3The statistics of database searching

Page 2: Comp. Genomics Recitation 3 The statistics of database searching.

Substitution matrices

• Random model(R): x and y appear at random•

• Match model(M): x and y are derived from a common ancestor•

• The odds ratio:

• Using the log of the odds ratio gives an additive scoring system

( , | )i jx y

i j

P x y R q q

( , | )i ix y

i

P x y M p( , | )

( , | )

P x y M

P x y R

Page 3: Comp. Genomics Recitation 3 The statistics of database searching.

Substitution matrices

• PAM (Point Accepted Mutation, by Margaret Dayhoff ‘78) matrices extrapolates amino acids substitution rates of evolutionary distant proteins from observed substitutions of evolutionary close proteins• PAM1 gives the frequencies of changes for proteins that

differ at 1% of the sites• PAM250=(PAM1)250 give the frequencies of changes for

proteins that differ in 250% of sites (multiple substitutions per site are possible)

• This calculation assumes that changes are independent of each other and past changes

Page 4: Comp. Genomics Recitation 3 The statistics of database searching.

Substitution matrices

• BLOSUM (Blocks of Amino Acid Substitution Matrix) matrices derive substitution rates from proteins of various evolutionary distances• Change rates were derived from alignment of common

patterns• E.g., Blosum60 was derived from patterns that were 60%

similar• No assumption is made regarding the evolutionary

relationship between the proteins from which the patterns are taken

• Pij is the probability of two amino acids i and j replacing each other in a homologous sequence, and qi and qj are the background probabilities

Page 5: Comp. Genomics Recitation 3 The statistics of database searching.

Exercise

The following substitution matrix is given:

What is the average score per nucleotide pair? Assume each nucleotide appears with equal probability.

T C A G

T 1 0 -1 -1

C 0 1 -1 -1

A -1 -1 1 0

G -1 -1 0 1

Page 6: Comp. Genomics Recitation 3 The statistics of database searching.

Solution

ji

jis, 16

4),(

4

1

4

1

•What happens if the average score is not negative?•What is the chance that in a pair of evolutionary related sequences T is replaced by C?

Page 7: Comp. Genomics Recitation 3 The statistics of database searching.

Solution

S(i,j)=ji

ij

qq

p2log 0log2

ji

ij

qq

p

1ji

ij

qq

p

Pij=1/16

Does the optimal alignment change if we multiply the matrix by a constant C?

Page 8: Comp. Genomics Recitation 3 The statistics of database searching.

How significant is my score?

• Create a mathematical model of the alignment of random sequences, and derive the score distribution analytically

• Use simulation to estimate the score distribution of the alignment of:• Generated sequences• Real sequences that are known to be

non-homologous, or that are shuffled

Page 9: Comp. Genomics Recitation 3 The statistics of database searching.

Empirical score distribution

• The picture shows a distribution of scores from a real database search using BLAST.

• This distribution contains scores from non-homologous and homologous pairs. High scores from homology.

Page 10: Comp. Genomics Recitation 3 The statistics of database searching.

Empirical null score distribution

• This distribution is similar to the previous one, but generated using a randomized sequence database.

Page 11: Comp. Genomics Recitation 3 The statistics of database searching.

Statistical analysis

• What is a null hypothesis?• An assumption that may be

contradicted (but not validated) by the data.

• The purpose of most statistical tests is to determine whether the observed data can be explained by the null hypothesis.

Page 12: Comp. Genomics Recitation 3 The statistics of database searching.

Overview

• What is a p-value?• The probability of observing an effect as

strong or stronger than you observed, given the null hypothesis. I.e., “How likely is this effect to occur by chance?”

• Pr(x > S|null)

Page 13: Comp. Genomics Recitation 3 The statistics of database searching.

Extreme value distribution

• What is the name of the distribution created by local alignment scores, and what does it look like?• Extreme value distribution, or Gumbel

distribution.• It looks similar to a normal distribution, but it

has a larger tail on the right.• Ungapped local alignment max scores follow

this distribution, and gapped alignment scores seem to follow it

Page 14: Comp. Genomics Recitation 3 The statistics of database searching.

Extreme value distribution

• The expected number of matches with a score≥ S is given by the formula:

• For ungapped local alignments, the parameters can be calculated directly from the substitution matrix scores and the lengths of the aligned sequences

SE Kmne

where m,n are sequence lengths, λ is a scaling parameter for the scoring system and K is a scaling parameter for the search space (e.g. accounts for overlaps)

(E-value)

Page 15: Comp. Genomics Recitation 3 The statistics of database searching.

Exercise

• Assuming that the probability for seeing x matches with score ≥S is given by the Poisson distribution:

where μ is the mean, what is the p-value of the score S?

!

x

ex

Page 16: Comp. Genomics Recitation 3 The statistics of database searching.

Solution

• The p-value is the probability of seeing the score ≥S by chance

• The probability of not seeing the score by chance is

• The probability of seeing the score by chance is

SKmnee

1 exp( )S

Kmne

Page 17: Comp. Genomics Recitation 3 The statistics of database searching.

What p-value is significant?• The most common thresholds are 0.01 and

0.05.• A threshold of 0.05 means you are 95%

sure that the result is significant.• Is 95% enough? It depends upon the cost

associated with making a mistake.• Examples of costs:

• Doing expensive wet lab validation.• Making clinical treatment decisions.• Misleading the scientific community.

Page 18: Comp. Genomics Recitation 3 The statistics of database searching.

Database searching

A database contains many sequences

Problem: multiple comparisonsIncrease chance for random high score

Page 19: Comp. Genomics Recitation 3 The statistics of database searching.

Multiple testing

• Say that you perform a statistical test with a 0.05 threshold, but you repeat the test on twenty different observations.

• Assume that all of the observations are explainable by the null hypothesis.

• What is the chance that at least one of the observations will receive a p-value less than 0.05?

Page 20: Comp. Genomics Recitation 3 The statistics of database searching.

Multiple testing

• Say that you perform a statistical test with a 0.05 threshold, but you repeat the test on twenty different observations. Assuming that all of the observations are explainable by the null hypothesis, what is the chance that at least one of the observations will receive a p-value less than 0.05?

• Pr(making a mistake) = 0.05• Pr(not making a mistake) = 0.95• Pr(not making any mistake) = 0.9520 = 0.358• Pr(making at least one mistake) = 1 - 0.358 = 0.642

• There is a 64.2% chance of making at least one mistake.

Page 21: Comp. Genomics Recitation 3 The statistics of database searching.

Bonferroni correction

• Divide the desired p-value threshold by the number of tests performed.

• For the previous example, 0.05 / 20 = 0.0025.

• Pr(making a mistake) = 0.0025• Pr(not making a mistake) = 0.9975• Pr(not making any mistake) = 0.997520 = 0.9512• Pr(making at least one mistake) = 1 - 0.9512 = 0.0488

Page 22: Comp. Genomics Recitation 3 The statistics of database searching.

Corrections for Database searching

• Say that you search the non-redundant protein database at NCBI, containing roughly one million sequences. What p-value threshold should you use?

• What is the hidden assumption here?

Page 23: Comp. Genomics Recitation 3 The statistics of database searching.

Example

• Say that you search the non-redundant protein database at NCBI, containing roughly one million sequences. What p-value threshold should you use?

• Say that you want to use a conservative p-value of 0.001.

• Recall that you would observe such a p-value by chance approximately every 1000 times in a random database.

• A Bonferroni correction would suggest using a p-value threshold of 0.001 / 1,000,000 = 0.000000001 = 10-9.

Page 24: Comp. Genomics Recitation 3 The statistics of database searching.
Page 25: Comp. Genomics Recitation 3 The statistics of database searching.
Page 26: Comp. Genomics Recitation 3 The statistics of database searching.
Page 27: Comp. Genomics Recitation 3 The statistics of database searching.

Exercise

• A sequence of size m is queried against a

database. The database contains k

sequences of lengths n1,n2,…,nk. The E-

value for alignment i is .• What is the query E-value for score S?

• If we know that the p-value for alignment i

is Pi, what is the query p-value?

SiKmn e

Page 28: Comp. Genomics Recitation 3 The statistics of database searching.

Solution

• Let Ai denote the number of matches in alignment i that scored ≥ S

• E(A1+A2+…+Ak)=E(A1)+E(A2)+…+E(Ak)=

1 1

k kS S S

i ii i

Kmn e Kme n Kmne

Page 29: Comp. Genomics Recitation 3 The statistics of database searching.

Solution

• The probability of seeing a match with score ≥S by chance in the entire database is

11 1

1 (1 ) 1 exp( ) 1 exp( )k k k

S Si i i

ii i

P Kmn e Kmn e