Comp. Genomics Recitation 3 The statistics of database searching.

Comp. Genomics

Recitation 3The statistics of database searching

Substitution matrices

• Random model(R): x and y appear at random•

• Match model(M): x and y are derived from a common ancestor•

• The odds ratio:

• Using the log of the odds ratio gives an additive scoring system

( , | )i jx y

i j

P x y R q q

( , | )i ix y

i

P x y M p( , | )

( , | )

P x y M

P x y R


• PAM (Point Accepted Mutation, by Margaret Dayhoff ‘78) matrices extrapolates amino acids substitution rates of evolutionary distant proteins from observed substitutions of evolutionary close proteins• PAM1 gives the frequencies of changes for proteins that

differ at 1% of the sites• PAM250=(PAM1)250 give the frequencies of changes for

proteins that differ in 250% of sites (multiple substitutions per site are possible)

• This calculation assumes that changes are independent of each other and past changes


• BLOSUM (Blocks of Amino Acid Substitution Matrix) matrices derive substitution rates from proteins of various evolutionary distances• Change rates were derived from alignment of common

patterns• E.g., Blosum60 was derived from patterns that were 60%

similar• No assumption is made regarding the evolutionary

relationship between the proteins from which the patterns are taken

• Pij is the probability of two amino acids i and j replacing each other in a homologous sequence, and qi and qj are the background probabilities

Exercise

The following substitution matrix is given:

What is the average score per nucleotide pair? Assume each nucleotide appears with equal probability.

T C A G

T 1 0 -1 -1

C 0 1 -1 -1

A -1 -1 1 0

G -1 -1 0 1

Solution

ji

jis, 16

4),(

4

1

4

1

•What happens if the average score is not negative?•What is the chance that in a pair of evolutionary related sequences T is replaced by C?

Solution

S(i,j)=ji

ij

qq

p2log 0log2

ji

ij

qq

p

1ji

ij

qq

p

Pij=1/16

Does the optimal alignment change if we multiply the matrix by a constant C?

How significant is my score?

• Create a mathematical model of the alignment of random sequences, and derive the score distribution analytically

• Use simulation to estimate the score distribution of the alignment of:• Generated sequences• Real sequences that are known to be

non-homologous, or that are shuffled

Empirical score distribution

• The picture shows a distribution of scores from a real database search using BLAST.

• This distribution contains scores from non-homologous and homologous pairs. High scores from homology.

Empirical null score distribution

• This distribution is similar to the previous one, but generated using a randomized sequence database.

Statistical analysis

• What is a null hypothesis?• An assumption that may be

contradicted (but not validated) by the data.

• The purpose of most statistical tests is to determine whether the observed data can be explained by the null hypothesis.

Overview

• What is a p-value?• The probability of observing an effect as

strong or stronger than you observed, given the null hypothesis. I.e., “How likely is this effect to occur by chance?”

• Pr(x > S|null)

Extreme value distribution

• What is the name of the distribution created by local alignment scores, and what does it look like?• Extreme value distribution, or Gumbel

distribution.• It looks similar to a normal distribution, but it

has a larger tail on the right.• Ungapped local alignment max scores follow

this distribution, and gapped alignment scores seem to follow it

Extreme value distribution

• The expected number of matches with a score≥ S is given by the formula:

• For ungapped local alignments, the parameters can be calculated directly from the substitution matrix scores and the lengths of the aligned sequences

SE Kmne

where m,n are sequence lengths, λ is a scaling parameter for the scoring system and K is a scaling parameter for the search space (e.g. accounts for overlaps)

(E-value)

Exercise

• Assuming that the probability for seeing x matches with score ≥S is given by the Poisson distribution:

where μ is the mean, what is the p-value of the score S?

!

x

ex

Solution

• The p-value is the probability of seeing the score ≥S by chance

• The probability of not seeing the score by chance is

• The probability of seeing the score by chance is

SKmnee

1 exp( )S

Kmne

What p-value is significant?• The most common thresholds are 0.01 and

0.05.• A threshold of 0.05 means you are 95%

sure that the result is significant.• Is 95% enough? It depends upon the cost

associated with making a mistake.• Examples of costs:

• Doing expensive wet lab validation.• Making clinical treatment decisions.• Misleading the scientific community.

Database searching

A database contains many sequences

Problem: multiple comparisonsIncrease chance for random high score

http://upload.wikimedia.org/wikipedia/commons/e/e4/Growth_of_Genbank.svg

Multiple testing

• Say that you perform a statistical test with a 0.05 threshold, but you repeat the test on twenty different observations.

• Assume that all of the observations are explainable by the null hypothesis.

• What is the chance that at least one of the observations will receive a p-value less than 0.05?

Multiple testing

• Say that you perform a statistical test with a 0.05 threshold, but you repeat the test on twenty different observations. Assuming that all of the observations are explainable by the null hypothesis, what is the chance that at least one of the observations will receive a p-value less than 0.05?

• Pr(making a mistake) = 0.05• Pr(not making a mistake) = 0.95• Pr(not making any mistake) = 0.9520 = 0.358• Pr(making at least one mistake) = 1 - 0.358 = 0.642

• There is a 64.2% chance of making at least one mistake.

Bonferroni correction

• Divide the desired p-value threshold by the number of tests performed.

• For the previous example, 0.05 / 20 = 0.0025.

• Pr(making a mistake) = 0.0025• Pr(not making a mistake) = 0.9975• Pr(not making any mistake) = 0.997520 = 0.9512• Pr(making at least one mistake) = 1 - 0.9512 = 0.0488

Corrections for Database searching

• Say that you search the non-redundant protein database at NCBI, containing roughly one million sequences. What p-value threshold should you use?

• What is the hidden assumption here?

Example

• Say that you search the non-redundant protein database at NCBI, containing roughly one million sequences. What p-value threshold should you use?

• Say that you want to use a conservative p-value of 0.001.

• Recall that you would observe such a p-value by chance approximately every 1000 times in a random database.

• A Bonferroni correction would suggest using a p-value threshold of 0.001 / 1,000,000 = 0.000000001 = 10-9.

Exercise

• A sequence of size m is queried against a

database. The database contains k

sequences of lengths n1,n2,…,nk. The E-

value for alignment i is .• What is the query E-value for score S?

• If we know that the p-value for alignment i

is Pi, what is the query p-value?

SiKmn e

Solution

• Let Ai denote the number of matches in alignment i that scored ≥ S

• E(A1+A2+…+Ak)=E(A1)+E(A2)+…+E(Ak)=

1 1

k kS S S

i ii i

Kmn e Kme n Kmne

Solution

• The probability of seeing a match with score ≥S by chance in the entire database is

11 1

1 (1 ) 1 exp( ) 1 exp( )k k k

S Si i i

ii i

P Kmn e Kmn e

Comp. Genomics Recitation 3 The statistics of database searching.

Documents

Transcript of Comp. Genomics Recitation 3 The statistics of database searching.