Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We...

30
Fa07 CSE 182 CSE182-L4: Database filtering

Transcript of Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We...

Page 1: Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Fa07 CSE 182

CSE182-L4: Database filtering

Page 2: Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Fa07 CSE 182

Summary (through lecture 3)

• A2 is online • We considered the basics of sequence alignment

– Opt score computation– Reconstructing alignments– Local alignments– Affine gap costs– Space saving measures

• The local alignment algorithm (for biological strings) was given by Temple Smith, and Mike Waterman, and is called the Smith-Waterman algorithm (SW)

• Can we recreate Blast?

Page 3: Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Fa07 CSE 182

The birthday question

• True or False: 2 people in our class share a birthday?

• How large should the class be that you can answer this reliably

Page 4: Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Fa07 CSE 182

Basic Probability

• Q: Toss a coin many times: If it is HEADS, you get a dollar. Otherwise you lose a dollar.

• What is the money you expect to get (lose) after 100 tosses?

Page 5: Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Fa07 CSE 182

Basic Probability:Expectation

• Q: Toss a coin many times: If it is HEADS, and the previous time was HEADS too, you get a dollar. Otherwise you lose a dollar.

• What is the money you expect to get (lose) after 100 tosses?

Page 6: Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Fa07 CSE 182

Blast and local alignment

• Concatenate all of the database sequences to form one giant sequence.

• Do local alignment computation with the query.

Page 7: Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Fa07 CSE 182

Large database search

Query (m)

Database (n)

Database size n=100M, Querysize m=1000.O(nm) = 1011 computations

Page 8: Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Fa07 CSE 182

Why is Blast Fast?

Page 9: Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Fa07 CSE 182

Observations

• For a typical query, there are only a few ‘real hits’ in the database.

• Much of the database is random from the query’s perspective

• Consider a random DNA string of length n. – Pr[A]=Pr[C] = Pr[G]=Pr[T]=0.25

• Assume for the moment that the query is all A’s (length m).

• What is the probability that an exact match to the query can be found?

Page 10: Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Fa07 CSE 182

Basic probability

• Probability that there is a match starting at a fixed position i = 0.25m

• What is the probability that some position i has a match.

• Dependencies confound probability estimates.

Page 11: Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Fa07 CSE 182

Basic Probability:Expectation

• Q: Toss a coin many times: If it is HEADS, and the previous time was HEADS too, you get a dollar. Otherwise you lose a dollar.

• What is the money you expect to get after n tosses?

– Let Xi be the amount earned in the i-th toss

E(X i) =1.p + (−1).(1− p) = 2p −1

Total money you expect to earn

E( X i) = E(X i) =i

∑ n(2p −1)i

Page 12: Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Fa07 CSE 182

Expected number of matches

• Expected number of matches can still be computed.

Let Xi=1 if there is a match starting at position i, Xi=0 otherwise

Pr(Match at Position i) = pi = 0.25m

E(X i) = pi = 0.25m

Expected number of matches =

E( X i) = E(X i)i∑

i∑ = n 1

4( )m

i

Page 13: Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Fa07 CSE 182

Expected number of exact Matches is small!

• Expected number of matches = n*0.25m

– If n=107, m=10, • Then, expected number of matches = 9.537

– If n=107, m=11• expected number of hits = 2.38

– n=107,m=12, • Expected number of hits = 0.5 < 1

• Bottom Line: An exact match to a substring of the query is unlikely just by chance.

Page 14: Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Fa07 CSE 182

Observation 2

• What is the pigeonhole principle?

Suppose we are looking for a database string with greater than 90% identity to the query (length 100) Partition the query into size 10 substrings. At least one much match the database string exactly

Page 15: Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Fa07 CSE 182

Why is this important?

• Suppose we are looking for sequences that are 80% identical to the query sequence of length 100.

• Assume that the mismatches are randomly distributed.• What is the probability that there is no stretch of 10 bp,

where the query and the subject match exactly?

• Rough calculations show that it is very low. Exact match of a short query substring to a truly similar subject is very high.– The above equation does not take dependencies into

account– Reality is better because the matches are not randomly

distributed

≅ 1− 810( )

10 ⎛ ⎝ ⎜ ⎞

⎠ ⎟90

= 0.000036

Page 16: Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Fa07 CSE 182

Just the Facts

• Consider the set of all substrings of the query string of fixed length W.– Prob. of exact match to a random database

string is very low.– Prob. of exact match to a true homolog is

very high.– Keyword Search (exact matches) is MUCH

faster than sequence alignment

Page 17: Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Fa07 CSE 182

BLAST

• Consider all (m-W) query words of size W (Default = 11)• Scan the database for exact match to all such words• For all regions that hit, extend using a dynamic programming alignment.• Can be many orders of magnitude faster than SW over the entire string

Database (n)

Page 18: Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Fa07 CSE 182

Why is BLAST fast?

• Assume that keyword searching does not consume any time and that alignment computation the expensive step.

• Query m=1000, random Db n=107, no TP• SW = O(nm) = 1000*107 = 1010 computations• BLAST, W=11

• E(#11-mer hits)= 1000* (1/4)11 * 107=2384• Number of computations =

2384*100*100=2.384*107

• Ratio=1010/(2.384*107)=420• Further speed improvements are possible

Page 19: Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Fa07 CSE 182

Keyword Matching

• How fast can we match keywords?

• Hash table/Db index? What is the size of the hash table, for m=11

• Suffix trees? What is the size of the suffix trees?

• Trie based search. We will do this in class.

AATCA 567

Page 20: Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Fa07 CSE 182

Silly Quiz

• Name a famous Bioinformatics Researcher

• Name a famous Bioinformatics Researcher who is a woman

Page 21: Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Fa07 CSE 182

Scoring DNA

• DNA has structure.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 22: Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Fa07 CSE 182

DNA scoring matrices

• So far, we considered a simple match/mismatch criterion.

• The nucleotides can be grouped into Purines (A,G) and Pyrimidines.

• Nucleotide substitutions within a group (transitions) are more likely than those across a group (transversions)

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 23: Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Fa07 CSE 182

Scoring proteins• Scoring protein sequence alignments is

a much more complex task than scoring DNA– Not all substitutions are equal

• Problem was first worked on by Pauling and collaborators

• In the 1970s, Margaret Dayhoff created the first similarity matrices.– “One size does not fit all”– Homologous proteins which are

evolutionarily close should be scored differently than proteins that are evolutionarily distant

– Different proteins might evolve at different rates and we need to normalize for that

Page 24: Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Fa07 CSE 182

PAM 1 distance

• Two sequences are 1 PAM apart if they differ in 1 % of the residues.

• PAM1(a,b) = Pr[residue b substitutes residue a, when the sequences are 1 PAM apart]

1% mismatch

Page 25: Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Fa07 CSE 182

PAM1 matrix

• Align many proteins that are very similar– Is this a problem?

• 1 PAM evolutionary distance represents the time in which 1% of the residues have changed

• Estimate the frequency Pb|a of residue a being substituted by residue b.

• PAM1(a,b) = Pa|b = Pr(b will mutate to an a after 1 PAM evolutionary distance)

• Scoring matrix – S(a,b) = log10(Pab/PaPb) = log10(Pb|a/Pb)

Page 26: Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Fa07 CSE 182

PAM 1

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

• Top column shows original, and left column shows replacement residue = PAM1(a,b) = Pr(a|b)

Page 27: Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Fa07 CSE 182

PAM distance

• Two sequences are 1 PAM apart when they differ in 1% of the residues.

• When are 2 sequences 2 PAMs apart?

1 PAM

1 PAM

2 PAM

Page 28: Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Fa07 CSE 182

Higher PAMs

• PAM2(a,b) = ∑c PAM1(a,c). PAM1 (c,b)

• PAM2 = PAM1 * PAM1 (Matrix multiplication)

• PAM250

– = PAM1*PAM249

– = PAM1250

Page 29: Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Fa07 CSE 182

Note: This is not the score matrix: What happens as you keep increasing the power?

Page 30: Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Fa07 CSE 182

END of L4