. Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms...

39
. Fasta, Blast, Probabilities

Transcript of . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms...

Page 1: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

.

Fasta, Blast, Probabilities

Page 2: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

2

Reminder

Last classes we discussed dynamic programming algorithms for

global alignment local alignment Multiple alignment

All of these assumed a scoring rule:

that determines the quality of perfect matches, substitutions, insertions, and deletions.

}){(}){(:

Page 3: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

3

Alignment in Real Life

One of the major uses of alignments is to find sequences in a “database.”

The current protein database contains about 100 millions (i.e.,108) residues ! So searching a 1000 long target sequence requires to evaluate about 1011 matrix cells which will take about three hours in the rate of 10 millions evaluations per second.

Quite annoying when, say, one thousand target sequences need to be searched because it will take about four months to run.

Page 4: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

4

Heuristic Fast Search

Instead, most searches rely on heuristic procedures These are not guaranteed to find the best match Sometimes, they will completely miss a high-scoring

match

We now describe the main ideas used by the best known of these heuristic procedures.

Page 5: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

5

Basic Intuition

Almost all heuristic search procedures are based on the observation that real-life matches often contain long strings with gap-less matches.

These heuristic try to find significant gap-less matches and then extend them.

Page 6: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

6

Banded DP

Suppose that we have two strings s[1..n] and t[1..m] such that nm

If the optimal alignment of s and t has few gaps, then path of the alignment will be close to diagonal

s

t

Page 7: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

7

Banded DP

To find such a path, it suffices to search in a diagonal region of the matrix.

If the diagonal band has width k, then the dynamic programming step takes O(kn).

Much faster than O(n2) of standard DP.s

t k

V[i+1, i+k/2 +1]Out of range

V[i, i+k/2+1]V[i,i+k/2]

Note that for diagonals i-j = constant.

Page 8: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

8

Banded DP for local alignment

Problem: Where is the banded diagonal ? It need not be the main diagonal when looking for a good local alignment.

How do we select which subsequences to align using banded DP?

s

tk

We heuristically find potential diagonals and evaluate them using Banded DP.

This is the main idea of FASTA.

Page 9: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

9

Finding Potential Diagonals

Suppose that we have a relatively long gap-less match

AGCGCCATGGATTGAGCGA

TGCGACATTGATCGACCTA Can we find “clues” that will let us find it quickly? Each such sequence defines a potential diagonal (which is

then evaluated using Banded DP.

Page 10: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

10

Signature of a Match

s

t

Assumption: good matches contain several “patches” of perfect matches

AGCGCCATGGATTGAGCTATGCGACATTGATCGACCTA

Since this is a gap-less alignment, all perfect match regionsshould be on one diagonal

Page 11: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

11

FASTA-finding ungapped matches

Input: strings s and t, and a parameter ktup Find all pairs (i,j) such that s[i..i+ktup]=t[j..j+ktup] Locate sets of pairs that are on the same diagonal

By sorting according to the difference i-j Compute the score for the diagonal that contains all these

pairs

s

t

Page 12: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

12

FASTA-finding ungapped matchesInput: strings s and t, and a parameter ktup Find all pairs (i,j) such that s[i..i+ktup]=t[j..j+ktup]

Step one: prepare an index of the database and the query sequence such that given a sequence of length ktup, one gets the list of positions. (Linear time).

Step two: for each ktup from the query add 1 in the diagonal (i-j) in which it appears. Then find contiguous (possibly with mismatch) ktup in diagonals.

s

t

Page 13: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

13

FASTA- using banded DPStep 3:

Select the ten high scoring contiguous segments

Try and score all combinations of these ten segments in order to constitute a pass into the matrix

Step 4: Run banded DP on the region

containing the best scoring pass (say with width 12).

Hence, the algorithm may combine some diagonals into gapped matches (in the example below combine diagonals 2 and 3).

s

t 3

2

1

Page 14: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

14

FASTA- practical choices

Some implementation choices /tricks have not been explicated herein.

s

t

Most applications of FASTA use very small ktup (1-2 for proteins, and 4-6 for DNA).

Higher values are faster, yielding less diagonal to search around, but increase the chance to miss the optimal local alignment.

Page 15: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

15

FASTA-summary

Input: strings s and t, and a parameter ktup = 1,2,4,5, or 6 depending on the application.

Output: A highly scored local alignment

1. Find pairs of matching substrings s[i..i+ktup]=t[j..j+ktup]

2. Extend to ungapped diagonals3. Extend to gapped matches using banded DP

Page 16: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

16

BLAST Overview(Basic Local Alignment Search Tool)

Input: strings s and t, and a parameter T = threshold valueOutput: A highly scored local alignment

Definition: Two strings s and t of length k are a high scoring pair (HSP) if d(s,t) > T (usually consider un-gapped alignments only).

1. Find high scoring pairs of substrings such that d(s,t) > T These words serve as seeds for finding longer matches

2. Extend to ungapped diagonals (as in FASTA)3. Extend to gapped matches

Page 17: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

17

BLAST Overview (cont.)

Step 1: Find high scoring pairs of substrings such that d(s,t) > T (The seeds):

Find all strings of length k which score at least T with substrings of s in a gapless alignment (k = 4 for proteins, 11 for DNA)

(note: possibly, not all k-words must be tested, e.g. when such a word scores less than T with itself).

Find in t all exact matches with each of the above strings.

Page 18: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

18

Extending Potential Matches

s

t

Once a seed is found, BLAST attempts to find a local alignment that extends the seed.

Seeds on the same diagonal are combined (as in FASTA), then extended as far as possible in a greedy manner without gap.

During the extension phase, the search stops when the score passes below some lower bound computed by BLAST (to save time).

For the best ungap alignment do a banded SW an assign a probabilistic score.

Page 19: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

19

Page 20: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

20

Page 21: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

21

Page 22: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

22

Page 23: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

23

Page 24: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

24

Page 25: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

25

Page 26: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

26

Why use probability to define and/or interpret a scoring function ?

• Similarity is probabilistic in nature because biological changes like mutation, recombination, and selection, are not deterministic.

• We could answer questions such as:• How probable two sequences are similar?• Is the similarity found significant or random?• How to change a similarity score when, say, mutation rate of a specific area on the chromosome becomes known ?

Page 27: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

27

A Probabilistic Model

For now, we will focus on alignment without indels. For now, we assume each position (nucleotide

/amino-acid) is independent of other positions. We consider two options:

M: the sequences are Matched (related)

R: the sequences are Random (unrelated)

Page 28: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

28

Unrelated Sequences

Our random model of unrelated sequences is simple Each position is sampled independently from a

distribution over the alphabet We assume there is a distribution q() that

describes the probability of letters in such positions.

Then:

i

itqisqRntnsP ])[(])[()|]..1[],..1[(

Page 29: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

29

Related Sequences

We assume that each pair of aligned positions (s[i],t[i]) evolved from a common ancestor

Let p(a,b) be a distribution over pairs of letters. p(a,b) is the probability that some ancestral letter

evolved into this particular pair of letters.

i

itispMntnsP ])[],[()|]..1[],..1[(

Page 30: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

30

Odd-Ratio Test for Alignment

i

i

i

itqisq

itisp

itqisq

itisp

RtsP

MtsPQ

])[(])[(

])[],[(

])[(])[(

])[],[(

)|,(

)|,(

If Q > 1, then the two strings s and t are more likely tobe related (M) than unrelated (R).

If Q < 1, then the two strings s and t are more likely tobe unrelated (R) than related (M).

Page 31: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

31

Score(s[i],t[i])

Log Odd-Ratio Test for AlignmentTaking logarithm of Q yields

])[(])[(

])[],[(log

])[(])[(

])[],[(log

)|,(

)|,(log

itqisq

itisp

itqisq

itisp

RtsP

MtsP

ii

If log Q > 0, then s and t are more likely to be related.If log Q < 0, then they are more likely to be unrelated.

How can we relate this quantity to a score function ?

Page 32: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

32

Probabilistic Interpretation of Scores

We define the scoring function via

Then, the score of an alignment is the log-ratio between the two models:

Score > 0 Model is more likely

Score < 0 Random is more likely

)()(),(

log),(bqaq

bapba

Page 33: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

33

Estimating Probabilities

Suppose we are given a long string s[1..n] of letters from

We want to estimate the distribution q(·) that generated the sequence

How should we go about this?

We build on the theory of parameter estimation in statistics using either maximum likelihood estimation or the Bayesian approach .

Page 34: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

34

Estimating q()

Suppose we are given a long string s[1..n] of letters from

s can be the concatenation of all sequences in our database

We want to estimate the distribution q()

That is, q is defined per letter

a

Nn

i

aaqisqsqL )(])[()|(1

Likelihood function:

Page 35: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

35

Estimating q() (cont.)

How do we define q?

n

Naq a)( ||

1)(

n

Naq a

a

Nn

i

aaqisqsqL )(])[()|(1

Likelihood function:

ML parameters

(Maximum Likelihood)

MAP parameters(Maximum A posteriori Probability)

Page 36: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

36

Estimating p(·,·)

Intuition: Find pair of aligned sequences s[1..n], t[1..n], Estimate probability of pairs:

Again, s and t can be the concatenation of many aligned pairs from the database

n

Nbap ba,),(

Number of times a is

aligned with b in (s,t)

Page 37: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

37

Problems in Estimating p(·,·)

How do we find pairs of aligned sequences? How far is the ancestor ?

earlier divergence low sequence similarity later divergence high sequence similarity

Does one letter mutate to the other or are they both mutations of a common ancestor having yet another residue/nucleotide acid ?

Page 38: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

47

Page 39: . Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

50

BLOSUM 62