CSE182-L5: Scoring matrices Dictionary Matching

Fa05 CSE 182

CSE182-L5: Scoring matrices

Dictionary Matching

Fa05 CSE 182

Silly Quiz

Fa05 CSE 182

PAM 1 distance

• Two sequences are 1 PAM apart if they differ in 1 % of the residues.

• PAM1(a,b) = Pr[residue a substitutes residue b, when the sequences are 1 PAM apart]

1% mismatch

Fa05 CSE 182

PAM 1

Fa05 CSE 182

Generating Higher PAMs

• PAM2(a,b) = ∑c PAM1(a,c). PAM1 (c,b)• PAM2 = PAM1 * PAM1 (Matrix multiplication)• PAM250

– = PAM1*PAM249 – = PAM1

250

=a a

b c b

c

PAM2PAM1PAM1

Fa05 CSE 182

Scoring residues

• A reasonable score function C(a,b) is given as follows:– Look at ‘high quality’ alignments – C(a,b) should be high when a,b are seen together

more often than is expected by chance– C(a,b) should be low, otherwise.

• How often would you expect to see a,b together just by chance?

– Pa Pb

• Let Pab be the probability that a and b are aligned in a high-quality alignment

• A good scoring function is the log-odds score

– C(a,b)= log10 (Pab/PaPb)

Fa05 CSE 182

Scoring alignments

• To compute Pab, we need ‘high-quality’ alignments

• How can you get quality alignments?– Use SW (But that needs the scoring

function)– Build alignments manually– Use Dayhoff’s theory to extrapolate from

high identity alignments

Fa05 CSE 182

Scoring using PAM matrices

• Suppose we know that two sequences are 250 PAMs apart.

• S(a,b) = log10(Pab/PaPb)= log10(Pa|b/Pa) = log10(PAM250(a,b)/Pa)

• How does it help?– S250(A,V) >> S1(A,V)– Scoring of hum vs. Dros should be

using a higher PAM matrix than scoring hum vs. mus.

– An alignment with a smaller % identity could still have a higher score and be more significant

hum

mus

dros

Fa05 CSE 182

•S250(a,b) = log10(Pab/PaPb) = log10(PAM250(a,b)/Pa)

PAM250 based scoring matrix

Fa05 CSE 182

BLOSUM series of Matrices

• Henikoff & Henikoff: Sequence substitutions in evolutionarily distant proteins do not seem to follow the PAM distributions

• A more direct method based on hand-curated multiple alignments of distantly related proteins from the BLOCKS database.

• BLOSUM60 Merge all proteins that have greater than 60%. Then, compute the substitution probability.– In practice BLOSUM62 seems to work very well.

Fa05 CSE 182

PAM vs. BLOSUM

• What is the correspondence?

• PAM1 Blosum1• PAM2 Blosum2

• Blosum62

• PAM250 Blosum100

Fa05 CSE 182

P-value computation

• We use text filtering to filter the database quickly.• The matching regions are expanded into alignments, which

are scored using SW, and an appropriate scoring matrix.• The results are presented in order.• How significant is the top scoring hits if it has a score S?• Expect/E-value (score S)= Number of times we would expect

to see a random query generate a score S, or better• How can we compute E-value?

Fa05 CSE 182

What is a distribution function

• Given a collection of numbers (scores)– 1, 2, 8, 3, 5, 3,6, 4, 4,1,5,3,6,7,….

• Plot its distribution as follows:– X-axis =each number– Y-axis (count/frequency/probability) of seeing that

number– More generally, the x-axis can be a range to

accommodate real numbers

Fa05 CSE 182

P-value computation• How significant is a score? What happens to

significance when you change the score function• A simple empirical method:

• Compute a distribution of scores against a random database.

• Use an estimate of the area under the curve to get the probability.

• OR, fit the distribution to one of the standard distributions.

Fa05 CSE 182

Z-scores for alignment

• Initial assumption was that the scores followed a normal distribution.

• Z-score computation:– For any alignment, score S, shuffle one of the

sequences many times, and recompute alignment. Get mean and standard deviation

– Look up a table to get a P-value

€

ZS =S − μ

σ

Fa05 CSE 182

Blast E-value

• Initial (and natural) assumption was that scores followed a Normal distribution

• 1990, Karlin and Altschul showed that ungapped local alignment scores follow an exponential distribution

• Practical consequence: – Longer tail. – Previously significant hits now not so significant

Fa05 CSE 182

Altschul Karlin statistics

• For simplicity, assume that the database is a binary string, and so is the query.– Let match-score=1, – mismatch score=- , – indel=- (No gaps allowed)

• What does it mean to get a score k?

Fa05 CSE 182

Exponential distribution

• Random Database, Pr(1) = p • What is the expected number of hits to a sequence of k

1’s

• Instead, consider a random binary Matrix. Expected # of diagonals of k 1s€

(n − k)pk ≅ nek ln p = ne−k ln

1

p

⎛

⎝ ⎜

⎞

⎠ ⎟

€

Λ=(n − k)(m − k) pk ≅ nmek ln p = nme−k ln

1

p

⎛

⎝ ⎜

⎞

⎠ ⎟

Fa05 CSE 182

• As you increase k, the number decreases exponentially.• The number of diagonals of k runs can be approximated by a

Poisson process

• In ungapped alignments, we replace the coin tosses by column scores, but the behaviour does not change (Karlin & Altschul).

• As the score increases, the number of alignments that achieve the score decreases exponentially €

Pr[u] =Λue−Λ

u!

Pr[u > 0] =1− e−Λ

Fa05 CSE 182

Blast E-value

• Choose a score such that the expected score between a pair of residues < 0

• Expected number of alignments with a score S

• For small values, E-value and P-value are the same€

E = Kmne−λS = mn2−

λS−ln K

ln 2

⎛

⎝ ⎜

⎞

⎠ ⎟

Pr(S ≥ x) =1− e−Kmne −λx

Fa05 CSE 182

The last step in Blast

• We have discussed– Alignments– Db filtering using keywords– Scoring matrices– E-values and P-values

• The last step: Database filtering requires us to scan a large sequence fast for matching keywords

Fa05 CSE 182

Dictionary Matching, R.E. matching, and position specific scoring

Fa05 CSE 182

Keyword search

• Recall: In BLAST, we get a collection of keywords from the query sequence, and identify all db locations with an exact match to the keyword.

• Question: Given a collection of strings (keywords), find all occurrences in a database string where they keyword might match.

Fa05 CSE 182

Dictionary Matching

• Q: Given k words (si has length li), and a database of size n, find all matches to these words in the database string.

• How fast can this be done?

1:POTATO2:POTASSIUM3:TASTE

P O T A S T P O T A T O

dictionary

database

Fa05 CSE 182

Dict. Matching & string matching

• How fast can you do it, if you only had one word of length m?– Trivial algorithm O(nm) time– Pre-processing O(m), Search O(n) time.

• Dictionary matching

– Trivial algorithm (l1+l2+l3…)n

– Using a keyword tree, lpn (lp is the length of the longest pattern)

– Aho-Corasick: O(n) after preprocessing O(l1+l2..)

• We will consider the most general case

Fa05 CSE 182

Direct Algorithm

P O P O P O T A S T P O T A T OP O T A T OP O T A T OP O T A T OP O T A T O P O T A T O

Observations:• When we mismatch, we (should) know something about

where the next match will be.• When there is a mismatch, we (should) know something

about other patterns in the dictionary as well.

Fa05 CSE 182

P O T A T O

T UIS M

S ETA

The Trie Automaton

• Construct an automaton A from the dictionary– A[v,x] describes the transition from node v to a node w upon

reading x.– A[u,’T’] = v, and A[u,’S’] = w– Special root node r– Some nodes are terminal, and labeled with the index of the

dictionary word.


1

2

3

w

vu

S

r

Fa05 CSE 182

An O(lpn) algorithm for keyword matching

• Start with the first position in the db, and the root node.

• If successful transition– Increment current pointer– Move to a new node– If terminal node “success”

• Else– Retract ‘current’ pointer– Increment ‘start’ pointer– Move to root & repeat

Fa05 CSE 182

Illustration:

P O T A T O

T UIS M

S ETA

P O T A S T P O T A T Ol c

v

S

1

2

3

Fa05 CSE 182

Idea for improving the time

P O T A S T P O T A T O

• Suppose we have partially matched pattern i (indicated by l, and c), but fail subsequently. If some other pattern j is to match– Then prefix(pattern j) = suffix [ first c-l characters of pattern(i))

l c


P O T A S S I U MT A S T E

Pattern i

Pattern j

Fa05 CSE 182

Improving speed of dictionary matching

• Every node v corresponds to a string sv that is a prefix of some pattern.

• Define F[v] to be the node u such that su is the longest suffix of sv

• If we fail to match at v, we should jump to F[v], and commence matching from there

• Let lp[v] = |su|

P O T A T O

T UIS M

S ETA

1 2 3 4 5

67

89 10

11S

Fa05 CSE 182

End of L5

CSE182-L5: Scoring matrices Dictionary Matching

Documents

Transcript of CSE182-L5: Scoring matrices Dictionary Matching