Heuristic alignment algorithms and cost matrices

37
Heuristic alignment Heuristic alignment algorithms and cost algorithms and cost matrices matrices Linda Muselaars and Miranda Stobbe

description

Heuristic alignment algorithms and cost matrices. Linda Muselaars and Miranda Stobbe. Overview chapter 2. What sorts of alignment should be considered? The scoring system used to rank alignments. The algorithm used to find optimal (or good) scoring alignments. - PowerPoint PPT Presentation

Transcript of Heuristic alignment algorithms and cost matrices

Page 1: Heuristic alignment algorithms and cost matrices

Heuristic alignment algorithms Heuristic alignment algorithms and cost matricesand cost matrices

Linda Muselaars and Miranda Stobbe

Page 2: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

2

Overview chapter 2Overview chapter 2

1. What sorts of alignment should be considered?

2. The scoring system used to rank alignments.

3. The algorithm used to find optimal (or good) scoring alignments.

4. The statistical methods used to evaluate the significance of an alignment score.

Page 3: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

3

Overview chapter 2Overview chapter 2

1. What sorts of alignment should be considered?

2. The scoring system used to rank alignments.

3. The algorithm used to find optimal (or good) scoring alignments.

4. The statistical methods used to evaluate the significance of an alignment score.

Page 4: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

4

ContentsContents Heuristic alignment algorithms

– BLAST– FASTA

Linear space methods Significance of scores

– Bayesian approach– Classical approach

Deriving score parameters– PAM matrices– BLOSUM

Page 5: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

5

ContentsContents Heuristic alignment algorithms

– BLAST– FASTA

Linear space methods Significance of scores

– Bayesian approach– Classical approach

Deriving score parameters– PAM matrices– BLOSUM

Page 6: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

6

The term heuristicThe term heuristic

A heuristic algorithm is based on empirical information that has no explicit rationalization.

It does not necessarily return the exact answer to the problem under study, but is faster than the algorithm that does and is still very usable.

Page 7: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

7

BLASTBLAST

Basic Linear Alignment Search Tool. Simplification of the Smith-Waterman algorithm. Uses subsequences of the query sequence to make

‘neighbourhood words’ using a threshold. When a neighbourhood word matches a

subsequence in the database a ‘hit extension’ process is started.

Page 8: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

8

ExampleExample

Query sequence: q l n fAll subsequences: q l, l n, n fCreating neighbourhood words: q l q l, q m, h l, z ll n l n, l bn f n f, a f, n y, d f, q f, e f, g f, h f, k f, s f, t f,

b f, z f

Page 9: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

9

FASTAFASTA FAST Alignment. Fast approximation of the Smith-Waterman

algorithm. Step 1:

– Exact short word matches with length ktup Step 2:

– extend to ungapped alignments Step 3:

– identify gapped alignments Step 4:

– dynamic programming restricted to a subregion

Page 10: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

10

Page 11: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

11

BLAST versus FASTABLAST versus FASTA They both use the same extension method. They both can be used for both DNA and proteins.

BLAST is faster than FASTA. BLAST is more sensitive than FASTA on proteins. BLAST is less sensitive than FASTA for nucleic acid

sequences. BLAST uses neighbourhood words, FASTA does not. BLAST is mainly for ungapped alignment, FASTA for

gapped alignments.

Page 12: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

12

BLAST vs. FASTA, exampleBLAST vs. FASTA, example

Consider the sequences: n f l and n y l ktup = 2 (remember: only for FASTA)Even though FASTA only needs a matching

word of size 2 it does not find a match.BLAST does find a match (of word size 3

even) on account of neighbourhood words.

Page 13: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

13

Demo at www.ebi.ac.ukDemo at www.ebi.ac.uk

Page 14: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

14

Page 15: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

15

ContentsContents Heuristic alignment algorithms

– BLAST– FASTA

Linear space methods Significance of scores

– Bayesian approach– Classical approach

Deriving score parameters– PAM matrices– BLOSUM

Page 16: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

16

Reducing memory usageReducing memory usage

Score matrices so far are of size nm (with n and m the sequence lengths).

We can reduce memory usage to n+m.Cost: time is doubled.This is done by linear space methods.

Page 17: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

17

Divide and conquerDivide and conquer We find a cell (u,v) in

the middle column that is on the optimal path.

This cell divides the matrix in four parts of which two are important for the path.

This is done recursively to these two parts.

Page 18: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

18

ContentsContents Heuristic alignment algorithms

– BLAST– FASTA

Linear space methods Significance of scores

– Bayesian approach– Classical approach

Deriving score parameters– PAM matrices– BLOSUM

Page 19: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

19

Short reviewShort review

Letter a occurs independently with frequency qa in the random model.

Aligned pairs of residues occur with a joint probability pab in the match model.

Random model: P(x,y|R) = ΠkqxkΠlqyl

Match model: P(x,y|M) = Πkpxkyk

Page 20: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

20

Bayesian approach: model comparisonBayesian approach: model comparison

Page 21: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

21

ComparisonComparison

For global matches compare with 0 to determine whether the alignment is significant.

When setting the prior odds ratio in inverse proportion to the size of the database N, compare with log N.

For local matches compare with 0.1 • log(nm)

Page 22: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

22

Extreme value distributionExtreme value distribution Scores of a sequence

aligned to a set of random sequences obey EVD.

We compute the probability that the best match of unrelated sequences has score greater than our maximal score. 0.0

0.4

-4 -3 -2 -1 0 1 2 3 4

x

Page 23: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

23

Other alignmentsOther alignments

For local ungapped alignments we have a different EVD than for fixed ungapped alignment (because we have more possible starting points).

For gapped alignments empirically established distributions are used.

Page 24: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

24

Correcting for lengthCorrecting for length When database

sequences are longer, we have higher scores.

Solutions:– Subtract log (mi) for

length mi of the database sequence.

– Bin all the database entries by length and fit a linear function.

Page 25: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

25

Notes on test statisticNotes on test statistic

Search statistic is the same as the test statistic.

Advantage: both have highly discriminative power.

Disadvantage: introduction of bias in test phase.

Page 26: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

26

ContentsContents Heuristic alignment algorithms

– BLAST– FASTA

Linear space methods Significance of scores

– Bayesian approach– Classical approach

Deriving score parameters– PAM matrices– BLOSUM

Page 27: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

27

Substitution and gap scoresSubstitution and gap scores Letter a occurs

independently with frequency qa in the random model.

Aligned pairs of residues occur with a joint probability pab in the match model.

f(g) is a function of the length of the gap

Page 28: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

28

Estimating probabilitiesEstimating probabilities

Simple approach: set the probabilities to normalised frequencies (assessed by counting frequences in confirmed alignments).

But: – It is difficult to obtain a good random sample.– Does not take into account different ‘distances’

to the common ancestor.

Page 29: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

29

PAM matricesPAM matricesPercentage of Acceptable point Mutations

per 108 years matrices.Amino acid substitution matrices.Obtain substitution data from alignments

and estimate probabilities for longer evolutionary distances.

A PAMn: n accepted mutations event per 100 amino acids.

Page 30: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

30

PAM matrices (2)PAM matrices (2)

Construct phylogenetic trees relating the sequences in 71 families (at least 85% similar).

Count the number of amino acid changes with respect to immediate ancestor.

20 x 20 amino acid substitution matrix computed. Expected number of substitutions is 1% in PAM1. PAMn = (PAM1)n. PAM-matrix converted to a log-odds matrix.

Page 31: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

31

DrawbacksDrawbacksUsing the matrix for short time intervals to

compute the ones for longer time intervals does not capture the true difference.

Takes into account only single base changes instead of all types of codon changes.

Databases containing alignments of more distantly related proteins are used to derive matrix scores more directly and accurately.

Page 32: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

32

BLOCKS databaseBLOCKS database

Used to derive BLOSUM matrices.Sequences are clustered according to

percentage of identical residues.Aab then is the frequency of observing a in

one cluster aligned to b in another cluster.Size of the clusters needs to be corrected

for.

Page 33: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

33

BLOSUMBLOSUM

BLOcks SUbstitution Matrix BLOSUMn is the matrix where two sequences

are put into one cluster when more then n% of their residues are identical (lower n corresponds to longer evolutionary time).

From Aab qa and pab are estimated, which are used to compute the scores for the matrix.

Page 34: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

34

PAM versus BLOSUMPAM versus BLOSUM Based on global

alignments.

PAM1 is the matrix calculated from comparisons of substitutions in unit time.

Other PAM matrices are extrapolated from PAM1.

Based on local alignments.

BLOSUMn is a matrix calculated from sequences with no less than n% divergence.

All matrices are based on

observed alignments.

Page 35: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

35

Gap penaltiesGap penalties

Time-dependent:– Number of gaps increases (gap-open score d linear in

log t).– Length distribution constant (gap-extend score e

remains constant). In practice people choose gap costs empirically

(only two parameters). As gaps become more likely we could reduce the

pairwise scores.

Page 36: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

36

NotesNotes Objective was to determine

whether two sequences are related.

Scoring schemes and statistics to determine the significance of a match.

Even so, it is not always possible to distinguish between two related sequences or two sequences that seem to be related, but are not.

Page 37: Heuristic alignment algorithms and cost matrices

Linda Muselaars and Miranda Stobbe

37

SummarySummaryBLAST and FASTA packages are used to

reduce the time used for finding alignments.Linear space alignments can be used to

reduce memory usage.We need the significance of scores for the

importance of a match.We can use the score parameters stated in

PAM and BLOSUM matrices.