Lecture 3: Scoring Matrices and Multiple Sequence Alignment BMI/IBGP 730

41
Lecture 3: Scoring Matrices and Multiple Sequence Alignment BMI/IBGP 730 Kun Huang Department of Biomedical Informatics Ohio State University

description

Lecture 3: Scoring Matrices and Multiple Sequence Alignment BMI/IBGP 730. Kun Huang Department of Biomedical Informatics Ohio State University. Review of sequence alignment Scoring matrices Multiple sequence alignment Tools and examples. 18. 3. 20. 5. 25. B. 2. A. 11. - PowerPoint PPT Presentation

Transcript of Lecture 3: Scoring Matrices and Multiple Sequence Alignment BMI/IBGP 730

Page 1: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

Lecture 3: Scoring Matrices and Multiple Sequence

Alignment BMI/IBGP 730

Kun HuangDepartment of Biomedical Informatics

Ohio State University

Page 2: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

Review of sequence alignment

Scoring matrices

Multiple sequence alignment

Tools and examples

Page 3: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

Dynamic programming

• The name comes from an operations research task, and has nothing to do with writing programs. Programming – use tabular structure for computing.

A B

520

2 25

11 17

183

Page 4: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

Dynamic programming matrix

• Each cell has the score for the best aligned sequence prefix up to that position.

• Example:

ATGCT vs. ACCGCT

Match: +2, mismatch: 0, gap: -1

Gap A T G C T

Gap 0 -1 -1 -1 -1 -1

A -1 2 0 0 0 0

C -1 0 0 0 2 0

C -1 0 0 0 2 0

G -1 0 0 2 0 0

C -1 0 0 0 2 0

T -1 0 2 0 0 2

Matching matrix, NOT the dynamical programming matrix!

Page 5: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

Dynamic programming matrix

Gap A T G C T

Gap 0 -1 -2 -3 -4 -5

A -1 2 (2) 1(0) 0 0 0

C -2 1 (0) 2 (0) 1 2 1

C -3 0 1 (0) 2 3 2

G -4 0 0 3 (2) 2 3

C -5 0 0 2 5 (2) 4

T -6 0 2 1 4 7 (2)

0 1 2 3 4 5 6

0

1

2

3

4

5

6

7

AA

A TA _

A TA C

A _A T

A TA C

A TA C

A T _A _ C

A _ TA C _

Page 6: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

Optimal alignment by traceback• We “traceback” a path that gets us the highest score.

If we don't have “end gap” penalties, then takeany path from thelast row or columnto the first.

• Otherwise we needto include the top and bottom corners

Gap A T G C T

Gap 0 -1 -2 -3 -4 -5

A -1 2 (2) 1(0) 0 -1 -2

C -2 1 (0) 2 (0) 1 2 1

C -3 0 1 (0) 2 3 2

G -4 -1 0 3 (2) 2 3

C -5 -2 -1 2 5 (2) 4

T -6 -3 0 1 4 7 (2)

0 1 2 3 4 5 6

0

1

2

3

4

5

6

7

AT - GCT ACCGCT

A - TGCT ACCGCT

Page 7: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

Dynamic programming

• Global alignment – an alignment of two or more sequences that matches as many characters as possible in all of the sequences. Needleman-Wunch algorithm

• Local alignment – an alignment that includes only the best matching, highest-scoring regions in two or more sequences. Smith-Waterman algorithm

• Difference – all the scores are kept in the dynamical programming matrix for global alignment; only the positive scores are kept in the dynamical programming matrix for local alignment, the negative ones are converted to zero.

Page 8: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

BLAST – Algorithm IntuitionThe BLAST algorithm.The BLAST algorithm is a heuristic search method that seeks words of length W (default = 3 in blastp) that score at least T when aligned with the query and scored with a substitution matrix. Words in the database that score T or greater are extended in both directions in an attempt to find a locally optimal ungapped alignment or HSP (high scoring pair) with a score of at least S or an E value lower than the specified threshold. HSPs that meet these criteria will be reported by BLAST, provided they do not exceed the cutoff value specified for number of descriptions and/or alignments to report.

Page 9: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

BLAST – Algorithm Intuition

http://www.compbio.dundee.ac.uk/ftp/preprints/review93/Figure10.pdf

Databases are pre-indexed by the words.

Without gaps:Altschul, S. F., Gish, W., Miller, W., Myers, E. W., Lipman, D. J., J. Mol. Biol. (1990) 215:403-410

With gaps:Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., Lipman, D. J., Nucleic Acids Research (1997) 25(17):3389-3402

Page 10: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

An alignment score

• An alignment score is the sum of all the match scores of an alignment, with a penalty subtracted for each gap.

• Gap penalties are usually "affine" meaning that the penalty for a gap computed as a linear function of its length x:

gap penalty = ax + b.

a b c - - da c c e f d9 2 7 6 => 24 - (10 + 2) = 12

Matchscore

Gap start + continuationpenalty

AlignmentScore

Page 11: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

How we pick substitution scores?

• Problem: how to score each residue pair in two aligned sequences? – intuitively the score should reflect the likelihood that two

sequences are aligned because they related as opposed to a random alignment

• We assume the letter a occurs independently with frequency qa

– random match scenario: the likelihood that the letters a and b occur by chance is qaqb

– true match scenario: the aligned pair a,b occurs with joint probability pab

Page 12: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

Substitution scores

• The likelihood that the residue pair (a, b) occurs as an aligned pair, as opposed to an unaligned pair, is pab/qaqb

• If we use the logarithm of this ratio (log-odds ratio), the score for the entire alignment is additive

ba

ab

qq

pbas log),(

i

ii yxsS ),(

i yx

yx

iy

ix

iyx

ii

ii

ii

ii

qq

p

qq

p

RandomyxP

MatchyxP

)|,(

)|,(

Page 13: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

Substitution matrices

• As we wanted, the total score is the sum of individual scores s(a, b) for each aligned pair

• The s(a, b) can be arranged in a matrix called substitution or score matrix– for proteins, this is a 20 x 20 matrix

• A number of such matrices have been proposed, each computed according to different strategies

Page 14: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

Alignment as An Optimization Problem

Optimization criteria / cost function

• What sort of alignment should be considered• Scoring system

• Additive model• Based on probability compared with random

sequence (PAM, BLOSUM)• Assumption of independence• More complicated cases

Gap penalty – linear (s = -gd) affine (s = -d – (g-1)e)

Page 15: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

Clusters of amino acids

A

PG V

LI

A

M

F

WC H

K

Y

N

R

DT

SE

Q

Hydrophobic

Aromatic

Basic

Acid

Polar

Page 16: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

Match scores

• Match scores are oftencalculated on the basis of the frequency of particular mutations in very similar sequences.

• We can then transform substitution frequencies into log odds scores, and then normalize.

Page 17: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

Example of construction of a phylogenetic tree

• Assumptions:– high similarity => each position should have only

changed once during evolution– number of changes proportional to evolutionary

distance

Organism A A W T V A A A V R T S IOrganism B A Y T V A A A V R T S IOrganism C A W T V A A A V L T S I

A B C

L to R

W to Y

Page 18: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

The PAM (point accepted mutation) approach

• Amino acid substitutions were estimated using 1572 changes in 71 groups of protein sequences with at least 85% similarity– reasons for requiring such high degree of similarity:

• to exclude possibility that observed substitutions reflect two successive changes instead of just one

• to make possible building phylogenetic trees for each group and reconstruct probable history of amino acid changes

• The number of changes of every amino acid a into every b was the base for computing the scores– two-step procedure:

• # of changes -> relative mutability (relative number of changes)

• relative mutability -> scores

Page 19: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

Computing relative mutability in PAM • Problem: how to account for differences across groups in

sequence composition, mutation rate?– a normalization factor called exposure to mutation for

each amino acid was computed for each group– the # of changes per amino acid (in each group) was

divided by this factor– the relative frequencies thus obtained were then

summed for all groups (sum called relative mutability)

• Exposure to mutation of a defined as Pa * K

– Pa : frequency of occurrences of a in group

– K : total # of all changes in group per 100 sites

Page 20: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

Deriving scores in PAM

• Relative mutability used to compute PAM1

– ma = fa/(100*pa*K)

– Mab = (fab/fa) * ma

– PPhe<->Tyr = 260/1572 * relative mutability of Phe * fraction of Phe<->Tyr out of 260 Phe changes

– Repeat for all other 18 PPhe<->whatever

– The resulting 20 transition probabilities were normalized so that their sum was 1

– To obtain the scores, each Pab probability was

• divided by Pb (likelihood, odds ratio)

• the log taken, multiplied by 10 and rounded

Page 21: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

PAM matrices and evolutionary distance

• All these computations were based on an amount of evolution corresponding to 1 mutation in 100 amino acids on average– called a 1 PAM evolutionary distance

• Assuming a Markov model for amino acid substitutions the matrices for an evolutionary distances of N can be computed as PAM1N

– Markov model assumes that every change is independent of the previous changes at that site

– example: PAM250 is computed as PAM1250

Page 22: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

Alignment as An Optimization Problem

Optimization criteria / cost function

• What sort of alignment should be considered• Scoring system

• Additive model• Based on probability compared with random

sequence (PAM, BLOSUM)• Assumption of independence• More complicated cases

Gap penalty – linear (s = -gd) affine (s = -d – (g-1)e)

Page 23: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

The PAM250 scoring matrix

Page 24: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

BLOSUM• The BLOSUM matrix approach is based on the study of ~2000 conserved

amino acid patterns

– protein were grouped in 500+ families based on similarity of biochemical function

– the patterns were found by looking for motifs of type d1 aa1 d2 aa2 d3 aa3, where di are stretches of up to 24 amino acids located in all sequences

– These initial patterns were organized in larger ungapped regions (blocks) of size between 3 and 60 amino acids

– The multiple alignment of these blocks is the basis for determining the probabilities of amino acid changes

– To better balance the count, sequences with more than K% similarity were grouped together in one sequence before counting the amino acid substitutions in the multiple alignment

• The version of the matrix thus derived is called BLOSUMK

Page 25: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

BLOSUM62

• BLOSUM62 is widely used for protein sequence alignments

• Differences with PAM:– no assumptions on

evolutionary model– based on scoring at

conserved positions in blocks, rather than in whole sequences

– based on analysis of larger number of families

Page 26: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

Which scoring matrix to use?

• For protein match scores, two main options– PAM based on global alignments of closely related

sequences. Normalized to changes per 100 sites, then exponentiated for more distant relatives.

– BLOSUM based on local alignments in much more diverse sequences

• Picking the right distance is important, and may be hard to do. BLOSUM seems to work better for more evolutionarily distant sequences. BLOSUM62 is a good default.

Page 27: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

Picking gap penalties

• Many different possible forms:– Most common is affine

(gap open + gap continue penalities)– More complex penalties have been proposed.

• Penalties must be commensurate with match scores. Therefore, the match scoring scheme influences the gap penalty

• Most alignment programs suggest appropriate penalties for each match score option.

Page 28: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

The significance of an alignment

• Significance testing is the branch of statistics that is concerned with assesing the probability that a particular result could have occurred by chance.

• How do we calculate the probability that an alignment occurred by chance?

– Either with a model of evolution, or – Empirically, by scrambling our sequences and

calculating scores on many randomized sequences.

Page 29: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

Multiple Sequence Alignment

• Often applied to proteins

• Proteins that are similar in sequence are often similar in structure and function

• Sequence changes more rapidly in evolution than does structure and function.

Page 30: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

Overview of Methods

• Dynamic programming – too computationally expensive to do a complete search; uses heuristics

• Progressive – starts with pair-wise alignment of most similar sequences; adds to that

• Iterative – make an initial alignment of groups of sequences, adds to these (e.g. genetic algorithms)

• Locally conserved patterns

• Statistical and probabilistic methods

Page 31: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

Dynamic Programming

• Computational complexity – even worse than for pair-wise alignment because we’re finding all the paths through an n-dimensional hyperspace (We can picture this in 2 or 3 dimensions.)

• Can align about 7 relatively short (200-300) protein sequences in a reasonable amount of time; not much beyond that

Page 32: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

A Heuristic for Reducing the Search Space in Dynamic Programming

• Let’s picture this in 3 dimensions. It generalizes to n.

• Consider the pair-wise alignments of each pair of sequences.

• Create a phylogenetic tree from these scores.

• Consider a multiple sequence alignment built from the phylogenetic tree.

• These alignments circumscribe a space in which to search for a good (but not necessarily optimal) alignment of all n sequences.

Page 33: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

Phylogenetic Tree as a Guilding Tree

• Dynamic programming uses a phylogenetic tree to build a “first-cut” MSA.

• The tree shows how protein could have evolved from shared origins over evolutionary time.

Page 34: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

Dynamic Programming -- MSA

• Create a phylogenetic tree based on pair-wise alignments (Pairs of sequences that have the best scores are paired first in the tree.)

• Do a “first-cut” msa by incrementally doing pair-wise alignments in the order of “alikeness” of sequences as indicated by the tree. Most alike sequences aligned first.

• Use the pair-wise alignments and the “first-cut” msa to circumscribe a space within which to do a full msa that searches through this solution space.

• The score for a given alignment of all the sequences is the sum of the scores for each pair, where each of the pair-wise scores is multiplied by a weight є indicating how far the pair-wise score differs from the first-cut msa alignment score.

Page 35: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

Heuristic Dynamic Programming Method for MSA

• Does not guarantee an optimal alignment of all the sequences in the group.

• Does get an optimal alignment within the space chosen.

Page 36: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

Progressive Methods

• Similar to dynamic programming method in that it uses the first step (i.e., it creates a phylogenetic tree, aligns the most-alike pair, and incrementally adds sequences to the alignment in order of “alikeness” as indicated by the tree.)

• Differs from dynamic programming method for MSA in that it doesn’t refine the “first-cut” MSA by doing a full search through the reduced search space. (This is the computationally expensive part of DP MSA in that, even though we’ve cut down the search space, it’s still big when we have many sequences to align.)

Page 37: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

Progressive Method

• Generally proceeds as follows: – Choose a starting pair of sequences and align them– Align each next sequence to those already aligned,

one at a time• Heuristic method – doesn’t guarantee an optimal

alignment• Details vary in implementation:

– How to choose the first sequence to align?– Align all subsequence sequences cumulatively or in

subfamilies?– How to score?

Page 38: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

ClustalW

• Based on phylogenetic analysis

• A phylogenetic tree is created using a pairwise distance matrix and nearest-neighbor algorithm

• The most closely-related pairs of sequences are aligned using dynamic programming

• Each of the alignments is analyzed and a profile of it is created

• Alignment profiles are aligned progressively for a total alignment

• W in ClustalW refers to a weighting of scores depending on how far a sequence is from the root on the phylogenetic tree (See p. 154 of Bioinformatics by Mount.)

Page 39: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

Problems with Progressive Method

• Highly sensitive to the choice of initial pair to align. If they aren’t very similar, it throws everything off.

• It’s not trivial to come up with a suitable scoring matrix or gap penalties.

Page 40: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

ClustalW

http://www.ebi.ac.uk/clustalw/http://align.genome.jp

Page 41: Lecture 3: Scoring Matrices and Multiple Sequence Alignment  BMI/IBGP 730

ClustalW>query

MKNTLLKLGVCVSLLGITPFVSTISSVQAERTVEHKVIKNETGTISISQLNKNVWVHTELGYFSGEAVPS NGLVLNTSKGLVLVDSSWDDKLTKELIEMVEKKFKKRVTDVIITHAHADRIGGMKTLKERGIKAHSTALT AELAKKNGYEEPLGDLQSVTNLKFGNMKVETFYPGKGHTEDNIVVWLPQYQILAGGCLVKSASSKDLGNV ADAYVNEWSTSIENVLKRYGNINLVVPGHGEVGDRGLLLHTLDLLK>gi|2984094 MGGFLFFFLLVLFSFSSEYPKHVKETLRKITDRIYGVFGVYEQVSYENRGFISNAYFYVADDGVLVVDALSTYKLGKELIESIRSVTNKPIRFLVVTHYHTDHFYGAKAFREVGAEVIAHEWAFDYISQPSSYNFFLARKKILKEHLEGTELTPPTITLTKNLNVYLQVGKEYKRFEVLHLCRAHTNGDIVVWIPDEKVLFSGDIVFDGRLPFLGSGNSRTWLVCLDEILKMKPRILLPGHGEALIGEKKIKEAVSWTRKYIKDLRETIRKLYEEGCDVECVRERINEELIKIDPSYAQVPVFFNVNPVNAYYVYFEIENEILMGE>gi|115023|sp|P10425|MKKNTLLKVGLCVSLLGTTQFVSTISSVQASQKVEQIVIKNETGTISISQLNKNVWVHTELGYFNGEAVPSNGLVLNTSKGLVLVDSSWDNKLTKELIEMVEKKFQKRVTDVIITHAHADRIGGITALKERGIKAHSTALTAELAKKSGYEEPLGDLQTVTNLKFGNTKVETFYPGKGHTEDNIVVWLPQYQILAGGCLVKSAEAKNLGNVADAYVNEWSTSIENMLKRYRNINLVVPGHGKVGDKGLLLHTLDLLK>gi|115030|sp|P25910|MKTVFILISMLFPVAVMAQKSVKISDDISITQLSDKVYTYVSLAEIEGWGMVPSNGMIVINNHQAALLDTPINDAQTEMLVNWVTDSLHAKVTTFIPNHWHGDCIGGLGYLQRKGVQSYANQMTIDLAKEKGLPVPEHGFTDSLTVSLDGMPLQCYYLGGGHATDNIVVWLPTENILFGGCMLKDNQATSIGNISDADVTAWPKTLDKVKAKFPSARYVVPGHGDYGGTELIEHTKQIVNQYIESTSKP>gi|282554|pir||S25844 MTVEVREVAEGVYAYEQAPGGWCVSNAGIVVGGDGALVVDTLSTIPRARRLAEWVDKLAAGPGRTVVNTH FHGDHAFGNQVFAPGTRIIAHEDMRSAMVTTGLALTGLWPRVDWGEIELRPPNVTFRDRLTLHVGERQVE LICVGPAHTDHDVVVWLPEERVLFAGDVVMSGVTPFALFGSVAGTLAALDRLAELEPEVVVGGHGPVAGP EVIDANRDYLRWVQRLAADAVDRRLTPLQAARRADLGAFAGLLDAERLVANLHRAHEELLGGHVRDAMEI FAELVAYNGGQLPTCLA