Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

35
Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

description

Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD. Sequence Alignment: Definition and Importance. - PowerPoint PPT Presentation

Transcript of Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Page 1: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Sequence Alignment

CSCE 769 Guest LectureNovember 1, 2012

Stephanie Irausquin, PhD

Page 2: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Sequence Alignment: Definition and Importance

● Sequence alignment is a process in which at least two homologous sequences are compared and involves the identification of insertions or deletions that might have occurred in either lineage since their divergence from a common ancestor

● A powerful tool for discovering biological function and establishing evolutionary relationships

Page 3: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Sequence Alignment● The same principles for sequence alignment can

be used to align both nucleotide and amino acid sequences

● More reliable alignments are usually obtained by using amino acid sequences1.Amino acids change less frequently during evolution

than nucleotides2.There are 20 amino acids and only 4 nucleotides, so the

probability for 2 sites to be identical by chance is lower at the amino acid level than at the nucleotide level

Page 4: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Sequence Alignment● A DNA sequence alignment consists of a series of

paired bases (one base from each sequence)● There are 3 types of aligned pairs

1.Match – it is assumed that the nucleotide at this site has not changed since the divergence between the two sequences

2.Mismatch – at least one substitution has occurred in one of the sequences since their divergence from each other

3.Gaps - a deletion has occurred in one sequence, or an insertion has occurred in the other (the alignment itself does not allow us to distinguish between these two possibilities)

Page 5: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Types of Alignment

● Manual● Dot matrix● Distance and similarity methods● Alignment algorithms

Page 6: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Manual Alignment● A reasonable alignment by visual inspection can be

obtained using either specialized alignment editors or plain text editors, when there are few gaps and the two sequences are not too different from each other

● Advantages: uses the brain and allows direct integration of additional data (i.e. domain structure)

● Disadvantages: is subjective and results cannot be compared to those derived using other methods

Page 7: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Dot Matrix

● The two sequences to be aligned are written out as column and row headings of a two-dimensional matrix

● A dot is put in the dot matrix plot at a position where the nucleotides in the two sequences are identical

● The alignment is defined by a path through the matrix starting at the upper left and ending with the lower right.

S E Q U E N C E A N A L Y S I S P R I M E RS • • •E • • • •Q •U •E • • • •N • •C •E • • • •A • •N • •A • •L •Y •S • • •I • •S • • •P •R • •I • •

M •E • • • •R • •

Page 8: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Dot Matrix● Advantages:

– a simple method – is useful in unraveling important evolution of sequences

● Disadvantages: – may become very cluttered – may require human intervention to recognize patterns – may not be reliable– limited to two sequences

Page 9: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Dot Matrix Examplesa.) b.)

Page 10: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Distance and similarity methods

● The best possible alignment between two sequences is the one which minimizes the numbers of mismatches and gaps

● However, reducing the number of mismatches usually results in an increase in the number of gaps (and vice versa)

Page 11: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Distance and similarity methods

● Considering the following example:

Seq1: TCAGACGATTG LengthSeq1=11

Seq2: TCGGAGCTG LengthSeq2=9● We can reduce the number of mismatches to

0, but the number of gaps in this case is 6:Seq1: TCAG-ACG-ATTGSeq2: TC-GGA-GC-T-G

Page 12: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Distance and similarity methods● Our example, yet again:

Seq1: TCAGACGATTG LengthSeq1=11

Seq2: TCGGAGCTG LengthSeq2=9● Conversely, we can reduce the number of gaps to a single gap

having the minimum possible size |LengthSeq1 – LengthSeq2| = 2 nucleotides, which increases the number of mismatches to 5:

Seq1: TCAGACGATTG * ****Seq2: TCGGAGCTG--

Page 13: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Distance and similarity methods● Our example, yet again:

Seq1: TCAGACGATTG LengthSeq1=11

Seq2: TCGGAGCTG LengthSeq2=9● We can also choose an alignment that minimizes neither the

number of gaps nor the number of mismatches. In the case below, the number of gaps is 4 and the number of mismatches is 2:

Seq1: TCAG-ACGATTG * *Seq2: TC-GGA-GCTG-

Page 14: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Distance and similarity methods● Which of the three alignments is preferable?● In order to determine that, we need to find a common denominator

(the gap penalty) that allows us to compare gaps and mismatches● Gap penalty – a factor (or set of factors) by which gap values (the

numbers and lengths of gaps) are multiplied in order to make the gaps equivalent in value to mismatches– Based on how frequent different types of insertions and

deletions occur in evolution in comparison with the frequency of occurrence of point substitutions

● Of course mismatch penalties also need to be assigned and serves to assess how frequently substitutions occur

Page 15: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Distance and similarity methods

● For any given alignment, we can calculate a distance or dissimilarity index (D) as:

D = ∑miyi + ∑wkzk

where yi is the number of mismatches of type i, mi is the mismatch penalty for an i-type of mismatch, zk is the number of gaps of length k, and wk is a positive number representing the penalty for gaps of length k

Page 16: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Distance and similarity methods

● In the most frequently used gap penalty systems, it is assumed that the gap penalty includes two components:1.Gap-opening penalty2.Gap-extension penalty

● Further complications in the gap penalty system may be introduced by distinguishing among different mismatches (i.e. amino acids)– Leu and Ile vs Arg and Glu

Page 17: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

BLOSUM● BLOSUM (BLOcks of Amino Acid SUbstitution Matrix) is a

substitution matrix used for sequence alignment of proteins● First introduced in a paper by Henikoff and Henikoff

– scanned very conserved regions of protein families and counted the relative frequencies of amino acids and their substitution probabilities

– Calculated a log-odds score for each of the possible substitutions of the 20 standard amino acids

● Several sets of matrices exist– High numbers designed for comparing closely related sequences– Low number designed for comparing distantly related sequences

Page 18: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

BLOSUM50 Substitution Matrix

TyrYTrpWValVThrTSerSArgRGlnQProPAsnNMetMLeuLLysKIleIHisHGlyGPheFGluEAspDCysCAlaA A R N D C Q E G H I L K M F P S T W Y V

A 5 -2 -1 -2 -1 -1 -1 0 -2 -1 -2 -1 -1 -3 -1 1 0 -3 -2 0R -2 7 -1 -2 -4 1 0 -3 0 -4 -3 3 -2 -3 -3 -1 -1 -3 -1 -3N -1 -1 7 2 -2 0 0 0 1 -3 -4 0 -2 -4 -2 1 0 -4 -2 -3D -2 -2 2 8 -4 0 2 -1 -1 -4 -4 -1 -4 -5 -1 0 -1 -5 -3 -4C -1 -4 -2 -4 13 -3 -3 -3 -3 -2 -2 -3 -2 -2 -4 -1 -1 -5 -3 -1Q -1 1 0 0 -3 7 2 -2 1 -3 -2 2 0 -4 -1 0 -1 -1 -1 -3E -1 0 0 2 -3 2 6 -3 0 -4 -3 1 -2 -3 -1 -1 -1 -3 -2 -3G 0 -3 0 -1 -3 -2 -3 8 -2 -4 -4 -2 -3 -4 -2 0 -2 -3 -3 -4H -2 0 1 -1 -3 1 0 -2 10 -4 -3 0 -1 -1 -2 -1 -2 -3 2 -4I -1 -4 -3 -4 -2 -3 -4 -4 -4 5 2 -3 2 0 -3 -3 -1 -3 -1 4L -2 -3 -4 -4 -2 -2 -3 -4 -3 2 5 -3 3 1 -4 -3 -1 -2 -1 1K -1 3 0 -1 -3 2 1 -2 0 -3 -3 6 -2 -4 -1 0 -1 -3 -2 -3M -1 -2 -2 -4 -2 0 -2 -3 -1 2 3 -2 7 0 -3 -2 -1 -1 0 1F -3 -3 -4 -5 -2 -4 -3 -4 -1 0 1 -4 0 8 -4 -3 -2 1 4 -1P -1 -3 -2 -1 -4 -1 -1 -2 -2 -3 -4 -1 -3 -4 10 -1 -1 -4 -3 -3S 1 -1 1 0 -1 0 -1 0 -1 -3 -3 0 -2 -3 -1 5 2 -4 -2 -2T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 2 5 -3 -2 0W -3 -3 -4 -5 -5 -1 -3 -3 -3 -3 -2 -3 -1 1 -4 -4 -3 15 2 -3Y -2 -1 -2 -3 -3 -1 -2 -3 2 -1 -1 -2 0 4 -3 -2 -2 2 8 -1V 0 -3 -3 -4 -1 -3 -3 -4 -4 4 1 -3 1 -1 -3 -2 0 -3 -1 5

s x,y=log pxyPxP y

Pxy is the probability that x and y are evolutionarily related.

Px is the probability of occurrence of x.

Py is the probability of occurrence of y.

Page 19: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Sequence Alignment Algorithms● The purpose of any alignment algorithm is to choose the alignment

associated with the smallest D from all possible alignments● The number of possible alignments can be very large● Fortunately, there are computer alignment algorithms for searching

the optimal alignment between two sequences● Fundamentally, there are two different types of alignment

algorithms:1.Global (Needleman-Wunsch)

Both sequences are aligned along their entire lengths and the best alignment is found

2.Local (Smith-Waterman)The best subsequence alignment is found.

Page 20: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Global Alignment: Needleman-Wunsch

● Every letter of each sequence is aligned to a letter or gap● Alignment takes place in a 2D matrix● Each cell corresponds to a pairing of one letter from each sequence

and contains a score derived from a scoring scheme along with a corresponding pointer

● The algorithm contains three major phases (initialization, fill, and trace-back)

● In order to examine each phase, lets align the words HEAGAWGHE and PAWHEAE using the following scoring scheme:– gap penalty of -8– match score and mismatch penalty to be determined using the BLOSUM50

matrix

Page 21: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Global Alignment: Needleman-Wunsch

● Initialization– Values for the first row and column are assigned– The score of each cell is set to the gap penalty (-8)

multiplied by the distance from the origin

H E A G A W G H E0 -8 -16 -24 -32 -40 -48 -56 -64 -72

P -8A -16W -24H -32E -40A -48E -56

Page 22: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Global Alignment: Needleman-Wunsch

● Fill– Three scores are computed for each cell

● Diagonal Score – sum of the diagonal cell score and the score for a match/mismatch (BLOSUM50 matrix)● Horizontal Score – sum of the cell to the left and the gap penalty● Vertical Score – sum of the above cell and the gap penalty

– The entire matrix is then filled by assigning for each cell the max score (obtained from the 3 computed scores) and corresponding pointer

H E A G A W G H E0 -8 -16 -24 -32 -40 -48 -56 -64 -72

P -8 -2A -16W -24H -32E -40A -48E -56

(P->H) Diagonal Score{0 + (-2) = -2 }

(P->H) Max Score = -2

(P->H) Vertical Score{-8 + (-8) = -16}

(P->H) Horizontal Score{-8 + (-8) = -16}

Page 23: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Global Alignment: Needleman-Wunsch

● Fill– Three scores are computed for each cell

● Diagonal Score – sum of the diagonal cell score and the score for a match/mismatch (BLOSUM50 matrix)● Horizontal Score – sum of the cell to the left and the gap penalty● Vertical Score – sum of the above cell and the gap penalty

– The entire matrix is then filled by assigning for each cell the max score (obtained from the 3 computed scores) and corresponding pointer

H E A G A W G H E0 -8 -16 -24 -32 -40 -48 -56 -64 -72

P -8 -2A -16W -24H -32E -40A -48E -56

(P->E) Diagonal Score{-8 + (-1) = -9 }

(P->E) Max Score = -9

(P->E) Vertical Score{-16 + (-8) = -24}

(P->E) Horizontal Score{-2 + (-8) = -10}

Page 24: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Global Alignment: Needleman-Wunsch

● Fill– Continue calculating max score for all cells along

with corresponding pointer

H E A G A W G H E0 -8 -16 -24 -32 -40 -48 -56 -64 -72

P -8 -2 -9 -17 -25 -33 -41 -49 -57 -65A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11E -40 -22 -8 -16 -16 -9 -12 -15 -7 3A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9

Page 25: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Global Alignment: Needleman-Wunsch

● Trace-back– Allows one to recover the alignment from the matrix– Trace back your transition from the bottom right corner to

the top left corner by referring back to the completed matrix

H E A G A W G H E0 -8 -16 -24 -32 -40 -48 -56 -64 -72

P -8 -2 -9 -17 -25 -33 -41 -49 -57 -65A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11E -40 -22 -8 -16 -16 -9 -12 -15 -7 3A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9

Page 26: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Global Alignment: Needleman-Wunsch

● Trace-back– Horizontal transition represents a gap in the vertical sequence– Vertical transition represents a gap in the horizontal sequence– Diagonal transition represents a match in the corresponding characters of the two sequences– Final Alignment:

H E A G A W G H - E- - P - A W H E A E

H E A G A W G H E0 -8 -16 -24 -32 -40 -48 -56 -64 -72

P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11E -40 -22 -8 -16 -16 -9 -12 -15 -7 3A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9

Page 27: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Local Alignment: Smith-Waterman

● A slight modification of the Needleman-Wunsch algorithm:– Edges of the matrix are initialized to zero– Max score is never less than zero, no pointer is

recorded unless the score is greater than zero– Trace-back starts from the highest score in the

matrix and ends at a score of zero

Page 28: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Local Alignment: Smith-Waterman● Again, lets align the words HEAGAWGHE and PAWHEAE using the same scoring scheme:

– gap penalty of -8– match score and mismatch penalty to be determined using the BLOSUM50 matrix– Start from the largest score and trace back to determine the best local alignment– Horizontal transition represents a gap in the vertical sequence– Vertical transition represents a gap in the horizontal sequence– Diagonal transition represents a match in the corresponding characters of the two sequences

● Final Alignment:H E A G A W G H E - -- - - P A W - H E A E

H E A G A W G H E0 0 0 0 0 0 0 0 0 0

P 0 0 0 0 0 0 0 0 0 0A 0 0 0 5 0 5 0 0 0 0W 0 0 0 0 2 0 20 12 0 0H 0 10 2 0 0 0 12 18 22 14E 0 2 16 8 0 0 4 10 18 28A 0 0 8 21 13 5 0 4 10 20E 0 0 6 13 18 12 4 0 4 16

Page 29: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Local Alignment: Smith-Waterman● Does it matter what “word”/sequence is horizontal/vertical?● To answer this question lets align PAWHEAE (horizontal) to HEAGAWGHE (vertical) using the

same scoring scheme as before:– gap penalty of -8– match score and mismatch penalty to be determined using the BLOSUM50 matrix– Start from the largest score and trace back to determine the best local alignment– Horizontal transition represents a gap in the vertical sequence– Vertical transition represents a gap in the horizontal sequence– Diagonal transition represents a match in the corresponding characters of the two sequences

P A W H E A E0 0 0 0 0 0 0 0

H 0 0 0 0 10 2 0 0E 0 0 0 0 2 16 8 6A 0 0 5 0 0 8 21 13G 0 0 0 2 0 0 13 18A 0 0 5 0 0 0 5 12W 0 0 0 20 12 4 0 4G 0 0 0 12 18 10 4 0H 0 0 0 4 22 18 10 4E 0 0 0 0 14 28 20 16

Page 30: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Local Alignment: Smith-Waterman● Does it matter what “word”/sequence is horizontal/vertical?● To answer this question lets align PAWHEAE (horizontal) to HEAGAWGHE (vertical) using the same

scoring scheme as before:– gap penalty of -8– match score and mismatch penalty to be determined using the BLOSUM50 matrix– Start from the largest score and trace back to determine the best local alignment– Horizontal transition represents a gap in the vertical sequence– Vertical transition represents a gap in the horizontal sequence– Diagonal transition represents a match in the corresponding characters of the two sequences

P A W H E A E0 0 0 0 0 0 0 0

H 0 0 0 0 10 2 0 0E 0 0 0 0 2 16 8 6A 0 0 5 0 0 8 21 13G 0 0 0 2 0 0 13 18A 0 0 5 0 0 0 5 12W 0 0 0 20 12 4 0 4G 0 0 0 12 18 10 4 0H 0 0 0 4 22 18 10 4E 0 0 0 0 14 28 20 16

Page 31: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Local Alignment: Smith-Waterman● Does it matter what “word”/sequence is horizontal/vertical?● To answer this question lets align PAWHEAE (horizontal) to HEAGAWGHE (vertical) using the same scoring scheme as before:

– gap penalty of -8– match score and mismatch penalty to be determined using the BLOSUM50 matrix– Start from the largest score and trace back to determine the best local alignment– Horizontal transition represents a gap in the vertical sequence– Vertical transition represents a gap in the horizontal sequence– Diagonal transition represents a match in the corresponding characters of the two sequences

● Final Alignment:- - - P A W - H E A E

H E A G A W G H E - -

P A W H E A E0 0 0 0 0 0 0 0

H 0 0 0 0 10 2 0 0E 0 0 0 0 2 16 8 6A 0 0 5 0 0 8 21 13G 0 0 0 2 0 0 13 18A 0 0 5 0 0 0 5 12W 0 0 0 20 12 4 0 4G 0 0 0 12 18 10 4 0H 0 0 0 4 22 18 10 4E 0 0 0 0 14 28 20 16

Page 32: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

So does it matter what “word”/sequence is horizontal/vertical? No, it does not. Either way, the final

alignment is the same and is considered to be the “optimal” alignment

P A W H E A E0 0 0 0 0 0 0 0

H 0 0 0 0 10 2 0 0E 0 0 0 0 2 16 8 6A 0 0 5 0 0 8 21 13G 0 0 0 2 0 0 13 18A 0 0 5 0 0 0 5 12W 0 0 0 20 12 4 0 4G 0 0 0 12 18 10 4 0H 0 0 0 4 22 18 10 4E 0 0 0 0 14 28 20 16

Final Alignment:H E A G A W G H E - -- - - P A W - H E A E

H E A G A W G H E0 0 0 0 0 0 0 0 0 0

P 0 0 0 0 0 0 0 0 0 0A 0 0 0 5 0 5 0 0 0 0W 0 0 0 0 2 0 20 12 0 0H 0 10 2 0 0 0 12 18 22 14E 0 2 16 8 0 0 4 10 18 28A 0 0 8 21 13 5 0 4 10 20E 0 0 6 13 18 12 4 0 4 16

Page 33: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Global or Local?● When is a global alignment more useful?

– When sequences in a query set are similar and close in size

● When is a local alignment more useful?– When sequences in a query set are dissimilar but suspected to

contain regions of similarity

When sequences (amino acid or nucleotide) are sufficiently similar, there is no difference between local and global alignments

Page 34: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Helpful Charts

IUPAC chart: http://www.bioinformatics.org/sms/iupac.html

AA chart: http://sofbiology.blogspot.com/2010/12/protein-synthesis-amino-acid-table.html

Page 35: Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD

Except where otherwise noted (i.e. items on the slide labeled “Helpful Charts”), most information contained in this presentation was obtained from:Graur, Dan and Wen-Hsiung Li. Fundamentals of Molecular Evolution. Second Edition. Sunderland, Massachusetts: Sinauer Associates, Inc., Publishers, 2000.

Some of the information related to global & local alignment algorithms was obtained from and can be accessed at: http://etutorials.org/Misc/blast/Part+II+Theory/Chapter+3.+Sequence+Alignment/