Sequence comparison

107
Michael Schroeder BioTechnological Center TU Dresden Biotec Sequence comparison based Chapter Lesk, Introduction to Bioinformati

description

Sequence comparison. based on Chapter 4 Lesk, Introduction to Bioinformatics. Contents. Motivation Sequence comparison and alignments Dot plots Dynamic programming Substitution matrices Dynamic programming: Local and global alignments and gaps BLAST Significance of alignments - PowerPoint PPT Presentation

Transcript of Sequence comparison

Page 1: Sequence  comparison

Michael Schroeder BioTechnological CenterTU Dresden

Biotec

Sequence comparisonbased onChapter 4

Lesk, Introduction to Bioinformatics

Page 2: Sequence  comparison

By Michael Schroeder, Biotec 2

Contents Motivation Sequence comparison and alignments

Dot plots Dynamic programming Substitution matrices Dynamic programming: Local and global alignments

and gaps BLAST Significance of alignments

Multiple sequence alignments

Page 3: Sequence  comparison

By Michael Schroeder, Biotec 3

Motivation From where are we?

Recent Africa vs. Multi-regional Hypothese In 1999 Encephalitis caused by the West Nile Virus

broke out in New York. How did the virus come to New York?

How did the nucleus get into the eucaryotic cells?

To answer such questions we will need sequence comparison and phylogenetic trees

Page 4: Sequence  comparison

By Michael Schroeder, Biotec 4

Sequence alignment Substitutions, insertions and deletions can be

interpreted in evolutionary terms But: distinguish chance similarity and real biological

relationship

CCGTAA

CCGTAT

TCGTAGTAGTAC

TCGTAC

TCGTAA

TTGTAA

Page 5: Sequence  comparison

By Michael Schroeder, Biotec 5

Evolution Convergent evolution: same sequence evolved

from different ancestors Back evolution - mutate to a previous sequence

CCGTAA

CCGTAT

TCGTAGTAGTAC

TCGTAC

TCGTAA

TAGTAC CCGTAA

TAGTAA

Page 6: Sequence  comparison

By Michael Schroeder, Biotec 6

Sequence alignments Given two or more sequences, we wish to

Measure their similarity Determine the residue-residue correspondences Observe patterns of conservation and variability Infer evolutionary relationships

Page 7: Sequence  comparison

By Michael Schroeder, Biotec 7

What is the best alignment? Uninformative: -------gctgaacg

ctataatc------- Without gaps: gctgaacg

ctataatc With gaps: gctga-a--cg

--ct-ataatc Another one: gctg-aa-cg

-ctataatc- Formally: The best alignments have only a minimal

number of mismatches (insertions, deletions, replace)

We need a method to systematically explore and to compute alignments

Page 8: Sequence  comparison

By Michael Schroeder, Biotec 8

Scores for an alignment Sequence Identity: Percentage of matches Score each match, mismatch, gap opening, gap extension[attg] a t t g[accc] a c - - Example

match +1 mismatch -1 Gap opening -3 Gap extension -1

Uninformative: 0%, score= -19 -------gctgaacgctataatc-------

Without gaps: 25%, score= -4 gctgaacgctataatc

With gaps: 0%, score= -19 gctga-a--cg--ct-ataatc

Another one: 50%, score= -8 gctg-aa-cg-ctataatc-

Page 9: Sequence  comparison

By Michael Schroeder, Biotec 9

Scores for an alignment Sequence Identity: Percentage of matches Score each match, mismatch, gap opening, gap extension[attg] a t t g[accc] a c - - Example

match +2 mismatch -1 Gap opening -1 Gap extension -1

Uninformative: 0%, score= -15 -------gctgaacgctataatc-------

Without gaps: 25%, score= -2 gctgaacgctataatc

With gaps: 0%, score= -11 gctga-a--cg--ct-ataatc

Another one: 50%, score= 5 gctg-aa-cg-ctataatc-

Page 10: Sequence  comparison

By Michael Schroeder, Biotec 10

Dot plots

Page 11: Sequence  comparison

By Michael Schroeder, Biotec 11

Dot plots

A convenient way of comparing 2 sequences visually Use matrix, put 1 sequence on X-axis, 1 on Y-axis Cells with

identical characters filled with a ‘1’, non-identical with ‘0’ (simplest scheme - could have weights)

Page 12: Sequence  comparison

By Michael Schroeder, Biotec 12

Dot plots

NIKGDOHYHTOROD

NIKGDOHTOOFWORCYHTOROD

Page 13: Sequence  comparison

By Michael Schroeder, Biotec 13

Dot plots

NNII

KKGG

DDDOOOOOOO

HHHYY

HHHTTT

OOOOOOORRR

OOOOOOODDD

NIKGDOHTOOFWORCYHTOROD

Page 14: Sequence  comparison

By Michael Schroeder, Biotec 14

Interpreting dot plots What do identical sequences look like? What do unrelated sequences look like? What do distantly related sequences look like?

What does reverse sequence look like? Relevant for detections of stems in RNA structure

What does a palindrome look like? Relevant for restriction enzymes

What do repeats look like? What does a protein with domains A and B and another

one with domains B and C look like?

Page 15: Sequence  comparison

By Michael Schroeder, Biotec 15

Dot plot for identical sequences

NNII

KKGG

DDDOOOO

HHHYY

HHHTT

OOOORR

OOOODDD

NIKGDOHYHTOROD

Page 16: Sequence  comparison

By Michael Schroeder, Biotec 16

Dotplot for unrelated sequences

RRE

TTE

IIDDD

OOOOTTTT

OOOONIKGDOHYHTOROD

Page 17: Sequence  comparison

By Michael Schroeder, Biotec 17

Dotplot for distantly related sequences

NNII

KKNN

EJ

YYHHH

TTOOOO

MII

TTNIKGDOHYHTOROD

Page 18: Sequence  comparison

By Michael Schroeder, Biotec 18

Dotplot for reverse sequences Relevant to identify stems in RNA structures Plot sequence against its reverse complement

Page 19: Sequence  comparison

By Michael Schroeder, Biotec 19

Dotplot for reverse sequences

DDOOOO

RROOOO

TTHHH

YYHHH

OOOODDD

GGKK

IINNNIKGDOHYHTOROD

Page 20: Sequence  comparison

By Michael Schroeder, Biotec 20

Dotplot of a Palindrome

MMM

AAA

DD

AAA

MMM

MADAM

Page 21: Sequence  comparison

By Michael Schroeder, Biotec 21

Dotplot of repeats

EEEENNNN

OOOYYY

TTTTTTNNNN

EEEEWWWW

TTTTTTOOO

WWWWTTTTTT

YYYTTTTTT

NNNNEEEE

WWWWTTTTTT

OWTYTNEWTENOYTNEWT

Page 22: Sequence  comparison

By Michael Schroeder, Biotec 22

Dotplot of Repeats/Palindrome

MMMMMAAAAA

DDDAAAAA

MMMMMII

MMMMMAAAAA

DDDAAAAA

MMMMMMADAMIMADAM

Page 23: Sequence  comparison

By Michael Schroeder, Biotec 23

Dotplot for shared domain

RRELL

IIM

YYHHH

TTOOOO

RROOOO

DDNIKGDOHYHTOROD

Page 24: Sequence  comparison

By Michael Schroeder, Biotec 24

ResultDot plot

dorothycrowfoothodgkind* * o * * * ** * r * * o * * * ** * t * * h * * y * h * * o * * * ** * d* * g * k * i * n *

Page 25: Sequence  comparison

By Michael Schroeder, Biotec 25

Dotplots Window size 15 Dot if

6 matches in window

Page 26: Sequence  comparison

By Michael Schroeder, Biotec 26

Window size 15 Dot if

6 matches in window

Cacain and Caricain, two proteases from papaya

Page 27: Sequence  comparison

By Michael Schroeder, Biotec 27

>gi|1942644|pdb|1MEG| Crystal Structure Of A Caricain D158e Mutant In Complex With E-64

Length = 216

Score = 271 bits (693), Expect = 1e-73 Identities = 142/216 (65%), Positives = 168/216 (77%), Gaps = 4/216 (1%)

Query: 1 IPEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGNLNQYSEQELLDCDRRS 60 +PE VDWR+KGAVTPV++QGSCGSCWAFSAV T+EGI KIRTG L + SEQEL+DC+RRSSbjct: 1 LPENVDWRKKGAVTPVRHQGSCGSCWAFSAVATVEGINKIRTGKLVELSEQELVDCERRS 60

Query: 61 YGCNGGYPWSALQLVAQYGIHYRNTYPYEGVQRYCRSREKGPYAAKTDGVRQVQPYNQGA 120 +GC GGYP AL+ VA+ GIH R+ YPY+ Q CR+++ G KT GV +VQP N+G Sbjct: 61 HGCKGGYPPYALEYVAKNGIHLRSKYPYKAKQGTCRAKQVGGPIVKTSGVGRVQPNNEGN 120

Query: 121 LLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNKVDHAVAAV----GYGPNYILIKNS 176 LL +IA QPVSVV+++ G+ FQLY+GGIF GPCG KV+HAV AV G YILIKNSSbjct: 121 LLNAIAKQPVSVVVESKGRPFQLYKGGIFEGPCGTKVEHAVTAVGYGKSGGKGYILIKNS 180

Query: 177 WGTGWGENGYIRIKRGTGNSYGVCGLYTSSFYPVKN 212 WGT WGE GYIRIKR GNS GVCGLY SS+YP KNSbjct: 181 WGTAWGEKGYIRIKRAPGNSPGVCGLYKSSYYPTKN 216

1 lpenvdwrkk gavtpvrhqg scgscwafsa vatveginki rtgklvelse qelvdcerrs 61 hgckggyppy aleyvakngi hlrskypyka kqgtcrakqv ggpivktsgv grvqpnnegn 121 llnaiakqpv svvveskgrp fqlykggife gpcgtkveha vtavgygksg gkgyilikns 181 wgtawgekgy irikrapgns pgvcglykss yyptkn

Cacain and Caricain, two proteases from papaya

Page 28: Sequence  comparison

By Michael Schroeder, Biotec 28

Window size 15 Dot if

6 matches in window

Cacain and Cruzain, a protease from Trypanosoma cruzi

Page 29: Sequence  comparison

By Michael Schroeder, Biotec 29

>gi|2624670|pdb|1AIM| Cruzain Inhibited By Benzoyl-Tyrosine-Alanine- Fluoromethylketone

Length = 215

Score = 121 bits (303), Expect = 3e-28 Identities = 78/202 (38%), Positives = 107/202 (52%), Gaps = 13/202 (6%)

Query: 2 PEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGNLNQYSEQELLDCDRRSY 61 P VDWR +GAVT VK+QG CGSCWAFSA+ +E + L SEQ L+ CD+ Sbjct: 2 PAAVDWRARGAVTAVKDQGQCGSCWAFSAIGNVECQWFLAGHPLTNLSEQMLVSCDKTDS 61

Query: 62 GCNGGYPWSALQLVAQY---GIHYRNTYPY---EGVQRYCRSREKGPYAAKTDGVRQVQP 115 GC+GG +A + + Q ++ ++YPY EG+ C + A T V Q Sbjct: 62 GCSGGLMNNAFEWIVQENNGAVYTEDSYPYASGEGISPPCTTSGHTVGATITGHVELPQD 121

Query: 116 YNQGALLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNKVDHAVAAVGYGPN----YI 171 Q A ++ N PV+V + A+ + Y GG+ +DH V VGY + Y Sbjct: 122 EAQIAAWLAV-NGPVAVAVDAS--SWMTYTGGVMTSCVSEALDHGVLLVGYNDSAAVPYW 178

Query: 172 LIKNSWGTGWGENGYIRIKRGT 193 +IKNSW T WGE GYIRI +G+Sbjct: 179 IIKNSWTTQWGEEGYIRIAKGS 200

Cacain and Cruzain, a protease from Trypanosoma

Page 30: Sequence  comparison

By Michael Schroeder, Biotec 30

Window size 15 Dot if

6 matches in window

Cacain and Cathepsin, a human protease

Page 31: Sequence  comparison

By Michael Schroeder, Biotec 31

gi|7546546|pdb|1EF7|B Chain B, Crystal Structure Of Human Cathepsin X Length = 242

Score = 52.0 bits (123), Expect = 2e-07 Identities = 60/231 (25%), Positives = 94/231 (40%), Gaps = 34/231 (14%)

Query: 1 IPEYVDWRQKGAV---TPVKNQ---GSCGSCWAFSAVVTIEGIIKIRTGNL---NQYSEQ 51 +P+ DWR V + +NQ CGSCWA ++ + I I+ S QSbjct: 1 LPKSWDWRNVDGVNYASITRNQHIPQYCGSCWAHASTSAMADRINIKRKGAWPSTLLSVQ 60

Query: 52 ELLDCDRRSYGCNGGYPWSALQLVAQYGIHYRNTYPYEGVQRYCR--------SREKGPY 103 ++DC C GG S Q+GI Y+ + C + K +Sbjct: 61 NVIDCGNAG-SCEGGNDLSVWDYAHQHGIPDETCNNYQAKDQECDKFNQCGTCNEFKECH 119

Query: 104 AAKTDGVRQVQPYN-----QGALLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNK-V 157 A + + +V Y + + AN P+S + A + Y GGI+ +Sbjct: 120 AIRNYTLWRVGDYGSLSGREKMMAEIYANGPISCGIMATER-LANYTGGIYAEYQDTTYI 178

Query: 158 DHAVAAVGY----GPNYILIKNSWGTGWGENGYIRI-----KRGTGNSYGV 199 +H V+ G+ G Y +++NSWG WGE G++RI K G G Y +Sbjct: 179 NHVVSVAGWGISDGTEYWIVRNSWGEPWGERGWLRIVTSTYKDGKGARYNL 229

Cacain and Cathepsin, a human protease

Page 32: Sequence  comparison

By Michael Schroeder, Biotec 32

Window size 5 Dot if

2 matches in window

Cacain and Cathepsin, a human protease

Page 33: Sequence  comparison

By Michael Schroeder, Biotec 33

Window size 1 Dot if

1 match in window

Cacain and Cathepsin, a human protease

Page 34: Sequence  comparison

By Michael Schroeder, Biotec 34

Dynamic programming

Page 35: Sequence  comparison

By Michael Schroeder, Biotec 35

From Dotplots to Alignments Obvious best alignment:

DOROTHYCROWFOOTHODGKINDOROTHY--------

HODGKIN

NN

II

KK

GG

DDD

OOOOOOO

HHH

YY

HHH

TTT

OOOOOOO

RRR

OOOOOOO

DDD

NIKGDOHTOOFWORCYHTOROD

Page 36: Sequence  comparison

By Michael Schroeder, Biotec 36

From Dotplots to Alignments Find “best” path from top left corner to bottom right Moving “east” corresponds to “-” in the second

sequence Moving “south” corresponds to “-” in the first

sequence Moving “southeast” corresponds to

a match (if the characters are the same) or a mismatch (otherwise)

Can we automate this?

Page 37: Sequence  comparison

By Michael Schroeder, Biotec 37

From Dotplots to Alignments Algorithm (Dynamic Programming):

Insert a row 0 and column 0 initialised with 0 Starting from the top left, move down row by row from row 1 and

right column by column from column 1 visiting each cell Consider

The value of the cell north The value of the cell west The value of the cell northwest if the row/column character

mismatch 1 + the value of the cell northwest if the row/column

character match Put down the maximum of these values as the value for the

current cell Trace back the path with the highest values from the bottom right

to the top left and output the alignment

Page 38: Sequence  comparison

By Michael Schroeder, Biotec 38

From Dotplots to Alignments0 1 2 3 4 5

6T G C A T

A0 1 A2 T3 C4 T5 G6 A7 T

Page 39: Sequence  comparison

By Michael Schroeder, Biotec 39

From Dotplots to Alignments0 1 2 3 4 5

6T G C A T

A0 0 0 0 0 0 001 A 02 T 03 C 04 T 05 G 06 A 07 T 0

Insert a row 0 and column 0 initialised with 0

Page 40: Sequence  comparison

By Michael Schroeder, Biotec 40

From Dotplots to Alignments0 1 2 3 4 5

6T G C A T

A0 0 0 0 0 0 001 A 0 02 T 03 C 04 T 05 G 06 A 07 T 0

• Consider• Value north• Value west• Value northwest if the row/column character mismatch• 1 + value northwest if the row/column character match

• Put down the maximum of these values for current celll

0 0 1 1 1

Page 41: Sequence  comparison

By Michael Schroeder, Biotec 41

From Dotplots to Alignments0 1 2 3 4 5

6T G C A T

A0 0 0 0 0 0 001 A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23 C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35 G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47 T 0 1 2 2 3 4 4

Page 42: Sequence  comparison

By Michael Schroeder, Biotec 42

Reading the Alignment0 1 2 3 4 5

6T G C A T

A0 0 0 0 0 0 001 A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23 C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35 G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47 T 0 1 2 2 3 4 4-tgcat-a-

at-c-tgat

Page 43: Sequence  comparison

By Michael Schroeder, Biotec 43

Reading the Alignment: there are more than one possibility

0 1 2 3 4 56

T G C A TA0 0 0 0 0 0 001 A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23 C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35 G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47 T 0 1 2 2 3 4 4---tgcata

atctg-at-

Page 44: Sequence  comparison

By Michael Schroeder, Biotec 44

Formally:Longest Common Subsequence LCS What is the length s(V,W) of the longest common

subsequence of two sequencesV=v1..vn and W=w1..wm ?

Find sequences of indices1 ≤ i1 < … < ik ≤ n and 1 ≤ j1 < … < jk ≤ msuch that vit

= wjt for 1 ≤ t ≤ k

How? Dynamic programming: si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m and si-1,j

si,j = max si,j-1

si-1,j-1 + 1, if vi = wj

Then s(V,W) = sn,m is the length of the LCS

{

Page 45: Sequence  comparison

By Michael Schroeder, Biotec 45

Example LCS0 1 2 3 4 5

6T G C A T

A0 1 A2 T3 C4 T5 G6 A7 T

Page 46: Sequence  comparison

By Michael Schroeder, Biotec 46

Example LCS: 0 1 2 3 4 5

6T G C A T

A0 0 0 0 0 0 001 A 02 T 03 C 04 T 05 G 06 A 07 T 0

Initialisation: si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m

Page 47: Sequence  comparison

By Michael Schroeder, Biotec 47

Example LCS: 0 1 2 3 4 5

6T G C A T

A0 0 0 0 0 0 001 A 0 0 0 0 1 1 12 T 03 C 04 T 05 G 06 A 07 T 0

Computing each cell: si-1,j

si,j = max si,j-1

si-1,j-1 + 1, if vi = wj

{

Page 48: Sequence  comparison

By Michael Schroeder, Biotec 48

Example LCS: 0 1 2 3 4 5

6T G C A T

A0 0 0 0 0 0 001 A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23 C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35 G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47 T 0 1 2 2 3 4 4

Computing each cell: si-1,j

si,j = max si,j-1

si-1,j-1 + 1, if vi = wj

{

Page 49: Sequence  comparison

By Michael Schroeder, Biotec 49

LCS Algorithm LCS(V,W)

For i = 0 to n si,0 = 0

For j = 0 to m s0,j = 0

For i = 1 to n For j = 1 to m

If vi = wj and si-1,j-1 +1 ≥ si-1,j and si-1,j-1 +1 ≥ si,j-1 Then si,j = si-1,j-1 +1 bi,j = North West

Else if si-1,j ≥ si,j-1 Then si,j = si-1,j bi,j = North

Else si,j = si,j-1 bi,j = West

Return s and b

Complexity: LCS has quadratic complexity:

O(n m)

Page 50: Sequence  comparison

By Michael Schroeder, Biotec 50

Printing the alignment of LCS PRINT-LCS(b,V,i,j)

If i=0 or j=0 Then Return If bi,j = North West Then

PRINT-LCS(V,b,i-1,j-1) Print vi

Else if bi,j = North Then PRINT-LCS(V,b,i-1,j)

Else PRINT-LCS(V,b,i,j-1)

Page 51: Sequence  comparison

By Michael Schroeder, Biotec 51

Rewards/Penalities We can use different schemes:

-1 for insert/delete/mismatch +1 for match

…Consider -1 + the value of the cell north -1 + the value of the cell west -1 + the value of the cell northwest if the row/column

character mismatch +1 + the value of the cell northwest if the row/column

character match Put down the maximum of these values as the value for

the current cell

Page 52: Sequence  comparison

By Michael Schroeder, Biotec 52

Reading the Alignment0 1 2 3 4 5

6T G C A T

A0 0 0 0 0 0 001 A 0 -1 -1 -1 1 0 12 T 0 1 0 -1 0 2 13 C 0 0 -1 1 0 1 14 T 0 1 0 0 0 1 05 G 0 0 2 1 0 0 06 A 0 -1 1 1 2 1 17 T 0 1 0 0 1 3 2---tgcata

atctg-at-

Page 53: Sequence  comparison

By Michael Schroeder, Biotec 53

Rewards/Penalities Let’s refine the schemes:

Transition mutations are more common: purine<->purine, a<->g pyrimidine<->pyrimidine, t<->c

Transversions (purine<->pyrimidine) are less common

Use a subsitutation matrix to rate mismatches:

-2 for insert/delete Mismatch/match according to substitution matrix

…Consider -2 + the value of the cell north -2 + the value of the cell west Corresponding value of the substion matrix

+ the value of the cell northwest Put down the maximum of these values as the

value for the current cell

2-20-2C

-22-20G

0-22-2T

-20-22A

CGTA

Page 54: Sequence  comparison

By Michael Schroeder, Biotec 54

Reading the Alignment0 1 2 3 4 5

6T G C A T

A0 0 0 0 0 0 001 A 0 -2 0 -2 2 0 22 T 0 2 0 0 0 4 23 C 0 0 0 2 0 2 24 T 0 2 0 0 0 2 05 G 0 0 4 2 0 0 26 A 0 -2 2 2 4 2 27 T 0 2 0 2 2 6 4---tgcata

atctg-at-

Page 55: Sequence  comparison

By Michael Schroeder, Biotec 55

Substitution matrixes

Page 56: Sequence  comparison

By Michael Schroeder, Biotec 56

How to derive a substitution matrix for amino acids?

Amino acids can be classified by physiochemical properties

HydrophobicA

GP

I L V

C W

M F

AcidicDE

PolarS T

N Q

Y

H

Aromatic

K

R Basic

Page 57: Sequence  comparison

By Michael Schroeder, Biotec 57

PAM 250 matrixCys 12Ser 0 2Thr -2 1 3Pro -3 1 0 6Ala -2 1 1 1 2Gly -3 1 0 -1 1 5Asn -4 1 0 -1 0 0 2Asp -5 0 0 -1 0 1 2 4Glu -5 0 0 -1 0 0 1 3 4Gln -5 -1 -1 0 0 -1 1 2 2 4His -3 -1 -1 0 -1 -2 2 1 1 3 6Arg -4 0 -1 0 -2 -3 0 -1 -1 1 2 6Lys -5 0 0 -1 -1 -2 1 0 0 1 0 3 5Met -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6Ile -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5Leu -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6Val -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4Phe -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9Tyr 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10Trp -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17 C S T P A G N D E Q H R K M I L V F Y W

>0, likely mutation0, random mutation<0, unlikely

Page 58: Sequence  comparison

By Michael Schroeder, Biotec 58

Cys 12 0 -2 -3 -2 -3 -4 -5 -5 -5 -3 -4 -5 -5 -2 -6 -2 -4 0 -8

Ser 0 2 1 1 1 1 1 0 0 -1 -1 0 0 -2 -1 -3 -1 -3 -3 -2

Thr -2 1 3 0 1 0 0 0 0 -1 -1 -1 0 -1 0 -2 0 -3 -3 -5

Pro -3 1 0 6 1 -1 -1 -1 -1 0 0 0 -1 -2 -2 -3 -1 -5 -5 -6

Ala -2 1 1 1 2 1 0 0 0 0 -1 -2 -1 -1 -1 -2 0 -4 -3 -6

Gly -3 1 0 -1 1 5 0 1 0 -1 -2 -3 -2 -3 -3 -4 -1 -5 -5 -7

Asn -4 1 0 -1 0 0 2 2 1 1 2 0 1 -2 -2 -3 -2 -4 -2 -4

Asp -5 0 0 -1 0 1 2 4 3 2 1 -1 0 -3 -2 -4 -2 -6 -4 -7

Glu -5 0 0 -1 0 0 1 3 4 2 1 -1 0 -2 -2 -3 -2 -5 -4 -7

Gln -5 -1 -1 0 0 -1 1 2 2 4 3 1 1 -1 -2 -2 -2 -5 -4 -5

His -3 -1 -1 0 -1 -2 2 1 1 3 6 2 0 -2 -2 -2 -2 -2 0 -3

Arg -4 0 -1 0 -2 -3 0 -1 -1 1 2 6 3 0 -2 -3 -2 -4 -4 2

Lys -5 0 0 -1 -1 -2 1 0 0 1 0 3 5 0 -2 -3 -2 -5 -4 -3

Met -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6 2 4 2 0 -2 -4

Ile -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5 2 4 1 -1 -5

Leu -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6 2 2 -1 -2

Val -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4 -1 -2 -6

Phe -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9 7 0

Tyr 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10 0

Trp -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17

  C S T P A G N D E Q H R K M I L V F Y W

 

Average -2.8 -0.5 -0.7 -1.2 -0.9 -1.6 -0.7 -1.1 -1.1 -0.8 -0.3 -0.7 -0.9 -0.8 -0.8 -1.4 -0.8 -1.9 -1.5 -3.1

StDev 4 1.5 1.7 2.6 1.9 2.7 2.1 3 2.8 2.6 2.3 2.6 2.5 2.6 2.4 3 2.3 4.1 3.8 5.4

Page 59: Sequence  comparison

By Michael Schroeder, Biotec 59

Cys 12 0 -2 -3 -2 -3 -4 -5 -5 -5 -3 -4 -5 -5 -2 -6 -2 -4 0 -8

Ser 0 2 1 1 1 1 1 0 0 -1 -1 0 0 -2 -1 -3 -1 -3 -3 -2

Thr -2 1 3 0 1 0 0 0 0 -1 -1 -1 0 -1 0 -2 0 -3 -3 -5

Pro -3 1 0 6 1 -1 -1 -1 -1 0 0 0 -1 -2 -2 -3 -1 -5 -5 -6

Ala -2 1 1 1 2 1 0 0 0 0 -1 -2 -1 -1 -1 -2 0 -4 -3 -6

Gly -3 1 0 -1 1 5 0 1 0 -1 -2 -3 -2 -3 -3 -4 -1 -5 -5 -7

Asn -4 1 0 -1 0 0 2 2 1 1 2 0 1 -2 -2 -3 -2 -4 -2 -4

Asp -5 0 0 -1 0 1 2 4 3 2 1 -1 0 -3 -2 -4 -2 -6 -4 -7

Glu -5 0 0 -1 0 0 1 3 4 2 1 -1 0 -2 -2 -3 -2 -5 -4 -7

Gln -5 -1 -1 0 0 -1 1 2 2 4 3 1 1 -1 -2 -2 -2 -5 -4 -5

His -3 -1 -1 0 -1 -2 2 1 1 3 6 2 0 -2 -2 -2 -2 -2 0 -3

Arg -4 0 -1 0 -2 -3 0 -1 -1 1 2 6 3 0 -2 -3 -2 -4 -4 2

Lys -5 0 0 -1 -1 -2 1 0 0 1 0 3 5 0 -2 -3 -2 -5 -4 -3

Met -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6 2 4 2 0 -2 -4

Ile -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5 2 4 1 -1 -5

Leu -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6 2 2 -1 -2

Val -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4 -1 -2 -6

Phe -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9 7 0

Tyr 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10 0

Trp -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17

  C S T P A G N D E Q H R K M I L V F Y W

 

Average -2.8 -0.5 -0.7 -1.2 -0.9 -1.6 -0.7 -1.1 -1.1 -0.8 -0.3 -0.7 -0.9 -0.8 -0.8 -1.4 -0.8 -1.9 -1.5 -3.1

StDev 4 1.5 1.7 2.6 1.9 2.7 2.1 3 2.8 2.6 2.3 2.6 2.5 2.6 2.4 3 2.3 4.1 3.8 5.4

Page 60: Sequence  comparison

By Michael Schroeder, Biotec 60

PAM 250: Interpretation Immutable:

Cysteine (Avg=-2.8): known to have several unique, indispensable functions attachment site of heme group in cytochrome and of iron

sulphur FeS in ferredoxins Cross links in proteins such as chymotrypsin or ribonuclease Seldom without unique function

Glycine (Avg=-1.6): small size maybe advantageous Mutable:

Serine often functions in active site, but can be easily replaced Self-alignment:

Tryptophan with itself scores very high, as W occurs rarely

Page 61: Sequence  comparison

By Michael Schroeder, Biotec 61

Point Accepted Mutations PAM Substitution matrix using explicit evolutionary model of how

amino acids change over time Use parsimony method to determine frequency of mutations Entry in PAM matrix: Likelihood ratio for residues a and b: Probability

a-b is a mutation / probability a-b is chance PAM x: Two sequences V, W have evolutionary distance of x PAM if

a series of accepted point mutations (and no insertions/deletions) converts V into W averaging to x point mutation per 100 residues

Mutations here = mutations in the DNA Because of silent mutations and back mutations n can be >100 PAM 250 most commonly used

Page 62: Sequence  comparison

By Michael Schroeder, Biotec 62

PAM and Sequence Similarity

PAM 0 30 80 110 200 250% identiy 100 75 60 50 25 20

Page 63: Sequence  comparison

By Michael Schroeder, Biotec 63

PAM Dayhoff, Eck, Park: A model of evolutionary change in proteins,

1978

Accepted point mutation = substitution of an amino acid accepted by natureal selection

Assumption: X replacing Y as likely as Y replacing X

Used cytochrome c, hemoglobin, myoglobin, virus coat proteins, chymotrypsinogen, glyceraldehyde 3-phosphate dehrydogenase, clupeine, insulin, ferredoxin

Sequences which are too distantly related have been omitted as they are more likely to contain multiple mutations per site

Page 64: Sequence  comparison

By Michael Schroeder, Biotec 64

PAM: Step 1 Step 1: Construct a multiple alignment

Example ACGCTAFKI GCGCTAFKI ACGCTAFKL GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL

Page 65: Sequence  comparison

By Michael Schroeder, Biotec 65

PAM: Step 2 Create a phylogenetic tree (parsimony method)

ACGCTAFKI

A->G I->L

GCGCTAFKI ACGCTAFKL

A->G A->L C->S G->A

GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL

Page 66: Sequence  comparison

By Michael Schroeder, Biotec 66

PAM: Step 3 Relative mutability mi

Probability that residue i will mutate

Relative mutability depends on Substatutability: Relative mutability should increase with

increasing substatutability Residue probability: Relative mutability should decrease

with increasing residue probability

From relative mutability final PAM score is derived

ri

Page 67: Sequence  comparison

By Michael Schroeder, Biotec 67

BLOSUM Different approach to PAM BLOcks SUbstitution Matrix (based on BLOCKS

database) Generation of BLOSUM x

Group highly similar sequences and replace them by a representative sequences.

Only consider sequences with no more than x % similarity Align sequences (no gaps) For any pair of amino acids a,b and for all columns c of the

alignment, let q(a,b) be the number of co-occurrences of a,b in all columns c.

Let p(a) be the overall probability of a occurring BLOSUM entry for a,b is log2 ( q(a,b) / ( p(a)*p(b) ) )

BLOSUM 50 and BLOSUM 62 widely used

Page 68: Sequence  comparison

By Michael Schroeder, Biotec 68

LCS Algorithm (Longest Common Subsequence) Revisited

Algorithm (Dynamic Programming) with Substitution Matrix: Insert a row 0 and column 0 initialised with 0 Starting from the top left, move down row by row from row 1 and

right column by column from column 1 visiting each cell Consider

The value of the cell north The value of the cell west The value of the cell northwest if the row/column character

mismatch s + the value of the cell northwest, where s is the value

in the subsitution matrix for the residues in row/column Put down the minimum of these values as the value for the

current cell Trace back the path with the highest values from the bottom right

to the top left and output the alignment

Page 69: Sequence  comparison

By Michael Schroeder, Biotec 69

LCS Revisited: Formally What is the length s(V,W) of the longest common subsequence

of two sequencesV=v1..vn and W=w1..wm ?

Find sequences of indices1 ≤ i1 < … < ik ≤ n and 1 ≤ j1 < … < jk ≤ msuch that vit

= wjt for 1 ≤ t ≤ k

How? Dynamic programming: si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m and si-1,j

si,j = max si,j-1

si-1,j-1 + t, where t is the value for vi and wj in

the substitution matrix

Then s(V,W) = sn,m is the length of the LCS

{

Page 70: Sequence  comparison

By Michael Schroeder, Biotec 70

Dynamic programming revisited:local and global alignments and gap

Page 71: Sequence  comparison

By Michael Schroeder, Biotec 71

Evolution and Alignments Alignments can be interpreted in evolutionary terms

Identical letters are aligned. Interpretation: part of the same ancestral sequence and not changed

Non-identical letters are aligned (substitution)Interpretation: Mutation

GapsInterpretation: Insertions and deletions (indels)

Page 72: Sequence  comparison

By Michael Schroeder, Biotec 72

Evolution and Alignments Specific problems aligning DNA:

“Frame shift”: DNA triplets code amino acids Indel of one nucleotide shifts the whole sequence of

triplets Thus may have a global effect and change all coded

amino acids Silent mutation:

Substitution in DNA leaves transcribed amino acid unchanged

Non-sense mutation: Substitution to stop-codon

Page 73: Sequence  comparison

By Michael Schroeder, Biotec 73

Local and Global Alignments Global alignment (Needleham-Wunsch) algorithm finds overall

best alignment Example: members of a protein family, e.g. globins are very

conserved and have the same length in different organisms from fruit fly to humans

Local alignment (Smith-Waterman) algorithm finds locally best alignment most widely used, as

e.g. genes from different organisms retain similar exons, but may have different introns

e.g. homeobox gene, which regulates embryonic development occurs in many species, but very different apart from one region called homeodomain

e.g. proteins share some domains, but not all

Page 74: Sequence  comparison

By Michael Schroeder, Biotec 74

Local Alignment LCS s(V,W) computes globally best alignment Often it is better to maximise locally, i.e. compute

maximal s(vi…vi’ , wj… wi’ ) for all substrings of V and W

Can we adapt algorithm? Global alignment = longest path in matrix s from (0,0)

to (n,m) Local alignment = longest path in matrix s from any

(i,j) to any (i’,j’) Modify definition of s adding vertex of weight 0 from

source to every other vertex, creating a free “jump” to any starting position (i,j)

Page 75: Sequence  comparison

By Michael Schroeder, Biotec 75

Local Alignment Modify the definition of s as follows:

si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m and 0

si-1,j

si,j = max si,j-1

si-1,j-1 + t, where t is the value for vi wj

in the substitution matrix

Then s(V,W) = max { si,j } is the length of the local LCS

This computes longest path in edit graph Several local alignment may have biological

significance (consider e.g. two multi-domain proteins whose domains are re-ordered

{

Page 76: Sequence  comparison

By Michael Schroeder, Biotec 76

Aligning with Gap Penalties Gap is sequence of spaces in alignment So far, we consider only insertion and deletion of single

nucleotides or amino acids creating alignments with many gaps So far, score of a gap of length l is l Because insertion/deletion of monomers is evolutionary slow

process, large numbers of gaps do not make sense Instead whole substrings will be deleted or inserted We can generalise score of a gap to a score function A + B l,

where A is the penalty to open the gap and B is the penalty to extend the gap

Page 77: Sequence  comparison

By Michael Schroeder, Biotec 77

Aligning with Gap Penalties High gap penalties result in shorter, lower-scoring

alignments with fewer gaps and Lower gap penalties give higher-scoring, longer

alignments with more gaps Gap opening penalty A mainly influences number

of gaps Gap extension penalty B mainly influences length

of gaps E.g. if interested in close relationships, then choose

A, B above default values, for distant relationships decrease default values

Page 78: Sequence  comparison

By Michael Schroeder, Biotec 78

Aligning with Gap Penalties Adapt the definition of s as follows:

s-deli,j = max s-deli-1,j - B

si-1,j – (A+B) s-insi,j = max s-insi,j-1 - B

si,j-1 – (A+B) 0

s-deli,jsi,j = max s-insi,j

si-1,j-1 + t, where t is the value for vi, wj

in the substitution matrix Then s(V,W) = max { si,j } is the length of the local LCS with gap penalties A and B

{

{{

Page 79: Sequence  comparison

By Michael Schroeder, Biotec 79

FASTA and BLAST

Page 80: Sequence  comparison

By Michael Schroeder, Biotec 80

Motivation As in dotplots, the underlying data structure for dynamic

programming is a table Given two sequences of length n dynamic programming

takes time proportional to n2

Given a database with m sequences, comparing a query sequence to the whole database takes time proportional to m n2

What does this mean? Imagine you need to fill in the tables by hand and it takes 10

second to fill in one cell Assume there are 1.000.000 sequences each 100 amino acids

long How long does it take?

Page 81: Sequence  comparison

By Michael Schroeder, Biotec 81

1.000.000 x 100 x 100 x 10 sec = 1011 sec = 27.777.778h = 1157407days = 3170 years

Even if a computer does not take 10 sec, but just 0.1ms to fill in one cell, it would still be 12 days.

We cannot do something about the database size, but can we do something about the table size?

Page 82: Sequence  comparison

By Michael Schroeder, Biotec 82

An idea: Prune the search space

Page 83: Sequence  comparison

By Michael Schroeder, Biotec 83

Another idea Did we formulate the

problem correctly? Do we need the alignments

for all sequences in the database?

No, only for “reasonable” hits introduce a threshold

A “reasonable” alignment will contain short stretches of perfect matches

Find these first, then extend them to connect them as best possible

Page 84: Sequence  comparison

By Michael Schroeder, Biotec 84

FASTA and BLAST FASTA and BLAST faster than dynamic programming

(5 times and 50 times respectively) Underlying idea for a heuristic:

High-scoring alignments will contain short stretches of identical letters, called words

FASTA and BLAST first search for matches of words of a given length and score threshold: BLAST for words of length 3 for proteins and 11 for

DNA FASTA for words of length 2 for proteins and 6 for

DNA Next, matches are extended to local (BLAST) and

global (FASTA) alignments

Page 85: Sequence  comparison

By Michael Schroeder, Biotec 85

FASTA and BLAST More formally:

If the strings V=v1..vm and W=w1..wm match with at most k mismatches, then they share an p-tuple for

p = m/(k+1), i.e. vi..vi+l-1 =wj..wj+l-1 for some 1 ≤ i,j ≤ m-p+1

FILTRATION ALGORITHM, which detects all matching words of length m with up to k mismatches Potential match detection: Find all matches of p-tuples

of V,W (can be done in linear time by inserting them into a hash table)

Potential match verification: Verify each potential match by extending it to the left and right until either the first k+1 mismatches are found or the beginning or end of the sequences are found

Page 86: Sequence  comparison

By Michael Schroeder, Biotec 86

Example for BLAST Search SWISSPROT for Immunoglobulin:

SWISS_PROT:C79A_HUMAN P11912

Page 87: Sequence  comparison

By Michael Schroeder, Biotec 87

Example for BLAST Search BLAST (www.ncbi.nlm.nih.gov/BLAST/) for P11912

Database: All non-redundant SwissProt sequences

1,292,592 sequences; 412,925,052 total letters

Page 88: Sequence  comparison

By Michael Schroeder, Biotec 88

Example for BLAST Distribution of Hits:

Page 89: Sequence  comparison

By Michael Schroeder, Biotec 89

Example for BLAST: Top Hits Score E Sequences producing significant alignments: Score E-Value gi|

547896|sp|P11912|C79A_HUMAN B-cell antigen receptor comp... 473 e-133 gi|728993|sp|P40293|C79A_BOVIN B-cell antigen receptor comp... 312 3e-85 gi|126779|sp|P11911|C79A_MOUSE B-cell antigen receptor comp... 278 5e-75 gi|728994|sp|P40259|C79B_HUMAN B-cell antigen receptor comp... 55 1e-07 gi|125781|sp|P01618|KV1_CANFA IG KAPPA CHAIN V REGION GOM 38 0.019 gi|125361|sp|P17948|VGR1_HUMAN Vascular endothelial growth ... 37 0.042 gi|549319|sp|P35969|VGR1_MOUSE Vascular endothelial growth ... 36 0.052 gi|114764|sp|P15530|C79B_MOUSE B-cell antigen receptor comp... 36 0.064 gi|1718161|sp|P53767|VGR1_RAT Vascular endothelial growth f... 35 0.080 gi|125735|sp|P01681|KV01_RAT Ig kappa chain V region S211 35 0.095 gi|1730075|sp|P01625|KV4A_HUMAN IG KAPPA CHAIN V-IV REGION LEN 34 0.26 gi|1718188|sp|P52583|VGR2_COTJA Vascular endothelial growth... 33 0.28 gi|125833|sp|P06313|KV4B_HUMAN IG KAPPA CHAIN V-IV REGION J... 33 0.30 gi|125806|sp|P01658|KV3F_MOUSE IG KAPPA CHAIN V-III REGION ... 33 0.30 gi|125808|sp|P01659|KV3G_MOUSE IG KAPPA CHAIN V-III REGION ... 33 0.30 gi|1172451|sp|Q05793|PGBM_MOUSE Basement membrane-specific ... 33 0.33 gi|125850|sp|P01648|KV5O_MOUSE Ig kappa chain V-V region HP... 33 0.36 gi|125830|sp|P06312|KV40_HUMAN Ig kappa chain V-IV region p... 33 0.38 gi|2501738|sp|Q06639|YD03_YEAST Putative 101.7 kDa transcri... 33 0.41

Page 90: Sequence  comparison

By Michael Schroeder, Biotec 90

Example for BLAST: Alignment>gi|126779|sp|P11911|C79A_MOUSE B-cell antigen receptor complex associated protein alpha-chainprecursor (IG-alpha) (MB-1 membrane glycoprotein)(Surface-IGM-associated protein) (Membrane-boundimmunoglobulin associated protein) (CD79A)Length = 220

Score = 278 bits (711), Expect = 5e-75Identities = 150/226 (66%), Positives = 165/226 (73%), Gaps = 6/226 (2%)

Query: 1 MPGGPGVLQALPATIFLLFLLSAVYLGPGCQALWMHKVPASLMVSLGEDAHFQCPHNSSN 60 MPGG + LL LS LGPGCQAL + P SL V+LGE+A C N+ Sbjct: 1 MPGG----LEALRALPLLLFLSYACLGPGCQALRVEGGPPSLTVNLGEEARLTC-ENNGR 55

Query: 61 NANVTWWRVLHGNYTWPPEFLGPGEDPNGTLIIQNVNKSHGGIYVCRVQEGNESYQQSCG 120 N N+TWW L N TWPP LGPG+ G L VNK+ G C+V E N ++SCGSbjct: 56 NPNITWWFSLQSNITWPPVPLGPGQGTTGQLFFPEVNKNTGACTGCQVIE-NNILKRSCG 114

Query: 121 TYLRVRQPPPRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEKLGLDAGD 180 TYLRVR P PRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEK G+D DSbjct: 115 TYLRVRNPVPRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEKFGVDMPD 174

Query: 181 EYEDENLYEGLNLDDCSMYEDISRGLQGTYQDVGSLNIGDVQLEKP 226 +YEDENLYEGLNLDDCSMYEDISRGLQGTYQDVG+L+IGD QLEKPSbjct: 175 DYEDENLYEGLNLDDCSMYEDISRGLQGTYQDVGNLHIGDAQLEKP 220

Page 91: Sequence  comparison

By Michael Schroeder, Biotec 91

Example for BLAST Lineage Report root . cellular organisms . . Eukaryota [eukaryotes] . . . Fungi/Metazoa group [eukaryotes] . . . . Bilateria [animals] . . . . . Coelomata [animals] . . . . . . Gnathostomata [vertebrates] . . . . . . . Tetrapoda [vertebrates] . . . . . . . . Amniota [vertebrates] . . . . . . . . . Eutheria [mammals] . . . . . . . . . . Homo sapiens (man) ---------------------- 473 33 hits [mammals] B-cell antigen receptor complex associated protein alpha-ch . . . . . . . . . . Bos taurus (bovine) ..................... 312 2 hits [mammals] B-cell antigen receptor complex associated protein alpha-ch . . . . . . . . . . Mus musculus (mouse) .................... 278 31 hits [mammals] B-cell antigen receptor complex associated protein alpha-ch . . . . . . . . . . Canis familiaris (dogs) ................. 37 1 hit [mammals] IG KAPPA CHAIN V REGION GOM . . . . . . . . . . Rattus norvegicus (brown rat) ........... 35 7 hits [mammals] Vascular endothelial growth factor receptor 1 precursor (VE . . . . . . . . . . Oryctolagus cuniculus (domestic rabbit) . 29 1 hit [mammals] IG KAPPA CHAIN V REGION K29-213 . . . . . . . . . Coturnix japonica ------------------------- 33 2 hits [birds] Vascular endothelial growth factor receptor 2 precursor (VE . . . . . . . . . Gallus gallus (chickens) .................. 31 4 hits [birds] CILIARY NEUROTROPHIC FACTOR RECEPTOR ALPHA PRECURSOR (CNTFR . . . . . . . . Xenopus laevis (clawed frog) ---------------- 30 2 hits [amphibians] Neural cell adhesion molecule 1, 180 kDa isoform precursor . . . . . . . Heterodontus francisci ------------------------ 28 1 hit [sharks and rays] Myelin P0 protein precursor (Myelin protein zero) (Myelin p . . . . . . Drosophila melanogaster ------------------------- 30 2 hits [flies] Neuroglian precursor . . . . . Caenorhabditis elegans ---------------------------- 29 1 hit [nematodes] Hypothetical protein F59B2.12 in chromosome III . . . . Saccharomyces cerevisiae (brewer's yeast) ----------- 33 1 hit [ascomycetes] Putative 101.7 kDa transcriptional regulatory protein in PR . . . Marchantia polymorpha --------------------------------- 29 1 hit [liverworts] Succinate dehydrogenase cytochrome b560 subunit (Succinate . . Agrobacterium tumefaciens str. C58 ---------------------- 28 1 hit [a-proteobacteria] Formamidopyrimidine-DNA glycosylase (Fapy-DNA glycosylase) . Human adenovirus type 3 ----------------------------------- 30 1 hit [viruses] EARLY E3 20.5 KD GLYCOPROTEIN . Human adenovirus type 7 ................................... 30 1 hit [viruses] EARLY E3 20.5 KD GLYCOPROTEIN

Page 92: Sequence  comparison

By Michael Schroeder, Biotec 92

How good is an alignment? Be careful: Fitch/Smith found 17 alignments for alpha- and

beta-chains in chicken haemoglobins Only one is the correct one (according to the structure)

Given an alignment, how good is it : Percentage of matching residues, i.e. number of matches divided

by length of smallest sequence Advantage: independent of sequence length E.g. AT–C –TGAT 4/6 = 66.67%

–TGCAT –A–

More general: also consider gaps, extensions,…

Page 93: Sequence  comparison

By Michael Schroeder, Biotec 93

Blast Raw Score R = a I + b X - c O - d G, where

I is the number of identities in the alignment and a is the reward for each identity

X is the number of mismatches in the alignment and b is the “reward” for each mismatch

O is the number of gaps and c is the penalty for each gap

G is the number of “-” characters in the alignment and d is the penalty for each

The values for a,b,c,d appear at the bottom of a Blast report. For BLASTn they are a=1, b=-3, c=5, d=2

Page 94: Sequence  comparison

By Michael Schroeder, Biotec 94

ExampleQuery: 1 atgctctggccacggcacttgcgga ||||||||||||||| |||| |||Sbjt:107 atgctctggccacggatcttgtgga

tcccagggtgatctgtgcacctgcgata 53 ||||| |||| ||||||||||||||| tccca---tgatatgtgcacctgcgata 156

R = 1 x 46 + -3 x 4 - 5 x 1 - 2 x 3 = 23

So, given the scores: how significant is the alignment?

Page 95: Sequence  comparison

By Michael Schroeder, Biotec 95

Significance of an alignment Significance of an alignment needs to be defined with respect

to a control population Pairwise alignment: How can we get control population?

Generate sequences randomly? Not a good model of real sequences

Chop up both sequences and randomly reassemble them Database search: How can we get control population?

Control = whole database Align sequence to control population and see how good result

is in comparison This is captured by Z scores, P-values and E-values

Page 96: Sequence  comparison

By Michael Schroeder, Biotec 96

Z-score Z-score normalises the score S:

Let m be mean of population and std its standard deviation, then Z-score = (S – m) / std

Z-score of 0 no better than average, hence might have occurred by chance

The higher the Z-score the better

Page 97: Sequence  comparison

By Michael Schroeder, Biotec 97

P-value P-value: probability of obtaining a score ≥ S

Range: 0 ≤ P ≤ 1 Let m be the number of sequences in the control

population with score ≥ S Let p be the size of the control population Then P-value = m / p Rule of thumb:

P ≤ 10-100 exact match, 10-100 ≤ P ≤ 10-50 nearly identical (SNPs) 10-50 ≤ P ≤ 10-10 homology certain 10-5 ≤ P ≤ 10-1 usually distant relative P > 10-1 probably insignificant

Page 98: Sequence  comparison

By Michael Schroeder, Biotec 98

E-values E-value takes also the database into account E-value = expected frequency of a score ≥ S

Range: 0 ≤ E ≤ m, where m is the size of the database Relationship to P: E = m P

E values are calculated from the bit score the length of the query the size of the database

Page 99: Sequence  comparison

By Michael Schroeder, Biotec 99

Precision and Recall How good are BLAST and FASTA?

True positives, tp = hits which are biologically meaningful False positives, fp = hits which are not biologically meaningful True negatives, tn = non-hits which are not biologically meaningful False negatives, fn = non-hits which are biologically meaningful

Minimise fp and fn Recall: tp/(tp+fn) (meaningful hits / all meaningful) Precision: tp/(tp+fp) (meaningful hits / all hits) But: since no objective data available difficult to judge BLAST

and FASTA’s sensitivity and specificity

Page 100: Sequence  comparison

By Michael Schroeder, Biotec 100

Multiple Sequence Alignments

Page 101: Sequence  comparison

By Michael Schroeder, Biotec 101

Multiple Sequence Alignment Align more than two sequences Choice of sequences

If too closely related then large redundant If very distantly related then difficult to generate good alignment

Additionally use colour for residues with similar properties Yellow Small polar GLy,Ala,Ser,Thr Green Hydrophobic Cys,Val,Ile,Leu,

Pro,Phe,Tyr,Met,Trp Magenta Polar Asn,Gln,His Red Negatively charged Asp,Glu Blue Positively charged Lys, Arg

Page 102: Sequence  comparison

By Michael Schroeder, Biotec 102

Thioredoxins: WCGPC[K or R] motif

Page 103: Sequence  comparison

By Michael Schroeder, Biotec 103

Thioredoxins: Gly/Pro = turn

Page 104: Sequence  comparison

By Michael Schroeder, Biotec 104

Thioredoxins: every second hydrophobic = beta strand

Page 105: Sequence  comparison

By Michael Schroeder, Biotec 105

Thioredoxins: ca. every 4th hydrophobic = alpha helix

Page 106: Sequence  comparison

By Michael Schroeder, Biotec 106

Page 107: Sequence  comparison

By Michael Schroeder, Biotec 107

Summary Evolutionary model: Indels and substitutions Homologues vs. similarity Dot plots

Easy visual exploration, but not scalable Dynamic programming

Local, global, gaps Substitution matrices (PAM, BLOSUM) BLAST and FASTA Scores and significance

Multiple Sequence Alignments