Download - 03 Comparison.ppt

7/27/2019 03 Comparison.ppt

1/12

1

SequenceComparison

BINF3010/9010

Homologyandsimilarity

HomologySequencesarehomologousiftheyare

evoluonarilyrelated-i.e.theyshareacommon

ancestorthroughevoluon

SimilarityLookingalikeNotanevoluonaryconcept


HomologyisnotaquantyTwosequencesareeitherhomologousornot

homologous

e.g.,itisincorrecttorefertotwosequencesasbeing50%homologous

Similaritycanbequanfiede.g.,twosequencescanbe50%similar,80%similaretc


Computaonalmethodsrecogniseandmeasuresimilarity

Highsimilarityissupporngevidencetoinferhomology

Typesofhomology

Orthologs:Genes/proteinsdescendedfromacommonancestor

Paralogs:Genes/proteinsrelatedtoeachotherduetoageneduplicaonevent

Evoluonthroughmutaons

SPAMEGGANDSPAMsubstitutions

insertionsdeletions

SPATEGGANDSPAM

1 SPLATEGGANDSPAM 2 SPAGANDSPAM


2/12

2

Visualisingtheprocess

Dotmatrixplots(dotplots) Alignments

Dotmatrixplot

M A P S D

N A G

A P S SPLATEGGANDSPAM

1 SPLATEGGANDSPAM

2 SPAGANDSPAM

1

2

DotmatrixplotsDotmatrixplot:Principle

AAGTTCAGTAGGCATTTAAGCGA ** * * * **

G * * ** * *T ** * ***

A ** * * * **

C * * *C * * *

G * * ** * *T ** * ***T ** * ***

C * * *C * * *

Word size = 1

AAGTTCAGTAGGCATTTAAGCGA * * * *

G * *T * *

A

C

C *

G * *T * **

T *

C

C

Word size = 2AAGTTCAGTAGGCATTTAAGCG

A * *

G *

T

A

CC

G *T *

T

CC

Word size = 3


3/12

3

AAGTTCAGTAGGCATTTAAGCGA * * * * *

G * * * *T * * *

A *

C * *C * * *

G * * *T * **T **

CC

Word size = 3

Threshold = 2

Window = 30 Stringency = 9

Window=20Stringency=9 Window = 30 Stringency = 14

Window = 20 Stringency = 13Dotmatrixplot:repeats

M A P S D

N A G

A P S SPLATEGGANDSPAM

1 SPLATEGGANDSPAM

2 SPAGANDSPAM

1

2


4/12

4

Repeatdetecon

TFIIIA

vs

TFIIIA

Sequencealignment

1 SPLATEGGANDSPAM 2 SPAGANDSPAM

1 SPLATEGGANDSPAM

|| | ||||||||

2 SP-A---GANDSPAM

GlobalvsLocalAlignment

1 ....AUAUCUUUAAUUUAAUGGUAAAAUAUUAGAAUACGAAUCUAAUUAU 46|||| || | || || || || | | | || ||

1 UGGUAUAUAGUUUAAACAAAACGAAUGAUUUCGACUCAUUAAAUUAUGAU 50. .

47 AUAGGUUCAAAUCCUAUAAGAUAUUCCA 74| | | | |

51 AAUCAUAUUUACCAACCA.......... 68

44 UAUAUAGGUUCAA 56||||||| || ||

4 UAUAUAGUUUAAA 16

Global: align the whole of the two sequences together

Local: align only the region of best similarity

Whichalignmentiscorrect?

1 SPLATEGGANDSPAM|| | ||||||||

2 SP-A---GANDSPAM2 insertion/deletions

1 SPLATEGGANDSPAM|| ||||||||

2 SPA----GANDSPAM1 indel, 1 substitution

1 SPLATEGGANDSPAM|| ||||||||

2 SP----AGANDSPAM1 indel, 1 substitution

1 SPLATEGGANDSPAM| ||||||||

2 -SPA---GANDSPAM2 indels, 2 substitutions

Whichalignmentisopmal?

SelectascoringsystemforalignmentsAssignvaluestomatches,mismatchesandgaps

SumupthevaluesoverthewholealignmentAlignmentscore=Scorematch-Scoregap

Theopmalalignmentistheonewiththehighestscore

Forexample:

Match:+2Mismatch:-1Gap:5

1 SPLATEGGANDSPAM

|| | ||||||||

2 SP-A---GANDSPAMS= (11*2) + (0*-1) - (2*5) = 12

1 SPLATEGGANDSPAM

||x ||||||||

2 SPA----GANDSPAMS= (10*2) + (1*-1) - (1*5) = 14

1 SPLATEGGANDSPAM|| x||||||||

2 SP----AGANDSPAMS= (10*2) + (1*-1) - (1*5) = 14

1 SPLATEGGANDSPAMxx| ||||||||

2 -SPA---GANDSPAMS= (9*2) + (2*-1) - (2*5) = 6


5/12

5

Algorithms

GlobalalignmentNeedleman-WunschSellers

LocalalignmentSmith-Waterman

Notethattheopmalalignmentisnot

necessarilythecorrectbiological

alignment.

However,itisusuallyimpossibletoknow

thecorrectevoluonaryalignment

Structurealignment Structurealignment

10 20 30 40 50 60

....*....|....*....|....*....|....*....|....*....|....*....| 4HHB_A 1 ~VLSPADKTNVKAAWGKVgaHAGEYGAEALERMFLSFPTTKTYFPHFD ls~~~~~~hGSA532HHB_B 1 vHLTPEEKSAVTALWGKV~~NVDEVGGEALGRLLVVYPWTQRFFESFGdlstpdavmGNP58

70 80 90 100 110 120....*....|....*....|....*....|....*....|....*....|....*....|

4HHB_A 54 QVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL 1132HHB_B 59 KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHF 118

130 140....*....|....*....|....*...

4HHB_A114 PAEFTPAVHASLDKFLASVSTVLTSKYR1412HHB_B119 GKEFTPPVQAAYQKVVAGVANALAHKYH 146

Scoringsystems

MatchesandmismatchesSubstuonmutaons

GapsInseronsanddeleons

DNAsequencealignment768 TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG 813

|| || || | | ||| | |||| ||||| ||| |||

87 TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG 135. . . . .

814 AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG 863| | | | |||||| | |||| | || | |

136 AAGGATC.............TCAGTAATTAATCATGCACCTATGTGGCGG 172

. . . . .864 AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT 913

||| | ||| || || ||| | ||||||||| || |||||| |

173 AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT 216


6/12

6

A T G CA 5 -4 -4 -4T -4 5 -4 -4G -4 -4 5 -4

C -4 -4 -4 5

DNAscoringmatrixusedinEMBOSS

Section of EMBOSS data file EDNAFULL

ProteinSequenceAlignment

TPKRREAEDLQVGQVLGGPLQLLE...SLQKRGIVEQCCT||:|: |: |:|||::|: |||||||||

YPKKRDMEQ......LSGPLDMLQQEYQKMKRGIVEQCCH


TPKRREAEDLQVGQVLGGPLQLLE...SLQKRGIVEQCCT||:|:|: |:|||::|: |||||||||


Identical


TPKRREAEDLQVGQVLGGPLQLLE...SLQKRGIVEQCCT||:|:|: |:|||::|: |||||||||


Identical

Similar

Different

ProteinComparison:

ScoringMatrix

A la C ys A sp G lu P h e G l y H is I l e L ys Le u M et As n P ro G ln A rg S er T hr V al T rp T yr A C D E F G H I K L M N P Q R S T V W Y

0.8 0 . 0 - 0 .4 - 0. 2 - 0 . 4 0 . 0 - 0 .4 - 0 .2 - 0 .2 - 0 .2 - 0 .2 - 0 .4 - 0. 2 - 0 .2 - 0 .2 0.2 0 . 0 0 . 0 - 0 .6 - 0 .4 A Ala

1.8 - 0 .6 - 0. 8 - 0 . 4 - 0 .6 - 0 .6 - 0 .2 - 0 .6 - 0 .2 - 0 .2 - 0 .6 - 0. 6 - 0 .6 - 0 .6 - 0 .2 - 0. 2 - 0. 2 - 0 . 4 - 0 .4 C Cys

1.2 0.4 - 0 .6 - 0 .2 - 0 .2 - 0 .6 - 0 .2 - 0 .8 - 0 .6 0.2 - 0. 2 0 . 0 - 0 .4 0 . 0 - 0. 2 - 0 . 6 - 0 .8 - 0 .6 D Asp

1.0 - 0 .6 - 0 .4 0 . 0 - 0 . 6 0.2 - 0 .6 - 0 .4 0 . 0 - 0 .2 0.4 0 . 0 0 . 0 - 0 .2 - 0. 4 - 0 . 6 - 0 . 4 E Glu

1.2 - 0 .6 - 0 . 2 0 . 0 - 0 .6 0 . 0 0 . 0 - 0 . 6 - 0. 8 - 0 . 6 - 0 .6 - 0 .4 - 0 . 4 - 0 . 2 0 .2 0 .6 F Phe

1.2 - 0 .4 - 0 .8 - 0 . 4 - 0 . 8 - 0 . 6 0 . 0 - 0 .4 - 0 .4 - 0 . 4 0 . 0 - 0 .4 - 0. 6 - 0 . 4 - 0 . 6 G Gly

1.6 - 0 .6 - 0 . 2 - 0 . 6 - 0. 4 0 . 2 - 0. 4 0 . 0 0 . 0 - 0 . 2 - 0. 4 - 0 . 6 - 0 .4 0.4 H His

0.8 -0.6 0 .4 0 .2 - 0 .6 - 0. 6 - 0 .6 - 0 .6 - 0 .4 - 0. 2 0 . 6 - 0 .6 - 0 .2 I Ile

1.0 - 0 .4 - 0 .2 0 . 0 - 0 .2 0 .2 0 .4 0 .0 - 0. 2 - 0. 4 - 0 .6 - 0 .4 K Lys

0.8 0.4 - 0 .6 - 0. 6 - 0 .4 - 0 .4 - 0 .4 - 0. 2 0.2 -0.4 -0.2 L Leu

1.0 - 0 .4 - 0. 4 0 . 0 - 0 .2 - 0 .2 - 0. 2 0.2 -0.2 -0.2 M Met

1.2 - 0. 4 0 .0 0 .0 0.2 0 . 0 - 0. 6 - 0 .8 - 0 .4 N Asn

1.4 - 0 .2 - 0 .4 - 0 .2 - 0. 2 - 0. 4 - 0 .8 - 0 .6 P Pro

1.0 0.2 0 . 0 - 0 . 2 - 0. 4 - 0 .4 - 0 .2 Q Gln

1.0 - 0 .2 - 0. 2 - 0. 6 - 0 .6 - 0 .4 R Arg

0.8 0.2 -0.4 -0.6 -0.4 S Ser

1.0 0 . 0 - 0 . 4 - 0 .4 T Thr

0.8 -0.6 -0.2 V Val

2.2 0.4 W Trp

1.4 Y Tyr

BLOSUM62 Matrix

Firstprinciplesaminoacidsubstuon

matrices Identymatrix

Perfectmatch:posivescore Anymismatch:negavescore

Genecscorematrix Basedontheaveragenumberofnucleodechanges

neededtomutateoneaminoacidintoanother

e.g.K(AAA,AAG)toN(AAC,AAU)hasahigherscorethanK(AAA,AAG)toD(GAU,GAC)

Chemicalproperesmatrices e.g.K(basic)toR(basic)hasahigherscorethanK(basic)

toF(aromac)orKtoE(acidic)


7/12

7

Identymatrixexample

D +1E -1 +1Q -1 -1 +1

H -1 -1 -1 +1V -1 -1 -1 -1 +1

F -1 -1 -1 -1 -1 +1W -1 -1 -1 -1 -1 -1 +1

D E Q H V F W

Data-basedmatrices

Calculatedfromaminoacidfrequenciesinknownhomologoussequences

PAMfamilyofmatrices BLOSUMfamilyofmatrices Performbeerthanfirstprinciplematrices

(whicharesllusefulforsomespecialised

applicaons)

BLOSUMmatrices

BLOSUM 62

BLOSUMmatrices

HenikoffandHenikoff,1992 BlocksSubstuonMatrix BasedontheBLOCKSdatabase Currently,mostwidelyusedmatrixfamily Mostcommonlyusedmatrices:BLOSUM62

andBLOSUM55

BLOCKSdatabase

BLOCKSareungappedmulplesequencealignmentsbasedontheSWISS-PROTdatabaseandthePROSITE

proteinfamilydatabase

AllthesequencesfromSWISS-PROTbelongingtoaPROSITEfamilyarealignedtogether,tocreatelocal

ungappedalignments characteriscoftheprotein

family

BLOCKexampleID Mn_catalase; BLOCK

AC IPB007760A; distance from previous block=(3

DE Manganese containing catalase

BL HIL; width=14; seqs=49; 99.5%=727; strengt

CTJC_BACSU|Q45538 ( 67) HLEMIATMVYKLTK 12

GS80_BACSU|P80878 ( 69) HVEMIATMIARLLE 14YDHU_BACSU|O05513 ( 4) HGNLITDLLDNLLL 25

O69145 ( 70) HMEIVAETINLLNG 64

Q9KDZ2 ( 136) SGNLIFDLLHNYFL 34

Q9KAU6 ( 69) HVEMLATMIARLLD 16

Q9I1T0 ( 68) HLEIIGSIVGMLNK 20Q97JE8 ( 68) HLEIVGSIVRQLSR 50

MCAT_CLOAB|Q97FE0 ( 124) TGDIVADLLSNIAS 73

Q8Z7E1 ( 68) HLEIIGSLVGMLNK 17

Q8YY54 ( 69) HIEMLATMIAHLLD 27Q8YSJ5 ( 68) HLEMVGKLIEAHTK 36


8/12

8

FromBLOCKStoBLOSUM

1. Countthenumberofaminoacidpairsobservedineachcolumnofeachblockandcalculatetheobservedfrequencyofeachpair

2. Calculatetheexpectedfrequencyofeachpair(basedonthefrequencyofindividualaminoacids)

3. Calculatethelograo(typicallylog2)

1.Countnumberofobservedpairsand

calculatefrequencies

DADAAAAEAAEEAADA

AAEE

AADE

There are 4 6

2

#

$%

&

'(= 60 aligned pairs of amino acids in the block

Aligned pair

(xy)

Proportion of times observed

(oxy)

A to A 26/60

A to D 8/60

A to E 10/60

D to D 3/60

D to E 6/60

E to E 7/60

Generalcaseforstep1.

For each pair of amino acids x and y,

nxy

= number of times x and y are in the same

column of a block

oxy = observed proportion of aligned pairxy

oxy

=

nxy

nuv

uv

2.Calculatetheexpectedfrequencyofeach

pair

DADAAAAE

AAEEAADA

AAEE

AADE

Amino acid (x) Proportion in block (px)

A 14/24

D 4/24

E 6/24

Amino acid pair (xy) Expected proportion (exy)

A to A (14/24)2 = 196/576

A to D 2(14/24) (4/24) = 112/576

A to E 2 (14/24) (6/24) = 168/576

D to D (4/24)2 = 16/576

D to E 2(4/24) (6/24) = 48/576

E to E (6/24)2 = 36/576

Generalcaseforstep2

Expected proportion of amino acid pair xy in

random block of same amino acid composition :

exy

=

2pxp

yifx y

pxp

yifx = y

#$%

3.Calculatethelograo

Matrix entry = 2log2oxy

exy

"

#$$

%

&''(rounded to nearest integer)

Aligned pair (xy) oxy

exy

2log2(oxy/exy)A to A 26/60 196/576 0.70A to D 8/60 112/576 -1.09A to E 10/60 168/576 -1.61D to D 3/60 16/576 1.70D to E 6/60 48/576 0.53E to E 7/60 36/576 1.80


9/12

9

Finalmatrix

A D EA 1 -1 -2

D -1 2 1

E -2 1 2

The 2log2 transformation means that the matrix is in half-bits

BLOSUMfamily

Problem:counngeveryaminoacidintheblockcanleadtoanover-representaonofaminoacid

changesfoundincloselyrelatedsequences

Soluon:clustersequencescloserthanaset%identy,andaveragetheircontribuonsothatthe

wholeclustercountsasonesequence

Thisgivesrisetoafamilyofmatrices,dependingonthe%identythreshold

VSLHLELTRSEWTRSEISRSELCRT

80% identical

60% identical

nEE nVE

No clustering (BLOSUM100) 6 4

Clustering sequences with

80% identity (BLOSUM80)3 3

Clustering sequences with

60% identity (BLOSUM60)2 2

PAMmatrices

PAM120

PAMmatrices

PAM-Point(Percent)AcceptedMutaon SchwartzandDayhoff,1978 AlsoknownasMDM78(mutaondatamatrix)or

Dayhoffmatrix

Empiricalmatrixbasedonevoluonarymodel Basedonsmallnumberoffamiliesofcloselyrelated

proteins(>85%identy)sothatsequencescanbealignedunambiguouslybyhand

Sincethechangesobservedbetweenthesesequencesdidnotaffectthefunconoftheprotein,theseareacceptedmuta9ons

1.Alignthesequencesbyhand

2.Orderthesequencesusingparsimony

hbb_ornan LSELHCDKLH VDPENFNRLG NVLIVVLARH FSKDFSPEVQ AAWQKLVSGVhbb_tacac LSELHCDKLH VDPENFNRLG NVLVVVLARH FSKEFTPEAQ AAWQKLVSGV

hbe_ponpy LSELHCDKLH VDPENFKLLG NVMVIILATH FGKEFTPEVQ AAWQKLVSAVhbb_speci LSELHCDKLH VDPENFKLLG NMIVIVMAHH LGKDFTPEAQ AAFQKVVAGV

hbb_speto LSELHCDKLH VDPENFKLLG NMIVIVMAHH LGKDFTPEAQ AAFQKVVAGVhbb_equhe LSELHCDKLH VDPENFRLLG NVLVVVLARH FGKDFTPELQ ASYQKVVAGV


10/12

10

3.Countthenumberofmeseachaminoacid

changestoeachotherone

e.g.FchangingtoLhbb_ornan LSELHCDKLH VDPENFNRLG NVLIVVLARH FSKDFSPEVQ AAWQKLVSGVhbb_tacac LSELHCDKLH VDPENFNRLG NVLVVVLARH FSKEFTPEAQ AAWQKLVSGV

hbe_ponpy LSELHCDKLH VDPENFKLLG NVMVIILATH FGKEFTPEVQ AAWQKLVSAVhbb_speci LSELHCDKLH VDPENFKLLG NMIVIVMAHH LGKDFTPEAQ AAFQKVVAGV

hbb_speto LSELHCDKLH VDPENFKLLG NMIVIVMAHH LGKDFTPEAQ AAFQKVVAGVhbb_equhe LSELHCDKLH VDPENFRLLG NVLVVVLARH FGKDFTPELQ ASYQKVVAGV

F

F

FL

L

F

L F FF

1 FL change. (NFL = 1)

4.Calculateprobabilityforeachaminoacidmutang

toeachotheraminoacid

Foreachpairofaminoacidsiandj,thefrequencyofchangefijis:

Forij,theprobabilityofchangepijis:

wherecisaposivescalingconstantchosensothat

eachpii>0.

fij =Nij

Nikk

pij =

cfijand p

ii =1 cfij

i j

Probabilitymatrix

Theresulngprobabilitymatrixallowsmodellingtheevoluonofproteinsequencesas

aMarkovprocess-thatis,theprobabilityofany

aminoacidmutangtoanotheroneis

dependentonlyonthataminoacid

ApAACpACpCC

DpADpCDpDDEpAEpCEpDEpEE

A C D E

PAM1 Theconstantcischosensothattheexpected

numberofaminoacidchangesaeroneroundofapplyingtheprobabiliesis1in100aminoacids

TheresulngprobabilitymatrixisthePAM1probabilitymatrix,givingtheprobabilitythatanaminoacidwillmutatetoanotheroveranamountofevoluonarymesuchthat1%ofaminoacidsmutate

Expected proportion of mutated amino acids :

pi

i

pijij

= c piij

i

fij = 0.01

5.PAMN

BecausetheprobabilitymatrixisMarkov,itispossibletocalculateprobabilitymatricesfor

longerevoluonarymesbymulplyingthe

matrixbyitselfnmes

e.g. PAM2 probability matrix :

pAA pAC pAD ...

pCA pCC pCD ...

pDA pDC pDD ...

... ... ... ...

"

#

$$$$

%

&

''''

pAA pAC pAD ...

pCA pCC pCD ...

pDA pDC pDD ...

... ... ... ...

"

#

$$$$

%

&

''''

PAMN

e.g.aPAM250matrixrepresentsa250%levelofevoluonarychange

e.g.PAM120,PAM80,PAM60matricescouldbeusedforaligningsequenceswhichareapproximately40%,

50%and60%similar,respecvely

PAM250hasbeenshownpreferablefordistantlyrelatedproteinsof14-27%similarity


11/12

11

Detecngevoluonaryrelaonships

300 million years

200 million years

100 million years

PAM100 PAM100 PAM100 PAM100

PAM200 PAM200

Today

6.PAMlogoddsmatrices Ratherthanuseprobabilies,itismoreconvenientto

uselogoddsmatrices IfpijisanentryinthePAMNprobabilitymatrix,the

correspondingentryinthePAMNlogoddsmatrixis:

whereCisaposiveconstantandqiandqjarethe

respecveobservedfrequenciesofaminoacidsiandjinthesequences

Interpretedastheraooftheprobabilitythatthesubstuonrepresentsanauthencevoluonarychangetotheprobabilitythatitoccurredduetorandomeventsofnobiologicalsignificance.

Clogp

ij

qiq

j

"#$$

%&''

PAMmatrices-summary

Familyofsubstuonmatricescorrespondingtodifferentlevelsofevoluonaryme

Basedonsoundevoluonaryprinciples Distancesforlongperiodsofevoluonaryhistory

extrapolatedfromshortermes(assumpon!)

Basedonarelavelysmalldataset(mainlyglobularproteins)

BLOSUMvsPAM

PAM BLOSUMBuilt from an evolutionary

model based on closely

related proteins

Built directly from blocks

of aligned protein segments

covering a wide range of

evolutionary time

Extrapolation from closely

related sequences

No extrapolation

Built from a small number

of complete sequences

Built from a large number

of sequence segments

BLOSUMvsPAM(cont.)

PAM BLOSUMPAMn matrices with low n

are better suited to closely

related sequencesBLOSUMn matrices with

low n are better suited to

highly divergent sequencesUses phylogenetic tree to

avoid over-representing

closely related sequences

Uses clustering of related

sequences and direct

counting of amino acid

changesCommonly used as log

odds matrix Commonly used as logodds matrix

BLOSUMvsPAM

CounngChanges

BLOSUMAA

AB

BB

direct counts A-B count = 4

PAMcounts from an

evolutionary modelA-B count = 2

AA

ABBB AB


12/12

12

GappenalesI

Raonale: Gapsarisethroughinseron/deleonevents,whichdonot

happenoneresidueatame.

Gapcreaonpenalty: Penaltyforcreanganewgap Typically,relavelyhightopreventtoomanygapsinthe

alignment

Gapextension(length)penalty: Penaltyforextendinganexisnggap Typically,relavelysmallsothatasmalldifferenceingap

lengthwillnotaffectthepenaltyforthisgap,butnottoosmalltoresultinverylonggaps.

Gap Penalties IIAlignment of human and hemoglobin chains

Gap penalty = 1, Gap extension penalty = 0.1

1 V.LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.DLSH.....GSA| |.|.:|..|.| |||| :.:| |:|||:|::: :| |. :|. | ||| |.:

1 VHLTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNP. . . . . .

54 QVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL.||:||||| :|:.:::||:|::...:..||:||..||:||| ||:||::.|:..|| |:

59 KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHF

. .114 PAEFTPAVHASLDKFLASVSTVLTSKYR 141

. ||||:|:|..:|.:|:|...|. ||:

119 GKEFTPPVQAAYQKVVAGVANALAHKYH 146

GapPenalesIIIAlignment of human and hemoglobin chains

Gap penalty = 5, Gap extension penalty = 0.1

2 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF......DLSHGSAQV|.|.:|..|.| |||| :.:| |:|||:|::: :| |. :|. | | |.:.|

3 LTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV. . . . . .

56 KGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPA|:||||| :|:.:::||:|::...:..||:||..||:||| ||:||::.|:..|| |:.

61 KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK

. .116 EFTPAVHASLDKFLASVSTVLTSKYR 141

||||:|:|..:|.:|:|...|. ||:

121 EFTPPVQAAYQKVVAGVANALAHKYH 146

Thetwilight

zone

True positives

False negatives

Rost, B.Protein Eng. 1999 12:85-94;doi:10.1093/protein/12.2.85

Measuringalignmentquality

AlignmentscoreRelavetorandomalignment?

Percentageidenty Percentagesimilarity Evoluonarydistance

Initssimplestform,1-%identySeveralmethodsavailabletocorrectformulple

substuons

Somethingtothinkabout

Whydoweaddthescorestogether?