03 Comparison.ppt

download 03 Comparison.ppt

of 12

Transcript of 03 Comparison.ppt

  • 7/27/2019 03 Comparison.ppt

    1/12

    1

    SequenceComparison

    BINF3010/9010

    Homologyandsimilarity

    HomologySequencesarehomologousiftheyare

    evoluonarilyrelated-i.e.theyshareacommon

    ancestorthroughevoluon

    SimilarityLookingalikeNotanevoluonaryconcept

    Homologyandsimilarity

    HomologyisnotaquantyTwosequencesareeitherhomologousornot

    homologous

    e.g.,itisincorrecttorefertotwosequencesasbeing50%homologous

    Similaritycanbequanfiede.g.,twosequencescanbe50%similar,80%similaretc

    Homologyandsimilarity

    Computaonalmethodsrecogniseandmeasuresimilarity

    Highsimilarityissupporngevidencetoinferhomology

    Typesofhomology

    Orthologs:Genes/proteinsdescendedfromacommonancestor

    Paralogs:Genes/proteinsrelatedtoeachotherduetoageneduplicaonevent

    Evoluonthroughmutaons

    SPAMEGGANDSPAMsubstitutions

    insertionsdeletions

    SPATEGGANDSPAM

    1 SPLATEGGANDSPAM 2 SPAGANDSPAM

  • 7/27/2019 03 Comparison.ppt

    2/12

    2

    Visualisingtheprocess

    Dotmatrixplots(dotplots) Alignments

    Dotmatrixplot

    M A P S D

    N A G

    A P S SPLATEGGANDSPAM

    1 SPLATEGGANDSPAM

    2 SPAGANDSPAM

    1

    2

    DotmatrixplotsDotmatrixplot:Principle

    AAGTTCAGTAGGCATTTAAGCGA ** * * * **

    G * * ** * *T ** * ***

    A ** * * * **

    C * * *C * * *

    G * * ** * *T ** * ***T ** * ***

    C * * *C * * *

    Word size = 1

    AAGTTCAGTAGGCATTTAAGCGA * * * *

    G * *T * *

    A

    C

    C *

    G * *T * **

    T *

    C

    C

    Word size = 2AAGTTCAGTAGGCATTTAAGCG

    A * *

    G *

    T

    A

    CC

    G *T *

    T

    CC

    Word size = 3

  • 7/27/2019 03 Comparison.ppt

    3/12

    3

    AAGTTCAGTAGGCATTTAAGCGA * * * * *

    G * * * *T * * *

    A *

    C * *C * * *

    G * * *T * **T **

    CC

    Word size = 3

    Threshold = 2

    Window = 30 Stringency = 9

    Window=20Stringency=9 Window = 30 Stringency = 14

    Window = 20 Stringency = 13Dotmatrixplot:repeats

    M A P S D

    N A G

    A P S SPLATEGGANDSPAM

    1 SPLATEGGANDSPAM

    2 SPAGANDSPAM

    1

    2

  • 7/27/2019 03 Comparison.ppt

    4/12

    4

    Repeatdetecon

    TFIIIA

    vs

    TFIIIA

    Sequencealignment

    1 SPLATEGGANDSPAM 2 SPAGANDSPAM

    1 SPLATEGGANDSPAM

    || | ||||||||

    2 SP-A---GANDSPAM

    GlobalvsLocalAlignment

    1 ....AUAUCUUUAAUUUAAUGGUAAAAUAUUAGAAUACGAAUCUAAUUAU 46|||| || | || || || || | | | || ||

    1 UGGUAUAUAGUUUAAACAAAACGAAUGAUUUCGACUCAUUAAAUUAUGAU 50. .

    47 AUAGGUUCAAAUCCUAUAAGAUAUUCCA 74| | | | |

    51 AAUCAUAUUUACCAACCA.......... 68

    44 UAUAUAGGUUCAA 56||||||| || ||

    4 UAUAUAGUUUAAA 16

    Global: align the whole of the two sequences together

    Local: align only the region of best similarity

    Whichalignmentiscorrect?

    1 SPLATEGGANDSPAM|| | ||||||||

    2 SP-A---GANDSPAM2 insertion/deletions

    1 SPLATEGGANDSPAM|| ||||||||

    2 SPA----GANDSPAM1 indel, 1 substitution

    1 SPLATEGGANDSPAM|| ||||||||

    2 SP----AGANDSPAM1 indel, 1 substitution

    1 SPLATEGGANDSPAM| ||||||||

    2 -SPA---GANDSPAM2 indels, 2 substitutions

    Whichalignmentisopmal?

    SelectascoringsystemforalignmentsAssignvaluestomatches,mismatchesandgaps

    SumupthevaluesoverthewholealignmentAlignmentscore=Scorematch-Scoregap

    Theopmalalignmentistheonewiththehighestscore

    Forexample:

    Match:+2Mismatch:-1Gap:5

    1 SPLATEGGANDSPAM

    || | ||||||||

    2 SP-A---GANDSPAMS= (11*2) + (0*-1) - (2*5) = 12

    1 SPLATEGGANDSPAM

    ||x ||||||||

    2 SPA----GANDSPAMS= (10*2) + (1*-1) - (1*5) = 14

    1 SPLATEGGANDSPAM|| x||||||||

    2 SP----AGANDSPAMS= (10*2) + (1*-1) - (1*5) = 14

    1 SPLATEGGANDSPAMxx| ||||||||

    2 -SPA---GANDSPAMS= (9*2) + (2*-1) - (2*5) = 6

  • 7/27/2019 03 Comparison.ppt

    5/12

    5

    Algorithms

    GlobalalignmentNeedleman-WunschSellers

    LocalalignmentSmith-Waterman

    Notethattheopmalalignmentisnot

    necessarilythecorrectbiological

    alignment.

    However,itisusuallyimpossibletoknow

    thecorrectevoluonaryalignment

    Structurealignment Structurealignment

    10 20 30 40 50 60

    ....*....|....*....|....*....|....*....|....*....|....*....| 4HHB_A 1 ~VLSPADKTNVKAAWGKVgaHAGEYGAEALERMFLSFPTTKTYFPHFD ls~~~~~~hGSA532HHB_B 1 vHLTPEEKSAVTALWGKV~~NVDEVGGEALGRLLVVYPWTQRFFESFGdlstpdavmGNP58

    70 80 90 100 110 120....*....|....*....|....*....|....*....|....*....|....*....|

    4HHB_A 54 QVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL 1132HHB_B 59 KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHF 118

    130 140....*....|....*....|....*...

    4HHB_A114 PAEFTPAVHASLDKFLASVSTVLTSKYR1412HHB_B119 GKEFTPPVQAAYQKVVAGVANALAHKYH 146

    Scoringsystems

    MatchesandmismatchesSubstuonmutaons

    GapsInseronsanddeleons

    DNAsequencealignment768 TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG 813

    || || || | | ||| | |||| ||||| ||| |||

    87 TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG 135. . . . .

    814 AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG 863| | | | |||||| | |||| | || | |

    136 AAGGATC.............TCAGTAATTAATCATGCACCTATGTGGCGG 172

    . . . . .864 AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT 913

    ||| | ||| || || ||| | ||||||||| || |||||| |

    173 AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT 216

  • 7/27/2019 03 Comparison.ppt

    6/12

    6

    A T G CA 5 -4 -4 -4T -4 5 -4 -4G -4 -4 5 -4

    C -4 -4 -4 5

    DNAscoringmatrixusedinEMBOSS

    Section of EMBOSS data file EDNAFULL

    ProteinSequenceAlignment

    TPKRREAEDLQVGQVLGGPLQLLE...SLQKRGIVEQCCT||:|: |: |:|||::|: |||||||||

    YPKKRDMEQ......LSGPLDMLQQEYQKMKRGIVEQCCH

    ProteinSequenceAlignment

    TPKRREAEDLQVGQVLGGPLQLLE...SLQKRGIVEQCCT||:|:|: |:|||::|: |||||||||

    YPKKRDMEQ......LSGPLDMLQQEYQKMKRGIVEQCCH

    Identical

    ProteinSequenceAlignment

    TPKRREAEDLQVGQVLGGPLQLLE...SLQKRGIVEQCCT||:|:|: |:|||::|: |||||||||

    YPKKRDMEQ......LSGPLDMLQQEYQKMKRGIVEQCCH

    Identical

    Similar

    Different

    ProteinComparison:

    ScoringMatrix

    A la C ys A sp G lu P h e G l y H is I l e L ys Le u M et As n P ro G ln A rg S er T hr V al T rp T yr A C D E F G H I K L M N P Q R S T V W Y

    0.8 0 . 0 - 0 .4 - 0. 2 - 0 . 4 0 . 0 - 0 .4 - 0 .2 - 0 .2 - 0 .2 - 0 .2 - 0 .4 - 0. 2 - 0 .2 - 0 .2 0.2 0 . 0 0 . 0 - 0 .6 - 0 .4 A Ala

    1.8 - 0 .6 - 0. 8 - 0 . 4 - 0 .6 - 0 .6 - 0 .2 - 0 .6 - 0 .2 - 0 .2 - 0 .6 - 0. 6 - 0 .6 - 0 .6 - 0 .2 - 0. 2 - 0. 2 - 0 . 4 - 0 .4 C Cys

    1.2 0.4 - 0 .6 - 0 .2 - 0 .2 - 0 .6 - 0 .2 - 0 .8 - 0 .6 0.2 - 0. 2 0 . 0 - 0 .4 0 . 0 - 0. 2 - 0 . 6 - 0 .8 - 0 .6 D Asp

    1.0 - 0 .6 - 0 .4 0 . 0 - 0 . 6 0.2 - 0 .6 - 0 .4 0 . 0 - 0 .2 0.4 0 . 0 0 . 0 - 0 .2 - 0. 4 - 0 . 6 - 0 . 4 E Glu

    1.2 - 0 .6 - 0 . 2 0 . 0 - 0 .6 0 . 0 0 . 0 - 0 . 6 - 0. 8 - 0 . 6 - 0 .6 - 0 .4 - 0 . 4 - 0 . 2 0 .2 0 .6 F Phe

    1.2 - 0 .4 - 0 .8 - 0 . 4 - 0 . 8 - 0 . 6 0 . 0 - 0 .4 - 0 .4 - 0 . 4 0 . 0 - 0 .4 - 0. 6 - 0 . 4 - 0 . 6 G Gly

    1.6 - 0 .6 - 0 . 2 - 0 . 6 - 0. 4 0 . 2 - 0. 4 0 . 0 0 . 0 - 0 . 2 - 0. 4 - 0 . 6 - 0 .4 0.4 H His

    0.8 -0.6 0 .4 0 .2 - 0 .6 - 0. 6 - 0 .6 - 0 .6 - 0 .4 - 0. 2 0 . 6 - 0 .6 - 0 .2 I Ile

    1.0 - 0 .4 - 0 .2 0 . 0 - 0 .2 0 .2 0 .4 0 .0 - 0. 2 - 0. 4 - 0 .6 - 0 .4 K Lys

    0.8 0.4 - 0 .6 - 0. 6 - 0 .4 - 0 .4 - 0 .4 - 0. 2 0.2 -0.4 -0.2 L Leu

    1.0 - 0 .4 - 0. 4 0 . 0 - 0 .2 - 0 .2 - 0. 2 0.2 -0.2 -0.2 M Met

    1.2 - 0. 4 0 .0 0 .0 0.2 0 . 0 - 0. 6 - 0 .8 - 0 .4 N Asn

    1.4 - 0 .2 - 0 .4 - 0 .2 - 0. 2 - 0. 4 - 0 .8 - 0 .6 P Pro

    1.0 0.2 0 . 0 - 0 . 2 - 0. 4 - 0 .4 - 0 .2 Q Gln

    1.0 - 0 .2 - 0. 2 - 0. 6 - 0 .6 - 0 .4 R Arg

    0.8 0.2 -0.4 -0.6 -0.4 S Ser

    1.0 0 . 0 - 0 . 4 - 0 .4 T Thr

    0.8 -0.6 -0.2 V Val

    2.2 0.4 W Trp

    1.4 Y Tyr

    BLOSUM62 Matrix

    Firstprinciplesaminoacidsubstuon

    matrices Identymatrix

    Perfectmatch:posivescore Anymismatch:negavescore

    Genecscorematrix Basedontheaveragenumberofnucleodechanges

    neededtomutateoneaminoacidintoanother

    e.g.K(AAA,AAG)toN(AAC,AAU)hasahigherscorethanK(AAA,AAG)toD(GAU,GAC)

    Chemicalproperesmatrices e.g.K(basic)toR(basic)hasahigherscorethanK(basic)

    toF(aromac)orKtoE(acidic)

  • 7/27/2019 03 Comparison.ppt

    7/12

    7

    Identymatrixexample

    D +1E -1 +1Q -1 -1 +1

    H -1 -1 -1 +1V -1 -1 -1 -1 +1

    F -1 -1 -1 -1 -1 +1W -1 -1 -1 -1 -1 -1 +1

    D E Q H V F W

    Data-basedmatrices

    Calculatedfromaminoacidfrequenciesinknownhomologoussequences

    PAMfamilyofmatrices BLOSUMfamilyofmatrices Performbeerthanfirstprinciplematrices

    (whicharesllusefulforsomespecialised

    applicaons)

    BLOSUMmatrices

    BLOSUM 62

    BLOSUMmatrices

    HenikoffandHenikoff,1992 BlocksSubstuonMatrix BasedontheBLOCKSdatabase Currently,mostwidelyusedmatrixfamily Mostcommonlyusedmatrices:BLOSUM62

    andBLOSUM55

    BLOCKSdatabase

    BLOCKSareungappedmulplesequencealignmentsbasedontheSWISS-PROTdatabaseandthePROSITE

    proteinfamilydatabase

    AllthesequencesfromSWISS-PROTbelongingtoaPROSITEfamilyarealignedtogether,tocreatelocal

    ungappedalignments characteriscoftheprotein

    family

    BLOCKexampleID Mn_catalase; BLOCK

    AC IPB007760A; distance from previous block=(3

    DE Manganese containing catalase

    BL HIL; width=14; seqs=49; 99.5%=727; strengt

    CTJC_BACSU|Q45538 ( 67) HLEMIATMVYKLTK 12

    GS80_BACSU|P80878 ( 69) HVEMIATMIARLLE 14YDHU_BACSU|O05513 ( 4) HGNLITDLLDNLLL 25

    O69145 ( 70) HMEIVAETINLLNG 64

    Q9KDZ2 ( 136) SGNLIFDLLHNYFL 34

    Q9KAU6 ( 69) HVEMLATMIARLLD 16

    Q9I1T0 ( 68) HLEIIGSIVGMLNK 20Q97JE8 ( 68) HLEIVGSIVRQLSR 50

    MCAT_CLOAB|Q97FE0 ( 124) TGDIVADLLSNIAS 73

    Q8Z7E1 ( 68) HLEIIGSLVGMLNK 17

    Q8YY54 ( 69) HIEMLATMIAHLLD 27Q8YSJ5 ( 68) HLEMVGKLIEAHTK 36

  • 7/27/2019 03 Comparison.ppt

    8/12

    8

    FromBLOCKStoBLOSUM

    1. Countthenumberofaminoacidpairsobservedineachcolumnofeachblockandcalculatetheobservedfrequencyofeachpair

    2. Calculatetheexpectedfrequencyofeachpair(basedonthefrequencyofindividualaminoacids)

    3. Calculatethelograo(typicallylog2)

    1.Countnumberofobservedpairsand

    calculatefrequencies

    DADAAAAEAAEEAADA

    AAEE

    AADE

    There are 4 6

    2

    #

    $%

    &

    '(= 60 aligned pairs of amino acids in the block

    Aligned pair

    (xy)

    Proportion of times observed

    (oxy)

    A to A 26/60

    A to D 8/60

    A to E 10/60

    D to D 3/60

    D to E 6/60

    E to E 7/60

    Generalcaseforstep1.

    For each pair of amino acids x and y,

    nxy

    = number of times x and y are in the same

    column of a block

    oxy = observed proportion of aligned pairxy

    oxy

    =

    nxy

    nuv

    uv

    2.Calculatetheexpectedfrequencyofeach

    pair

    DADAAAAE

    AAEEAADA

    AAEE

    AADE

    Amino acid (x) Proportion in block (px)

    A 14/24

    D 4/24

    E 6/24

    Amino acid pair (xy) Expected proportion (exy)

    A to A (14/24)2 = 196/576

    A to D 2(14/24) (4/24) = 112/576

    A to E 2 (14/24) (6/24) = 168/576

    D to D (4/24)2 = 16/576

    D to E 2(4/24) (6/24) = 48/576

    E to E (6/24)2 = 36/576

    Generalcaseforstep2

    Expected proportion of amino acid pair xy in

    random block of same amino acid composition :

    exy

    =

    2pxp

    yifx y

    pxp

    yifx = y

    #$%

    3.Calculatethelograo

    Matrix entry = 2log2oxy

    exy

    "

    #$$

    %

    &''(rounded to nearest integer)

    Aligned pair (xy) oxy

    exy

    2log2(oxy/exy)A to A 26/60 196/576 0.70A to D 8/60 112/576 -1.09A to E 10/60 168/576 -1.61D to D 3/60 16/576 1.70D to E 6/60 48/576 0.53E to E 7/60 36/576 1.80

  • 7/27/2019 03 Comparison.ppt

    9/12

    9

    Finalmatrix

    A D EA 1 -1 -2

    D -1 2 1

    E -2 1 2

    The 2log2 transformation means that the matrix is in half-bits

    BLOSUMfamily

    Problem:counngeveryaminoacidintheblockcanleadtoanover-representaonofaminoacid

    changesfoundincloselyrelatedsequences

    Soluon:clustersequencescloserthanaset%identy,andaveragetheircontribuonsothatthe

    wholeclustercountsasonesequence

    Thisgivesrisetoafamilyofmatrices,dependingonthe%identythreshold

    VSLHLELTRSEWTRSEISRSELCRT

    80% identical

    60% identical

    nEE nVE

    No clustering (BLOSUM100) 6 4

    Clustering sequences with

    80% identity (BLOSUM80)3 3

    Clustering sequences with

    60% identity (BLOSUM60)2 2

    PAMmatrices

    PAM120

    PAMmatrices

    PAM-Point(Percent)AcceptedMutaon SchwartzandDayhoff,1978 AlsoknownasMDM78(mutaondatamatrix)or

    Dayhoffmatrix

    Empiricalmatrixbasedonevoluonarymodel Basedonsmallnumberoffamiliesofcloselyrelated

    proteins(>85%identy)sothatsequencescanbealignedunambiguouslybyhand

    Sincethechangesobservedbetweenthesesequencesdidnotaffectthefunconoftheprotein,theseareacceptedmuta9ons

    1.Alignthesequencesbyhand

    2.Orderthesequencesusingparsimony

    hbb_ornan LSELHCDKLH VDPENFNRLG NVLIVVLARH FSKDFSPEVQ AAWQKLVSGVhbb_tacac LSELHCDKLH VDPENFNRLG NVLVVVLARH FSKEFTPEAQ AAWQKLVSGV

    hbe_ponpy LSELHCDKLH VDPENFKLLG NVMVIILATH FGKEFTPEVQ AAWQKLVSAVhbb_speci LSELHCDKLH VDPENFKLLG NMIVIVMAHH LGKDFTPEAQ AAFQKVVAGV

    hbb_speto LSELHCDKLH VDPENFKLLG NMIVIVMAHH LGKDFTPEAQ AAFQKVVAGVhbb_equhe LSELHCDKLH VDPENFRLLG NVLVVVLARH FGKDFTPELQ ASYQKVVAGV

  • 7/27/2019 03 Comparison.ppt

    10/12

    10

    3.Countthenumberofmeseachaminoacid

    changestoeachotherone

    e.g.FchangingtoLhbb_ornan LSELHCDKLH VDPENFNRLG NVLIVVLARH FSKDFSPEVQ AAWQKLVSGVhbb_tacac LSELHCDKLH VDPENFNRLG NVLVVVLARH FSKEFTPEAQ AAWQKLVSGV

    hbe_ponpy LSELHCDKLH VDPENFKLLG NVMVIILATH FGKEFTPEVQ AAWQKLVSAVhbb_speci LSELHCDKLH VDPENFKLLG NMIVIVMAHH LGKDFTPEAQ AAFQKVVAGV

    hbb_speto LSELHCDKLH VDPENFKLLG NMIVIVMAHH LGKDFTPEAQ AAFQKVVAGVhbb_equhe LSELHCDKLH VDPENFRLLG NVLVVVLARH FGKDFTPELQ ASYQKVVAGV

    F

    F

    FL

    L

    F

    L F FF

    1 FL change. (NFL = 1)

    4.Calculateprobabilityforeachaminoacidmutang

    toeachotheraminoacid

    Foreachpairofaminoacidsiandj,thefrequencyofchangefijis:

    Forij,theprobabilityofchangepijis:

    wherecisaposivescalingconstantchosensothat

    eachpii>0.

    fij =Nij

    Nikk

    pij =

    cfijand p

    ii =1 cfij

    i j

    Probabilitymatrix

    Theresulngprobabilitymatrixallowsmodellingtheevoluonofproteinsequencesas

    aMarkovprocess-thatis,theprobabilityofany

    aminoacidmutangtoanotheroneis

    dependentonlyonthataminoacid

    ApAACpACpCC

    DpADpCDpDDEpAEpCEpDEpEE

    A C D E

    PAM1 Theconstantcischosensothattheexpected

    numberofaminoacidchangesaeroneroundofapplyingtheprobabiliesis1in100aminoacids

    TheresulngprobabilitymatrixisthePAM1probabilitymatrix,givingtheprobabilitythatanaminoacidwillmutatetoanotheroveranamountofevoluonarymesuchthat1%ofaminoacidsmutate

    Expected proportion of mutated amino acids :

    pi

    i

    pijij

    = c piij

    i

    fij = 0.01

    5.PAMN

    BecausetheprobabilitymatrixisMarkov,itispossibletocalculateprobabilitymatricesfor

    longerevoluonarymesbymulplyingthe

    matrixbyitselfnmes

    e.g. PAM2 probability matrix :

    pAA pAC pAD ...

    pCA pCC pCD ...

    pDA pDC pDD ...

    ... ... ... ...

    "

    #

    $$$$

    %

    &

    ''''

    pAA pAC pAD ...

    pCA pCC pCD ...

    pDA pDC pDD ...

    ... ... ... ...

    "

    #

    $$$$

    %

    &

    ''''

    PAMN

    e.g.aPAM250matrixrepresentsa250%levelofevoluonarychange

    e.g.PAM120,PAM80,PAM60matricescouldbeusedforaligningsequenceswhichareapproximately40%,

    50%and60%similar,respecvely

    PAM250hasbeenshownpreferablefordistantlyrelatedproteinsof14-27%similarity

  • 7/27/2019 03 Comparison.ppt

    11/12

    11

    Detecngevoluonaryrelaonships

    300 million years

    200 million years

    100 million years

    PAM100 PAM100 PAM100 PAM100

    PAM200 PAM200

    Today

    6.PAMlogoddsmatrices Ratherthanuseprobabilies,itismoreconvenientto

    uselogoddsmatrices IfpijisanentryinthePAMNprobabilitymatrix,the

    correspondingentryinthePAMNlogoddsmatrixis:

    whereCisaposiveconstantandqiandqjarethe

    respecveobservedfrequenciesofaminoacidsiandjinthesequences

    Interpretedastheraooftheprobabilitythatthesubstuonrepresentsanauthencevoluonarychangetotheprobabilitythatitoccurredduetorandomeventsofnobiologicalsignificance.

    Clogp

    ij

    qiq

    j

    "#$$

    %&''

    PAMmatrices-summary

    Familyofsubstuonmatricescorrespondingtodifferentlevelsofevoluonaryme

    Basedonsoundevoluonaryprinciples Distancesforlongperiodsofevoluonaryhistory

    extrapolatedfromshortermes(assumpon!)

    Basedonarelavelysmalldataset(mainlyglobularproteins)

    BLOSUMvsPAM

    PAM BLOSUMBuilt from an evolutionary

    model based on closely

    related proteins

    Built directly from blocks

    of aligned protein segments

    covering a wide range of

    evolutionary time

    Extrapolation from closely

    related sequences

    No extrapolation

    Built from a small number

    of complete sequences

    Built from a large number

    of sequence segments

    BLOSUMvsPAM(cont.)

    PAM BLOSUMPAMn matrices with low n

    are better suited to closely

    related sequencesBLOSUMn matrices with

    low n are better suited to

    highly divergent sequencesUses phylogenetic tree to

    avoid over-representing

    closely related sequences

    Uses clustering of related

    sequences and direct

    counting of amino acid

    changesCommonly used as log

    odds matrix Commonly used as logodds matrix

    BLOSUMvsPAM

    CounngChanges

    BLOSUMAA

    AB

    BB

    direct counts A-B count = 4

    PAMcounts from an

    evolutionary modelA-B count = 2

    AA

    ABBB AB

  • 7/27/2019 03 Comparison.ppt

    12/12

    12

    GappenalesI

    Raonale: Gapsarisethroughinseron/deleonevents,whichdonot

    happenoneresidueatame.

    Gapcreaonpenalty: Penaltyforcreanganewgap Typically,relavelyhightopreventtoomanygapsinthe

    alignment

    Gapextension(length)penalty: Penaltyforextendinganexisnggap Typically,relavelysmallsothatasmalldifferenceingap

    lengthwillnotaffectthepenaltyforthisgap,butnottoosmalltoresultinverylonggaps.

    Gap Penalties IIAlignment of human and hemoglobin chains

    Gap penalty = 1, Gap extension penalty = 0.1

    1 V.LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.DLSH.....GSA| |.|.:|..|.| |||| :.:| |:|||:|::: :| |. :|. | ||| |.:

    1 VHLTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNP. . . . . .

    54 QVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL.||:||||| :|:.:::||:|::...:..||:||..||:||| ||:||::.|:..|| |:

    59 KVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHF

    . .114 PAEFTPAVHASLDKFLASVSTVLTSKYR 141

    . ||||:|:|..:|.:|:|...|. ||:

    119 GKEFTPPVQAAYQKVVAGVANALAHKYH 146

    GapPenalesIIIAlignment of human and hemoglobin chains

    Gap penalty = 5, Gap extension penalty = 0.1

    2 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF......DLSHGSAQV|.|.:|..|.| |||| :.:| |:|||:|::: :| |. :|. | | |.:.|

    3 LTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV. . . . . .

    56 KGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPA|:||||| :|:.:::||:|::...:..||:||..||:||| ||:||::.|:..|| |:.

    61 KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK

    . .116 EFTPAVHASLDKFLASVSTVLTSKYR 141

    ||||:|:|..:|.:|:|...|. ||:

    121 EFTPPVQAAYQKVVAGVANALAHKYH 146

    Thetwilight

    zone

    True positives

    False negatives

    Rost, B.Protein Eng. 1999 12:85-94;doi:10.1093/protein/12.2.85

    Measuringalignmentquality

    AlignmentscoreRelavetorandomalignment?

    Percentageidenty Percentagesimilarity Evoluonarydistance

    Initssimplestform,1-%identySeveralmethodsavailabletocorrectformulple

    substuons

    Somethingtothinkabout

    Whydoweaddthescorestogether?