Sequence motifs, information content, logos, and HMM’s
description
Transcript of Sequence motifs, information content, logos, and HMM’s
![Page 1: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/1.jpg)
Sequence motifs, information content, logos, and HMM’s
Morten Nielsen,CBS, BioCentrum,
DTU
![Page 2: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/2.jpg)
Outline
• What is a binding motif?• How to describe a sequence motif?• Construction of scoring matrices• Sequence motifs and hidden Markov models• Use of HMM• Why are Profile HMM’s better than Anders
Gorms sequence alignments– Or at least PSSM’s
![Page 3: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/3.jpg)
Binding motifs
MHC-I
TAPMHC-II
![Page 4: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/4.jpg)
Anchor positions
MHC class I with peptide
![Page 5: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/5.jpg)
SLLPAIVEL YLLPAIVHI TLWVDPYEV GLVPFLVSV KLLEPVLLL LLDVPTAAV LLDVPTAAV LLDVPTAAVLLDVPTAAV VLFRGGPRG MVDGTLLLL YMNGTMSQV MLLSVPLLL SLLGLLVEV ALLPPINIL TLIKIQHTLHLIDYLVTS ILAPPVVKL ALFPQLVIL GILGFVFTL STNRQSGRQ GLDVLTAKV RILGAVAKV QVCERIPTIILFGHENRV ILMEHIHKL ILDQKINEV SLAGGIIGV LLIENVASL FLLWATAEA SLPDFGISY KKREEAPSLLERPGGNEI ALSNLEVKL ALNELLQHV DLERKVESL FLGENISNF ALSDHHIYL GLSEFTEYL STAPPAHGVPLDGEYFTL GVLVGVALI RTLDKVLEV HLSTAFARV RLDSYVRSL YMNGTMSQV GILGFVFTL ILKEPVHGVILGFVFTLT LLFGYPVYV GLSPTVWLS WLSLLVPFV FLPSDFFPS CLGGLLTMV FIAGNSAYE KLGEFYNQMKLVALGINA DLMGYIPLV RLVTLKDIV MLLAVLYCL AAGIGILTV YLEPGPVTA LLDGTATLR ITDQVPFSVKTWGQYWQV TITDQVPFS AFHHVAREL YLNKIQNSL MMRKLAILS AIMDKNIIL IMDKNIILK SMVGNWAKVSLLAPGAKQ KIFGSLAFL ELVSEFSRM KLTPLCVTL VLYRYGSFS YIGEVLVSV CINGVCWTV VMNILLQYVILTVILGVL KVLEYVIKV FLWGPRALV GLSRYVARL FLLTRILTI HLGNVKYLV GIAGGLALL GLQDCTMLVTGAPVTYST VIYQYMDDL VLPDVFIRC VLPDVFIRC AVGIGIAVV LVVLGLLAV ALGLGLLPV GIGIGVLAAGAGIGVAVL IAGIGILAI LIVIGILIL LAGIGLIAA VDGIGILTI GAGIGVLTA AAGIGIIQI QAGIGILLAKARDPHSGH KACDPHSGH ACDPHSGHF SLYNTVATL RGPGRAFVT NLVPMVATV GLHCYEQLV PLKQHFQIVAVFDRKSDA LLDFVRFMG VLVKSPNHV GLAPPQHLI LLGRNSFEV PLTFGWCYK VLEWRFDSR TLNAWVKVVGLCTLVAML FIDSYICQV IISAVVGIL VMAGVGSPY LLWTLVVLL SVRDRLARL LLMDCSGSI CLTSTVQLVVLHDDLLEA LMWITQCFL SLLMWITQC QLSLLMWIT LLGATCMFV RLTRFLSRV YMDGTMSQV FLTPKKLQCISNDVCAQV VKTDGNPPE SVYDFFVWL FLYGALLLA VLFSSDFRI LMWAKIGPV SLLLELEEV SLSRFSWGAYTAFTIPSI RLMKQDFSV RLPRIFCSC FLWGPRAYA RLLQETELV SLFEGIDFY SLDQSVVEL RLNMFTPYINMFTPYIGV LMIIPLINV TLFIGSHVV SLVIVTTFV VLQWASLAV ILAKFLHWL STAPPHVNV LLLLTVLTVVVLGVVFGI ILHNGAYSL MIMVKCWMI MLGTHTMEV MLGTHTMEV SLADTNSLA LLWAARPRL GVALQTMKQGLYDGMEHL KMVELVHFL YLQLVFGIE MLMAQEALA LMAQEALAF VYDGREHTV YLSGANLNL RMFPNAPYLEAAGIGILT TLDSQVMSL STPPPGTRV KVAELVHFL IMIGVLVGV ALCRWGLLL LLFAGVQCQ VLLCESTAVYLSTAFARV YLLEMLWRL SLDDYNHLV RTLDKVLEV GLPVEYLQV KLIANNTRV FIYAGSLSA KLVANNTRLFLDEFMEGV ALQPGTALL VLDGLDVLL SLYSFPEPE ALYVDSLFF SLLQHLIGL ELTLGEFLK MINAYLDKLAAGIGILTV FLPSDFFPS SVRDRLARL SLREWLLRI LLSAWILTA AAGIGILTV AVPDEIPPL FAYDGKDYIAAGIGILTV FLPSDFFPS AAGIGILTV FLPSDFFPS AAGIGILTV FLWGPRALV ETVSEQSNV ITLWQRPLV
Sequence information
![Page 6: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/6.jpg)
Sequence information
![Page 7: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/7.jpg)
SLLPAIVEL YLLPAIVHI TLWVDPYEV GLVPFLVSV KLLEPVLLL LLDVPTAAV LLDVPTAAV LLDVPTAAVLLDVPTAAV VLFRGGPRG MVDGTLLLL YMNGTMSQV MLLSVPLLL SLLGLLVEV ALLPPINIL TLIKIQHTLHLIDYLVTS ILAPPVVKL ALFPQLVIL GILGFVFTL STNRQSGRQ GLDVLTAKV RILGAVAKV QVCERIPTIILFGHENRV ILMEHIHKL ILDQKINEV SLAGGIIGV LLIENVASL FLLWATAEA SLPDFGISY KKREEAPSLLERPGGNEI ALSNLEVKL ALNELLQHV DLERKVESL FLGENISNF ALSDHHIYL GLSEFTEYL STAPPAHGVPLDGEYFTL GVLVGVALI RTLDKVLEV HLSTAFARV RLDSYVRSL YMNGTMSQV GILGFVFTL ILKEPVHGVILGFVFTLT LLFGYPVYV GLSPTVWLS WLSLLVPFV FLPSDFFPS CLGGLLTMV FIAGNSAYE KLGEFYNQMKLVALGINA DLMGYIPLV RLVTLKDIV MLLAVLYCL AAGIGILTV YLEPGPVTA LLDGTATLR ITDQVPFSVKTWGQYWQV TITDQVPFS AFHHVAREL YLNKIQNSL MMRKLAILS AIMDKNIIL IMDKNIILK SMVGNWAKVSLLAPGAKQ KIFGSLAFL ELVSEFSRM KLTPLCVTL VLYRYGSFS YIGEVLVSV CINGVCWTV VMNILLQYVILTVILGVL KVLEYVIKV FLWGPRALV GLSRYVARL FLLTRILTI HLGNVKYLV GIAGGLALL GLQDCTMLVTGAPVTYST VIYQYMDDL VLPDVFIRC VLPDVFIRC AVGIGIAVV LVVLGLLAV ALGLGLLPV GIGIGVLAAGAGIGVAVL IAGIGILAI LIVIGILIL LAGIGLIAA VDGIGILTI GAGIGVLTA AAGIGIIQI QAGIGILLAKARDPHSGH KACDPHSGH ACDPHSGHF SLYNTVATL RGPGRAFVT NLVPMVATV GLHCYEQLV PLKQHFQIVAVFDRKSDA LLDFVRFMG VLVKSPNHV GLAPPQHLI LLGRNSFEV PLTFGWCYK VLEWRFDSR TLNAWVKVVGLCTLVAML FIDSYICQV IISAVVGIL VMAGVGSPY LLWTLVVLL SVRDRLARL LLMDCSGSI CLTSTVQLVVLHDDLLEA LMWITQCFL SLLMWITQC QLSLLMWIT LLGATCMFV RLTRFLSRV YMDGTMSQV FLTPKKLQCISNDVCAQV VKTDGNPPE SVYDFFVWL FLYGALLLA VLFSSDFRI LMWAKIGPV SLLLELEEV SLSRFSWGAYTAFTIPSI RLMKQDFSV RLPRIFCSC FLWGPRAYA RLLQETELV SLFEGIDFY SLDQSVVEL RLNMFTPYINMFTPYIGV LMIIPLINV TLFIGSHVV SLVIVTTFV VLQWASLAV ILAKFLHWL STAPPHVNV LLLLTVLTVVVLGVVFGI ILHNGAYSL MIMVKCWMI MLGTHTMEV MLGTHTMEV SLADTNSLA LLWAARPRL GVALQTMKQGLYDGMEHL KMVELVHFL YLQLVFGIE MLMAQEALA LMAQEALAF VYDGREHTV YLSGANLNL RMFPNAPYLEAAGIGILT TLDSQVMSL STPPPGTRV KVAELVHFL IMIGVLVGV ALCRWGLLL LLFAGVQCQ VLLCESTAVYLSTAFARV YLLEMLWRL SLDDYNHLV RTLDKVLEV GLPVEYLQV KLIANNTRV FIYAGSLSA KLVANNTRLFLDEFMEGV ALQPGTALL VLDGLDVLL SLYSFPEPE ALYVDSLFF SLLQHLIGL ELTLGEFLK MINAYLDKLAAGIGILTV FLPSDFFPS SVRDRLARL SLREWLLRI LLSAWILTA AAGIGILTV AVPDEIPPL FAYDGKDYIAAGIGILTV FLPSDFFPS AAGIGILTV FLPSDFFPS AAGIGILTV FLWGPRALV ETVSEQSNV ITLWQRPLV
Sequence Information
![Page 8: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/8.jpg)
Sequence Information
Calculate pa at each positionEntropy
Information content
Conserved positions– PV=1, PREST=0 => S=0, I=log(20)
Mutable positions– Paa=1/20 => S=log(20), I=0
€
S = − paa
∑ log(pa )
€
I = log(20) + paa
∑ log(pa )
![Page 9: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/9.jpg)
Information content
A R N D C Q E G H I L K M F P S T W Y V S I1 0.10 0.06 0.01 0.02 0.01 0.02 0.02 0.09 0.01 0.07 0.11 0.06 0.04 0.08 0.01 0.11 0.03 0.01 0.05 0.08 3.96 0.372 0.07 0.00 0.00 0.01 0.01 0.00 0.01 0.01 0.00 0.08 0.59 0.01 0.07 0.01 0.00 0.01 0.06 0.00 0.01 0.08 2.16 2.163 0.08 0.03 0.05 0.10 0.02 0.02 0.01 0.12 0.02 0.03 0.12 0.01 0.03 0.05 0.06 0.06 0.04 0.04 0.04 0.07 4.06 0.264 0.07 0.04 0.02 0.11 0.01 0.04 0.08 0.15 0.01 0.10 0.04 0.03 0.01 0.02 0.09 0.07 0.04 0.02 0.00 0.05 3.87 0.455 0.04 0.04 0.04 0.04 0.01 0.04 0.05 0.16 0.04 0.02 0.08 0.04 0.01 0.06 0.10 0.02 0.06 0.02 0.05 0.09 4.04 0.286 0.04 0.03 0.03 0.01 0.02 0.03 0.03 0.04 0.02 0.14 0.13 0.02 0.03 0.07 0.03 0.05 0.08 0.01 0.03 0.15 3.92 0.407 0.14 0.01 0.03 0.03 0.02 0.03 0.04 0.03 0.05 0.07 0.15 0.01 0.03 0.07 0.06 0.07 0.04 0.03 0.02 0.08 3.98 0.348 0.05 0.09 0.04 0.01 0.01 0.05 0.07 0.05 0.02 0.04 0.14 0.04 0.02 0.05 0.05 0.08 0.10 0.01 0.04 0.03 4.04 0.289 0.07 0.01 0.00 0.00 0.02 0.02 0.02 0.01 0.01 0.08 0.26 0.01 0.01 0.02 0.00 0.04 0.02 0.00 0.01 0.38 2.78 1.55
€
I = log(20) + paa
∑ log(pa )
€
S = − paa
∑ log(pa )
![Page 10: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/10.jpg)
Sequence logos
Height of a column equal to I
Relative height of a letter is pHighly useful tool to visualize sequence motifs
High information positions
HLA-A0201
http://www.cbs.dtu.dk/~gorodkin/appl/plogo.html
![Page 11: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/11.jpg)
Characterizing a sequence motif from small data sets
What can we learn?
1. A at P1 favors binding?2. I is not allowed at P9? 3. K at P4 favors binding?4. Which positions are important
for binding?
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
10 MHC restricted peptides
![Page 12: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/12.jpg)
Simple motifs Yes/No rules
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
10 MHC restricted peptides
€
[AGTK]1[LMIV ]2[ANLV ]3 ...[MNRTVL]9
• Only 11 of 212 peptides identified!• Need more flexible rules
•If not fit P1 but fit P2 then ok• Not all positions are equally important
•We know that P2 and P9 determines binding more than other positions
•Cannot discriminate between good and very good binders
![Page 13: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/13.jpg)
Simple motifsYes/No rules
€
[AGTK]1[LMIV ]2[ANLV ]3 ...[AIFKLV ]7...[MNRTVL]9
• Example
•Two first peptides will not fit the motif
RLLDDTPEV 0.59GLLGNVSTV 0.71ALAKAAAAL 0.47
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
10 MHC restricted peptides
![Page 14: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/14.jpg)
Extended motifs
Fitness of aa at each position given by P(aa)
Example P1PA = 6/10
PG = 2/10
PT = PK = 1/10
PC = PD = …PV = 0
Problems– Few data– Data redundancy/duplication
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
![Page 15: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/15.jpg)
Sequence informationRaw sequence counting
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
![Page 16: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/16.jpg)
Sequence weighting
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
Poor or biased sampling of sequence spaceExample P1
PA = 2/6
PG = 2/6
PT = PK = 1/6
PC = PD = …PV = 0
}Similar sequencesWeight 1/5
Example RLLDDTPEV 0.59 GLLGNVSTV 0.71 ALAKAAAAL 0.47
![Page 17: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/17.jpg)
Sequence weightingALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
![Page 18: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/18.jpg)
Pseudo counts
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
I is not found at position P9. Does this mean that I is forbidden (P(I)=0)?No! Use Blosum substitution matrix to estimate pseudo frequency of I at P9
![Page 19: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/19.jpg)
A R N D C Q E G H I L K M F P S T W Y V A 0.29 0.03 0.03 0.03 0.02 0.03 0.04 0.08 0.01 0.04 0.06 0.04 0.02 0.02 0.03 0.09 0.05 0.01 0.02 0.07 R 0.04 0.34 0.04 0.03 0.01 0.05 0.05 0.03 0.02 0.02 0.05 0.12 0.02 0.02 0.02 0.04 0.03 0.01 0.02 0.03 N 0.04 0.04 0.32 0.08 0.01 0.03 0.05 0.07 0.03 0.02 0.03 0.05 0.01 0.02 0.02 0.07 0.05 0.00 0.02 0.03 D 0.04 0.03 0.07 0.40 0.01 0.03 0.09 0.05 0.02 0.02 0.03 0.04 0.01 0.01 0.02 0.05 0.04 0.00 0.01 0.02 C 0.07 0.02 0.02 0.02 0.48 0.01 0.02 0.03 0.01 0.04 0.07 0.02 0.02 0.02 0.02 0.04 0.04 0.00 0.01 0.06 Q 0.06 0.07 0.04 0.05 0.01 0.21 0.10 0.04 0.03 0.03 0.05 0.09 0.02 0.01 0.02 0.06 0.04 0.01 0.02 0.04 E 0.06 0.05 0.04 0.09 0.01 0.06 0.30 0.04 0.03 0.02 0.04 0.08 0.01 0.02 0.03 0.06 0.04 0.01 0.02 0.03 G 0.08 0.02 0.04 0.03 0.01 0.02 0.03 0.51 0.01 0.02 0.03 0.03 0.01 0.02 0.02 0.05 0.03 0.01 0.01 0.02 H 0.04 0.05 0.05 0.04 0.01 0.04 0.05 0.04 0.35 0.02 0.04 0.05 0.02 0.03 0.02 0.04 0.03 0.01 0.06 0.02 I 0.05 0.02 0.01 0.02 0.02 0.01 0.02 0.02 0.01 0.27 0.17 0.02 0.04 0.04 0.01 0.03 0.04 0.01 0.02 0.18 L 0.04 0.02 0.01 0.02 0.02 0.02 0.02 0.02 0.01 0.12 0.38 0.03 0.05 0.05 0.01 0.02 0.03 0.01 0.02 0.10 K 0.06 0.11 0.04 0.04 0.01 0.05 0.07 0.04 0.02 0.03 0.04 0.28 0.02 0.02 0.03 0.05 0.04 0.01 0.02 0.03 M 0.05 0.03 0.02 0.02 0.02 0.03 0.03 0.03 0.02 0.10 0.20 0.04 0.16 0.05 0.02 0.04 0.04 0.01 0.02 0.09 F 0.03 0.02 0.02 0.02 0.01 0.01 0.02 0.03 0.02 0.06 0.11 0.02 0.03 0.39 0.01 0.03 0.03 0.02 0.09 0.06 P 0.06 0.03 0.02 0.03 0.01 0.02 0.04 0.04 0.01 0.03 0.04 0.04 0.01 0.01 0.49 0.04 0.04 0.00 0.01 0.03 S 0.11 0.04 0.05 0.05 0.02 0.03 0.05 0.07 0.02 0.03 0.04 0.05 0.02 0.02 0.03 0.22 0.08 0.01 0.02 0.04 T 0.07 0.04 0.04 0.04 0.02 0.03 0.04 0.04 0.01 0.05 0.07 0.05 0.02 0.02 0.03 0.09 0.25 0.01 0.02 0.07 W 0.03 0.02 0.02 0.02 0.01 0.02 0.02 0.03 0.02 0.03 0.05 0.02 0.02 0.06 0.01 0.02 0.02 0.49 0.07 0.03 Y 0.04 0.03 0.02 0.02 0.01 0.02 0.03 0.02 0.05 0.04 0.07 0.03 0.02 0.13 0.02 0.03 0.03 0.03 0.32 0.05 V 0.07 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.01 0.16 0.13 0.03 0.03 0.04 0.02 0.03 0.05 0.01 0.02 0.27
The Blosum matrix
Some amino acids are highly conserved (i.e. C), some have a high change of mutation (i.e. I)
![Page 20: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/20.jpg)
• Calculate observed amino acids frequencies fa
• Pseudo frequency for amino acid b
• Example
€
gb = faa
∑ ⋅qb |a
€
gI = 0.2 ⋅qI |M + 0.1⋅qI |N + ...+ 0.3⋅qI |V + 0.1⋅qI |LgI = 0.2 ⋅0.04 + 0.1⋅0.01+ ...+ 0.3⋅0.18 + 0.1⋅0.17 ≈ 0.09
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
Pseudo count estimation
![Page 21: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/21.jpg)
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
Weight on pseudo count
• Pseudo counts are important when only limited data is available
• With large data sets only “true” observation should count
is the effective number of sequences (N-1), is the weight on prior
€
pa =α ⋅ fa + β ⋅gaα + β
![Page 22: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/22.jpg)
• Example
• If large, p ≈ f and only the observed data defines the motif
• If small, p ≈ g and the pseudo counts (or prior) defines the motif
is [50-200] normally
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
Weight on pseudo count
€
pa =α ⋅ fa + β ⋅gaα + β
![Page 23: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/23.jpg)
Sequence weighting and pseudo counts
RLLDDTPEV 0.59GLLGNVSTV 0.71ALAKAAAAL 0.47
P7P and P7S > 0
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
![Page 24: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/24.jpg)
Position specific weighting
We know that positions 2 and 9 are anchor positions for most MHC binding motifs– Increase weight on high
information positions
Motif found on large data set
![Page 25: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/25.jpg)
Weight matrices
Estimate amino acid frequencies from alignment including sequence weighting and pseudo count
What do the numbers mean?– P2(V)>P2(M). Does this mean that V enables binding more than M.– In nature not all amino acids are found equally often
• qA = 0.070, qW = 0.013
• Finding 6% A is hence not significant, but 6% W highly significant • In nature V is found more often than M, so we must somehow rescale
with the background
A R N D C Q E G H I L K M F P S T W Y V1 0.08 0.06 0.02 0.03 0.02 0.02 0.03 0.08 0.02 0.08 0.11 0.06 0.04 0.06 0.02 0.09 0.04 0.01 0.04 0.082 0.04 0.01 0.01 0.01 0.01 0.01 0.02 0.02 0.01 0.11 0.44 0.02 0.06 0.03 0.01 0.02 0.05 0.00 0.01 0.103 0.08 0.04 0.05 0.07 0.02 0.03 0.03 0.08 0.02 0.05 0.11 0.03 0.03 0.06 0.04 0.06 0.05 0.03 0.05 0.074 0.08 0.05 0.03 0.10 0.01 0.05 0.08 0.13 0.01 0.05 0.06 0.05 0.01 0.03 0.08 0.06 0.04 0.02 0.01 0.055 0.06 0.04 0.05 0.03 0.01 0.04 0.05 0.11 0.03 0.04 0.09 0.04 0.02 0.06 0.06 0.04 0.05 0.02 0.05 0.086 0.06 0.03 0.03 0.03 0.03 0.03 0.04 0.06 0.02 0.10 0.14 0.04 0.03 0.05 0.04 0.06 0.06 0.01 0.03 0.137 0.10 0.02 0.04 0.04 0.02 0.03 0.04 0.05 0.04 0.08 0.12 0.02 0.03 0.06 0.07 0.06 0.05 0.03 0.03 0.088 0.05 0.07 0.04 0.03 0.01 0.04 0.06 0.06 0.03 0.06 0.13 0.06 0.02 0.05 0.04 0.08 0.07 0.01 0.04 0.059 0.08 0.02 0.01 0.01 0.02 0.02 0.03 0.02 0.01 0.10 0.23 0.03 0.02 0.04 0.01 0.04 0.04 0.00 0.02 0.25
![Page 26: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/26.jpg)
How to score a sequence to a probability matrix?
• pij describes a motif• The probability that a
peptide fits the motif is
€
p(s | model) = ppap
∏
A R N D C Q E G H I L K M F P S T W Y V1 0.08 0.06 0.02 0.03 0.02 0.02 0.03 0.08 0.02 0.08 0.11 0.06 0.04 0.06 0.02 0.09 0.04 0.01 0.04 0.082 0.04 0.01 0.01 0.01 0.01 0.01 0.02 0.02 0.01 0.11 0.44 0.02 0.06 0.03 0.01 0.02 0.05 0.00 0.01 0.103 0.08 0.04 0.05 0.07 0.02 0.03 0.03 0.08 0.02 0.05 0.11 0.03 0.03 0.06 0.04 0.06 0.05 0.03 0.05 0.074 0.08 0.05 0.03 0.10 0.01 0.05 0.08 0.13 0.01 0.05 0.06 0.05 0.01 0.03 0.08 0.06 0.04 0.02 0.01 0.055 0.06 0.04 0.05 0.03 0.01 0.04 0.05 0.11 0.03 0.04 0.09 0.04 0.02 0.06 0.06 0.04 0.05 0.02 0.05 0.086 0.06 0.03 0.03 0.03 0.03 0.03 0.04 0.06 0.02 0.10 0.14 0.04 0.03 0.05 0.04 0.06 0.06 0.01 0.03 0.137 0.10 0.02 0.04 0.04 0.02 0.03 0.04 0.05 0.04 0.08 0.12 0.02 0.03 0.06 0.07 0.06 0.05 0.03 0.03 0.088 0.05 0.07 0.04 0.03 0.01 0.04 0.06 0.06 0.03 0.06 0.13 0.06 0.02 0.05 0.04 0.08 0.07 0.01 0.04 0.059 0.08 0.02 0.01 0.01 0.02 0.02 0.03 0.02 0.01 0.10 0.23 0.03 0.02 0.04 0.01 0.04 0.04 0.00 0.02 0.25
![Page 27: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/27.jpg)
How to score a sequence to a probability matrix?
• pij describes a motif• The probability that a
peptide fits the motif is
• The probability that the peptide fits a random model is €
p(s | model) = ppap
∏
€
p(s | random) = qap
∏
![Page 28: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/28.jpg)
How to score a sequence to a probability matrix?
• pij describes a motif• The probability that a
peptide fits the motif is• The probability that the
peptide fits a random model is
• The ratio of the two gives the odds
• The log gives the score
€
p(s | model) = ppap
∏
€
p(s | random) = qap
∏
€
0 =
ppap
∏
qap
∏=
ppaqap
∏
€
S = logppaqa
⎛
⎝ ⎜
⎞
⎠ ⎟
p
∑
![Page 29: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/29.jpg)
Weight matrices• A weight matrix is given as
Wij = log(pij/qj)– where i is a position in the motif, and j an amino acid. qj is the background frequency for amino acid j.
• W is a L x 20 matrix, L is motif length
A R N D C Q E G H I L K M F P S T W Y V 1 0.6 0.4 -3.5 -2.4 -0.4 -1.9 -2.7 0.3 -1.1 1.0 0.3 0.0 1.4 1.2 -2.7 1.4 -1.2 -2.0 1.1 0.7 2 -1.6 -6.6 -6.5 -5.4 -2.5 -4.0 -4.7 -3.7 -6.3 1.0 5.1 -3.7 3.1 -4.2 -4.3 -4.2 -0.2 -5.9 -3.8 0.4 3 0.2 -1.3 0.1 1.5 0.0 -1.8 -3.3 0.4 0.5 -1.0 0.3 -2.5 1.2 1.0 -0.1 -0.3 -0.5 3.4 1.6 0.0 4 -0.1 -0.1 -2.0 2.0 -1.6 0.5 0.8 2.0 -3.3 0.1 -1.7 -1.0 -2.2 -1.6 1.7 -0.6 -0.2 1.3 -6.8 -0.7 5 -1.6 -0.1 0.1 -2.2 -1.2 0.4 -0.5 1.9 1.2 -2.2 -0.5 -1.3 -2.2 1.7 1.2 -2.5 -0.1 1.7 1.5 1.0 6 -0.7 -1.4 -1.0 -2.3 1.1 -1.3 -1.4 -0.2 -1.0 1.8 0.8 -1.9 0.2 1.0 -0.4 -0.6 0.4 -0.5 -0.0 2.1 7 1.1 -3.8 -0.2 -1.3 1.3 -0.3 -1.3 -1.4 2.1 0.6 0.7 -5.0 1.1 0.9 1.3 -0.5 -0.9 2.9 -0.4 0.5 8 -2.2 1.0 -0.8 -2.9 -1.4 0.4 0.1 -0.4 0.2 -0.0 1.1 -0.5 -0.5 0.7 -0.3 0.8 0.8 -0.7 1.3 -1.1 9 -0.2 -3.5 -6.1 -4.5 0.7 -0.8 -2.5 -4.0 -2.6 0.9 2.8 -3.0 -1.8 -1.4 -6.2 -1.9 -1.6 -4.9 -1.6 4.5
![Page 30: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/30.jpg)
A R N D C Q E G H I L K M F P S T W Y V E 0.06 0.05 0.04 0.09 0.01 0.06 0.30 0.04 0.03 0.02 0.04 0.08 0.01 0.02 0.03 0.06 0.04 0.01 0.02 0.03 G 0.08 0.02 0.04 0.03 0.01 0.02 0.03 0.51 0.01 0.02 0.03 0.03 0.01 0.02 0.02 0.05 0.03 0.01 0.01 0.02 H 0.04 0.05 0.05 0.04 0.01 0.04 0.05 0.04 0.35 0.02 0.04 0.05 0.02 0.03 0.02 0.04 0.03 0.01 0.06 0.02 I 0.05 0.02 0.01 0.02 0.02 0.01 0.02 0.02 0.01 0.27 0.17 0.02 0.04 0.04 0.01 0.03 0.04 0.01 0.02 0.18 L 0.04 0.02 0.01 0.02 0.02 0.02 0.02 0.02 0.01 0.12 0.38 0.03 0.05 0.05 0.01 0.02 0.03 0.01 0.02 0.10 K 0.06 0.11 0.04 0.04 0.01 0.05 0.07 0.04 0.02 0.03 0.04 0.28 0.02 0.02 0.03 0.05 0.04 0.01 0.02 0.03 M 0.05 0.03 0.02 0.02 0.02 0.03 0.03 0.03 0.02 0.10 0.20 0.04 0.16 0.05 0.02 0.04 0.04 0.01 0.02 0.09 F 0.03 0.02 0.02 0.02 0.01 0.01 0.02 0.03 0.02 0.06 0.11 0.02 0.03 0.39 0.01 0.03 0.03 0.02 0.09 0.06
A R N D C Q E G H I L K M F P S T W Y V 0.08 0.05 0.04 0.05 0.02 0.03 0.06 0.07 0.02 0.06 0.10 0.06 0.02 0.04 0.04 0.06 0.05 0.01 0.03 0.07
Example
Calculate the weight matrix based on the following observation (use =50):
Sequence = I
Important. What is ?
€
gb = faa
∑ ⋅qb |a
€
pa =α ⋅ fa + β ⋅gaα + β
Wij = log(pij/qj)
q
qb|a
![Page 31: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/31.jpg)
Example
So the score is simply the Blosum62 row for amino acid I!!!
This is why is called weight on prior. Our prior knowledge is Blosum. We will only accept a weight matrix different from Blosum if we have many data.
€
gb = faa
∑ ⋅qb |a
€
pa =α ⋅ fa + β ⋅gaα + β
Wij = log(pij/qj)
A R N D C Q E G H I L K M F P S T W Y 0.05 0.02 0.01 0.02 0.02 0.01 0.02 0.02 0.01 0.27 0.17 0.02 0.04 0.04 0.01 0.03 0.04 0.01 0.02 0.08 0.05 0.04 0.05 0 .02 0.03 0.06 0.07 0.02 0.06 0.10 0.06 0.02 0.04 0.04 0.06 0.05 0.01 -0.45 -1.08 -1.12 -1.12 -0.43 -0.94 -1.12 -1.28 -1.08 1.38 0.53 -0.90 0.39 -0.06 -0.98 -0.82 -0.25 -0.79 -0.44 -1.30 -3.11 -3.23 -3.29 -1.25 -2.71 -3.218 -3.69 -3.13 3.99 1.52 -2.60 1.12 -0.18 -2.82 -2.38 -0.72 -2.28 -1.27
![Page 32: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/32.jpg)
Score sequences to weight matrix by looking up and adding L values from the matrix
A R N D C Q E G H I L K M F P S T W Y V 1 0.6 0.4 -3.5 -2.4 -0.4 -1.9 -2.7 0.3 -1.1 1.0 0.3 0.0 1.4 1.2 -2.7 1.4 -1.2 -2.0 1.1 0.7 2 -1.6 -6.6 -6.5 -5.4 -2.5 -4.0 -4.7 -3.7 -6.3 1.0 5.1 -3.7 3.1 -4.2 -4.3 -4.2 -0.2 -5.9 -3.8 0.4 3 0.2 -1.3 0.1 1.5 0.0 -1.8 -3.3 0.4 0.5 -1.0 0.3 -2.5 1.2 1.0 -0.1 -0.3 -0.5 3.4 1.6 0.0 4 -0.1 -0.1 -2.0 2.0 -1.6 0.5 0.8 2.0 -3.3 0.1 -1.7 -1.0 -2.2 -1.6 1.7 -0.6 -0.2 1.3 -6.8 -0.7 5 -1.6 -0.1 0.1 -2.2 -1.2 0.4 -0.5 1.9 1.2 -2.2 -0.5 -1.3 -2.2 1.7 1.2 -2.5 -0.1 1.7 1.5 1.0 6 -0.7 -1.4 -1.0 -2.3 1.1 -1.3 -1.4 -0.2 -1.0 1.8 0.8 -1.9 0.2 1.0 -0.4 -0.6 0.4 -0.5 -0.0 2.1 7 1.1 -3.8 -0.2 -1.3 1.3 -0.3 -1.3 -1.4 2.1 0.6 0.7 -5.0 1.1 0.9 1.3 -0.5 -0.9 2.9 -0.4 0.5 8 -2.2 1.0 -0.8 -2.9 -1.4 0.4 0.1 -0.4 0.2 -0.0 1.1 -0.5 -0.5 0.7 -0.3 0.8 0.8 -0.7 1.3 -1.1 9 -0.2 -3.5 -6.1 -4.5 0.7 -0.8 -2.5 -4.0 -2.6 0.9 2.8 -3.0 -1.8 -1.4 -6.2 -1.9 -1.6 -4.9 -1.6 4.5
Scoring a sequence to a weight matrix
RLLDDTPEVGLLGNVSTVALAKAAAAL
Which peptide is most likely to bind?Which peptide second?
11.9 14.7 4.3
0.59 0.71 0.47
![Page 33: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/33.jpg)
Example from real life
• 10 peptides from MHCpep database
• Bind to the MHC complex
• Relevant for immune system recognition
• Estimate sequence motif and weight matrix
• Evaluate motif “correctness” on 528 peptides
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
![Page 34: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/34.jpg)
Prediction accuracy
Pearson correlation 0.45
![Page 35: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/35.jpg)
Predictive performance
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Pearsons correlation
CC 0.45 0.5 0.6 0.65 0.79
Simple Seq.W Seq.W+SCSeq.W+SC+PW
Large dataset
![Page 36: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/36.jpg)
End of first part
Take a deep breathSmile to you neighbor
![Page 37: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/37.jpg)
Hidden Markov Models
• Weight matrices do not deal with insertions and deletions
• In alignments, this is done in an ad-hoc manner by optimization of the two gap penalties for first gap and gap extension
• HMM is a natural frame work where insertions/deletions are dealt with explicitly
![Page 38: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/38.jpg)
Why hidden?
Model generates numbers– 312453666641
Does not tell which die was used
Alignment (decoding) can give the most probable solution/path (Viterby)– FFFFFFLLLLLL
1:1/62:1/63:1/64:1/65:1/66:1/6Fair
1:1/102:1/103:1/104:1/105:1/106:1/2Loaded
0.95
0.10
0.05
0.9
The unfair casino: Loaded die p(6) = 0.5; switch fair to load:0.05; switch load to fair: 0.1
![Page 39: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/39.jpg)
HMM (a simple example)
ACA---ATG
TCAACTATC
ACAC--AGC
AGA---ATC
ACCG--ATC
• Example from A. Krogh• Core region defines the
number of states in the HMM (red)
• Insertion and deletion statistics are derived from the non-core part of the alignment (black)
Core of alignment
![Page 40: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/40.jpg)
.2
.8
.2
ACGT
ACGT
ACGT
ACGT
ACGT
ACGT.8
.8 .8.8
.2.2.2
.2
1
ACGT
.2
.2
.4
1. .4 1. 1.1.
.6.6
.4
HMM construction
ACA---ATG
TCAACTATC
ACAC--AGC
AGA---ATC
ACCG--ATC
• 5 matches. A, 2xC, T, G• 5 transitions in gap region
• C out, G out• A-C, C-T, T out• Out transition 3/5• Stay transition 2/5
ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x1x0.8x1x0.2 = 3.3x10-2
![Page 41: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/41.jpg)
Align sequence to HMM
ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2 = 3.3x10-2
TCAACTATC 0.2x1x0.8x1x0.8x0.6x0.2x0.4x0.4x0.4x0.2x0.6x1x1x0.8x1x0.8 = 0.0075x10-2
ACAC--AGC = 1.2x10-2
AGA---ATC = 3.3x10-2
ACCG--ATC = 0.59x10-2
Consensus:
ACAC--ATC = 4.7x10-2, ACA---ATC = 13.1x10-2
Exceptional:
TGCT--AGG = 0.0023x10-2
![Page 42: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/42.jpg)
Align sequence to HMM - Null model
• Score depends strongly on length
• Null model is a random model. For length L the score is 0.25L
• Log-odds score for sequence S
Log( P(S)/0.25L)• Positive score means
more likely than Null model
ACA---ATG = 4.9
TCAACTATC = 3.0 ACAC--AGC = 5.3AGA---ATC = 4.9ACCG--ATC = 4.6Consensus:ACAC--ATC = 6.7 ACA---ATC = 6.3Exceptional:TGCT--AGG = -0.97
Note!
![Page 43: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/43.jpg)
Model decoding (Viterby)The unfair casino
Example: 1245666
1 2 4 5 6 6 6
F -0.78 -1.58 -2.38 -3.18 -3.98 -4.78 -5.58
L Null -3.08 -3.88 -4.68 -4.78 -5.13 -5.48
FFFFLLL
1:-0.782:-0.783:-0.784:-0.785:-0.786:-0-78
Fair
1:-12:-13:-14:-15:-16:-0.3Loaded
-0.02
-1
-1.3
-0.05Log model
€
2F = −0.78 − 0.02 − 0.78 = −1.58
€
Pl (i +1) = pl (i +1) • maxk
(Pk (i) • akl ) or
log(Pl (i +1) = log(pl (i −1)) + maxk
(log(Pk (i) + log(akl ))
FFFFLLL
![Page 44: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/44.jpg)
HMM’s and weight matrices
• In the case of un-gapped alignments HMM’s become simple weight matrices
• To achieve high performance, the emission frequencies are estimated using the techniques of – Sequence weighting– Pseudo counts
![Page 45: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/45.jpg)
Profile HMM’s
• Alignments based on conventional scoring matrices (BLOSUM62) scores all positions in a sequence in an equal manner
• Some positions are highly conserved, some are highly variable (more than what is described in the BLOSUM matrix)
• Profile HMM’s are ideal suited to describe such position specific variations
![Page 46: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/46.jpg)
ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I-TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---IIE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD----TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---VASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE----TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVPTVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI
Profile HMM’s
Conserved
Core: Position with < 2 gaps
Deletion
Insertion
Non-conserved
Must have a G Any thing can match
![Page 47: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/47.jpg)
Profile HMM’s
All M/D pairs must be visited once
L1- Y2A3V4R5- I6
P1D2P3P4I4P5D6P7
![Page 48: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/48.jpg)
Example. Sequence profiles
• Alignment of protein sequences 1PLC._ and 1GYC.A• E-value > 1000• Profile alignment
– Align 1PLC._ against Swiss-prot– Make position specific weight matrix from
alignment– Use this matrix to align 1PLC._ against 1GYC.A
• E-value < 10-22. Rmsd=3.3
![Page 49: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/49.jpg)
Example continued
Smith-Waterman score: 53; 26.2% identity in 61 aa overlap
10 20 30 1PLC._ IDVLLGADDGSLAFVPSEFSISPG--EKIV-----FKNNAG :: .: : .:: .: . :... 1GYC.A ILRYQGAPVAEPTTTQTTSVIPLIETNLHPLARMPVPGSPTPGGVDKALNLAFNFNGTNF 280 290 300 310 320 330
40 50 60 70 80 90 1PLC._ FPHNIVFDEDSIPSGVDASKISMSEEDLLNAKGETFEVALSNKGEYSFYCSPHQGAGMVG : .: : ..: .. . ... .::: : 1GYC.A FINNASFTPPTVPVLLQILSGAQTAQDLLPAGSVYPLPAHSTIEITLPATALAPGAPHPF 340 350 360 370 380 390
1PLC._ KVTVN 1GYC.A HLHGHAFAVVRSAGSTTYNYNDPIFRDVVSTGTPAAGDNVTIRFQTDNPGPWFLHCHIDF
400 410 420 430 440 450
![Page 50: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/50.jpg)
Example continued
Score = 97.1 bits (241), Expect = 9e-22 Identities = 13/107 (12%), Positives = 27/107 (25%), Gaps = 17/107 (15%) Query: 3 ADDGSLAFVPSEFSISPGEKI------VFKNNAGFPHNIVFDEDSIPSGVDASKIS 56 F + G++ N+ + +G + +Sbjct: 26 ------VFPSPLITGKKGDRFQLNVVDTLTNHTMLKSTSIHWHGFFQAGTNWADGP 79 Query: 57 MSEEDLLNAKGETFEVAL---SNKGEYSFYCSP--HQGAGMVGKVTV 98 A G +F G + ++ G+ G VSbjct: 80 AFVNQCPIASGHSFLYDFHVPDQAGTFWYHSHLSTQYCDGLRGPFVV 126
Rmsd=3.3 ÅModel redStructure blue
![Page 51: Sequence motifs, information content, logos, and HMM’s](https://reader035.fdocuments.in/reader035/viewer/2022062500/56815367550346895dc16dea/html5/thumbnails/51.jpg)
HMM packages
• HMMER (http://hmmer.wustl.edu/)– S.R. Eddy, WashU St. Louis. Freely available.
• SAM (http://www.cse.ucsc.edu/research/compbio/sam.html)– R. Hughey, K. Karplus, A. Krogh, D. Haussler and others, UC Santa
Cruz. Freely available to academia, nominal license fee for commercial users.
• META-MEME (http://metameme.sdsc.edu/)– William Noble Grundy, UC San Diego. Freely available. Combines
features of PSSM search and profile HMM search.
• NET-ID, HMMpro (http://www.netid.com/html/hmmpro.html)– Freely available to academia, nominal license fee for commercial
users.– Allows HMM architecture construction.
• EasyGibbs (http://www.cbs.dtu.dk/biotools/EasyGibbs/)– Webserver for Gibbs sampling of proteins sequences