Sequence motifs, information content, logos, and HMM’s
-
Upload
riley-preston -
Category
Documents
-
view
31 -
download
0
description
Transcript of Sequence motifs, information content, logos, and HMM’s
Sequence motifs, information content,
logos, and HMM’s
Morten Nielsen,CBS, BioCentrum,
DTU
Objectives
• Visualization of binding motifs– Construction of sequence logos
• Understand the concepts of weight matrix construction– One of the most important methods of bioinformatics
• Introduce Hidden Markov models and understand that they are just weight matrices with gaps
Anchor positions
Binding Motif. MHC class I with peptide
SLLPAIVEL YLLPAIVHI TLWVDPYEV GLVPFLVSV KLLEPVLLL LLDVPTAAV LLDVPTAAV LLDVPTAAVLLDVPTAAV VLFRGGPRG MVDGTLLLL YMNGTMSQV MLLSVPLLL SLLGLLVEV ALLPPINIL TLIKIQHTLHLIDYLVTS ILAPPVVKL ALFPQLVIL GILGFVFTL STNRQSGRQ GLDVLTAKV RILGAVAKV QVCERIPTIILFGHENRV ILMEHIHKL ILDQKINEV SLAGGIIGV LLIENVASL FLLWATAEA SLPDFGISY KKREEAPSLLERPGGNEI ALSNLEVKL ALNELLQHV DLERKVESL FLGENISNF ALSDHHIYL GLSEFTEYL STAPPAHGVPLDGEYFTL GVLVGVALI RTLDKVLEV HLSTAFARV RLDSYVRSL YMNGTMSQV GILGFVFTL ILKEPVHGVILGFVFTLT LLFGYPVYV GLSPTVWLS WLSLLVPFV FLPSDFFPS CLGGLLTMV FIAGNSAYE KLGEFYNQMKLVALGINA DLMGYIPLV RLVTLKDIV MLLAVLYCL AAGIGILTV YLEPGPVTA LLDGTATLR ITDQVPFSVKTWGQYWQV TITDQVPFS AFHHVAREL YLNKIQNSL MMRKLAILS AIMDKNIIL IMDKNIILK SMVGNWAKVSLLAPGAKQ KIFGSLAFL ELVSEFSRM KLTPLCVTL VLYRYGSFS YIGEVLVSV CINGVCWTV VMNILLQYVILTVILGVL KVLEYVIKV FLWGPRALV GLSRYVARL FLLTRILTI HLGNVKYLV GIAGGLALL GLQDCTMLVTGAPVTYST VIYQYMDDL VLPDVFIRC VLPDVFIRC AVGIGIAVV LVVLGLLAV ALGLGLLPV GIGIGVLAAGAGIGVAVL IAGIGILAI LIVIGILIL LAGIGLIAA VDGIGILTI GAGIGVLTA AAGIGIIQI QAGIGILLAKARDPHSGH KACDPHSGH ACDPHSGHF SLYNTVATL RGPGRAFVT NLVPMVATV GLHCYEQLV PLKQHFQIVAVFDRKSDA LLDFVRFMG VLVKSPNHV GLAPPQHLI LLGRNSFEV PLTFGWCYK VLEWRFDSR TLNAWVKVVGLCTLVAML FIDSYICQV IISAVVGIL VMAGVGSPY LLWTLVVLL SVRDRLARL LLMDCSGSI CLTSTVQLVVLHDDLLEA LMWITQCFL SLLMWITQC QLSLLMWIT LLGATCMFV RLTRFLSRV YMDGTMSQV FLTPKKLQCISNDVCAQV VKTDGNPPE SVYDFFVWL FLYGALLLA VLFSSDFRI LMWAKIGPV SLLLELEEV SLSRFSWGAYTAFTIPSI RLMKQDFSV RLPRIFCSC FLWGPRAYA RLLQETELV SLFEGIDFY SLDQSVVEL RLNMFTPYINMFTPYIGV LMIIPLINV TLFIGSHVV SLVIVTTFV VLQWASLAV ILAKFLHWL STAPPHVNV LLLLTVLTVVVLGVVFGI ILHNGAYSL MIMVKCWMI MLGTHTMEV MLGTHTMEV SLADTNSLA LLWAARPRL GVALQTMKQGLYDGMEHL KMVELVHFL YLQLVFGIE MLMAQEALA LMAQEALAF VYDGREHTV YLSGANLNL RMFPNAPYLEAAGIGILT TLDSQVMSL STPPPGTRV KVAELVHFL IMIGVLVGV ALCRWGLLL LLFAGVQCQ VLLCESTAVYLSTAFARV YLLEMLWRL SLDDYNHLV RTLDKVLEV GLPVEYLQV KLIANNTRV FIYAGSLSA KLVANNTRLFLDEFMEGV ALQPGTALL VLDGLDVLL SLYSFPEPE ALYVDSLFF SLLQHLIGL ELTLGEFLK MINAYLDKLAAGIGILTV FLPSDFFPS SVRDRLARL SLREWLLRI LLSAWILTA AAGIGILTV AVPDEIPPL FAYDGKDYIAAGIGILTV FLPSDFFPS AAGIGILTV FLPSDFFPS AAGIGILTV FLWGPRALV ETVSEQSNV ITLWQRPLV
Sequence information
Outline
• Pattern recognition– Regular expressions and probabilities
• Information content– Sequence logos
• Multiple alignment and sequence motifs
• Weight matrix construction– Sequence weighting– Low (pseudo) counts
• Examples from the real world
• HMM’s – Viterbi decoding
• HMM’s in immunology– Profile HMMs– TAP binding
• MHC class II binding– Gibbs sampling
• Links to HMM packages
Sequence Information
• Say that a peptide must have L at P2 in order to bind, and that A,F,W,and Y are found at P1. Which position has most information? • How many questions do I need to ask to tell if a peptide binds looking at only P1 or P2?
Sequence Information
• Say that a peptide must have L at P2 in order to bind, and that A,F,W,and Y are found at P1. Which position has most information? • How many questions do I need to ask to tell if a peptide binds looking at only P1 or P2?
• P1: 4 questions (at most)• P2: 1 question (L or not)• P2 has the most information
Sequence Information
Calculate pa at each position
Entropy
Information content
Conserved positions– PV=1, P!v=0 => S=0, I=log(20)
Mutable positions– Paa=1/20 => S=log(20), I=0
• Say that a peptide must have L at P2 in order to bind, and that A,F,W,and Y are found at P1. Which position has most information? • How many questions do I need to ask to tell if a peptide binds looking at only P1 or P2?
• P1: 4 questions (at most)• P2: 1 question (L or not)• P2 has the most information
Information content
A R N D C Q E G H I L K M F P S T W Y V S I1 0.10 0.06 0.01 0.02 0.01 0.02 0.02 0.09 0.01 0.07 0.11 0.06 0.04 0.08 0.01 0.11 0.03 0.01 0.05 0.08 3.96 0.372 0.07 0.00 0.00 0.01 0.01 0.00 0.01 0.01 0.00 0.08 0.59 0.01 0.07 0.01 0.00 0.01 0.06 0.00 0.01 0.08 2.16 2.163 0.08 0.03 0.05 0.10 0.02 0.02 0.01 0.12 0.02 0.03 0.12 0.01 0.03 0.05 0.06 0.06 0.04 0.04 0.04 0.07 4.06 0.264 0.07 0.04 0.02 0.11 0.01 0.04 0.08 0.15 0.01 0.10 0.04 0.03 0.01 0.02 0.09 0.07 0.04 0.02 0.00 0.05 3.87 0.455 0.04 0.04 0.04 0.04 0.01 0.04 0.05 0.16 0.04 0.02 0.08 0.04 0.01 0.06 0.10 0.02 0.06 0.02 0.05 0.09 4.04 0.286 0.04 0.03 0.03 0.01 0.02 0.03 0.03 0.04 0.02 0.14 0.13 0.02 0.03 0.07 0.03 0.05 0.08 0.01 0.03 0.15 3.92 0.407 0.14 0.01 0.03 0.03 0.02 0.03 0.04 0.03 0.05 0.07 0.15 0.01 0.03 0.07 0.06 0.07 0.04 0.03 0.02 0.08 3.98 0.348 0.05 0.09 0.04 0.01 0.01 0.05 0.07 0.05 0.02 0.04 0.14 0.04 0.02 0.05 0.05 0.08 0.10 0.01 0.04 0.03 4.04 0.289 0.07 0.01 0.00 0.00 0.02 0.02 0.02 0.01 0.01 0.08 0.26 0.01 0.01 0.02 0.00 0.04 0.02 0.00 0.01 0.38 2.78 1.55
Sequence logos
Height of a column equal to IRelative height of a letter is pHighly useful tool to visualize sequence motifs
High information positions
HLA-A0201
http://www.cbs.dtu.dk/~gorodkin/appl/plogo.html
Characterizing a binding motif from small data sets
What can we learn?
1. A at P1 favors binding?
2. I is not allowed at P9?
3. K at P4 favors binding?
4. Which positions are important for binding?
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
10 MHC restricted peptides
Simple motifs Yes/No rules
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
10 MHC restricted peptides
• Only 11 of 212 peptides identified!• Need more flexible rules
•If not fit P1 but fit P2 then ok
• Not all positions are equally important
•We know that P2 and P9 determines binding more than other positions
•Cannot discriminate between good and very good binders
Simple motifsYes/No rules
• Example
•Two first peptides will not fit the motif. They are all good binders (aff< 500nM)
RLLDDTPEV 84 nMGLLGNVSTV 23 nMALAKAAAAL 309 nM
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
10 MHC restricted peptides
Extended motifs
Fitness of aa at each position given by P(aa)
Example P1PA = 6/10
PG = 2/10
PT = PK = 1/10
PC = PD = …PV = 0
Problems– Few data– Data redundancy/duplication
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
RLLDDTPEV 84 nMGLLGNVSTV 23 nMALAKAAAAL 309 nM
Sequence informationRaw sequence counting
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
Sequence weighting
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
Poor or biased sampling of sequence spaceExample P1
PA = 2/6
PG = 2/6
PT = PK = 1/6
PC = PD = …PV = 0
}Similar sequencesWeight 1/5
RLLDDTPEV 84 nMGLLGNVSTV 23 nMALAKAAAAL 309 nM
Sequence weightingALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
Pseudo counts
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
I is not found at position P9. Does this mean that I is forbidden (P(I)=0)?No! Use Blosum substitution matrix to estimate pseudo frequency of I at P9
A R N D C Q E G H I L K M F P S T W Y V A 0.29 0.03 0.03 0.03 0.02 0.03 0.04 0.08 0.01 0.04 0.06 0.04 0.02 0.02 0.03 0.09 0.05 0.01 0.02 0.07 R 0.04 0.34 0.04 0.03 0.01 0.05 0.05 0.03 0.02 0.02 0.05 0.12 0.02 0.02 0.02 0.04 0.03 0.01 0.02 0.03 N 0.04 0.04 0.32 0.08 0.01 0.03 0.05 0.07 0.03 0.02 0.03 0.05 0.01 0.02 0.02 0.07 0.05 0.00 0.02 0.03 D 0.04 0.03 0.07 0.40 0.01 0.03 0.09 0.05 0.02 0.02 0.03 0.04 0.01 0.01 0.02 0.05 0.04 0.00 0.01 0.02 C 0.07 0.02 0.02 0.02 0.48 0.01 0.02 0.03 0.01 0.04 0.07 0.02 0.02 0.02 0.02 0.04 0.04 0.00 0.01 0.06 Q 0.06 0.07 0.04 0.05 0.01 0.21 0.10 0.04 0.03 0.03 0.05 0.09 0.02 0.01 0.02 0.06 0.04 0.01 0.02 0.04 E 0.06 0.05 0.04 0.09 0.01 0.06 0.30 0.04 0.03 0.02 0.04 0.08 0.01 0.02 0.03 0.06 0.04 0.01 0.02 0.03 G 0.08 0.02 0.04 0.03 0.01 0.02 0.03 0.51 0.01 0.02 0.03 0.03 0.01 0.02 0.02 0.05 0.03 0.01 0.01 0.02 H 0.04 0.05 0.05 0.04 0.01 0.04 0.05 0.04 0.35 0.02 0.04 0.05 0.02 0.03 0.02 0.04 0.03 0.01 0.06 0.02 I 0.05 0.02 0.01 0.02 0.02 0.01 0.02 0.02 0.01 0.27 0.17 0.02 0.04 0.04 0.01 0.03 0.04 0.01 0.02 0.18 L 0.04 0.02 0.01 0.02 0.02 0.02 0.02 0.02 0.01 0.12 0.38 0.03 0.05 0.05 0.01 0.02 0.03 0.01 0.02 0.10 K 0.06 0.11 0.04 0.04 0.01 0.05 0.07 0.04 0.02 0.03 0.04 0.28 0.02 0.02 0.03 0.05 0.04 0.01 0.02 0.03 M 0.05 0.03 0.02 0.02 0.02 0.03 0.03 0.03 0.02 0.10 0.20 0.04 0.16 0.05 0.02 0.04 0.04 0.01 0.02 0.09 F 0.03 0.02 0.02 0.02 0.01 0.01 0.02 0.03 0.02 0.06 0.11 0.02 0.03 0.39 0.01 0.03 0.03 0.02 0.09 0.06 P 0.06 0.03 0.02 0.03 0.01 0.02 0.04 0.04 0.01 0.03 0.04 0.04 0.01 0.01 0.49 0.04 0.04 0.00 0.01 0.03 S 0.11 0.04 0.05 0.05 0.02 0.03 0.05 0.07 0.02 0.03 0.04 0.05 0.02 0.02 0.03 0.22 0.08 0.01 0.02 0.04 T 0.07 0.04 0.04 0.04 0.02 0.03 0.04 0.04 0.01 0.05 0.07 0.05 0.02 0.02 0.03 0.09 0.25 0.01 0.02 0.07 W 0.03 0.02 0.02 0.02 0.01 0.02 0.02 0.03 0.02 0.03 0.05 0.02 0.02 0.06 0.01 0.02 0.02 0.49 0.07 0.03 Y 0.04 0.03 0.02 0.02 0.01 0.02 0.03 0.02 0.05 0.04 0.07 0.03 0.02 0.13 0.02 0.03 0.03 0.03 0.32 0.05 V 0.07 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.01 0.16 0.13 0.03 0.03 0.04 0.02 0.03 0.05 0.01 0.02 0.27
The Blosum matrix
Some amino acids are highly conserved (i.e. C), some have a high change of mutation (i.e. I)
A R N D C Q E G H I L K M F P S T W Y V A 0.29 0.03 0.03 0.03 0.02 0.03 0.04 0.08 0.01 0.04 0.06 0.04 0.02 0.02 0.03 0.09 0.05 0.01 0.02 0.07 R 0.04 0.34 0.04 0.03 0.01 0.05 0.05 0.03 0.02 0.02 0.05 0.12 0.02 0.02 0.02 0.04 0.03 0.01 0.02 0.03 N 0.04 0.04 0.32 0.08 0.01 0.03 0.05 0.07 0.03 0.02 0.03 0.05 0.01 0.02 0.02 0.07 0.05 0.00 0.02 0.03 D 0.04 0.03 0.07 0.40 0.01 0.03 0.09 0.05 0.02 0.02 0.03 0.04 0.01 0.01 0.02 0.05 0.04 0.00 0.01 0.02 C 0.07 0.02 0.02 0.02 0.48 0.01 0.02 0.03 0.01 0.04 0.07 0.02 0.02 0.02 0.02 0.04 0.04 0.00 0.01 0.06 …. Y 0.04 0.03 0.02 0.02 0.01 0.02 0.03 0.02 0.05 0.04 0.07 0.03 0.02 0.13 0.02 0.03 0.03 0.03 0.32 0.05 V 0.07 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.01 0.16 0.13 0.03 0.03 0.04 0.02 0.03 0.05 0.01 0.02 0.27
What is a pseudo count?
• Say V is observed at P2• Knowing that V at P2 binds, what is the probability that a peptide could have I at P2?
• P(I|V) = 0.16
• Calculate observed amino acids frequencies fa
• Pseudo frequency for amino acid b
• Example
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
Pseudo count estimation
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
Weight on pseudo count
• Pseudo counts are important when only limited data is available
• With large data sets only “true” observation should count
is the effective number of sequences (N-1), is the weight on prior
• Example
• If large, p ≈ f and only the observed data defines the motif
• If small, p ≈ g and the pseudo counts (or prior) defines the motif
is [50-200] normally
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
Weight on pseudo count
Sequence weighting and pseudo counts
RLLDDTPEV 84nMGLLGNVSTV 23nMALAKAAAAL 309nM
P7P and P7S > 0
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
Position specific weighting
We know that positions 2 and 9 are anchor positions for most MHC binding motifs– Increase weight on
high information positions
Motif found on large data set
Weight matrices
Estimate amino acid frequencies from alignment including sequence weighting and pseudo count
What do the numbers mean?– P2(V)>P2(M). Does this mean that V enables binding more
than M.– In nature not all amino acids are found equally often
• In nature V is found more often than M, so we must somehow rescale with the background
• qM = 0.025, qV = 0.073• Finding 7% V is hence not significant, but 7% M highly significant
A R N D C Q E G H I L K M F P S T W Y V1 0.08 0.06 0.02 0.03 0.02 0.02 0.03 0.08 0.02 0.08 0.11 0.06 0.04 0.06 0.02 0.09 0.04 0.01 0.04 0.082 0.04 0.01 0.01 0.01 0.01 0.01 0.02 0.02 0.01 0.11 0.44 0.02 0.06 0.03 0.01 0.02 0.05 0.00 0.01 0.103 0.08 0.04 0.05 0.07 0.02 0.03 0.03 0.08 0.02 0.05 0.11 0.03 0.03 0.06 0.04 0.06 0.05 0.03 0.05 0.074 0.08 0.05 0.03 0.10 0.01 0.05 0.08 0.13 0.01 0.05 0.06 0.05 0.01 0.03 0.08 0.06 0.04 0.02 0.01 0.055 0.06 0.04 0.05 0.03 0.01 0.04 0.05 0.11 0.03 0.04 0.09 0.04 0.02 0.06 0.06 0.04 0.05 0.02 0.05 0.086 0.06 0.03 0.03 0.03 0.03 0.03 0.04 0.06 0.02 0.10 0.14 0.04 0.03 0.05 0.04 0.06 0.06 0.01 0.03 0.137 0.10 0.02 0.04 0.04 0.02 0.03 0.04 0.05 0.04 0.08 0.12 0.02 0.03 0.06 0.07 0.06 0.05 0.03 0.03 0.088 0.05 0.07 0.04 0.03 0.01 0.04 0.06 0.06 0.03 0.06 0.13 0.06 0.02 0.05 0.04 0.08 0.07 0.01 0.04 0.059 0.08 0.02 0.01 0.01 0.02 0.02 0.03 0.02 0.01 0.10 0.23 0.03 0.02 0.04 0.01 0.04 0.04 0.00 0.02 0.25
Weight matrices• A weight matrix is given as
Wij = log(pij/qj)– where i is a position in the motif, and j an amino acid. qj is the background frequency for amino acid j.
• W is a L x 20 matrix, L is motif length
A R N D C Q E G H I L K M F P S T W Y V 1 0.6 0.4 -3.5 -2.4 -0.4 -1.9 -2.7 0.3 -1.1 1.0 0.3 0.0 1.4 1.2 -2.7 1.4 -1.2 -2.0 1.1 0.7 2 -1.6 -6.6 -6.5 -5.4 -2.5 -4.0 -4.7 -3.7 -6.3 1.0 5.1 -3.7 3.1 -4.2 -4.3 -4.2 -0.2 -5.9 -3.8 0.4 3 0.2 -1.3 0.1 1.5 0.0 -1.8 -3.3 0.4 0.5 -1.0 0.3 -2.5 1.2 1.0 -0.1 -0.3 -0.5 3.4 1.6 0.0 4 -0.1 -0.1 -2.0 2.0 -1.6 0.5 0.8 2.0 -3.3 0.1 -1.7 -1.0 -2.2 -1.6 1.7 -0.6 -0.2 1.3 -6.8 -0.7 5 -1.6 -0.1 0.1 -2.2 -1.2 0.4 -0.5 1.9 1.2 -2.2 -0.5 -1.3 -2.2 1.7 1.2 -2.5 -0.1 1.7 1.5 1.0 6 -0.7 -1.4 -1.0 -2.3 1.1 -1.3 -1.4 -0.2 -1.0 1.8 0.8 -1.9 0.2 1.0 -0.4 -0.6 0.4 -0.5 -0.0 2.1 7 1.1 -3.8 -0.2 -1.3 1.3 -0.3 -1.3 -1.4 2.1 0.6 0.7 -5.0 1.1 0.9 1.3 -0.5 -0.9 2.9 -0.4 0.5 8 -2.2 1.0 -0.8 -2.9 -1.4 0.4 0.1 -0.4 0.2 -0.0 1.1 -0.5 -0.5 0.7 -0.3 0.8 0.8 -0.7 1.3 -1.1 9 -0.2 -3.5 -6.1 -4.5 0.7 -0.8 -2.5 -4.0 -2.6 0.9 2.8 -3.0 -1.8 -1.4 -6.2 -1.9 -1.6 -4.9 -1.6 4.5
Score sequences to weight matrix by looking up and adding L values from the matrix
A R N D C Q E G H I L K M F P S T W Y V 1 0.6 0.4 -3.5 -2.4 -0.4 -1.9 -2.7 0.3 -1.1 1.0 0.3 0.0 1.4 1.2 -2.7 1.4 -1.2 -2.0 1.1 0.7 2 -1.6 -6.6 -6.5 -5.4 -2.5 -4.0 -4.7 -3.7 -6.3 1.0 5.1 -3.7 3.1 -4.2 -4.3 -4.2 -0.2 -5.9 -3.8 0.4 3 0.2 -1.3 0.1 1.5 0.0 -1.8 -3.3 0.4 0.5 -1.0 0.3 -2.5 1.2 1.0 -0.1 -0.3 -0.5 3.4 1.6 0.0 4 -0.1 -0.1 -2.0 2.0 -1.6 0.5 0.8 2.0 -3.3 0.1 -1.7 -1.0 -2.2 -1.6 1.7 -0.6 -0.2 1.3 -6.8 -0.7 5 -1.6 -0.1 0.1 -2.2 -1.2 0.4 -0.5 1.9 1.2 -2.2 -0.5 -1.3 -2.2 1.7 1.2 -2.5 -0.1 1.7 1.5 1.0 6 -0.7 -1.4 -1.0 -2.3 1.1 -1.3 -1.4 -0.2 -1.0 1.8 0.8 -1.9 0.2 1.0 -0.4 -0.6 0.4 -0.5 -0.0 2.1 7 1.1 -3.8 -0.2 -1.3 1.3 -0.3 -1.3 -1.4 2.1 0.6 0.7 -5.0 1.1 0.9 1.3 -0.5 -0.9 2.9 -0.4 0.5 8 -2.2 1.0 -0.8 -2.9 -1.4 0.4 0.1 -0.4 0.2 -0.0 1.1 -0.5 -0.5 0.7 -0.3 0.8 0.8 -0.7 1.3 -1.1 9 -0.2 -3.5 -6.1 -4.5 0.7 -0.8 -2.5 -4.0 -2.6 0.9 2.8 -3.0 -1.8 -1.4 -6.2 -1.9 -1.6 -4.9 -1.6 4.5
Scoring a sequence to a weight matrix
RLLDDTPEVGLLGNVSTVALAKAAAAL
Which peptide is most likely to bind?Which peptide second?
11.9 14.7 4.3
84nM 23nM 309nM
Example from real life
• 10 peptides from MHCpep database
• Bind to the MHC complex
• Relevant for immune system recognition
• Estimate sequence motif and weight matrix
• Evaluate motif “correctness” on 528 peptides
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
Prediction accuracy
Pearson correlation 0.45
Prediction score
Measu
red
affi
nit
y
Predictive performance
End of first part
Take a deep breathSmile to you neighbor
Hidden Markov Models
• Weight matrices do not deal with insertions and deletions
• In alignments, this is done in an ad-hoc manner by optimization of the two gap penalties for first gap and gap extension
• HMM is a natural frame work where insertions/deletions are dealt with explicitly
Why hidden?
Model generates numbers– 312453666641
Does not tell which die was used
Alignment (decoding) can give the most probable solution/path (Viterby)– FFFFFFLLLLLL
Or most probable set of states– FFFFFFLLLLLL
1:1/62:1/63:1/64:1/65:1/66:1/6Fair
1:1/102:1/103:1/104:1/105:1/106:1/2Loaded
0.95
0.10
0.05
0.9
The unfair casino: Loaded die p(6) = 0.5; switch fair to load:0.05; switch load to fair: 0.1
HMM (a simple example)
ACA---ATG
TCAACTATC
ACAC--AGC
AGA---ATC
ACCG--ATC
• Example from A. Krogh
• Core region defines the number of states in the HMM (red)
• Insertion and deletion statistics are derived from the non-core part of the alignment (black) Core of alignment
.2
.8
.2
ACGT
ACGT
ACGT
ACGT
ACGT
ACGT.8
.8 .8.8
.2.2.2
.2
1
ACGT
.2
.2
.4
1. .4 1. 1.1.
.6.6
.4
HMM construction
ACA---ATG
TCAACTATC
ACAC--AGC
AGA---ATC
ACCG--ATC
• 5 matches. A, 2xC, T, G• 5 transitions in gap region
• C out, G out• A-C, C-T, T out• Out transition 3/5• Stay transition 2/5
ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x1x0.8x1x0.2 = 3.3x10-2
Align sequence to HMM
ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2 = 3.3x10-2
TCAACTATC 0.2x1x0.8x1x0.8x0.6x0.2x0.4x0.4x0.4x0.2x0.6x1x1x0.8x1x0.8 = 0.0075x10-2
ACAC--AGC = 1.2x10-2
Consensus:
ACAC--ATC = 4.7x10-2, ACA---ATC = 13.1x10-2
Exceptional:
TGCT--AGG = 0.0023x10-2
Align sequence to HMM - Null model
• Score depends strongly on length
• Null model is a random model. For length L the score is 0.25L
• Log-odds score for sequence S
Log( P(S)/0.25L)
• Positive score means more likely than Null model
ACA---ATG = 4.9
TCAACTATC = 3.0 ACAC--AGC = 5.3AGA---ATC = 4.9ACCG--ATC = 4.6Consensus:ACAC--ATC = 6.7 ACA---ATC = 6.3Exceptional:TGCT--AGG = -0.97
Note!
Model decoding (Viterby)
Example: 1245666. What was the series of dice used to generate this output?
1:1/62:1/63:1/64:1/65:1/66:1/6
Fair
1:1/102:1/103:1/104:1/105:1/106:1/2
Loaded
0.95
0.10
0.05
0.9
Model decoding (Viterby)
Example: 1245666. What was the series of dice used to generate this output?
1:-0.782:-0.783:-0.784:-0.785:-0.786:-0-78Fair
1:-12:-13:-14:-15:-16:-0.3Loaded
-0.02
-1
-1.3
-0.05Log model
1 2 4 5 6 6 6
F -0.78
L Null -3.08
Model decoding (Viterby)
1:-0.782:-0.783:-0.784:-0.785:-0.786:-0-78Fair
1:-12:-13:-14:-15:-16:-0.3Loaded
-0.02
-1
-1.3
-0.05Log model
1 2 4 5 6 6 6
F -0.78 -1.58
L Null -3.08 -3.88
Model decoding (Viterby)
1:-0.782:-0.783:-0.784:-0.785:-0.786:-0-78Fair
1:-12:-13:-14:-15:-16:-0.3Loaded
-0.02
-1
-1.3
-0.05Log model
1 2 4 5 6 6 6
F -0.78 -1.58 -2.38 -3.18 -3.98 -4.78 -5.58
L Null -3.08 -3.88 -4.68 -4.78 -5.13 -5.48
Series of dice is FFFFLLL
HMM’s and weight matrices
• In the case of un-gapped alignments HMM’s become simple weight matrices
• To achieve high performance, the emission frequencies are estimated using the techniques of – Sequence weighting– Pseudo counts
Profile HMM’s
• Alignments based on conventional scoring matrices (BLOSUM62) scores all positions in a sequence in an equal manner
• Some positions are highly conserved, some are highly variable (more than what is described in the BLOSUM matrix)
• Profile HMM’s are ideal suited to describe such position specific variations
ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I-TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---IIE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD----TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---VASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE----TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVPTVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI
Profile HMM’s
Conserved
Core: Position with < 2 gaps
Deletion
Insertion
Non-conserved
Must have a G Any thing can match
HMM vs. alignment
• Detailed description of core– Conserved/variable positions
• Price for insertions/deletions varies at different locations in sequence
• These features cannot be captured in conventional alignments
Profile HMM’s
All M/D pairs must be visited once
L1- Y2A3V4R5- I6
P1D2P3P4I4P5D6P7
Example. Sequence profiles
• Alignment of protein sequences 1PLC._ and 1GYC.A
• E-value > 1000• Profile alignment
– Align 1PLC._ against Swiss-prot– Make position specific weight matrix from alignment
– Use this matrix to align 1PLC._ against 1GYC.A
• E-value < 10-22. Rmsd=3.3
Example continued
Score = 97.1 bits (241), Expect = 9e-22 Identities = 13/107 (12%), Positives = 27/107 (25%), Gaps = 17/107 (15%) Query: 3 ADDGSLAFVPSEFSISPGEKI------VFKNNAGFPHNIVFDEDSIPSGVDASKIS 56 F + G++ N+ + +G + +Sbjct: 26 ------VFPSPLITGKKGDRFQLNVVDTLTNHTMLKSTSIHWHGFFQAGTNWADGP 79 Query: 57 MSEEDLLNAKGETFEVAL---SNKGEYSFYCSP--HQGAGMVGKVTV 98 A G +F G + ++ G+ G VSbjct: 80 AFVNQCPIASGHSFLYDFHVPDQAGTFWYHSHLSTQYCDGLRGPFVV 126
Rmsd=3.3 ÅModel redStructure blue
Class II MHC binding
• MHC class II binds peptides in the class II antigen presentation pathway• Binds peptides of length 9-18 (even whole proteins can bind!)• Binding cleft is open• Binding core is 9 aa
Gibbs samplerwww.cbs.dtu.dk/biotools/EasyGibbs
100 10mer peptides2100~1030 combinations
Monte Carlo simulations can do
it
Gibbs sampler. Prediction accuracy
HMM packages
• HMMER (http://hmmer.wustl.edu/)– S.R. Eddy, WashU St. Louis. Freely available.
• SAM (http://www.cse.ucsc.edu/research/compbio/sam.html)– R. Hughey, K. Karplus, A. Krogh, D. Haussler and others,
UC Santa Cruz. Freely available to academia, nominal license fee for commercial users.
• META-MEME (http://metameme.sdsc.edu/)– William Noble Grundy, UC San Diego. Freely available.
Combines features of PSSM search and profile HMM search. • NET-ID, HMMpro (http://www.netid.com/html/hmmpro.html)
– Freely available to academia, nominal license fee for commercial users.
– Allows HMM architecture construction.
• EasyGibbs (http://www.cbs.dtu.dk/biotools/EasyGibbs/)– Webserver for Gibbs sampling of proteins sequences