Immunological Bioinformatics

54
Immunological Bioinformatics Introduction to the immune system

description

Immunological Bioinformatics. Introduction to the immune system. Vaccination. Vaccination Administration of a substance to a person with the purpose of preventing a disease Traditionally composed of a killed or weakened micro organism - PowerPoint PPT Presentation

Transcript of Immunological Bioinformatics

Page 1: Immunological Bioinformatics

Immunological Bioinformatics

Introduction tothe immune system

Page 2: Immunological Bioinformatics

Vaccination

• Vaccination • Administration of a substance to a person with the purpose of preventing a disease

• Traditionally composed of a killed or weakened micro organism

• Vaccination works by creating a type of immune response that enables the memory cells to later respond to a similar organism before it can cause disease

Page 3: Immunological Bioinformatics

Figure 1-20

Page 4: Immunological Bioinformatics

Effectiveness of vaccines

1958 start of small pox eradication program

Page 5: Immunological Bioinformatics

The Immune System

• The innate immune system

• The adaptive immune system

Page 6: Immunological Bioinformatics

The innate immune system

• Unspecific• Antigen independent• Immediate response• No training/selection hence no memory

• Pathogen independent (but response might be pathogen type dependent)

Page 7: Immunological Bioinformatics

The adaptive immune system

• Pathogen specific

– Humoral

– Cellular

http://www.uni-heidelberg.de/zentral/ztl/grafiken_bilder/bilder/e-coli.jpg

Bacteria

http://en.wikipedia.org/wiki/Image:Aids_virus.jpg

Virus

http://tpeeaupotable.ifrance.com/ma%20photo/bilharzoze.jpg

Parasite

Page 8: Immunological Bioinformatics

Adaptive immune response

• Signal induced– Pathogens

•Antigens– Epitopes

B CellT Cell

Page 9: Immunological Bioinformatics

Diversity is a hallmark of the (adaptive) immune system• Diversity of lymphocytes

– Huge diversity within a host– At least 108 different T & B cell clones

• Receptors made by recombination & N-additions, and

• Somatic mutation during immune response

• Repertoires are (partly) random– Randomness requires self tolerance

Page 10: Immunological Bioinformatics

Figure 1-14

Page 11: Immunological Bioinformatics

The role of lymphocytes

Page 12: Immunological Bioinformatics

Cartoon by Eric Reits

Humoral immunity

Page 13: Immunological Bioinformatics

Antibody - Antigen interaction

Fab

Antigen

Epitope

Paratope

Antibody

The antibody recognizes structural properties of the surface of the antigen

Page 14: Immunological Bioinformatics

Antibody Effect

Virus or Toxin

Neutralizing Antibodies

Page 15: Immunological Bioinformatics

Cellular immune response Cartoon by Eric Reits

Page 16: Immunological Bioinformatics

MHC-I molecules present peptides on the surface of most cells

Page 17: Immunological Bioinformatics

CTL response

Healthy cell

Virus-infectedcell

MHC-I

Page 18: Immunological Bioinformatics

CTL response

Virus-infectedcell

MHC-I

Page 19: Immunological Bioinformatics

The death of an infected cell

QuickTime™ and aSorenson Video decompressorare needed to see this picture.

Page 20: Immunological Bioinformatics

Polymorphism of MHC

• Within a host limited number of loci (genes)– only 6 different class I molecules (two A, B and C)

– only 12 different class II molecules

• Within a population > 100 alleles per locus

Page 21: Immunological Bioinformatics

More MHC molecules: more diversity in the presented peptides

• 1% probability that MHC molecule presents a peptide• Different hosts sample different peptides from same pathogen.

Page 22: Immunological Bioinformatics

Immunological benefits of MHC polymorphism• Heterozygote advantage

– Heterozygotes have a selective advantage because they can present more peptides (Hughes.n88).

• Coevolution– Pathogens avoid presentation on common MHC alleles (HIV)

– Frequency dependent selection

Page 23: Immunological Bioinformatics

Figure 5-13

Page 24: Immunological Bioinformatics

Heterozygote disadvantage!(for vaccine design)• Few human beings will share the same set of HLA alleles– Different persons will react to a pathogen infection in a non-similar manner

• A CTL based vaccine must include epitopes specific for each HLA allele in a population– A CTL based vaccine must consist of ~800 HLA class I epitopes and ~400 class II epitopes

Page 25: Immunological Bioinformatics

HLA specificity clustering

A0201

A0101

A6802

B0702

Page 26: Immunological Bioinformatics

HLA polymorphism - supertypes

• Each HLA molecule within a supertype binds essentially the same peptides• Nine major HLA class I supertypes have been defined

• HLA-A1, A2, A3, A24,B7, B27, B44, B58, and B62

• And maybe add three more• HLA-A26, HLA-B8, and HLA-B39

=> A CTL based vaccine must consist of 9-12 HLA class I epitopes

Sette et al, Immunogenetics (1999) 50:201-212

Page 27: Immunological Bioinformatics

Summary

• The adaptive immune system is extremely diverse– A immune responds can by raised against any thing foreign!

• Antibodies defines the humoral response– Antibodies recognize structural properties on the surface of extra cellular antigens

• T cells defines the cellular response– CTL’s kill cell that present MHC molecules bound with intra cellular derived foreign peptides

Page 28: Immunological Bioinformatics

Anchor positions

MHC class I with peptide

Page 29: Immunological Bioinformatics

What makes a peptide a potential and effective epitope?• Part of a pathogen protein• Successful processing

– Proteasome cleavage– TAP binding

• Binds to MHC molecule• Protein function and expression

– Early in replication– Highly expressed proteins are more likely to generate immunogens

• Sequence conservation in evolution

Page 30: Immunological Bioinformatics

 

Prediction of HLA binding specificityHistorical overview• Simple Motifs

– Allowed/non allowed amino acids• Extended motifs

– Amino acid preferences (SYFPEITHI)– Anchor/Preferred/other amino acids

• Hidden Markov models– Peptide statistics from sequence alignment

• Neural networks– Can take sequence correlations into account

Page 31: Immunological Bioinformatics

SYFPEITHI predictions

• Extended motifs based on peptides from the literature and peptides eluted from cells expressing specific HLAs ( i.e., binding peptides)

• Scoring scheme is not readily accessible.• Positions defined as anchor or auxiliary anchor positions are weighted differently (higher)

• The final score is the sum of the scores at each position

• Predictions can be made for several HLA-A, -B and -DRB1 alleles, as well as some mice K, D and L alleles.

Page 32: Immunological Bioinformatics

BIMAS

• Matrix made from peptides with a measured T1/2 for the MHC-peptide complex

• The matrices are available on the website• The final score is the product of the scores of each position in the matrix multiplied with a constant, different for each MHC, to give a prediction of the T1/2

• Predictions can be obtained for several HLA-A, -B and -C alleles, mice K,D and L alleles, and a single cattle MHC.

Page 33: Immunological Bioinformatics

SLLPAIVEL YLLPAIVHI TLWVDPYEV GLVPFLVSV KLLEPVLLL LLDVPTAAV LLDVPTAAV LLDVPTAAVLLDVPTAAV VLFRGGPRG MVDGTLLLL YMNGTMSQV MLLSVPLLL SLLGLLVEV ALLPPINIL TLIKIQHTLHLIDYLVTS ILAPPVVKL ALFPQLVIL GILGFVFTL STNRQSGRQ GLDVLTAKV RILGAVAKV QVCERIPTIILFGHENRV ILMEHIHKL ILDQKINEV SLAGGIIGV LLIENVASL FLLWATAEA SLPDFGISY KKREEAPSLLERPGGNEI ALSNLEVKL ALNELLQHV DLERKVESL FLGENISNF ALSDHHIYL GLSEFTEYL STAPPAHGVPLDGEYFTL GVLVGVALI RTLDKVLEV HLSTAFARV RLDSYVRSL YMNGTMSQV GILGFVFTL ILKEPVHGVILGFVFTLT LLFGYPVYV GLSPTVWLS WLSLLVPFV FLPSDFFPS CLGGLLTMV FIAGNSAYE KLGEFYNQMKLVALGINA DLMGYIPLV RLVTLKDIV MLLAVLYCL AAGIGILTV YLEPGPVTA LLDGTATLR ITDQVPFSVKTWGQYWQV TITDQVPFS AFHHVAREL YLNKIQNSL MMRKLAILS AIMDKNIIL IMDKNIILK SMVGNWAKVSLLAPGAKQ KIFGSLAFL ELVSEFSRM KLTPLCVTL VLYRYGSFS YIGEVLVSV CINGVCWTV VMNILLQYVILTVILGVL KVLEYVIKV FLWGPRALV GLSRYVARL FLLTRILTI HLGNVKYLV GIAGGLALL GLQDCTMLVTGAPVTYST VIYQYMDDL VLPDVFIRC VLPDVFIRC AVGIGIAVV LVVLGLLAV ALGLGLLPV GIGIGVLAAGAGIGVAVL IAGIGILAI LIVIGILIL LAGIGLIAA VDGIGILTI GAGIGVLTA AAGIGIIQI QAGIGILLAKARDPHSGH KACDPHSGH ACDPHSGHF SLYNTVATL RGPGRAFVT NLVPMVATV GLHCYEQLV PLKQHFQIVAVFDRKSDA LLDFVRFMG VLVKSPNHV GLAPPQHLI LLGRNSFEV PLTFGWCYK VLEWRFDSR TLNAWVKVVGLCTLVAML FIDSYICQV IISAVVGIL VMAGVGSPY LLWTLVVLL SVRDRLARL LLMDCSGSI CLTSTVQLVVLHDDLLEA LMWITQCFL SLLMWITQC QLSLLMWIT LLGATCMFV RLTRFLSRV YMDGTMSQV FLTPKKLQCISNDVCAQV VKTDGNPPE SVYDFFVWL FLYGALLLA VLFSSDFRI LMWAKIGPV SLLLELEEV SLSRFSWGAYTAFTIPSI RLMKQDFSV RLPRIFCSC FLWGPRAYA RLLQETELV SLFEGIDFY SLDQSVVEL RLNMFTPYINMFTPYIGV LMIIPLINV TLFIGSHVV SLVIVTTFV VLQWASLAV ILAKFLHWL STAPPHVNV LLLLTVLTVVVLGVVFGI ILHNGAYSL MIMVKCWMI MLGTHTMEV MLGTHTMEV SLADTNSLA LLWAARPRL GVALQTMKQGLYDGMEHL KMVELVHFL YLQLVFGIE MLMAQEALA LMAQEALAF VYDGREHTV YLSGANLNL RMFPNAPYLEAAGIGILT TLDSQVMSL STPPPGTRV KVAELVHFL IMIGVLVGV ALCRWGLLL LLFAGVQCQ VLLCESTAVYLSTAFARV YLLEMLWRL SLDDYNHLV RTLDKVLEV GLPVEYLQV KLIANNTRV FIYAGSLSA KLVANNTRLFLDEFMEGV ALQPGTALL VLDGLDVLL SLYSFPEPE ALYVDSLFF SLLQHLIGL ELTLGEFLK MINAYLDKLAAGIGILTV FLPSDFFPS SVRDRLARL SLREWLLRI LLSAWILTA AAGIGILTV AVPDEIPPL FAYDGKDYIAAGIGILTV FLPSDFFPS AAGIGILTV FLPSDFFPS AAGIGILTV FLWGPRALV ETVSEQSNV ITLWQRPLV

Sequence information

Page 34: Immunological Bioinformatics

Sequence Information

• Calculate pa at each position

• Entropy

• Information content

• Conserved positions– PV=1, P!v=0 => S=0, I=log(20)

• Mutable positions– Paa=1/20 => S=log(20), I=0

S = − paa∑ log(pa )

I = log(20) + paa∑ log(pa )

• Say that a peptide must have L at P2 in order to bind, and that A,F,W,and Y are found at P1. Which position has most information? • How many questions do I need to ask to tell if a peptide binds looking at only P1 or P2?• P1: 4 questions (at most)• P2: 1 question (L or not)• P2 has the most information

Page 35: Immunological Bioinformatics

Information content

A R N D C Q E G H I L K M F P S T W Y V S I1 0.10 0.06 0.01 0.02 0.01 0.02 0.02 0.09 0.01 0.07 0.11 0.06 0.04 0.08 0.01 0.11 0.03 0.01 0.05 0.08 3.96 0.372 0.07 0.00 0.00 0.01 0.01 0.00 0.01 0.01 0.00 0.08 0.59 0.01 0.07 0.01 0.00 0.01 0.06 0.00 0.01 0.08 2.16 2.163 0.08 0.03 0.05 0.10 0.02 0.02 0.01 0.12 0.02 0.03 0.12 0.01 0.03 0.05 0.06 0.06 0.04 0.04 0.04 0.07 4.06 0.264 0.07 0.04 0.02 0.11 0.01 0.04 0.08 0.15 0.01 0.10 0.04 0.03 0.01 0.02 0.09 0.07 0.04 0.02 0.00 0.05 3.87 0.455 0.04 0.04 0.04 0.04 0.01 0.04 0.05 0.16 0.04 0.02 0.08 0.04 0.01 0.06 0.10 0.02 0.06 0.02 0.05 0.09 4.04 0.286 0.04 0.03 0.03 0.01 0.02 0.03 0.03 0.04 0.02 0.14 0.13 0.02 0.03 0.07 0.03 0.05 0.08 0.01 0.03 0.15 3.92 0.407 0.14 0.01 0.03 0.03 0.02 0.03 0.04 0.03 0.05 0.07 0.15 0.01 0.03 0.07 0.06 0.07 0.04 0.03 0.02 0.08 3.98 0.348 0.05 0.09 0.04 0.01 0.01 0.05 0.07 0.05 0.02 0.04 0.14 0.04 0.02 0.05 0.05 0.08 0.10 0.01 0.04 0.03 4.04 0.289 0.07 0.01 0.00 0.00 0.02 0.02 0.02 0.01 0.01 0.08 0.26 0.01 0.01 0.02 0.00 0.04 0.02 0.00 0.01 0.38 2.78 1.55

I = log(20) + paa∑ log(pa )

S = − paa∑ log(pa )

pL = 0.26log2(0.26) = −1.94−pL log2(pL ) = −0.26 ⋅−1.94 = 0.51

Page 36: Immunological Bioinformatics

Sequence logos

•Height of a column equal to I•Relative height of a letter is p•Highly useful tool to visualize sequence motifs

High information positions

HLA-A0201

http://www.cbs.dtu.dk/~gorodkin/appl/plogo.html

Page 37: Immunological Bioinformatics

Characterizing a binding motif from small data sets

What can we learn?

1. A at P1 favors binding?

2. I is not allowed at P9?

3. K at P4 favors binding?

4. Which positions are important for binding?

l ALAKAAAAMl ALAKAAAANl ALAKAAAARl ALAKAAAATl ALAKAAAAVl GMNERPILTl GILGFVFTMl TLNAWVKVVl KLNEPVLLLl AVVPFIVSV

10 MHC restricted peptides

Page 38: Immunological Bioinformatics

Simple motifs Yes/No rules

l ALAKAAAAMl ALAKAAAANl ALAKAAAARl ALAKAAAATl ALAKAAAAVl GMNERPILTl GILGFVFTMl TLNAWVKVVl KLNEPVLLLl AVVPFIVSV

10 MHC restricted peptides

[AGTK]1[LMIV ]2[ANLV ]3 ...[MNRTVL]9

• Only 11 of 212 peptides identified!• Need more flexible rules

•If not fit P1 but fit P2 then ok

• Not all positions are equally important

•We know that P2 and P9 determines binding more than other positions

•Cannot discriminate between good and very good binders

Page 39: Immunological Bioinformatics

Simple motifsYes/No rules

[AGTK]1[LMIV ]2[ANLV ]3 ...[AIFKLV ]7 ...[MNRTVL]9

• Example

•Two first peptides will not fit the motif. They are all good binders (aff< 500nM)

RLLDDTPEV 84 nMGLLGNVSTV 23 nMALAKAAAAL 309 nM

l ALAKAAAAMl ALAKAAAANl ALAKAAAARl ALAKAAAATl ALAKAAAAVl GMNERPILTl GILGFVFTMl TLNAWVKVVl KLNEPVLLLl AVVPFIVSV

10 MHC restricted peptides

Page 40: Immunological Bioinformatics

Extended motifs

• Fitness of aa at each position given by P(aa)

• Example P1PA = 6/10PG = 2/10PT = PK = 1/10PC = PD = …PV = 0

• Problems– Few data– Data redundancy/duplication

l ALAKAAAAMl ALAKAAAANl ALAKAAAARl ALAKAAAATl ALAKAAAAVl GMNERPILTl GILGFVFTMl TLNAWVKVVl KLNEPVLLLl AVVPFIVSV

RLLDDTPEV 84 nMGLLGNVSTV 23 nMALAKAAAAL 309 nM

Page 41: Immunological Bioinformatics

Sequence informationRaw sequence counting

l ALAKAAAAMl ALAKAAAANl ALAKAAAARl ALAKAAAATl ALAKAAAAVl GMNERPILTl GILGFVFTMl TLNAWVKVVl KLNEPVLLLl AVVPFIVSV

Page 42: Immunological Bioinformatics

Sequence weighting

l ALAKAAAAMl ALAKAAAANl ALAKAAAARl ALAKAAAATl ALAKAAAAVl GMNERPILTl GILGFVFTMl TLNAWVKVVl KLNEPVLLLl AVVPFIVSV

•Poor or biased sampling of sequence space•Example P1

PA = 2/6PG = 2/6PT = PK = 1/6PC = PD = …PV = 0

}Similar sequencesWeight 1/5

RLLDDTPEV 84 nMGLLGNVSTV 23 nMALAKAAAAL 309 nM

Page 43: Immunological Bioinformatics

Sequence weightingl ALAKAAAAMl ALAKAAAANl ALAKAAAARl ALAKAAAATl ALAKAAAAVl GMNERPILTl GILGFVFTMl TLNAWVKVVl KLNEPVLLLl AVVPFIVSV

Page 44: Immunological Bioinformatics

Pseudo counts

l ALAKAAAAMl ALAKAAAANl ALAKAAAARl ALAKAAAATl ALAKAAAAVl GMNERPILTl GILGFVFTMl TLNAWVKVVl KLNEPVLLLl AVVPFIVSV

•I is not found at position P9. Does this mean that I is forbidden (P(I)=0)?•No! Use Blosum substitution matrix to estimate pseudo frequency of I at P9

Page 45: Immunological Bioinformatics

A R N D C Q E G H I L K M F P S T W Y V A 0.29 0.03 0.03 0.03 0.02 0.03 0.04 0.08 0.01 0.04 0.06 0.04 0.02 0.02 0.03 0.09 0.05 0.01 0.02 0.07 R 0.04 0.34 0.04 0.03 0.01 0.05 0.05 0.03 0.02 0.02 0.05 0.12 0.02 0.02 0.02 0.04 0.03 0.01 0.02 0.03 N 0.04 0.04 0.32 0.08 0.01 0.03 0.05 0.07 0.03 0.02 0.03 0.05 0.01 0.02 0.02 0.07 0.05 0.00 0.02 0.03 D 0.04 0.03 0.07 0.40 0.01 0.03 0.09 0.05 0.02 0.02 0.03 0.04 0.01 0.01 0.02 0.05 0.04 0.00 0.01 0.02 C 0.07 0.02 0.02 0.02 0.48 0.01 0.02 0.03 0.01 0.04 0.07 0.02 0.02 0.02 0.02 0.04 0.04 0.00 0.01 0.06 Q 0.06 0.07 0.04 0.05 0.01 0.21 0.10 0.04 0.03 0.03 0.05 0.09 0.02 0.01 0.02 0.06 0.04 0.01 0.02 0.04 E 0.06 0.05 0.04 0.09 0.01 0.06 0.30 0.04 0.03 0.02 0.04 0.08 0.01 0.02 0.03 0.06 0.04 0.01 0.02 0.03 G 0.08 0.02 0.04 0.03 0.01 0.02 0.03 0.51 0.01 0.02 0.03 0.03 0.01 0.02 0.02 0.05 0.03 0.01 0.01 0.02 H 0.04 0.05 0.05 0.04 0.01 0.04 0.05 0.04 0.35 0.02 0.04 0.05 0.02 0.03 0.02 0.04 0.03 0.01 0.06 0.02 I 0.05 0.02 0.01 0.02 0.02 0.01 0.02 0.02 0.01 0.27 0.17 0.02 0.04 0.04 0.01 0.03 0.04 0.01 0.02 0.18 L 0.04 0.02 0.01 0.02 0.02 0.02 0.02 0.02 0.01 0.12 0.38 0.03 0.05 0.05 0.01 0.02 0.03 0.01 0.02 0.10 K 0.06 0.11 0.04 0.04 0.01 0.05 0.07 0.04 0.02 0.03 0.04 0.28 0.02 0.02 0.03 0.05 0.04 0.01 0.02 0.03 M 0.05 0.03 0.02 0.02 0.02 0.03 0.03 0.03 0.02 0.10 0.20 0.04 0.16 0.05 0.02 0.04 0.04 0.01 0.02 0.09 F 0.03 0.02 0.02 0.02 0.01 0.01 0.02 0.03 0.02 0.06 0.11 0.02 0.03 0.39 0.01 0.03 0.03 0.02 0.09 0.06 P 0.06 0.03 0.02 0.03 0.01 0.02 0.04 0.04 0.01 0.03 0.04 0.04 0.01 0.01 0.49 0.04 0.04 0.00 0.01 0.03 S 0.11 0.04 0.05 0.05 0.02 0.03 0.05 0.07 0.02 0.03 0.04 0.05 0.02 0.02 0.03 0.22 0.08 0.01 0.02 0.04 T 0.07 0.04 0.04 0.04 0.02 0.03 0.04 0.04 0.01 0.05 0.07 0.05 0.02 0.02 0.03 0.09 0.25 0.01 0.02 0.07 W 0.03 0.02 0.02 0.02 0.01 0.02 0.02 0.03 0.02 0.03 0.05 0.02 0.02 0.06 0.01 0.02 0.02 0.49 0.07 0.03 Y 0.04 0.03 0.02 0.02 0.01 0.02 0.03 0.02 0.05 0.04 0.07 0.03 0.02 0.13 0.02 0.03 0.03 0.03 0.32 0.05 V 0.07 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.01 0.16 0.13 0.03 0.03 0.04 0.02 0.03 0.05 0.01 0.02 0.27

The Blosum matrix

Some amino acids are highly conserved (i.e. C), some have a high change of mutation (i.e. I)

Page 46: Immunological Bioinformatics

A R N D C Q E G H I L K M F P S T W Y V A 0.29 0.03 0.03 0.03 0.02 0.03 0.04 0.08 0.01 0.04 0.06 0.04 0.02 0.02 0.03 0.09 0.05 0.01 0.02 0.07 R 0.04 0.34 0.04 0.03 0.01 0.05 0.05 0.03 0.02 0.02 0.05 0.12 0.02 0.02 0.02 0.04 0.03 0.01 0.02 0.03 N 0.04 0.04 0.32 0.08 0.01 0.03 0.05 0.07 0.03 0.02 0.03 0.05 0.01 0.02 0.02 0.07 0.05 0.00 0.02 0.03 D 0.04 0.03 0.07 0.40 0.01 0.03 0.09 0.05 0.02 0.02 0.03 0.04 0.01 0.01 0.02 0.05 0.04 0.00 0.01 0.02 C 0.07 0.02 0.02 0.02 0.48 0.01 0.02 0.03 0.01 0.04 0.07 0.02 0.02 0.02 0.02 0.04 0.04 0.00 0.01 0.06 …. Y 0.04 0.03 0.02 0.02 0.01 0.02 0.03 0.02 0.05 0.04 0.07 0.03 0.02 0.13 0.02 0.03 0.03 0.03 0.32 0.05 V 0.07 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.01 0.16 0.13 0.03 0.03 0.04 0.02 0.03 0.05 0.01 0.02 0.27

What is a pseudo count?

• Say I observe V at P1• Knowing that V at P1 binds, what is the probability that a peptide could have I at P1?

• P(I|V) = 0.16

Page 47: Immunological Bioinformatics

• Calculate observed amino acids frequencies fa

• Pseudo frequency for amino acid b

• Example€

gb = faa∑ ⋅qb |a

gI = 0.2 ⋅qI |M + 0.1⋅qI |N + ...+ 0.3⋅qI |V + 0.1⋅qI |LgI = 0.2 ⋅0.04 + 0.1⋅0.01+ ...+ 0.3⋅0.18 + 0.1⋅0.17 ≈ 0.09

l ALAKAAAAMl ALAKAAAANl ALAKAAAARl ALAKAAAATl ALAKAAAAVl GMNERPILTl GILGFVFTMl TLNAWVKVVl KLNEPVLLLl AVVPFIVSV

Pseudo count estimation

Page 48: Immunological Bioinformatics

l ALAKAAAAMl ALAKAAAANl ALAKAAAARl ALAKAAAATl ALAKAAAAVl GMNERPILTl GILGFVFTMl TLNAWVKVVl KLNEPVLLLl AVVPFIVSV

Weight on pseudo count

• Pseudo counts are important when only limited data is available

• With large data sets only “true” observation should count

• is the effective number of sequences (N-1), is the weight on prior€

pa = α ⋅ fa + β ⋅gaα + β

Page 49: Immunological Bioinformatics

• Example

• If large, p ≈ f and only the observed data defines the motif

• If small, p ≈ g and the pseudo counts (or prior) defines the motif

• is [50-200] normally

l ALAKAAAAMl ALAKAAAANl ALAKAAAARl ALAKAAAATl ALAKAAAAVl GMNERPILTl GILGFVFTMl TLNAWVKVVl KLNEPVLLLl AVVPFIVSV

Weight on pseudo count

pa = α ⋅ fa + β ⋅gaα + β

Page 50: Immunological Bioinformatics

Sequence weighting and pseudo counts

RLLDDTPEV 84nMGLLGNVSTV 23nMALAKAAAAL 309nM

P7P and P7S > 0

l ALAKAAAAMl ALAKAAAANl ALAKAAAARl ALAKAAAATl ALAKAAAAVl GMNERPILTl GILGFVFTMl TLNAWVKVVl KLNEPVLLLl AVVPFIVSV

Page 51: Immunological Bioinformatics

Position specific weighting

• We know that positions 2 and 9 are anchor positions for most MHC binding motifs– Increase weight on

high information positions

• Motif found on large data set

Page 52: Immunological Bioinformatics

Weight matrices

• Estimate amino acid frequencies from alignment including sequence weighting and pseudo count

• What do the numbers mean?– P2(V)>P2(M). Does this mean that V enables binding more than

M.– In nature not all amino acids are found equally often

• qM = 0.025, qV = 0.073• Finding 7% V is hence not significant, but 2% M highly significant

• In nature V is found more often than M, so we must somehow rescale with the background

A R N D C Q E G H I L K M F P S T W Y V1 0.08 0.06 0.02 0.03 0.02 0.02 0.03 0.08 0.02 0.08 0.11 0.06 0.04 0.06 0.02 0.09 0.04 0.01 0.04 0.082 0.04 0.01 0.01 0.01 0.01 0.01 0.02 0.02 0.01 0.11 0.44 0.02 0.06 0.03 0.01 0.02 0.05 0.00 0.01 0.103 0.08 0.04 0.05 0.07 0.02 0.03 0.03 0.08 0.02 0.05 0.11 0.03 0.03 0.06 0.04 0.06 0.05 0.03 0.05 0.074 0.08 0.05 0.03 0.10 0.01 0.05 0.08 0.13 0.01 0.05 0.06 0.05 0.01 0.03 0.08 0.06 0.04 0.02 0.01 0.055 0.06 0.04 0.05 0.03 0.01 0.04 0.05 0.11 0.03 0.04 0.09 0.04 0.02 0.06 0.06 0.04 0.05 0.02 0.05 0.086 0.06 0.03 0.03 0.03 0.03 0.03 0.04 0.06 0.02 0.10 0.14 0.04 0.03 0.05 0.04 0.06 0.06 0.01 0.03 0.137 0.10 0.02 0.04 0.04 0.02 0.03 0.04 0.05 0.04 0.08 0.12 0.02 0.03 0.06 0.07 0.06 0.05 0.03 0.03 0.088 0.05 0.07 0.04 0.03 0.01 0.04 0.06 0.06 0.03 0.06 0.13 0.06 0.02 0.05 0.04 0.08 0.07 0.01 0.04 0.059 0.08 0.02 0.01 0.01 0.02 0.02 0.03 0.02 0.01 0.10 0.23 0.03 0.02 0.04 0.01 0.04 0.04 0.00 0.02 0.25

Page 53: Immunological Bioinformatics

Weight matrices• A weight matrix is given as

Wij = log(pij/qj)– where i is a position in the motif, and j an amino acid. qj is the background frequency for amino acid j.

• W is a L x 20 matrix, L is motif length

A R N D C Q E G H I L K M F P S T W Y V 1 0.6 0.4 -3.5 -2.4 -0.4 -1.9 -2.7 0.3 -1.1 1.0 0.3 0.0 1.4 1.2 -2.7 1.4 -1.2 -2.0 1.1 0.7 2 -1.6 -6.6 -6.5 -5.4 -2.5 -4.0 -4.7 -3.7 -6.3 1.0 5.1 -3.7 3.1 -4.2 -4.3 -4.2 -0.2 -5.9 -3.8 0.4 3 0.2 -1.3 0.1 1.5 0.0 -1.8 -3.3 0.4 0.5 -1.0 0.3 -2.5 1.2 1.0 -0.1 -0.3 -0.5 3.4 1.6 0.0 4 -0.1 -0.1 -2.0 2.0 -1.6 0.5 0.8 2.0 -3.3 0.1 -1.7 -1.0 -2.2 -1.6 1.7 -0.6 -0.2 1.3 -6.8 -0.7 5 -1.6 -0.1 0.1 -2.2 -1.2 0.4 -0.5 1.9 1.2 -2.2 -0.5 -1.3 -2.2 1.7 1.2 -2.5 -0.1 1.7 1.5 1.0 6 -0.7 -1.4 -1.0 -2.3 1.1 -1.3 -1.4 -0.2 -1.0 1.8 0.8 -1.9 0.2 1.0 -0.4 -0.6 0.4 -0.5 -0.0 2.1 7 1.1 -3.8 -0.2 -1.3 1.3 -0.3 -1.3 -1.4 2.1 0.6 0.7 -5.0 1.1 0.9 1.3 -0.5 -0.9 2.9 -0.4 0.5 8 -2.2 1.0 -0.8 -2.9 -1.4 0.4 0.1 -0.4 0.2 -0.0 1.1 -0.5 -0.5 0.7 -0.3 0.8 0.8 -0.7 1.3 -1.1 9 -0.2 -3.5 -6.1 -4.5 0.7 -0.8 -2.5 -4.0 -2.6 0.9 2.8 -3.0 -1.8 -1.4 -6.2 -1.9 -1.6 -4.9 -1.6 4.5

Page 54: Immunological Bioinformatics

• Score sequences to weight matrix by looking up and adding L values from the matrix

A R N D C Q E G H I L K M F P S T W Y V 1 0.6 0.4 -3.5 -2.4 -0.4 -1.9 -2.7 0.3 -1.1 1.0 0.3 0.0 1.4 1.2 -2.7 1.4 -1.2 -2.0 1.1 0.7 2 -1.6 -6.6 -6.5 -5.4 -2.5 -4.0 -4.7 -3.7 -6.3 1.0 5.1 -3.7 3.1 -4.2 -4.3 -4.2 -0.2 -5.9 -3.8 0.4 3 0.2 -1.3 0.1 1.5 0.0 -1.8 -3.3 0.4 0.5 -1.0 0.3 -2.5 1.2 1.0 -0.1 -0.3 -0.5 3.4 1.6 0.0 4 -0.1 -0.1 -2.0 2.0 -1.6 0.5 0.8 2.0 -3.3 0.1 -1.7 -1.0 -2.2 -1.6 1.7 -0.6 -0.2 1.3 -6.8 -0.7 5 -1.6 -0.1 0.1 -2.2 -1.2 0.4 -0.5 1.9 1.2 -2.2 -0.5 -1.3 -2.2 1.7 1.2 -2.5 -0.1 1.7 1.5 1.0 6 -0.7 -1.4 -1.0 -2.3 1.1 -1.3 -1.4 -0.2 -1.0 1.8 0.8 -1.9 0.2 1.0 -0.4 -0.6 0.4 -0.5 -0.0 2.1 7 1.1 -3.8 -0.2 -1.3 1.3 -0.3 -1.3 -1.4 2.1 0.6 0.7 -5.0 1.1 0.9 1.3 -0.5 -0.9 2.9 -0.4 0.5 8 -2.2 1.0 -0.8 -2.9 -1.4 0.4 0.1 -0.4 0.2 -0.0 1.1 -0.5 -0.5 0.7 -0.3 0.8 0.8 -0.7 1.3 -1.1 9 -0.2 -3.5 -6.1 -4.5 0.7 -0.8 -2.5 -4.0 -2.6 0.9 2.8 -3.0 -1.8 -1.4 -6.2 -1.9 -1.6 -4.9 -1.6 4.5

Scoring a sequence to a weight matrix

RLLDDTPEVGLLGNVSTVALAKAAAAL

Which peptide is most likely to bind?Which peptide second?

11.9 14.7 4.3

84nM 23nM 309nM