Gibbs sampling - DTU Health Tech · 2011. 10. 27. · Gibbs sampling A special kind of Monte Carlo...

76
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Department of Systems Biology Technical University of Denmark 1 Gibbs sampling Massimo Andreatta Center for Biological Sequence Analysis Technical University of Denmark [email protected]

Transcript of Gibbs sampling - DTU Health Tech · 2011. 10. 27. · Gibbs sampling A special kind of Monte Carlo...

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 1

    Gibbs sampling

    Massimo Andreatta Center for Biological Sequence Analysis

    Technical University of Denmark [email protected]

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 2

    Monte Carlo simulations

    MC methods use repeated random sampling to numerically approximate solutions to problems

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 3

    Monte Carlo simulations

    A simple example: computing π with sampling

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 4

    Monte Carlo simulations

    A simple example: computing π with sampling

    Ac = πr2

    r

    As = 2r( )2

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 5

    Monte Carlo simulations

    A simple example: computing π with sampling

    Ac = πr2

    r

    As = 2r( )2

    AcAs

    =πr2

    4r2=π4

    π = 4 AcAs

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 6

    Monte Carlo simulations

    A simple example: computing π with sampling

    π = 4 AcAs

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 7

    X

    Monte Carlo simulations

    A simple example: computing π with sampling

    π = 4 AcAs

    X

    X

    Throw darts randomly

    hit circlehit square

    =hit

    hit +miss=AcAs

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 8

    Monte Carlo simulations

    A simple example: computing π with sampling

    X

    π = 4 AcAs

    hit=0for N iterations x = random(-1,1) y = random(-1,1) dist=sqrt(x2+y2)

    if (dist

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 9

    Monte Carlo simulations

    A simple example: computing π with sampling

    X

    hit=0for N iterations x = random(-1,1) y = random(-1,1) dist=sqrt(x2+y2)

    if (dist

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 10

    Monte Carlo simulations

    A simple example: computing π with sampling

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 11

    Monte Carlo simulations

    A simple example: computing π with sampling

    -  More iterations more accurate estimate -  After 1,000,000 iterations I got pi ≈ 3,14182...

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 12

    Gibbs sampling

    A special kind of Monte Carlo method (Markov Chain Monte Carlo, or MCMC) - estimates a distribution by sampling from it - the samples are taken with pseudo-random steps - stepping to the next state only depends on the current state (memory-less chain)

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 13

    Gibbs sampling

    Stochastic search

    Z

    f(Z)

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 14

    Gibbs sampling

    Stochastic search

    Zi = current state of the system P = probability of accepting the move T = a scalar lowered during the search

    P =min 1,exp dET

    dE = f (Zi) − f (Zi−1)

    Z

    f(Z)

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 15

    Gibbs sampling - down to biology

    Sequence alignment

    Zi = current state of the system P = probability of accepting the move T = a scalar lowered during the search

    P =min 1,exp dET

    SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT

    dE = f (Zi) − f (Zi−1)

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 16

    Gibbs sampling - down to biology

    Sequence alignment

    Zi = current state of the system P = probability of accepting the move T = a scalar lowered during the search

    P =min 1,exp dET

    dE = Ei − Ei−1

    E = Cp,ap,a∑ log

    pp,aqa

    SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT

    dE = f (Zi) − f (Zi−1)

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 17

    Gibbs sampling - sequence alignment

    State transition

    SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT

    move to state +1

    dE = Ei − Ei−1

    E = Cp,ap,a∑ log

    pp,aqa

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 18

    SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT

    SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT

    move to state +1

    P =min 1,exp dET

    Accept or reject the move?

    Note that the probability of going to the new state only depends on the previous state

    Gibbs sampling - sequence alignment

    State transition

    dE = Ei − Ei−1

    E = Cp,ap,a∑ log

    pp,aqa

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 19

    SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT

    Ei = 2.52

    SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT

    move to state +1

    Gibbs sampling - sequence alignment

    Numerical example - 1

    Ei−1 = 2.44

    P = min 1,exp 0.080.2

    = min 1 , 1.49[ ] =1 Accept move with

    Prob = 100%

    T = 0.2

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 20

    SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT

    SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT

    move to state +1

    Gibbs sampling - sequence alignment

    Numerical example - 2

    P = min 1,exp −0.090.2

    = min 1 , 0.638[ ] = 0.638

    T = 0.2

    Accept move with Prob = 63.8%

    Ei = 2.35

    Ei−1 = 2.44

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 21

    Gibbs sampling - sequence alignment

    Now, one thing at a time

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 22

    Gibbs sampling - sequence alignment

    What is the MC temperature?

    iteration

    T

    it’s a scalar decreased during the simulation

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 23

    Gibbs sampling - sequence alignment

    What is the MC temperature?

    iteration

    T

    it’s a scalar decreased during the simulation E.g. same dE=-0.3 but at different temperatures

    P(t1) =min 1,expdEt1

    =min 1,exp

    −0.30.4

    = 0.47

    t1=0.4

    t3=0.02

    P(t3) =min 1,exp−0.30.02

    ≈ 0

    t2=0.1

    P(t2) =min 1,exp−0.30.1

    = 0.05

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 24

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 25

    Z

    f(Z)

    Move freely around states when the system is “warm”, then cool it off to force it into a state of high fitness

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 26

    Gibbs sampling - sequence alignment

    Why sampling? SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT

    ............

    DFAAQVDYPSTGLY

    50 sequences 12 amino acids long

    try all possible combinations with a 9-mer overlap

    450 ~ 1030 possible combinations

    ...computationally unfeasible

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 27

    SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT

    SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT

    move to state +1

    P =min 1,exp dET

    Accept or reject the move?

    Single sequence move

    dE = Ei − Ei−1

    E = Cp,ap,a∑ log

    pp,aqa

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 28

    SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT

    SLFIGLKGDIRESTVDGEEEVQLIAAVPGK VFRLKGGAPIKGVTF SFSCIAIGIITLYLG IDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDV KMLLDNINTPEGIIP ELLEFHYYLSSKLNK LNKFISPKSVAGRFAESLHNPYPDYHWLRT

    move to state +1

    P =min 1,exp dET

    Accept or reject the move?

    Phase shift move

    dE = Ei − Ei−1

    E = Cp,ap,a∑ log

    pp,aqa

    shift all sequences

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 29

    A sketch for the alignment algorithm

    •  Start from a random alignment •  Set initial temperature •  For N iterations

    •  pick a random sequence •  suggest a shift move •  accept or reject the move depending on •  every Psh moves, attempt a phase shift move •  decrease temperature

    P =min 1,exp dET

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 30

    Does it work?

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 31

    Gibbs sequence alignment - performance

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 32

    Aligning scoring matrices

    More Gibbs sampling

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 33

    Alignment of scoring matrices

    4 networks trained on HLA*DRB1-0401

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 34

    Alignment of scoring matrices

    Combined logo

    Equally valid solutions, but with different core registers

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 35

    The PSSM-align algorithm

    L

    20

    Individual PSSM

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 36

    The PSSM-align algorithm

    LIndividual PSSM

    1. Extend matrix with BG frequencies

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 37

    LAll individual PSSMs

    1. Extend matrix with BG frequencies

    The PSSM-align algorithm

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 38

    LAll individual PSSMs

    1. Extend matrix with BG frequencies

    The PSSM-align algorithm

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 39

    LAll individual PSSMs

    1.  Extend matrix with BG frequencies

    2.  Apply random shift

    The PSSM-align algorithm

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 40

    1.  Extend matrix with BG frequencies

    2.  Apply random shift 3.  Do Gibbs sampling for

    many iterations core

    Maximize combined Information Content of the core

    The PSSM-align algorithm

    P =min 1,exp dET

    Accept moves with probability:

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 41

    1.  Extend matrix with BG frequencies

    2.  Apply random shift 3.  Do Gibbs sampling for

    many iterations core

    Avg matrix

    Offset2

    -300

    -803

    Maximize combined Information Content of the core

    The PSSM-align algorithm

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 42

    Alignment of scoring matrices

    after alignment before alignment

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 43

    And more Gibbs sampling

    Clustering peptide data

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 44

    Gibbs clustering

    --ELLEFHYYLSSKLNK----------LNKFISPKSVAGRFAESLHNPYPDYHWLRT-------NKVKSLRILNTRRKL-------MMGMFNMLSTVLGVS----AKSSPAYPSVLGQTI--------RHLIFCHSKKKCDELAAK-

    ----SLFIGLKGDIRESTV----DGEEEVQLIAAVPGK----------VFRLKGGAPIKGVTF---SFSCIAIGIITLYLG-------IDQVTIAGAKLRSLN--WIQKETLVTFKNPHAKKQDV-------KMLLDNINTPEGIIP

    Cluster 2

    Cluster 1 SLFIGLKGDIRESTVDGEEEVQLIAAVPGKVFRLKGGAPIKGVTFSFSCIAIGIITLYLGIDQVTIAGAKLRSLNWIQKETLVTFKNPHAKKQDVKMLLDNINTPEGIIPELLEFHYYLSSKLNKLNKFISPKSVAGRFAESLHNPYPDYHWLRTNKVKSLRILNTRRKLMMGMFNMLSTVLGVSAKSSPAYPSVLGQTIRHLIFCHSKKKCDELAAK

    Multiple motifs!

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 45

    Gibbs clustering - the algorithm

    FIGLKGDIREEEVQLIAARLKGGAPIKSCIAIGIITQVTIAGAKLQKETLVTFKLLDNINTPELEFHYYLSSKFISPKSVALHNPYPDYHVKSLRILNTGMFNMLSTVSSPAYPSVLLIFCHSKKK

    1. List of peptides

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 46

    FIGLKGDIREEEVQLIAARLKGGAPIKSCIAIGIITQVTIAGAKLQKETLVTFKLLDNINTPELEFHYYLSSKFISPKSVALHNPYPDYHVKSLRILNTGMFNMLSTVSSPAYPSVLLIFCHSKKK

    1. List of peptides 2. create N random groups

    -----QVTIAGAKL----------QKETLVTFK----------LEFHYYLSS----------GMFNMLSTV----------SSPAYPSVL-----

    -----SLFIGLKGD----------SFSCIAIGI----------KMLLDNINT----------KYVHGTWRS----------NKVKSLRIL-----

    -----LHNPYPDYH----------LIFCHSKKK----------RLKGGAPIK----------KFISPKSVA----------EEEVQLIAA-----

    g1 g2 gN

    Gibbs clustering - the algorithm

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 47

    FIGLKGDIREEEVQLIAARLKGGAPIKSCIAIGIITQVTIAGAKLQKETLVTFKLLDNINTPELEFHYYLSSKFISPKSVALHNPYPDYHVKSLRILNTGMFNMLSTVSSPAYPSVLLIFCHSKKK

    1. List of peptides 2. create N random groups

    -----QVTIAGAKL----------QKETLVTFK----------LEFHYYLSS----------GMFNMLSTV----------SSPAYPSVL-----

    -----SLFIGLKGD----------SFSCIAIGI----------KMLLDNINT----------KYVHGTWRS----------NKVKSLRIL-----

    -----LHNPYPDYH----------LIFCHSKKK----------RLKGGAPIK----------KFISPKSVA----------EEEVQLIAA-----

    g1 g2 gN

    GMFNMLSTV

    3 Move sequence

    Gibbs clustering - the algorithm

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 48

    4b. Remove peptide from its group I

    FIGLKGDIREEEVQLIAARLKGGAPIKSCIAIGIITQVTIAGAKLQKETLVTFKLLDNINTPELEFHYYLSSKFISPKSVALHNPYPDYHVKSLRILNTGMFNMLSTVSSPAYPSVLLIFCHSKKK

    1. List of peptides 2. create N random groups

    -----QVTIAGAKL----------QKETLVTFK----------LEFHYYLSS----------SSPAYPSVL-----

    -----SLFIGLKGD----------SFSCIAIGI----------KMLLDNINT----------KYVHGTWRS----------NKVKSLRIL-----

    -----LHNPYPDYH----------LIFCHSKKK----------RLKGGAPIK----------KFISPKSVA----------EEEVQLIAA-----

    g1 g2 gN

    GMFNMLSTV

    3 Move sequence

    Gibbs clustering - the algorithm

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 49

    4b. Remove peptide from its group I

    5b. Score peptide to a new random group R and in its original group I

    dE = SR − SI

    FIGLKGDIREEEVQLIAARLKGGAPIKSCIAIGIITQVTIAGAKLQKETLVTFKLLDNINTPELEFHYYLSSKFISPKSVALHNPYPDYHVKSLRILNTGMFNMLSTVSSPAYPSVLLIFCHSKKK

    1. List of peptides 2. create N random groups

    -----QVTIAGAKL----------QKETLVTFK----------LEFHYYLSS----------SSPAYPSVL-----

    -----SLFIGLKGD----------SFSCIAIGI----------KMLLDNINT----------KYVHGTWRS----------NKVKSLRIL-----

    -----LHNPYPDYH----------LIFCHSKKK----------RLKGGAPIK----------KFISPKSVA----------EEEVQLIAA-----

    g1 g2 gN

    GMFNMLSTV

    3 Move sequence

    Gibbs clustering - the algorithm

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 50

    P =min 1,exp dET

    6b. Accept or reject move

    4b. Remove peptide from its group I

    5b. Score peptide to a new random group R and in its original group I

    dE = SR − SI

    FIGLKGDIREEEVQLIAARLKGGAPIKSCIAIGIITQVTIAGAKLQKETLVTFKLLDNINTPELEFHYYLSSKFISPKSVALHNPYPDYHVKSLRILNTGMFNMLSTVSSPAYPSVLLIFCHSKKK

    1. List of peptides 2. create N random groups

    -----QVTIAGAKL----------QKETLVTFK----------LEFHYYLSS----------SSPAYPSVL-----

    -----SLFIGLKGD----------SFSCIAIGI----------KMLLDNINT----------KYVHGTWRS----------NKVKSLRIL-----

    -----LHNPYPDYH----------LIFCHSKKK----------RLKGGAPIK----------KFISPKSVA----------EEEVQLIAA----------GMFNMLSTV-----

    g1 g2 gN

    GMFNMLSTV

    3 Move sequence

    Gibbs clustering - the algorithm

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 51

    P =min 1,exp dET

    6b. Accept or reject move

    4b. Remove peptide from its group I

    5b. Score peptide to a new random group R and in its original group I

    dE = SR − SI

    FIGLKGDIREEEVQLIAARLKGGAPIKSCIAIGIITQVTIAGAKLQKETLVTFKLLDNINTPELEFHYYLSSKFISPKSVALHNPYPDYHVKSLRILNTGMFNMLSTVSSPAYPSVLLIFCHSKKK

    1. List of peptides 2. create N random groups

    -----QVTIAGAKL----------QKETLVTFK----------LEFHYYLSS----------SSPAYPSVL-----

    -----SLFIGLKGD----------SFSCIAIGI----------KMLLDNINT----------KYVHGTWRS----------NKVKSLRIL-----

    -----LHNPYPDYH----------LIFCHSKKK----------RLKGGAPIK----------KFISPKSVA----------EEEVQLIAA----------GMFNMLSTV-----

    g1 g2 gN

    GMFNMLSTV

    3 Move sequence

    Gibbs clustering - the algorithm

    And iterate many times, gradually decreasing T

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 52

    Two MHC class I alleles: HLA-A*0101 and HLA-B*4402

    Does it work ?

    Mixture of 100 binders for the two alleles

    ATDKAAAAY A*0101EVDQTKIQY A*0101AETGSQGVY B*4402ITDITKYLY A*0101AEMKTDAAT B*4402FEIKSAKKF B*4402LSEMLNKEY A*0101GELDRWEKI B*4402LTDSSTLLV A*0101FTIDFKLKY A*0101TTTIKPVSY A*0101EEKAFSPEV B*4402AENLWVPVY B*4402

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 53

    Two MHC class I alleles: HLA-A*0101 and HLA-B*4402

    A0101  B4402 

    G 1 

    G 2 

    Mixed 

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 54

    Two MHC class I alleles: HLA-A*0101 and HLA-B*4402

    97 3

    3 97

    A0101  B4402 

    G 1 

    G 2 

    Resolved 

    Mixed 

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 55

    Five MHC class I alleles A0101  A0201  A0301  B0702  B4402 

    G 0 

    G 1 

    G 2 

    G 3 

    G 4 

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 56

    Five MHC class I alleles A0101 

    0 1 76 1 0

    2 4 0 0 95

    5 87 5 1 0

    93 2 19 0 2

    0 6 0 98 3

    A0201  A0301  B0702  B4402 

    G 0 

    G 1 

    G 2 

    G 3 

    G 4 

    HLA-B4402 94%

    HLA-B0702 92%

    HLA-A0201 89%

    HLA-A0101 80%

    HLA-A0301 97%

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 57

    HLA-A*02:01 sub-motifs

    = 10 nM = 4 hours

    = 10 nM = 1.5 hours

    666 peptide binders (aff < 500 nM)

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 58

    Splitting with Gibbs clustering

    = 10 nM = 3.5 hours

    = 10 nM = 2.25 hours

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 59

    Gibbs clustering

    And what if we don’t know a priori the number of clusters?

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 60

    How many clusters?

    We could run the algorithm with different number of clusters k and choose the k with highest information content

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 61

    How many clusters?

    We could run the algorithm with different number of clusters k and choose the k with highest information content

    What’s going on ?

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 62

    How many clusters?

    We could run the algorithm with different number of clusters k and choose the k with highest information content

    What’s going on ?

    smaller groups tend to have higher information content

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 63

    How many clusters?

    Let’s look back at the Energy function

    E = Cp,ap,a∑ log

    pp,aqa

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 64

    How many clusters?

    Let’s look back at the Energy function

    E = Cp,ap,a∑ log

    pp,aqa

    This is equivalent to scoring each sequence S to its matrix

    E = logpp,aqap,a

    ∑S∑

    L

    20

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 65

    How many clusters?

    Let’s look back at the Energy function

    E = Cp,ap,a∑ log

    pp,aqa

    This is equivalent to scoring each sequence S to its matrix

    E = logpp,aqap,a

    ∑S∑

    What is the problem? Overfitting. S was also used to calculate the log-odds matrix!

    The contribution of S on the matrix will be larger if the cluster is small.

    L

    20

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 66

    How many clusters?

    Let’s look back at the Energy function

    E = Cp,ap,a∑ log

    pp,aqa

    This is equivalent to scoring each sequence S to its matrix

    E = logpp,aqap,a

    ∑S∑

    What is the problem? Overfitting. S was also used to calculate the log-odds matrix!

    The contribution of S on the matrix will be larger if the cluster is small.

    L

    20

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 67

    How many clusters?

    E = logpp,aqap,a

    ∑S∑

    What is the problem? Overfitting. S was also used to calculate the log-odds matrix!

    The contribution of S on the matrix will be larger if the cluster is small.

    E = logpp,aS−

    qap,a∑

    S∑

    Before scoring S, remove it and update the matrix

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 68

    How many clusters?

    E = logpp,aqap,a

    ∑S∑

    E = logpp,aS−

    qap,a∑

    S∑

    Is this so important..?

    YQAFRTKVHSPRTLNAWVYALTVVWLLLSSIGIPAYAVAKCNLNHTPYDINQMLLLMMTLPSIKELENEYYFIENATFFIFAEMLASIDL...

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 69

    How many clusters?

    E = logpp,aqap,a

    ∑S∑

    E = logpp,aS−

    qap,a∑

    S∑

    Is this so important..? YES

    YQAFRTKVHSPRTLNAWVYALTVVWLLLSSIGIPAYAVAKCNLNHTPYDINQMLLLMMTLPSIKELENEYYFIENATFFIFAEMLASIDL...

    SCORE Num of sequences in the cluster

    100 20 3

    w/o removing

    5.52 10.42 26.78

    removing 4.11 2.57 0.05

    Score YALTVVWLL to a matrix, including vs. excluding YALTVVWLL in the matrix construction

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 70

    How many clusters?

    Quality of clustering is not only determined by information content of individual clusters (intra-cluster distance), but also by the ability of different groups to discriminate (inter-cluster distance)

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 71

    How many clusters?

    Quality of clustering is not only determined by information content of individual clusters (intra-cluster distance), but also by the ability of different groups to discriminate (inter-cluster distance)

    E = logpp,aS−

    qap,a∑

    S∑

    E = logpp,aS−

    qp,aS

    p,a∑

    S∑

    position and cluster-specific background (the background is calculated on all groups not containing S, it accounts for inter-cluster distance)

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 72

    How many clusters?

    E = logpp,aS−

    qp,aS

    p,a∑

    S∑ −λn

    One last thing and we are ready.

    A parameter λ to modulate the ‘tightness’ of the clustering (n is the number of clusters)

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 73

    How many clusters?

    E = logpp,aS−

    qp,aS

    p,a∑

    S∑ −λn

    One last thing and we are ready.

    A parameter λ to modulate the ‘tightness’ of the clustering (n is the number of clusters) position and cluster-specific background

    (the background is calculated on all groups not containing S, it accounts for inter-cluster distance)

    frequencies are calculated by removing the sequence being scored S

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 74

    How many clusters?

    !

    !

    !!

    !!

    ! ! !

    !! !

    3.0

    3.4

    3.8

    4.2

    2 alleles ! lambda=0.02

    Groups

    KLD

    sum

    2 4 6 8 10 12

    !

    !

    !

    !

    ! ! ! !!

    ! ! !

    3.2

    3.4

    3.6

    3.8

    4.0

    3 alleles ! lambda=0.02

    Groups

    KLD

    sum

    2 4 6 8 10 12

    !

    !

    !

    !

    !

    !!

    !! !

    !!

    3.4

    3.8

    4.2

    4 alleles ! lambda=0.02

    Groups

    KLD

    sum

    2 4 6 8 10 12

    !

    ! !

    !

    !

    !

    !

    !! !

    !

    !

    3.4

    3.8

    4.2

    5 alleles ! lambda=0.02

    Groups

    KLD

    sum

    2 4 6 8 10 12

    !

    !

    !

    ! !

    !

    !

    !

    !

    !

    !

    !

    3.3

    3.5

    3.7

    3.9

    6 alleles ! lambda=0.02

    Groups

    KLD

    sum

    2 4 6 8 10 12

    !

    !

    !

    !

    ! !

    !!

    !!

    ! !

    2.8

    3.2

    3.6

    7 alleles ! lambda=0.02

    Groups

    KLD

    sum

    2 4 6 8 10 12

    !

    !

    !

    !

    !

    ! !

    ! !

    ! !!

    2.8

    3.2

    3.6

    8 alleles ! lambda=0.02

    Groups

    KLD

    sum

    2 4 6 8 10 12

    !

    !

    !

    !

    !

    !

    !

    ! !

    !!

    !

    3.0

    3.4

    3.8

    9 alleles ! lambda=0.02

    Groups

    KLD

    sum

    2 4 6 8 10 12

    Binders for 2 to 9 MHC class I alleles

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 75

    How many clusters?

    2 3 4 5 6 7 8

    34

    56

    78

    9

    10 random allele combinations

    Alleles

    Nu

    mb

    er

    of

    clu

    ste

    rs

    !

    !

    !

    ! !

    ! !

    ! !

    !

    !

    !

    !

    !

    ! !

    !

    !

    !

    !

    !

    Lambda penalty

    0

    0.02

    0.04

    !

    !

    2 3 4 5 6 7 8

    24

    68

    10

    Lambda = 0.000

    Alleles

    Nu

    mb

    er

    of

    clu

    ste

    rs

    !

    !

    2 3 4 5 6 7 8

    24

    68

    10

    Lambda = 0.020

    Alleles

    Nu

    mb

    er

    of

    clu

    ste

    rs

    !

    !

    ! !

    !

    2 3 4 5 6 7 8

    23

    45

    67

    89

    Lambda = 0.040

    Alleles

    Nu

    mb

    er

    of

    clu

    ste

    rs

  • CEN

    TER FO

    R B

    IOLO

    GIC

    AL SEQ

    UEN

    CE A

    NA

    LYSIS

    Department of Systems Biology Technical University of Denmark 76

    In conclusion

    •  Sampling methods can solve problems where the search space is too large to be exhaustively explored

    •  Gibbs sampling can detect even weak motifs in a sequence alignment (e.g. MHC class II)

    •  More than 1,000 papers in PubMed using Gibbs sampling methods

    •  Transcription start-sites •  Receptor binding sites •  Acceptor:Donor sites •  ...