LOGOS: a modular Bayesian model for de novo motif...

23
LOGOS: a modular Bayesian model for de novo motif detection Eric Xing Wei Wu Michael Jordan Richard Karp University of California, Berkeley epxing,jordan,karp @cs.berkeley.edu Lawrance Berkeley National Lab [email protected] 13 Aug 2003 CSB2003

Transcript of LOGOS: a modular Bayesian model for de novo motif...

Page 1: LOGOS: a modular Bayesian model for de novo motif detectionepxing/papers/Old_papers/slide_CSB03/CSB... · 2005-08-18 · LOGOS: a modular Bayesian model for de novo motif detection

LOGOS: a modular Bayesian model for

de novo motif detection

Eric Xing

�Wei Wu

Michael Jordan�

Richard Karp

University of California, Berkeley�

epxing,jordan,karp

�@cs.berkeley.edu

Lawrance Berkeley National [email protected]

13 Aug 2003

CSB2003

Page 2: LOGOS: a modular Bayesian model for de novo motif detectionepxing/papers/Old_papers/slide_CSB03/CSB... · 2005-08-18 · LOGOS: a modular Bayesian model for de novo motif detection

The cis-regulatory machinery of eukaryotic organisms

CSB2003 Eric Xing et al. — 1

Page 3: LOGOS: a modular Bayesian model for de novo motif detectionepxing/papers/Old_papers/slide_CSB03/CSB... · 2005-08-18 · LOGOS: a modular Bayesian model for de novo motif detection

A DNA motif (Gcn4)

Motif features:

� PSMD: position-specific multinomial distribution� PWM: position weight matrix

CSB2003 Eric Xing et al. — 2

Page 4: LOGOS: a modular Bayesian model for de novo motif detectionepxing/papers/Old_papers/slide_CSB03/CSB... · 2005-08-18 · LOGOS: a modular Bayesian model for de novo motif detection

Meta-sequence features

� Shape bias: typical patterns of site-conservativeness due to binding surfaceconstraints.

� Spatial organization of motifs: dependencies between motif instances.

— Believed to be crucial in distinguishing biologically meaningful motifs from random

background or trivial recurring patterns.

Our goal: Build statistical models for motif-bearing sequences that:

� exploit the meta-sequence features and provide improved sensitivity tobiologically plausible motifs

� can generalize easily to various motifs and handle complex cis-regulatorystructures

� can make use of biologically identified motifs for de novo detection

CSB2003 Eric Xing et al. — 3

Page 5: LOGOS: a modular Bayesian model for de novo motif detectionepxing/papers/Old_papers/slide_CSB03/CSB... · 2005-08-18 · LOGOS: a modular Bayesian model for de novo motif detection

Introduction

� Using a hierarchical Bayesian Markovian model to posit hiddendependency structures of motifs

� Exploit this model to predict locations of novel motifs

� Talk outline

� Modeling generation of motif instances with a hidden MarkovDirichlet-multinomial model� Modeling motif instance distribution with a hidden Markov grammar� The LOGOS motif model and a variational EM algorithm for motifdetection� Experiments� Future work

CSB2003 Eric Xing et al. — 4

Page 6: LOGOS: a modular Bayesian model for de novo motif detectionepxing/papers/Old_papers/slide_CSB03/CSB... · 2005-08-18 · LOGOS: a modular Bayesian model for de novo motif detection

The Local Alignment Model— for multiple-aligned motif instances

CSB2003 Eric Xing et al. — 5

Page 7: LOGOS: a modular Bayesian model for de novo motif detectionepxing/papers/Old_papers/slide_CSB03/CSB... · 2005-08-18 · LOGOS: a modular Bayesian model for de novo motif detection

Signature internal structures of motifs

Shape bias:

gal4:

0

1

2

bit

s

5′

1

C2

G3

CG

4CGA

5TA

CG

6

T

A

GC

7

TCGA

8

TGC

9

AT

10

T

CG

11

GCT

12

T

G

AC

13

T

CG

14

G

ACT

15

GC

16

C

17

G3′

gal4-permute:

0

1

2

bit

s

5′

1

TA

CG

2

C

3

C

4

CGA

5

T

CG

6

CG

7

T

CG

8

TCGA

9

AT

10

T

G

AC

11

G

12

G

ACT

13

G

14

T

A

GC

15

GC

16

GCT

17

TGC

3′

pho4:

0

1

2

bit

s

5′

1

CAT

2 3

GCAT

4 5

CGA

6

AC

7

GA

8

C9

G10

T11

TG

12

TCG

13

TAG

3′

pho4-permute:

0

1

2

bit

s

5′

1

TAG

2

G

3

GA

4

TG

5

CAT

6

T

7 8

CGA

9

C

10

GCAT

11 12

AC

13

TCG

3′

Two yeast motifs and their column-permuted versions.

� The conservativeness of sites in the motif may be spatially dependent.

� The conserved sites are more likely to occur consecutively and possibly followed (or

preceded) by heterogeneous sites that are also consecutive.� The “shape” is only associated with the conservativeness pattern of a motif PWM, but

not with any specific consensus sequence of the motif.

CSB2003 Eric Xing et al. — 6

Page 8: LOGOS: a modular Bayesian model for de novo motif detectionepxing/papers/Old_papers/slide_CSB03/CSB... · 2005-08-18 · LOGOS: a modular Bayesian model for de novo motif detection

A standard motif alignment model

The Product Multinomial (PM) Model:

A 1 A 2 A M...={ }

1: AAAAGAGTCA2: AAATGACTCA. AAATGAGTCA . GAATGAGTCAM: AAAAGAGTCA

A=

θLθ2θ1

M M M

...A Am,2 Am,Lm,1

� The nucleotide distributions at different sites within the motif are assumedto be independent.

� Limitation:

� Unable to capture site dependence and shape preference in motifs

CSB2003 Eric Xing et al. — 7

Page 9: LOGOS: a modular Bayesian model for de novo motif detectionepxing/papers/Old_papers/slide_CSB03/CSB... · 2005-08-18 · LOGOS: a modular Bayesian model for de novo motif detection

The PSMD simplex and the Dirichlet Distribution

Each possible position-specific nucleotide distribution corresponds to a point in the simplex.

CSB2003 Eric Xing et al. — 8

Page 10: LOGOS: a modular Bayesian model for de novo motif detectionepxing/papers/Old_papers/slide_CSB03/CSB... · 2005-08-18 · LOGOS: a modular Bayesian model for de novo motif detection

A Markov model for PSMD-simplex sequence

α 1 α 3

α 2

α 4

The PSMD-simplex sequence defines a structural prior for the PWM.

CSB2003 Eric Xing et al. — 9

Page 11: LOGOS: a modular Bayesian model for de novo motif detectionepxing/papers/Old_papers/slide_CSB03/CSB... · 2005-08-18 · LOGOS: a modular Bayesian model for de novo motif detection

Hidden Markov Dirichlet-multinomial model (HMDM)

I

s s

θ θ θ

M M M

1 2 L

1 2 L

...

...

s

A Am,2 Am,Lm,1

α

� Parameters (for all motifs):� �

sets of 4-dimensional Dirichlet parameter � � � � �� � � � � �

.� Initial probability � and transition probability

of an HMM.

� Random variables (for each motif site):� The site prototype indicator indexing different distributions on thePSMD simplex is represented by a multinomial r.v. �� .� The PSMD associated with the site is represented by a Dirichlet r.v.

�� .� The nucleotide content of the site is represented as a multinomial r.v.��� �� .

CSB2003 Eric Xing et al. — 10

Page 12: LOGOS: a modular Bayesian model for de novo motif detectionepxing/papers/Old_papers/slide_CSB03/CSB... · 2005-08-18 · LOGOS: a modular Bayesian model for de novo motif detection

The Global Distribution Model— for patterns of motif occurrences in genomic

sequences

CSB2003 Eric Xing et al. — 11

Page 13: LOGOS: a modular Bayesian model for de novo motif detectionepxing/papers/Old_papers/slide_CSB03/CSB... · 2005-08-18 · LOGOS: a modular Bayesian model for de novo motif detection

Global organization of motif instances

cis-regulatory modules (Davidson [2001]):

� There exist instance (resp. cluster)-level dependencies between motifs(resp. modules).

CSB2003 Eric Xing et al. — 12

Page 14: LOGOS: a modular Bayesian model for de novo motif detectionepxing/papers/Old_papers/slide_CSB03/CSB... · 2005-08-18 · LOGOS: a modular Bayesian model for de novo motif detection

A conventional global model

The uniform and independent (UI) start-position model:

� The locations of motif instances are assumed to be independently sampledfrom a uniform distribution over possible positions.

� Limitations:

� instances can overlap� does not capture inter-instance, inter-motif dependences

CSB2003 Eric Xing et al. — 13

Page 15: LOGOS: a modular Bayesian model for de novo motif detectionepxing/papers/Old_papers/slide_CSB03/CSB... · 2005-08-18 · LOGOS: a modular Bayesian model for de novo motif detection

A hidden Markov model for motif occurrences

The state-transition grammar of the global HMM:

b(1)

(1) (1) (1)...1 2 L

(1’)... 1L(1’) 2 (1’)

β1

11/2

1/2

1

1

1

1

β1,i

β i,1

1,0

1

1

β

b(0)

0,0

αi,0β0,i αk,0 α1,0β0,1

β0,k

β

b(k)

(k) (k) (k)...1 2 L

(k’)...L(k’) 2 (k’) 1

k

k

1/2

1/2

1

1

1

1

βk,i

β i,k

k,0

1

1

...

� Latent states:

� all possible sites within motifs on the forward and reverse complementary strand� intra-cluster backgrounds� inter-cluster (global) background

� Observed output: unannotated sequence of nucleotides.

� Parameters: HMM parameters

��� �� � � �

; motif PWMs

; background frequencies

�! .

CSB2003 Eric Xing et al. — 14

Page 16: LOGOS: a modular Bayesian model for de novo motif detectionepxing/papers/Old_papers/slide_CSB03/CSB... · 2005-08-18 · LOGOS: a modular Bayesian model for de novo motif detection

A modular motif model for de novo motif detection

� The integrated LOcal and GlObal motif Sequence model

� The occurrences of motifs in a DNA sequence, as indicated by ", aregoverned by a global distribution model # � $ " %& ' ( .� For each type of motif, the nucleotide sequence pattern shared by all itsinstances admits a local alignment model #*) $+ $ " , ( % " & � (

.� Assume that the background non-motif sequences are modeled by asimple conditional model, # $ ,.- + $ , " ( % " �0/ (

.

� The likelihood of a regulatory sequence , is:

# $ , %& ( �1

# � $ " % & ' ( #) $ , % " & � (

�1

# � $ " %& ' ( #) $+ % " & � ( # $ ,- + % " �/ (

CSB2003 Eric Xing et al. — 15

Page 17: LOGOS: a modular Bayesian model for de novo motif detectionepxing/papers/Old_papers/slide_CSB03/CSB... · 2005-08-18 · LOGOS: a modular Bayesian model for de novo motif detection

Inference and parameter estimation

� The inference and learning problems:

� Maximum a posteriori (MAP) prediction of motif locations� Bayesian estimation of motif PWMs

A variational EM algorithm for LOGOS

Variational “E” step: Compute the

23 4, the expected count of each

nucleotide, for each motif and each site within it, via inference in theglobal motif model given

2576 � 4and sequence ,.

Variational “M” step: Compute the Bayesian estimation of the PWM, innatural parameter form

2 56 � 4, via inference in the local motif alignment

model given

23 4

.

CSB2003 Eric Xing et al. — 16

Page 18: LOGOS: a modular Bayesian model for de novo motif detectionepxing/papers/Old_papers/slide_CSB03/CSB... · 2005-08-18 · LOGOS: a modular Bayesian model for de novo motif detection

Experiments:

� Semi-synthetic sequence: random background (300 8600bp) with genuinemotif instance(s) planted at known positions.

� Real genomic sequences: 500-5000 bp fragments in the genomicsequences flanking the motif instances.

CSB2003 Eric Xing et al. — 17

Page 19: LOGOS: a modular Bayesian model for de novo motif detectionepxing/papers/Old_papers/slide_CSB03/CSB... · 2005-08-18 · LOGOS: a modular Bayesian model for de novo motif detection

Detecting a single motif using HMDM model

0 20 40 600

1

2

abf1 (21)

0 20 40 600

1

2

gal4 (14)

0 20 40 600

1

2

gcn4 (24)

0 20 40 600

1

2

gcr1 (17)

0 20 40 600

1

2

mat−a2 (12)

0 20 40 600

1

2

mcb (16)

0 20 40 600

1

2

mig1 (11)

0 20 40 600

1

2

crp (24)

abf1 gal4 gcn4 gcr1 mat mcb mig1 crpHMDM(f) 0.81 0.82 0.71 0.65 0 0 0.73 0.58HMDM(e) 0.77 0.89 0.63 0.12 0.67 0 1.00 0.54HMDM(b) 0.12 0 0 0.59 0.92 0.72 0.91 0.04

PM 0.14 0 0 0.41 0.79 0.13 0.64 0.04

� Result: different HMDMs can be used for different “classes” of motifs.

� can use #-value to identify best-performing model.

CSB2003 Eric Xing et al. — 18

Page 20: LOGOS: a modular Bayesian model for de novo motif detectionepxing/papers/Old_papers/slide_CSB03/CSB... · 2005-08-18 · LOGOS: a modular Bayesian model for de novo motif detection

Detecting multiple motif instances using LOGOS

� We compare three variants of LOGOS, ordered with decreasing modelexpressiveness,

� HMDM+HMM (LOGOS 9 9)� PM+HMM (LOGOS : 9)� PM+UI (LOGOS :; )

motif LOGOS < < LOGOS = < LOGOS =>

name FP FN FP FN FP FNabf1 0.31 0.21 0.68 0.20 0.79 0.91gal4 0.16 0.16 0.19 0.15 0.29 0.79gcn4 0.18 0.24 0.61 0.28 0 0.96gcr1 0.20 0.21 0.34 0.20 0.33 0.94mat 0.07 0.03 0.36 0 0.50 0.96mcb 0.37 0.09 0.36 0.08 0.33 0.94mig1 0.08 0 0.09 0 0.98 0.10crp 0.38 0.34 0.27 0.53 0 0.95

CSB2003 Eric Xing et al. — 19

Page 21: LOGOS: a modular Bayesian model for de novo motif detectionepxing/papers/Old_papers/slide_CSB03/CSB... · 2005-08-18 · LOGOS: a modular Bayesian model for de novo motif detection

Analyzing Yeast promoter sequences using LOGOS

set LOGOS MEME AlignACE seqname FP FN FP FN FP FN no.abf1 0.79 0.65 1.00 1.00 0.53 0.61 20csre 0.44 0.17 0.78 0.50 0.80 0.50 4gal4 0.13 0.07 0.17 0.29 0.33 0.14 6gcn4 0.35 0.18 1.00 1.00 0.33 0.56 9gcr1 0.29 0.61 1.00 1.00 0.45 0.46 6hstf 0.86 0.55 0.60 0.56 0.85 0.67 6mat 0.42 0 0.38 0.56 0.25 0.25 7mcb 0.47 0.25 0.20 0.33 0.25 0.25 6mig1 0.81 0.28 1.00 1.00 0.83 0.79 22pho2 0.90 0.50 1.00 1.00 1.00 1.00 3swi5 0.76 0.50 1.00 1.00 0.94 0.75 2uash 0.83 0.68 1.00 1.00 0.92 0.95 18

CSB2003 Eric Xing et al. — 20

Page 22: LOGOS: a modular Bayesian model for de novo motif detectionepxing/papers/Old_papers/slide_CSB03/CSB... · 2005-08-18 · LOGOS: a modular Bayesian model for de novo motif detection

Analyzing Drosophila promoter sequences

� Motif patterns derived from biologically identified motif instances (Berman et al. [2002]).

0

1

2

bit

s

5′

1

A

TC

2

CAT

3

GA

4

A

5

GT

6

A

TC

7

GTC

8

T

CG

3′

0

1

2

bit

s

5′1

T

G

A

2

G

3 4

G

A

TC

5

A

6

GCT

7

A

8

A

9

A

10

C

TGA

3′

0

1

2

bit

s

5′

1

GCAT

2

T

3

G

CT

4

T

5

T

6

T

7

T

CGA

8

G

CAT

9

CTG

10 11

C

G

T

3′

0

1

2

bit

s

5′

1

T

CGA

2

GCA

3

TCA

4

G

TCA

5

T

G

6

AG

7

CAG

8

GT

9

A

CGT

10

C

TGA

3′

0

1

2

bit

s

5′

1

A

2

A

3

GAC

4

GT

5

CA

6

G

7

GCA

8

CTG

9

C

10

A3′

bcd cad hb kni kr

� Motif patterns detected by LOGOS in the promoter regions (600-5000bp) of 9 Drosophilagenes.

0

1

2

bit

s

5′

1

AT

2

CTA

3

ATC

4

A

TGC

5

T

G

AC

6

TGCA

7

GTA

8

GC

9

C

10

C

11

TGA

12

CTA

13

CAT

14

C

15

GC

3′

0

1

2

bit

s

5′

1

TGA

2

ACT

3 4

G

T

AC

5

CTGA

6

G

TCA

7

GCAT

8

A

C

T

9

GTA

10

G

AT

11

A

C

GT

12

TAC

13

TAG

14 15

TAC

16

TGA

17

AT

18

T

GCA

19TA

20

CTA

3′

0

1

2

bit

s

5′

1 2

AGT

3

CAT

4

A

GCT

5

A

CGT

6

A

CGT

7

GCAT

8

GTA

9

C

GAT

10

TG

11

T

G

12

GCT

13

G

A

CT

14

G

CT

15

A

G

CT

3′

0

1

2

bit

s

5′

1

A

G

T

2

GT

3

GC

4

GC

5

G

CTA

6

T

C

AG

7

A

CG

8

C

A

T

9 10

C

G

TA

11 12

GC

13

GTC

14

GAC

15

T

G

A

16

AT

17

GACT

18

C

3′

0

1

2

bit

s

5′

1

CT

2

GTC

3

T

4

TC

5

A

G

6

ATC

7

GACT

8

CGT

9

ACT

10 3′

0

1

2

bit

s

5′

1

A

T

2

T

A

GC

3 4

A

GC

5

CG

6

G

7

AC

8 9

CTA

10

A

C

GT

11

TCG

12

AGC

13

G

C

TA

14

TCGA

15

GAT

3′

0

1

2

bit

s

5′1

AT

2

CTA

3

GTA

4

TCA

5

TA

6

TA

7

AT

8

CGT

9

GAT

10

T

GA

3′

0

1

2

bit

s

5′

1

CTG

2

C

3

C

4

TAG

5

CG

6

T

A

GC

7

A

TCG

8

GCA

9

AGT

10

C3′

� Motif patterns detected using conventional models.

0

1

2

bit

s

5′

1

GCT

2

AT

3

A

GT

4

G

AT

5

CGT

6

GTC

7

ACTG

8

A

TC

9

GTC

10

CG

11

C

G

A

12

T

ACG

13

A

TC

14

A

CGT

15

G

T

CA

3′

0

1

2

bit

s

5′

1

CGA

2

T

CAG

3

T

GA

4

CA

5

TA

6

TCA

7

GCT

8

GCT

9

G

T

A

10

CAG

11

T

CGA

12

ACT

13

T

AGC

14

GTA

15

A

TCG

16

A

C

TG

17

GT

18

C

G

A

T

19

AGT

20

GTC

3′

0

1

2

bit

s

5′

1

C

ATG

2

TGC

3T

G

AC

4

T

GA

5

GAT

6

C

TGA

7

G

T

CA

8

CTA

9

C

T

GA

10

GCA

3′

CSB2003 Eric Xing et al. — 21

Page 23: LOGOS: a modular Bayesian model for de novo motif detectionepxing/papers/Old_papers/slide_CSB03/CSB... · 2005-08-18 · LOGOS: a modular Bayesian model for de novo motif detection

Future work

� Background model: higher order HMM

� Incorporating expression, binding, comparative genomic data

� Analyzing eucaryotic cis-regulatory networks

CSB2003 Eric Xing et al. — 22