Regulatory element discovery for developmental time series

Regulatory element discovery for developmental time series

Joint work with Xuejing Li, Chris Wiggins, Valerie Reinke

Christina Leslie

Computational Biology ProgramSloan-Kettering Institute

Memorial Sloan-Kettering Cancer Center

http://cbio.mskcc.org

Regulatory networks in development

• Reinke lab: genome-wide expression for C. elegans developmental time series

+ germ cell/gametogenesis mutants• Problem: decipher regulatory networks

governing germline- and sex-regulated genes

Previous work: MEDUSA in yeast

• Predict up/down expression of target genes from promoter + regulator expression

• Learns from a set of mRNA expression experiments without

clustering• Problem: high correlation

of nearby time points, many regulator profiles

Sequence to expression profile

• Can we learn mapping from promoter sequence to full expression trajectory (with some level of statistical significance)?

• Retain some properties of MEDUSA:– No clustering of expression profiles

– Learn motifs de novo from promoters by building from k-mers

…AGCTATGCCATCGACTGCTCCA…

Regression problem

• Idea: learn latent factors T = X W that “explain” Y

• Then regress X ≈ TPt, Y ≈ TQt

or Y ≈ BX where B WQt

X YG G

M E

motif vector (k-mer counts) for gene g

expression profile for gene g

columns wi

= weight vectors

columns of P, Q = loadings

First step: PLS regression

• Sequentially build latent factors ti = Xwi:– Maximize covariance between factors and Y– Constrain t1, …, tK to be uncorrelated

• SIMPLS: – for i = 1, …, K

in 1D case

subject to

wi argmaxw wtX tYY tXw

argmaxwCov(Y,Xw)2

witwi 1, ti

t t j witX tXw j 0, j 1i 1

Equivalent formulation

• Learn latent factors ti = Xwi and ui = Xci for both predictor and response variables– wi and ci chosen to maximize Cov(ti, ui)

– for i = 1, …, K

subject to

wi cimotif weight vector

expression weight vector

wi,c i argmax w,c wtX tYc

witwi c i

tc i 1,

tit t j wi

tX tXw j 0, j 1i 1

Next steps: sparsity, graph Laplacian

• For regulatization and interpretability of weight vectors, want– sparsity in w: want most components to be 0

– smoothness in w: define graph on set of k-mers, with edge k ~ l if corresponding k-mers are close in Hamming distance

w kk

b1

w k w l 2k~ l

b2

Preliminary results: worm time series

• Reinke data: ~9000 genes, 12 time points (3 replicates), wild type germline development

• Genes sets, from mutant expression data:– Sperm genes: high expression

in spermatogenesis– Oocyte genes: high expression

in oogenesis

• Motif matrix: filter k-mers based on expected counts

Standard PLS

• 10-fold c.v. on held-out genes

Regularized PLS

• 10-fold c.v. on held-out genes

Regularized PLS

• Sperm/oocyte gene sets: largest chi-square reduction for 3rd/1st latent factor

Interpretation of factor weights

• To infer motifs relevant for an expression pattern:– Latent factors ti = Xwi and ui = Yci for both

predictors and reponse variables

– wi and ci chosen to maximize Cov(ti,ui)

• ci gives weights over time points: interpret as expression pattern

• wi gives weights over motifs: highly weighted motifs relevant for this expression pattern

Sperm genes

• c3 correlated with sperm gene expression, consistent with drop in chi-square

Motif graph for sperm genes

• Top 50 k-mer graph for w3, clusters around GATAA (ELT-1) and ACGTG (bHLH)

Oocyte genes

• Oocyte genes correlate with c1 pattern

Oocyte motif map

• Top 50 k-mer graph for w1, log(p) vs weight

Some related work

• Zhang et al, 2008: PCA in Y for motif discovery

• Naughton et al, 2006: algorithmic motif search using graph representation

• Beer and Tavazoie, 2004; Segal et al, 2002: sequence to expression via clustering

Regulatory element discovery for developmental time series

Documents

Transcript of Regulatory element discovery for developmental time series