Regulatory element discovery for developmental time series
description
Transcript of Regulatory element discovery for developmental time series
Regulatory element discovery for developmental time series
Joint work with Xuejing Li, Chris Wiggins, Valerie Reinke
Christina Leslie
Computational Biology ProgramSloan-Kettering Institute
Memorial Sloan-Kettering Cancer Center
http://cbio.mskcc.org
Regulatory networks in development
• Reinke lab: genome-wide expression for C. elegans developmental time series
+ germ cell/gametogenesis mutants• Problem: decipher regulatory networks
governing germline- and sex-regulated genes
Previous work: MEDUSA in yeast
• Predict up/down expression of target genes from promoter + regulator expression
• Learns from a set of mRNA expression experiments without
clustering• Problem: high correlation
of nearby time points, many regulator profiles
Sequence to expression profile
• Can we learn mapping from promoter sequence to full expression trajectory (with some level of statistical significance)?
• Retain some properties of MEDUSA:– No clustering of expression profiles
– Learn motifs de novo from promoters by building from k-mers
…AGCTATGCCATCGACTGCTCCA…
Regression problem
• Idea: learn latent factors T = X W that “explain” Y
• Then regress X ≈ TPt, Y ≈ TQt
or Y ≈ BX where B WQt
X YG G
M E
motif vector (k-mer counts) for gene g
expression profile for gene g
columns wi
= weight vectors
columns of P, Q = loadings
First step: PLS regression
• Sequentially build latent factors ti = Xwi:– Maximize covariance between factors and Y– Constrain t1, …, tK to be uncorrelated
• SIMPLS: – for i = 1, …, K
in 1D case
subject to
wi argmaxw wtX tYY tXw
argmaxwCov(Y,Xw)2
witwi 1, ti
t t j witX tXw j 0, j 1i 1
Equivalent formulation
• Learn latent factors ti = Xwi and ui = Xci for both predictor and response variables– wi and ci chosen to maximize Cov(ti, ui)
– for i = 1, …, K
subject to
wi cimotif weight vector
expression weight vector
wi,c i argmax w,c wtX tYc
witwi c i
tc i 1,
tit t j wi
tX tXw j 0, j 1i 1
Next steps: sparsity, graph Laplacian
• For regulatization and interpretability of weight vectors, want– sparsity in w: want most components to be 0
– smoothness in w: define graph on set of k-mers, with edge k ~ l if corresponding k-mers are close in Hamming distance
w kk
b1
w k w l 2k~ l
b2
Preliminary results: worm time series
• Reinke data: ~9000 genes, 12 time points (3 replicates), wild type germline development
• Genes sets, from mutant expression data:– Sperm genes: high expression
in spermatogenesis– Oocyte genes: high expression
in oogenesis
• Motif matrix: filter k-mers based on expected counts
Standard PLS
• 10-fold c.v. on held-out genes
Regularized PLS
• 10-fold c.v. on held-out genes
Regularized PLS
• Sperm/oocyte gene sets: largest chi-square reduction for 3rd/1st latent factor
Interpretation of factor weights
• To infer motifs relevant for an expression pattern:– Latent factors ti = Xwi and ui = Yci for both
predictors and reponse variables
– wi and ci chosen to maximize Cov(ti,ui)
• ci gives weights over time points: interpret as expression pattern
• wi gives weights over motifs: highly weighted motifs relevant for this expression pattern
Sperm genes
• c3 correlated with sperm gene expression, consistent with drop in chi-square
Motif graph for sperm genes
• Top 50 k-mer graph for w3, clusters around GATAA (ELT-1) and ACGTG (bHLH)
Oocyte genes
• Oocyte genes correlate with c1 pattern
Oocyte motif map
• Top 50 k-mer graph for w1, log(p) vs weight
Some related work
• Zhang et al, 2008: PCA in Y for motif discovery
• Naughton et al, 2006: algorithmic motif search using graph representation
• Beer and Tavazoie, 2004; Segal et al, 2002: sequence to expression via clustering