Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene...

27
Bayesian detection of non-sinusoidal periodic patterns in circadian expression data Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no. 23 2009, pages 3114-3120

Transcript of Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene...

Page 1: Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no. 23 2009, pages 3114-3120.

Bayesian detection of non-sinusoidal periodic patterns in circadian expression data

Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic SmythBIOINFORMATICS

Gene expressionVol. 25 no. 23 2009, pages 3114-3120

Page 2: Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no. 23 2009, pages 3114-3120.

OutlineIntroductionMethodologyExperimental ResultsConclusion

Page 3: Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no. 23 2009, pages 3114-3120.

IntroductionCyclical biological processes :

Cell cycle, hair growth cycle, mammary cycle and circadian rhythms

Produce coordinated periodic expression of thousands of genes.

Existing computational methods are biased toward discovering genes that follow sine-wave patterns.

The objective is to identify or rank which of these genes are most likely to be periodically regulated.

Page 4: Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no. 23 2009, pages 3114-3120.

IntroductionTwo major categories :

Frequency domain Compute the spectrum of the average expression

profile for each probe. Test the significance of the dominant frequency against

a suitable null hypothesis such as uncorrelated noise. Not well suited for short time courses.

Time domain Identification of sinusoidal expression patterns Simple and computational efficiency Not effective at finding periodic signals which violate

the sinusoidal assumption.

Page 5: Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no. 23 2009, pages 3114-3120.

IntroductionIn this article, a general statistical framework

for detecting periodic profiles from time courseAnalyzing the similarity of observed profiles

across the cycles.discover periodic transcripts of arbitrary

shapes from replicated gen expression profiles.Provide an empirical Bayes procedure for

estimating parameters of the prior distribution.Derive closed-formed expressions for the

posterior probability of periodicity.

Page 6: Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no. 23 2009, pages 3114-3120.

IntroductionExpression profiles from the murine liver

time course data set.Two of these probe sets (NrIdI and Arntl)

correspond to well-established clock-control genes.

Page 7: Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no. 23 2009, pages 3114-3120.

MethodologyProbabilistic mixture model:

Differentially expressed genes change their expression level in response to

changes in experimental conditionsBackground genes

remains constant throughout the experimentCoordinated expression across multiple cycles

Model periodic phenomena

Page 8: Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no. 23 2009, pages 3114-3120.

MethodologyMode the data using a mixture of three

components for background, differentially and periodically expressed profiles.

Compute the posterior probability that a given probe set was generated by the periodic component.

Page 9: Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no. 23 2009, pages 3114-3120.

MethodologyA probabilistic model for periodicity

N probe sets over C cycles of known length.Each cycle is represented by the same grid of T

time points, indexed from 1 to T.Denote the number of replicate observations

for probe set at time point of cycle by .

: the expression intensity value for a particular probe set i , time point j and replicate k for cycle c.

: the entire set of observations for probe set i.

},...,1{ Ni },...,1{ Tj },...,1{ Cccijn

cijkY

iY

Page 10: Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no. 23 2009, pages 3114-3120.

MethodologyOur probabilistic model for expression , then

consists of three components : background(b), differentailly expressed but aperiodic (d) and periodically expressed profiles (p).

Let denote the component associated with probe set i.

Each of the three component models consists of Normal/Inverse Gamma (NIG) prior distribution on the latent profile and additional Normal noise on the observations.

},,{Z pdbi

Page 11: Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no. 23 2009, pages 3114-3120.

MethodologyNormal/Inverse Gamma (NIG) prior is a flexible

and computationally convenient distribution commonly used as a prior model for latent expression levels and replicate variability.Scalar variables are distributed as NIG

with parameters .

: inverse Gamma distribution with a degrees of freedom and scale parameters b, evaluated at x.

),( ),;,( ba

),|()/,|(),( 1 baNP

),|(1 bax

Page 12: Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no. 23 2009, pages 3114-3120.

MethodologyThree type of unknown quantities:

The prior parameters, denoted Determine via an empirical Bayesian procedure Subsequently treated as known and fixed

Probe set-specific hidden variables: the latent profiles (consisting of a mean and variance) for each component.

The component identify , indicating from which component the data ware generated.

iZ

Page 13: Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no. 23 2009, pages 3114-3120.

Methodology

The observed profiles Y and latent variables Z (component identity) and {, }

N probes sets, repeat N times

Page 14: Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no. 23 2009, pages 3114-3120.

MethodologyThe background component model:

NIG prior shared by all background probe sets and parameterized by four scalars

Yi are modeled as independent samples from a Gaussian distribution with mean and variance

},;,{ bbbbb ba

),( bi

bi

Page 15: Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no. 23 2009, pages 3114-3120.

MethodologyThe differentially expressed component

model: and be (C x T)-dimensional vectorThe prior distribution for this component is

defined by four (C x T) –dimensional parameters,

Mode observations as being independent given :

di d

i

},;,{ ddddd ba

),( di

di

Page 16: Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no. 23 2009, pages 3114-3120.

MethodologyThe periodic component model:

Assume repeated expression of the same pattern across multiple cycles

and are T-dimensional variables encoding expression levels and replicate variability in the ‘ideal’ cycle.

pi p

i

Page 17: Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no. 23 2009, pages 3114-3120.

MethodologyThe complete set of prior parameters

includes the prior component probabilities z

(corresponding to the relative frequencies of background, differentially expressed, and periodic probe sets) }},,{),;{( zz pdbz

Page 18: Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no. 23 2009, pages 3114-3120.

MethodologyInference

Detect periodic expression by computing the posterior probability of the periodic component

Page 19: Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no. 23 2009, pages 3114-3120.

MethodologyAn analysis of variance periodicity detector

The resulting inferential test for periodicity is quite close to a simplified, non-Bayesian test based on analysis of variance (ANOVA).

Construct ANOVA testDividing the data into groups by their

associated time points regardless of cycle number All replicates for c=1,..,C and k=1,…, fall

into the same group

cijkY

cijn

Page 20: Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no. 23 2009, pages 3114-3120.

Methodologytest whether the data support separation into

these groups whether the amount of variation between groups is

significantly larger than the variation found within the groups.

High values of the ratio of these quantities indicated that most of the variability in observations can be explained using a time-dependent, cycle-independent profile,

Page 21: Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no. 23 2009, pages 3114-3120.

MethodologyEstimating parameters of the prior

distribution:Develop an empirical Bayes procedure to

determine the prior parameters Determine a tentative assignment of probe set to

each component Use this assignment to find approximate maximum

likelihood estimates of the location scale and parameter of the inverse Gamma distribution (a,b); we set the location mean to o in all three components.

Page 22: Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no. 23 2009, pages 3114-3120.

MethodologyTo find a tentative initial assignment of probe sets for

estimating prior parameters:Run ANOVA detector of differential expression and

periodicity.To define parameters of the component for differential

expression Probe sets that vary significantly over time (P<0.01)

To define the parameters of the background components: Probe sets which fail this test (P>0.1)

probe sets for estimating the prior parameters of the periodic component choosing those probe sets with P<0.001 results in a number of

probe sets similar to that previously identified in the literature.

Page 23: Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no. 23 2009, pages 3114-3120.

Experimental ResultsDemonstrate the model can effectively

identify both sinusoidal and non-sinusoidal periodic expression pattern.

It is widely believed that 5-10% of transcribed genes may be under circadian regulation, with some studies suggesting a higher proportion – up to 50% in murine liver.

The datasets analyzed in this article contain gene expression profiles of liver and skeletal muscle tissues in mice.

Page 24: Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no. 23 2009, pages 3114-3120.

Experimental ResultsSine-wave detection:

Use the sine-wave matching algorithm of Straume (2004).

Identify 848 distinct rhythmic prove sets in liver and 383 such probe sets in skeletal muscle.

Model-based detection: Among the top 25 probe sets there are nine that

were not among the top 400 ranked by sine-wave matching.

Profile peak or drop at a single time point are poorly matched to a sinusoid shape.

Page 25: Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no. 23 2009, pages 3114-3120.

Experimental Results

Page 26: Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no. 23 2009, pages 3114-3120.

Experimental ResultsTns3 is just the single probe set that ranked

above 25 by the sin-wave method but below 400 by the model.Conforms to the sine-wave pattern, but

possesses a very small amplitude, and is assigned to the background component by the model.

All of the other probe sets that were so highly ranked by the sine-wave method received posterior probabilities of periodicity >0.9 from our model.

Page 27: Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no. 23 2009, pages 3114-3120.

ConclusionWe argue that in typical experiments with

only a small number of samples per cycle, we should test for arbitrary patterns which are repeated between cycles, rather than parametric shapes.

To this end, we propose a Bayesian mixture model for identifying patterns of unconstrained shape, which stand out as both differentially and periodically expressed.