Extracting binary signals from microarray time-course data Debashis Sahoo 1, David L. Dill 2, Rob...

Extracting binary signals from microarray time-course data

Debashis Sahoo1, David L. Dill2, Rob Tibshirani3 and Sylvia K. Plevritis4

1 Department of Electrical Engineering2 Department of Computer Science

3 Department of Radiology and4 Department of Health Research and Policy and Department of Statistics

Stanford University

Roli Shrivastava

Introduction

• Problem Statement– To identify up and down regulated gene– To identify the time of transition

• Experimental Technique– Microarray (Tens of thousands of distinct probes on

an array to accomplish the equivalent number of genetic tests in parallel)

• Computational Technique– A tool called StepMiner to extract biologically

meaningful result from large amounts of data

Types of Transitions

1. One Step

2. Two Step

3. Genes for which the one- or two-step patterns do not fit appreciably better than a constant mean value (the null hypothesis).

Fitting One or Two-Step Function

• F1 statistic: Computes how well the one-step model fits the data

• F2 statistic: Computes how well the two-step model fits the data

• F12 statistic: Compares the fit of one-step model and two-step model on same data

• P-value: Low P-value represents a good fit of the model to the data

Calculate the F statistic for the model and data set

Calculate the P-valueIf P < Pthreshold If P > Pthreshold

The model fits The model does not fit

Pthreshold = 0.05

StepMiner Algorithm

one-step fits data AND one-step fits better than two-step

two-step fits data AND one-step does not fit it

Neither one-step Nor two-step fits the data

Comparison of 4 Algorithms

Step height = 5σ. Number of timepoints = 15. A total of 2000 random data, 2000 one step data and 2000 two step datawith random step positions.

StepMiner Algo

Comparison of 4 Algorithms

Step height = 5σ. Number of timepoints = 15. A total of 2000 random data, 2000 one step data and 2000 two step datawith random step positions.

Generation of Simulated Data

• Microarray data with 15 non-uniform time points

• 4000 genes with 2000 one-step and 200 two-step patterns

• Gaussian noise was added to the above data

• P-value threshold of 0.05 was used

Results of Simulated Data - I

• σ is the standard deviation of noise

• Step position is fixed at 5 for 1-step

• Step position at 5 and 9 for 2-step

• Higher the height easier is the identification

Results of Simulated Data - II

• σ is the standard deviation of noise

• Random step positions

• Small reduction in accuracy

• Higher matches occur if all constant segments in a curve have several time points.

• Desirable to design experiments so that there are several points before the first interesting transition and after the last interesting transition.

Results of Simulated Data - III

• Shows sensitivity to P-value threshold and number of time points

• Random step position and step height of 5σ

• Two-step signals require more time points than one-step signals

• Matches increase on increasing P-value but at the cost of higher False Discovery Rate

Results of Simulated Data - IV

• Shows sensitivity to spacing between steps

• For 15 time points first step is fixed at position 4

• A spacing of at least 3 time points is required when step height is > 3σ

• Steps are required to be placed at least 3 time points from end point

Diauxic Shift

• In the initial phases of a growing batch culture, yeast prefers to metabolize glucose and produce ethanol even when oxygen is abundant.

• When the glucose is exhausted, cells undergo a “diauxic shift,” in which they switch abruptly to an oxidative metabolism. This pathway allows the oxidation of the accumulated fermentation products and is highly efficient as a mechanism for generating ATP.

Brauer et. al., Mol Biol Cell. 2005 May; 16(5): 2503–2517

Analysis of Experimental Data

• 2284 genes with diauxic shift• 1088 were matched with one-

step transition• 267 were two-step transitions• 929 did not match to anything

Fitting functions for 3 genes

Same Data reanalyzed using StepMiner

Heat Maps

Analysis by Brauer et. al.

The heat map shows twotransitions at8.25 and 9.25 h

Comparison With Brauer et al’s Results

• The GO annotations and FDR-corrected P-values for the clusters reported in Brauer et al. was recomputed with the latest yeast gene annotations from the Gene Ontology Consortium Website

• Table shows the results of the p-values from GO- Term Finder as well as Step Miner.

Table for Comparison

Results Of Comparison

• The annotation that had the lowest P-values in Brauer et al. had even low P-values in the StepMiner groups.

• In most cases, the P-values in the reanalysis are lower than Brauer et al’s, implies that grouping by time-of-change is at least as effective as hierarchical clustering at identifying relevant genes.

• GO annotations are obtained fully automatically using StepMiner – it is not necessary to select interesting clusters manually.

• Those clusters which has no P-values from StepMiner were “less interpretable in terms of diauxic shift”, in the words of Brauer et al.

Comparison of StepMiner to Other Tools

• Hierarchical clustering: finds clusters that transition at same time point

– Manual search required to find transitions

• SAM: finds transitions by looking for significant differences in average expression before and after a specified time point.

– However, many of the genes selected by this method do not, in fact, have a transition at the specified time point.

• EDGE: identify genes whose expression systematically change over time and significantly different from the mean of the expressions over time.

– Clearly, this method doesn’t provide the direction and position of significant change directly.

Hierarchical vs. StepMiner

Cluster that transitions at 3 hours

StepMiner clearly shows other transition times

Comparison of StepMiner to Other Tools - STEM

• Provides model profiles and their significance values

• But profiles don’t look like step functions and therefore is not helpful to locate transitions

Strengths and Limitations

• Easy to understand• Few parameters• Biologically transitions can

be more interesting• Very fast < 15s for 15

microarrays of 40000 genes

• Can deal with missing measurements

• Provides statistical parameters like P-value, FDR etc.

• Binary model

• There can be other cases: eg, transition is not step

• Short and long time courses are not good

Most appropriate for 10-30 Time measurements.

Post StepMiner Analysis

• Once StepMiner is run genes undergoing binary transitions can easily be partitioned into sets based on the number, direction, and timing of transitions.

• These sets can be merged at the user’s discretion (e.g., the set of one-step genes that rise at time 3 could be merged with the two-step genes that rise at time 3), or can be further subdivided etc.

• BACK UP SLIDES

Replication vs. Resolution

• For accuracy it is better to take more frequent measurements that to get replicates

• It comes at a cost of correctly identifying the kind of step

Extracting binary signals from microarray time-course data Debashis Sahoo 1, David L. Dill 2, Rob...

Documents

Transcript of Extracting binary signals from microarray time-course data Debashis Sahoo 1, David L. Dill 2, Rob...