Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular...
-
date post
21-Dec-2015 -
Category
Documents
-
view
218 -
download
1
Transcript of Finding Transcription Modules from large gene-expression data sets Ned Wingreen – Molecular...
Finding Transcription Modules from large gene-expression data sets
Ned Wingreen – Molecular BiologyMorten Kloster, Chao Tang – NEC Laboratories America
Outline
• Introduction – transcription, regulation, gene chips, and transcription modules.
• Iterative Signature Algorithm (ISA).
• Advantages of Progressive Iterative Signature Algorithm (PISA).
• PISA applied to yeast data.
Transcription regulation
http://doegenomestolife.org
Gene chips
DNA microarray
Gene-expression profile
Egc g=1,2,...,Ng
c=1,2,...,Nc
But data very noisy…
Transcription module
C1 C2 C3 Conditions
G1 G7G2 G3 G4 G5 G6 Genes
TF1 TF2 TF3 TF4Transcription factors
A Transcription Module: a set of conditions and a set of genes connected by a transcription factor.
A gene can be in multiple transcription modules.
Conditions
Gen
esc1 c2 c3 … … cm … … cn ... ... cNc
g 1
g 2 g 3 . .g i . .g j . . g N g
Signature of a transcription module
Iterative Signature Algorithm (ISA)Barkai group (2002,2003)
( ) { : }
( ) { : }
m
m
gcm m G Cg G
gcm m C Gc C
C G c C E t
G C g G E t
1 1
2 2
G C
G C
G CG C
G CN N
m m
m m
m m
m m
Transcription Module (TM)
Gene vector and condition vector:
T
( 1) ( ( ))
( 1) ( ( 1))
G
C
G Ct C
C Gt G
n f n
n f n
m E m
m E m
Conditions
Gen
es
c1 c2 c3 … … cm … … cn ... ... cNC
g 1
g 2 g 3 . .g i . .g j . . g N
G
Thresholding on both genes and conditions reduces noise.
Thresholding:
Limitations of ISA• Lots of spurious modules (millions…).
• Weak modules may be absorbed by strong ones.
• ISA does not make use of identified modules to find new ones.
c1 c2 c3 … … cm … … cn ... ... cNc
g 1
g 2 g 3 . .g i . .g j . . g Ng
Progressive Iterative Signature Algorithm (PISA)
c1 c2 c3 … … cm … … cn ... ... cNc
g 1
g 2 g 3 . .g i . .g j . . g N g
Advantages of PISA over ISA
• Removing found modules reveals “hidden” modules, and reduces noise for unrelated modules.
• No positive feedback.
• Improved thresholding for genes.
• Combines coregulated and counter-regulated genes.
Example of PISA vs. ISA
TF1 TF2
G1 G2
A B
The gene-score threshold
•Goal: less than one gene included in the module by mistake.
•Require: threshold that is insensitive to (unknown) module size.
Gene scores along the condition vector for some module
Eliminating false modulesFor scrambled data, preliminary modules either have few genes or few contributing conditions.
Truepositives
PISA applied to yeast data
• Applied PISA to a dataset containing almost all available microarray data for S. cerevisiae: >6000 genes, ~1000 conditions.
• Found ~140 different modules, including all “good” modules found by ISA.
• Found some unknown modules.
• Found many “good” small modules that ISA could not find / separate from the spurious modules.
• ~2600 genes in at least one module, ~900 genes in more than module.
Some modules found by PISA
Example: Zinc module
ZRT1
YNL254C
INO1ZAP1
YOL154W
ADH4
ZRT3ZRT2
YOR387C
ZRT1
ZAP1
ZRT2
YNL254C
YOL154W
ZRT3
ADH4
RAD27
ZRC1
… Lyons
et a
l., P
NA
S 97
, 795
7-7
962
(2000)
ZAP1-regulated genesduring zinc starvation.
Zinc module found by PISA
Comparison with other databases“Gold standard”: Gene Ontology (Genome Res. 11, 1425-1433
(2001)) Database A: Immunoprecipitation (Lee et al., Science 298, 799-804 (2002))
Database B: Comparative genomics (Kellis et al., Nature 423, 241-254
(2003))
anticorrelated correlated
Oxidative stress response(69)De novo purine biosyn (32)Lysine biosyn (11)Biotin syn & transport (6)Arg biosyn (6)aa biosyn (96)
Oxidative stress response (69) aryl alcohol dehydrogenase (6) proteolysis (27) trehalose & hexose metabolism/conversion (21) COS genes (11) heat shock (52) repair of disulfide bonds (26)
Mating genes for type a (15)Mating type a signaling genes (6)Mating (110)Mating factors/receptors: a/ difference (26)
rRNA processing (117) Ribosomal proteins (126) Histone (19) Fatty acid syn ++ (22) Cell cycle G2/M (31) Cell cycle M/G1 (35) Cell cycle G1/S (66)
Correlations
Summary
• Data from gene chips can be used to identify transcription modules (TMs).
• Iterative approach (ISA) is promising.
• PISA improves on ISA by taking out found TMs.
– PISA also improves gene thresholding, avoids positive feedback, and improves signal to noise by grouping coregulated and counter-regulated genes.
– PISA very effective for finding “secondary modules”.
http://cn.arxiv.org/abs/q-bio/0311017
Future Directions
• Input to experiment: – new modules and new genes in old modules.– what kinds of experiments give the most informative
data?
• Improve PISA:– better pre/post-processing of data.
• Apply PISA to other organisms.
• Combine PISA with other data (experimental, bioinformatic) to systematically identify TMs, and reconstruct the transcription network.
De novo purine biosynthesisNumber of genes: 32Average number of contributing conditions: 14.6Consistency: 0.59Best ISA overlap: 0.59 at tG=5.0; frequency 16
Galactose induced genesNumber of genes: 23Average number of contributing conditions: 18.1Consistency: 0.55Best ISA overlap: 0.74 at tG=3.2; frequency 686
Hexose transporters
Number of genes: 10Average number of contributing conditions: 33.7Consistency: 0.59Best ISA overlap: 0.6 at tG=3.8; frequency 41
Peroxide shockNumber of genes: 69Average number of contributing conditions: 23.9Consistency: 0.50Best ISA overlap: 0.34 at tG=3.4; frequency (1)
Implementation of PISA
• Normalization of gene-expression data
• Iterative algorithm to find preliminary modules (modified ISA)– avoiding positive feedback– gene-score threshold
• Orthogonalization
• Finding consistent modules
Normalization of expression data
Gene-score matrix EG:
Condition-score matrix EC:
removes reference-condition bias
normalizes total RNA levels
makes gene scores comparable
makes condition scores comparable
Iterative algorithm: modified ISA (mISA)
Start with a random set of genes GI.
Produce condition-score vector sC.
Produce gene-score vector sG, using “leave-one-out” scoring to avoid positive feedback.
From sG, calculate gene vector mG for next iteration.
OrthogonalizationAfter finding each converged preliminary module (sG, sC), remove component along sC from all genes:
s1C
s’
s2C
Why does scrambled data yield large modules?
Long tails of expression data lead to single-condition modules.
Finding consistent modules
• Repeat PISA runs many times (~30).• Tabulate preliminary modules.• A preliminary module contributes to a module if:
– the preliminary module contains > 50% of the genes in the module,
– these genes constitute > 20% of the preliminary module.
• A gene is included in a module if it appears in >50% of the contributing modules, always with the same gene-score sign.
Comparison with other databasesGene Ontology (Genome Res. 11, 1425-1433 (2001))
Database A: Immunoprecipitation (Lee et al., Science 298, 799-804 (2002)) Database B: Comparative genomics (Kellis et al., Nature 423, 241-254 (2003))
1
0
1
G
n
i G
c N c
i m ip
N
m
Ng — number of genes in organismm — number of genes in module c — number of genes in GO categoryn — number of genes in both module and GO category
p value:
Correlation of modules
1 2 1 2
'
Corr( , ) ' '
CC
C
mm
m
m m m m
Conditions
Gen
es
c1 c2 c3 … … cm … … cn ... ... cNc
g 1
g 2 g 3 . .g i . .g j . . g Ng