Network integration and function prediction: Putting it all together Curtis Huttenhower 04-13-11...
-
Upload
eliezer-sill -
Category
Documents
-
view
215 -
download
0
Transcript of Network integration and function prediction: Putting it all together Curtis Huttenhower 04-13-11...
Network integration and function prediction:Putting it all together
Curtis Huttenhower
04-13-11Harvard School of Public HealthDepartment of Biostatistics
2
Outline
• Functional network integration– Bayes nets and LR– The human genome, tissues, and disease
• Network meta-analysis– Pathogens and MTb– Quantifying progress in yeast
• Networks to pathways– Functional mapping: networks of networks– Hierarchical integration– Pathway prediction
• Regulatory network integration– Network motifs
3
A computational definition offunctional genomics
Genomic data Prior knowledge
Data↓
Function
Function↓
Function
Gene↓
Gene
Gene↓
Function
4
A framework for functional genomics
HighSimilarity
LowSimilarity
HighCorrelation
LowCorrelation
G1G2
+
G4G9
+
…
G3G6
-
G7G8
-
…
G2G5
?
0.9 0.7 … 0.1 0.2 … 0.8
+ - … - - … +
0.8 0.5 … 0.05 0.1 … 0.6
HighCorrelation
LowCorrelation
Fre
quen
cy
Coloc.Not coloc.
Fre
quen
cy
SimilarDissim.
Fre
quen
cy
P(G2-G5|Data) = 0.85
100Ms gene pairs →
← 1
Ks
data
sets
+ =
5
MEFIT: A Framework forFunctional Genomics
Golub 1999
Butte 2000
Whitfield 2002
Hansen 1998
Functional Relationship
Biological Context
Functional areaTissueDisease…
6
Functional networkprediction and analysis
Global interaction network
Metabolism network Signaling network Gut community network
Currently includes data from30,000 human experimental results,
15,000 expression conditions +15,000 diverse others, analyzed for
200 biological functions and150 diseases
HEFalMp
7
HEFalMp: Predicting human gene function
HEFalMp
8
HEFalMp: Predicting humangenetic interactions
HEFalMp
9
HEFalMp: Analyzing human genomic data
HEFalMp
10
HEFalMp: Understanding human disease
HEFalMp
11
Validating Human Predictions
Autophagy
Luciferase(Negative control)
ATG5(Positive control) LAMP2 RAB11A
NotStarved
Starved(Autophagic)
Predicted novel autophagy proteins
5½ of 7 predictions currently confirmed
With Erin Haley, Hilary Coller
12
Outline
• Functional network integration– Bayes nets and LR– The human genome, tissues, and disease
• Network meta-analysis– Pathogens and MTb– Quantifying progress in yeast
• Networks to pathways– Functional mapping: networks of networks– Hierarchical integration– Pathway prediction
• Regulatory network integration– Network motifs
13
Meta-analysis for unsupervisedfunctional data integration
Evangelou 2007
Huttenhower 2006Hibbs 2007
1
1log2
1'
'
''
z
eiey ,
ieeeiey ,,
i
ieiee yw ,*,̂
22,
*, ˆ
1
eie
ies
w
Simple regression:All datasets are equally accurate
Random effects:Variation within and
among datasets and interactions
14
Meta-analysis for unsupervisedfunctional data integration
Evangelou 2007
Huttenhower 2006Hibbs 2007
1
1log2
1'
'
''
z
+ =
15
Unsupervised data integration:TB virulence and ESX-1 secretionWith Sarah Fortune
Graphle http://huttenhower.sph.harvard.edu/graphle/
16
Unsupervised data integration:TB virulence and ESX-1 secretionWith Sarah Fortune
Graphle http://huttenhower.sph.harvard.edu/graphle/
X?
17
Predicting gene function
Cell cycle genes
Predicted relationships between genes
HighConfidence
LowConfidence
18
Predicting gene function
Predicted relationships between genes
HighConfidence
LowConfidence
Cell cycle genes
19
Cell cycle genes
Predicting gene function
Predicted relationships between genes
HighConfidence
LowConfidence
These edges provide a measure of how likely a gene is to
specifically participate in the process of
interest.
20
Comprehensive validation of computational predictions
Genomic data
Computational Predictions of Gene Function
MEFITSPELLHibbs et al 2007
bioPIXIEMyers et al 2005
Genes predicted to function in mitochondrion organization
and biogenesis
Laboratory ExperimentsPetite
frequencyGrowthcurves
Confocal microscopy
New known functions for correctly predicted genes
Retraining
With David Hess, Amy Caudy
Prior knowledge
21
Evaluating the performance of computational predictions
106Original GO Annotations
Genes involved in mitochondrion organization and biogenesis
135Under-annotations
82Novel Confirmations,
First Iteration
17Novel Confirmations,
Second Iteration
340 total: >3x previously known genes in ~5 person-months
22
Evaluating the performance of computational predictions
106Original GO Annotations
Genes involved in mitochondrion organization and biogenesis
95Under-annotations
40Confirmed
Under-annotations
80Novel Confirmations
First Iteration
17Novel Confirmations
Second Iteration
340 total: >3x previously known genes in ~5 person-months
Computational predictions from large collections of genomic data can be
accurate despite incomplete or misleading gold standards, and they
continue to improve as additional data are incorporated.
23
Outline
• Functional network integration– Bayes nets and LR– The human genome, tissues, and disease
• Network meta-analysis– Pathogens and MTb– Quantifying progress in yeast
• Networks to pathways– Functional mapping: networks of networks– Hierarchical integration– Pathway prediction
• Regulatory network integration– Network motifs
24
Functional mapping: mining integrated networks
Predicted relationships between genes
HighConfidence
LowConfidence
The strength of these relationships indicates how
cohesive a process is.
Chemotaxis
25
Functional mapping: mining integrated networks
Predicted relationships between genes
HighConfidence
LowConfidence
Chemotaxis
26
Functional mapping: mining integrated networks
Flagellar assembly
The strength of these relationships indicates how
associated two processes are.
Predicted relationships between genes
HighConfidence
LowConfidence
Chemotaxis
27
Functional mapping:Associations among processes
EdgesAssociations between processes
VeryStrong
ModeratelyStrong
Hydrogen Transport
Electron Transport
Cellular Respiration
Protein ProcessingPeptide
Metabolism
Cell Redox Homeostasis
Aldehyde Metabolism
Energy Reserve
Metabolism
Vacuolar Protein
Catabolism
Negative Regulation of Protein Metabolism
Organelle Fusion
Protein Depolymerization
Organelle Inheritance
28
Functional mapping:Associations among processes
EdgesAssociations between processes
VeryStrong
ModeratelyStrong
BordersData coverage of processes
WellCovered
SparselyCovered
Hydrogen Transport
Electron Transport
Cellular Respiration
Protein ProcessingPeptide
Metabolism
Cell Redox Homeostasis
Aldehyde Metabolism
Energy Reserve
Metabolism
Vacuolar Protein
Catabolism
Negative Regulation of Protein Metabolism
Organelle Fusion
Protein Depolymerization
Organelle Inheritance
29
Functional mapping:Associations among processes
EdgesAssociations between processes
VeryStrong
ModeratelyStrong
NodesCohesiveness of processes
BelowBaseline
Baseline(genomic
background)
VeryCohesive
BordersData coverage of processes
WellCovered
SparselyCovered
Hydrogen Transport
Electron Transport
Cellular Respiration
Protein ProcessingPeptide
Metabolism
Cell Redox Homeostasis
Aldehyde Metabolism
Energy Reserve
Metabolism
Vacuolar Protein
Catabolism
Negative Regulation of Protein Metabolism
Organelle Fusion
Protein Depolymerization
Organelle Inheritance
30
Functional mapping:Associations among processes
EdgesAssociations between processes
VeryStrong
ModeratelyStrong
NodesCohesiveness of processes
BelowBaseline
Baseline(genomic
background)
VeryCohesive
BordersData coverage of processes
WellCovered
SparselyCovered
• Gene expression
• Physical PPIs
• Genetic interactions
• Colocalization
• Sequence
• Protein domains
• Regulatory binding
sites
…
?
How do functional interactionsbecome pathways?
31
+ =
Functional genomic data
32
With Chris Park, Olga Troyanskaya
Simultaneous inference of physical, genetic, regulatory, and functional networks
Functional interactions
Regulatory interactions
Post-transcriptional regulation
Metabolic interactions
Phosphorylation Protein complexes
33
Learning a compendium of interaction networks
Train one SVM per interaction type
Resolve consistency using hierarchical Bayes net
34
Learning a compendium of interaction networks
AUC
0.5 1.0
Both presence/absence and directionality of
interactions are accurately inferred
35
Using network compendia to predictcomplete pathways
Additional 20 novel synthetic lethality predictions tested,
14 confirmed(>100x better than random)
Confirmed
Unconfirmed
With David Hess
36
Interactive aligned network viewer –http://function.princeton.edu/bioweaver
Graphle
37
Outline
• Functional network integration– Bayes nets and LR– The human genome, tissues, and disease
• Network meta-analysis– Pathogens and MTb– Quantifying progress in yeast
• Networks to pathways– Functional mapping: networks of networks– Hierarchical integration– Pathway prediction
• Regulatory network integration– Network motifs
38
• Of only five regulators found, four have
generic cell cycle/proliferation targets
• Just five basic regulators for ~7,000 genes?
• These motifs only appear upstream of ~half
of the genes
Human Regulatory Networks
G0
I
III
IV
V
VIVII
IX
VIII
II
X
6,829genes
Serum re-stimulated (hrs)Serum starved (hrs)1
5< <50
2 4 8 24 96 1 2 4 8 24 48
De
velo
pm
en
t
De
velo
pm
en
t
Ch
ole
ste
rol
Pro
tein
loca
liza
tion
Ce
ll cy
cle
RN
A p
roce
ssin
g
Me
tab
olis
m
FIRE: Elemento et al. 2007
Elk-1
Sp1
NF-Y
YY1
Quiescence: reversible exit from the cell cycle
39
COALESCE: Combinatorial Algorithm forExpression and Sequence-based Cluster Extraction
Gene Expression DNA Sequence
5’ UTR 3’ UTR
Upstream flank Downstream flank
Evolutionary Conservation
Nucleosome Positions
Identify conditions where genes
coexpress
Identify motifs enriched in
genes’ sequences
Create a new module
Select genes based on conditions
and motifs
Subtract mean from all data
Regulatory modules• Coregulated genes• Conditions where they’re
coregulated• Putative regulating motifs
Feature selection:Tests for differential expression/frequency
Bayesian integration
40
COALESCE: SelectingCoexpressed Conditions
• For each gene expression condition…– Compare distributions of values for
• Genes in the module versus• Genes not in the module
– If significantly different, include the condition
Preserving data structure:• If multiple conditions derive from the
samedataset, can be included/excluded as a
unit• For example, time course vs. deletion
collection• Test using multivariate z-test• Precalculate covariance matrix; still very
efficient
41
COALESCE: SelectingSignificant Motifs
• Coalesce looks for three kinds of motifs:– K-mers– Reverse complement pairs– Probabilistic Suffix Trees (PSTs)
• For every possible motif…– Compare distributions of values for
• Genes in the module versus• Genes not in the module
– If significantly different, include the motif
ACGACGT
ACGACAT | ATGTCGT
A
TC
G
T
TG
CA
• This can distinguish flanks from UTRs• Fast!• Efficient enough to search coding sequence
(e.g. exons/introns)
42
COALESCE: SelectingProbable Genes
• For each gene in the genome…For each significant condition… For each significant motif…
What’s the probability the gene came from the module’s distribution?
What’s the probability that it came from outside the module?
)()|()()|(
)()|()|(
MgPMgDPMgPMgDP
MgPMgDPDMgP
Distributions of each feature in and out of the developing module are observed from the data.
Prior is used to stabilize module convergence; genes already in the module are more likely to stay there next iteration.
The probability of a gene being in the module given some data…
43
COALESCE: IntegratingAdditional Data Types
Nucleosome placement Evolutionary conservation
• Can be included as additional datasets and feature
selected just like expression conditions/motifs.
• Or can be used as a prior or weight on the values of
individual motifs.
N C
G1 2.5 0.0
G2 0.6 0.5
G3 1.2 0.9
… … …
TCCGGTAGAACTACTGGTATTGTTTTGGATTCCGGTGATG
44
COALESCE Results:S. cerevisiae Modules
~2,200 conditions
~6,000 genes
The haystack
A needle
100 genes80 conditions
45
COALESCE Results:S. cerevisiae Modules
54 genes, 144 conditionsConjugation
33 genes, 434 conditionsBudding
112 genes, 82 conditionsMitosis and DNA replication
Swi5
Stb1/Swi6Ste12
46
COALESCE Results:S. cerevisiae Modules
50 genes, 775 conditionsIron transport
11 genes, 844 conditionsPhosphate transport
126 genes, 660 conditionsGlycolysis, iron and phosphate transport, amino acid metabolism…
Pho4
Helix-Loop-HelixTye7/Cbf1/Pho4
Aft1/2
47
COALESCE Results:S. cerevisiae Modules
72 genes, 319 conditionsMitochondrial translation Puf3
…plus more ribosome clusters
than you can shake a stick at!
48
COALESCE Results:Yeast TF/Target Accuracy
Bas1p Hap4p Met32p Cup2p Met31p Zap1p Upc2p Mbp1p Hsf1p Gln3p Hap3p Gcn4p Uga3p Gis1p Hap5p
-0.3
-0.1
0.1
0.3
0.5
0.7
0.9
1.1
1.3
COALESCE
cMonkey
FIRE
Weeder
Z-S
core
49
COALESCE Results:TF/Targets Influenced by Supporting Data
Sfl1p Gcr1p Uga3p Mot3p Sum1p Cst6p Mig3p
-0.5
0
0.5
1
1.5
2
2.5
3
COALESCE
COALESCE, conservation
COALESCE, nucleosomes
COALESCE, cons. + nuc.
Z-S
core
Improved by any addl. data, mainly conservation
Decreased by addl. data Improved by conservation
Improved only by both
50
COALESCE Results:Yeast Clustering Accuracy
• ~2,200 yeast conditions– Recapitulation of known biology from Gene Ontology
51
COALESCE Results:Yeast Clustering Accuracy
• ~2,200 yeast conditions– Recapitulation of known biology from Gene Ontology
ASCL1 in 5’ flank, unch. sequences underenriched in 3’ UTR
M. musculus: Up in callosal and motor neurons
C. elegans: Up in larvae, down in adults
GATA in 5’ flank, miR-788 seed in 3’ UTR
AAGGGGC (zf?) and enriched in 5’ flank
H. sapiens: Up in normal muscle, down in diabetic
52
COALESCE: Coregulated Quiescence Modules
• Predicts regulatory modules from genomic data:– Coregulated genes– Conditions under which coregulation occurs– Putative regulatory motifs
• 5 quiescence-related microarray datasets,60 conditions– Quiescence program (Coller et al. 2006)– Adenoviral infection (Miller et
al. 2007)– let-7 response
(Legesse-Miller et al. unpub.)– Contact inhibition
(Scarino et al. unpub.)– Serum withdrawal (Legesse-
Miller et al. unpub.)
53
COALESCE: Coregulated Quiescence Modules
Down during quiescence entry, up during quiescence exit,down with adenoviral infection
Specific predicted uncharacterized reverse complement motif
Up during quiescence entry, down during quiescence exit
Many known related (proliferation) motifs:Pax4, Staf, NFKB1, Gfi, ESR1, Runx1, Su(H)
Down during quiescence entry,enriched for transport/trafficking
miR-297 motif predicted in 3’ UTR (CACATAC)
Down with let-7 exposure
let-7 motifs predicted in 3’ UTR (UACCUC)
Network Motifs
54
Coherent feed-forward
filter
Incoherent feed-forward
pulse
Bi-fan
Positiveauto-regulation
delay
WGD and evolvability
Negativeauto-regulation
speed + stability
Feedback
memory
March 1, 2010 55
From Milo, et al., Science, 2002
56
Outline
• Functional network integration– Bayes nets and LR– The human genome, tissues, and disease
• Network meta-analysis– Pathogens and MTb– Quantifying progress in yeast
• Networks to pathways– Functional mapping: networks of networks– Hierarchical integration– Pathway prediction
• Regulatory network integration– Network motifs
1:1 Lewis Carroll Map“… And then came the grandest idea of
all! We actually made a map of the country, on the scale of a mile to the mile!"
"Have you used it much?" I enquired.
"It has never been spread out, yet," said Mein Herr: "the farmers objected: they said it would cover the whole country, and shut out the sunlight! So we now use the country itself, as its own map, and I assure you it does nearly as well.
Sylvie and Bruno Concluded by Lewis Carroll, 1893.March 1, 2010 57