Answering biological questions using large genomic data collections
description
Transcript of Answering biological questions using large genomic data collections
![Page 1: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/1.jpg)
Answering biological questions using large genomic data collections
Curtis Huttenhower
10-05-09Harvard School of Public HealthDepartment of Biostatistics
![Page 2: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/2.jpg)
2
A Definition ofComputational Functional Genomics
Genomic data Prior knowledge
Data↓
Function
Function↓
Function
Gene↓
Gene
Gene↓
Function
![Page 3: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/3.jpg)
3
MEFIT: A Framework forFunctional Genomics
BRCA1 BRCA2 0.9BRCA1 RAD51 0.8RAD51 TP53 0.85…
Related Gene Pairs
HighCorrelation
LowCorrelation
Freq
uenc
y
MEFIT
![Page 4: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/4.jpg)
4
MEFIT: A Framework forFunctional Genomics
BRCA1 BRCA2 0.9BRCA1 RAD51 0.8RAD51 TP53 0.85…
BRCA2 SOX2 0.1RAD51 FOXP2 0.2ACTR1 H6PD 0.15…
Related Gene Pairs
Unrelated Gene PairsHigh
CorrelationLow
Correlation
Freq
uenc
y
MEFIT
![Page 5: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/5.jpg)
5
MEFIT: A Framework forFunctional Genomics
Golub 1999
Butte 2000
Whitfield 2002
Hansen 1998
Functional Relationship
![Page 6: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/6.jpg)
6
MEFIT: A Framework forFunctional Genomics
Golub 1999
Butte 2000
Whitfield 2002
Hansen 1998
Functional Relationship
Biological Context
Functional areaTissueDisease…
![Page 7: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/7.jpg)
7
Functional Interaction Networks
MEFIT
Global interaction network
Autophagy networkVacuolar transport
network Translation network
Currently have data from30,000 human experimental results,
15,000 expression conditions +15,000 diverse others, analyzed for
200 biological functions and150 diseases
![Page 8: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/8.jpg)
8
Predicting Gene Function
Cell cycle genes
Predicted relationships between genes
HighConfidence
LowConfidence
![Page 9: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/9.jpg)
9
Predicting Gene FunctionPredicted relationships
between genes
HighConfidence
LowConfidence
Cell cycle genes
![Page 10: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/10.jpg)
10
Cell cycle genes
Predicting Gene FunctionPredicted relationships
between genes
HighConfidence
LowConfidence
These edges provide a measure of how likely a gene is to
specifically participate in the process of
interest.
![Page 11: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/11.jpg)
11
Functional Associations Between Contexts
Predicted relationships between genes
HighConfidence
LowConfidence
The average strength of these relationships
indicates how cohesive a process is.
Cell cycle genes
![Page 12: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/12.jpg)
12
Functional Associations Between Contexts
Predicted relationships between genes
HighConfidence
LowConfidence
Cell cycle genes
![Page 13: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/13.jpg)
13
Functional Associations Between Contexts
DNA replication genes
The average strength of these relationships indicates how
associated two processes are.
Predicted relationships between genes
HighConfidence
LowConfidence
Cell cycle genes
![Page 14: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/14.jpg)
14
Functional Associations Between Processes
EdgesAssociations between processes
VeryStrong
ModeratelyStrong
Hydrogen Transport
Electron Transport
Cellular Respiration
Protein ProcessingPeptide
Metabolism
Cell Redox HomeostasisAldehyde
Metabolism
Energy Reserve
Metabolism
Vacuolar Protein
CatabolismNegative Regulation
of Protein Metabolism
Organelle Fusion
Protein Depolymerization
Organelle Inheritance
![Page 15: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/15.jpg)
15
Functional Associations Between Processes
EdgesAssociations between processes
VeryStrong
ModeratelyStrong
BordersData coverage of processes
WellCovered
SparselyCovered
Hydrogen Transport
Electron Transport
Cellular Respiration
Protein ProcessingPeptide
Metabolism
Cell Redox HomeostasisAldehyde
Metabolism
Energy Reserve
Metabolism
Vacuolar Protein
CatabolismNegative Regulation
of Protein Metabolism
Organelle Fusion
Protein Depolymerization
Organelle Inheritance
![Page 16: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/16.jpg)
16
Functional Associations Between Processes
EdgesAssociations between processes
VeryStrong
ModeratelyStrong
NodesCohesiveness of processes
BelowBaseline
Baseline(genomic
background)
VeryCohesive
BordersData coverage of processes
WellCovered
SparselyCovered
Hydrogen Transport
Electron Transport
Cellular Respiration
Protein ProcessingPeptide
Metabolism
Cell Redox HomeostasisAldehyde
Metabolism
Energy Reserve
Metabolism
Vacuolar Protein
CatabolismNegative Regulation
of Protein Metabolism
Organelle Fusion
Protein Depolymerization
Organelle Inheritance
AHP1DOT5GRX1GRX2…
APE3LAP4PAI3PEP4 …
![Page 17: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/17.jpg)
17
HEFalMp: Predicting human gene function
HEFalMp
![Page 18: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/18.jpg)
18
HEFalMp: Predicting humangenetic interactions
HEFalMp
![Page 19: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/19.jpg)
19
HEFalMp: Analyzing human genomic data
HEFalMp
![Page 20: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/20.jpg)
20
HEFalMp: Understanding human disease
HEFalMp
![Page 21: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/21.jpg)
21
Validating Human Predictions
Autophagy
Luciferase(Negative control)
ATG5(Positive control) LAMP2 RAB11A
NotStarved
Starved(Autophagic)
Predicted novel autophagy proteins
5½ of 7 predictions currently confirmed
With Erin Haley, Hilary Coller
![Page 22: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/22.jpg)
22
Comprehensive Validation of Computational Predictions
Genomic data
Computational Predictions of Gene FunctionMEFITSPELL
Hibbs et al 2007bioPIXIEMyers et al 2005
Genes predicted to function in mitochondrion organization
and biogenesis
Laboratory ExperimentsPetite
frequencyGrowthcurves
Confocal microscopy
New known functions for correctly predicted genes
Retraining
With David Hess, Amy Caudy
Prior knowledge
![Page 23: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/23.jpg)
23
Evaluating the Performance of Computational Predictions
106Original GO Annotations
Genes involved in mitochondrion organization and biogenesis
135Under-annotations
82Novel Confirmations,
First Iteration
17Novel Confirmations,
Second Iteration
340 total: >3x previously known genes in ~5 person-months
![Page 24: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/24.jpg)
24
Evaluating the Performance of Computational Predictions
106Original GO Annotations
Genes involved in mitochondrion organization and biogenesis
95Under-annotations
40Confirmed
Under-annotations
80Novel Confirmations
First Iteration
17Novel Confirmations
Second Iteration
340 total: >3x previously known genes in ~5 person-months
Computational predictions from large collections of genomic data can be
accurate despite incomplete or misleading gold standards, and they
continue to improve as additional data are incorporated.
![Page 25: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/25.jpg)
25
Functional Maps:Focused Data Summarization
ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA
Data integration summarizes an impossibly huge amount of experimental data into an
impossibly huge number of predictions; what next?
![Page 26: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/26.jpg)
26
Functional Maps:Focused Data Summarization
ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA
How can a researcher take advantage of all this data to study
his/her favorite gene/pathway/disease without
losing information?
Functional mapping• Very large collections of genomic data• Specific predicted molecular interactions• Pathway, process, or disease
associations• Underlying experimental results and
functional activities in data
![Page 27: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/27.jpg)
27
Thanks!
NIGMShttp://function.princeton.edu/hefalmp
Interested? I’m accepting students and postdocs!
Hilary CollerErin HaleyTsheko Mutungu
Olga TroyanskayaMatt HibbsChad MyersDavid HessEdo AiroldiFlorian Markowetz
Shuji OginoCharlie Fuchs
http://www.huttenhower.org
![Page 28: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/28.jpg)
![Page 29: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/29.jpg)
29
Next Steps:Microbial Communities
• Data integration is off to a great start in humans– Complex communities of distinct cell types– Very sparse prior knowledge
• Concentrated in a few specific areas– Variation across populations– Critical to understand mechanisms of disease
![Page 30: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/30.jpg)
30
Next Steps:Microbial Communities
• What about microbial communities?– Complex communities of distinct species/strains– Very sparse prior knowledge
• Concentrated in a few specific species/strains– Variation across populations– Critical to understand mechanisms of disease
![Page 31: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/31.jpg)
31
Next Steps:Microbial Communities
PKH1
PKH3
PKH2LPD1
CAR1W04B5.5
pdk-1
R04B3.2
LLC1.3
T21F4.1
PDPK1
ARG1DLD
ARG2AGA
~120 available expression datasets
~70 species
PKH1
PKH3
PKH2LPD1
CAR1
W04B5.5
pdk-1
R04B3.2
LLC1.3
T21F4.1
PDPK1
ARG1DLD
ARG2AGA
Weskamp et al 2004
Flannick et al 2006
Kanehisa et al 2008
Tatusov et al 1997
• Data integration works just as well in microbes as it does in humans• We know an awful lot about some microorganisms and almost nothing about others• Purely sequence-based and purely network-based tools for function transfer both fall short• We need data integration to take advantage of both and mine out useful biology!
![Page 32: Answering biological questions using large genomic data collections](https://reader036.fdocuments.in/reader036/viewer/2022062501/5681649f550346895dd68d3e/html5/thumbnails/32.jpg)
32
Next Steps:Functional Metagenomics
• Metagenomics: data analysis from environmental samples– Microflora: environment includes us!
• Another data integration problem– Must include datasets from multiple organisms
• Another context-specificity problem– Now “context” can also mean “species”
• What questions can we answer?– How do human microflora interact with diabetes,
obesity, oral health, antibiotics, aging, …– What’s shared within community X?
What’s different? What’s unique?– What’s perturbed in disease state Y?
One organism, or many? Host interactions?– Current methods annotate ~50% of synthetic data,
<5% of environmental data
PKH1
PKH3
PKH2LPD1
CAR1
W04B5.5
pdk-1
R04B3.2
LLC1.3
T21F4.1
PDPK1
ARG1DLD
ARG2AGA