Olga Troyanskaya Assistant Professor Lewis-Sigler ...€¦ · proteasome components Pup1, Pre6,...

Computational Approaches to Functional Genomics

Olga TroyanskayaAssistant Professor

Lewis-Sigler Institute for Integrative Genomics &Department of Computer Science

Princeton University

Laboratory of Bioinformatics & Functional Genomics

A primer: Molecular biology 101

Cells are fundamental working units of all organisms

Yeast are unicellular organisms

Humans are multi-cellular organisms

Understanding how a cell works is critical to understanding how the organism functions

Uses alphabet of 4 letters {ATCG}, called basesEncodes genetic information in triplet codeStructure: a double helix

Proteins

A sequence of amino acids (alphabet of 20)Each amino acid encoded by 3 DNA basesPerform most of the actual work in the cellFold into complex 3D structure

Courtesy of the Zhou Laboratory, The State University of New York at Buffalo

How does a cell function?

Courtesy U.S. Department of Energy Genomes to Life program

DNA is a sequence of bases {A, T, C, G}

TAT-CGT-AGTProteins consist of amino acids, whose sequence is encoded in DNA

Tyr-Arg-Ser

Each 3 bases of DNA encode 1 amino acid

DNA-RNA-protein

Genes vs. proteins

Genes are units of inheritanceThey are static blueprintsIt’s proteins (dynamic) that do most of the work The process of making mRNA, and then protein from a gene (or genes) is called GENE EXPRESSIONIt’s the control of gene expression that causes most phenotypic differences in organisms

Gene Regulatory Circuit

Genes =? wiresMotifs =? gates

A B Make DC

If C then D

If B then NOT D

If A and B then D D

Make BD

If D then B

gene D

gene B

The “greatness” of genomics…

Biological systems are complexMany biological processes & diseases result from complex changes on molecular levelNeed to observe & model cellular processes on a systems level

High-throughput technologies have lead to an explosion of data in biology in hopes of

understanding biological systems

Explosion of functional genomic DATA

KNOWLEDGE of components and inter-relationships that lead to function

… And its “downfall”

13 Why have genomic data not been utilized fully?

Challenges:

•Genomic data are noisy

•Genomic data are heterogeneous

•Coverage/accuracy varies by biological process

14 Computation is a tool for functional genomics

Our approach:(1) Integrated analysis of diverse data(2) Probabilistic methods to battle noise in data(3) Integrating computation and experiments(4) Accessibility and usefulness to community

(bringing experts into the analysis loop and feedback to experimental biology)

Computational methods (and targeted experiments) can greatly aid in extracting knowledge from biological data,

but several challenges must be addressed:

Story #1: predicting function of unknown proteins

16 Predicting gene function using the Gene Ontology hierarchy

A number of previous approaches to function prediction from diverse data, most use GO biological process termsHowever, GO is a hierarchy

• Could improve accuracy by enforcing Hierarchical consistency

Biological Process

Regulation Cellular Process

Regulation ofCellular Process

Unknown

Cell Differentiation

Hierarchical Consistency

All genes

TRAINING

cytokinesisNO

bud site selectionYES

cell proliferationYES

EVALUATION

Our Method

Individual classifiers for each classInconsistent predictions allowedAny classification algorithm can be usedParallel evaluation

Bayesian combination of predictionsInconsistencies resolved globallyAny inference algorithm can be used

mRNA processing

mRNA metabolism RNA processing

RNA metabolism

A Bayesian Framework

Given predictions g1...gN ∈ ℜ, find true labels y1...yN ∈ {0,1}that maximize

P(y1...yN | g1...gN) = α P(g1...gN | y1...yN) P(y1...yN)

Data Types (for Saccharomyces cerevisiae)

The Gene Ontology105 “meaningful” nodes selected

Pairwise Interaction (GRID)Affinity PrecipitationAffinity ChromatographyTwo-HybridPurified ComplexBiochemical AssaySynthetic LethalitySynthetic RescueDosage Lethality

ColocalizationO’SheaCurated Complexes

(152 features)

Transcription Factor Binding Sites

PROSPECT(39 features)

Microarrays (SMD)Spellman et al., 1998Gasch et al., 2000, 2001Sudarsanam et al., 2000Yoshimoto et al., 2002Chu et al., 1998Shakoury-Elizeh et al., 2003Ogawa et al., 2000

(342 features)

Does hierarchical consistency help?

For each class, 10 linear SVMs trained by bootstrappingMedian of unthresholded outputs used (bagging)Area under the ROC curve (AUC) for evaluation

93 of 105 nodes (86%) are improved by Bayesian correction.

Best ΔAUC = +0.346 (+63% of old AUC)Worst ΔAUC = -0.031 (-3% of old AUC)Average ΔAUC = +0.033 (+4% of old AUC)

22 Most processes improve in accuracy (AUC Scatter Plot)

AUC Changes24

Held-out Example: YNL261W

Raw SVM outputs

Bayes-marginal probabilities

Raw SVM Predictions Bayes Net Probabilities

Verification: New Data

GO since our April 2004 snapshot105 new annotations for 88 genes

Predictions over the 88 genes on our dataIndependent SVMs

32% precision, 7% recallBayesian correction

32% precision, 20% recall51% precision, 7 % recall

26 Predictions of novel proteins involved in mitosis

Lab testing of some predictions for mitosisYMR144W - “mitotic chromosome segregation”

Large-budded YMR144WΔ cells -> frequent nuclear defects

YOR315W - “mitotic spindle assembly”Cells were fixed andLarge-budded YOR315WΔ cells -> frequent misaligned spindles (anti-a-tubulin antibody) and nuclear defects.

YMR299C – “mitotic cell cycle”Lee et al. (2005) showed YMR299C protein that is part of a dynein pathway

Independent SVMs miss these.

Experimental validation

YMR144WΔ YOR315WΔWild Type

Summary

Using multiple information sources helps prediction accuracy

Multiple diverse data sourcesUsing gene ontology hierarchy

Probabilistic and machine learning approaches can generate experimentally testable predictionsOur hierarchical consistency approach increases accuracy and generates novel predictions

Story #2: predicting biological networks

Functional genomic DATA

KNOWLEDGE of components and inter-relationships that lead to function

Specific goal: building biological networks from experimental data

• Gene expression

• Physical protein-protein interactions

• Genetic interactions

• Cellular localization

• Sequence

Key ideas:

Integration: combine information from all available sources in a robust way

Understand/use information on biological context

Building a practical system that directly involves biologists in the prediction process and can direct further experiments

http://pixie.princeton.edu

bioPIXIE – a system for discovery & analysis of biological networks

(in specific biological context)

•For S. cerevisiae: integrates data from ~6500 publications

•Other organisms coming

System overview

Data integration via a Bayesian network

Network recovery algorithm

sion Gene expression dataset 1

Gene expression dataset 2

Gene expression dataset N

ions Yeast two-hybrid dataset 1

Co-precipitation dataset 1

Transcription factor bin sites

Localization

Curated literature

ions Synthetic lethality dataset

Synthetic rescue dataset

User-selected query focuses search

ste11,ste3,ste5,ste20,msg5

Results displayed in a dynamic visualization

bioPIXIE: Pathway Inference from eXperimentalInteraction Evidence

Query determines

biological context

Bayesian context-specific integration

Functional Relationship

Co-expression Physicalinteraction

Genetic interaction

profilesCo-

localization

Genetic interaction

profilesCo-

localization

Genetic interaction

profilesCo-

localization

DNA repair

Cell cycle…

Query selects context

• We infer: • 174 observable nodes (datasets grouped by publication and by assay)• Naïve bayes

(compares favorably against more sophisticated alternatives, e.g. TAN)• Training set: GO biological process co-annotated proteins

System overview

Data integration via a Bayesian network

sion Gene expression dataset 1

Gene expression dataset 2

Gene expression dataset N

ions Yeast two-hybrid dataset 1

Co-precipitation dataset 1

Transcription factor bin sites

Localization

Curated literature

ions Synthetic lethality dataset

Synthetic rescue dataset

User-selected query focuses search

ste11,ste3,ste5,ste20,msg5

Results displayed in a dynamic visualization

bioPIXIE: Pathway Inference from eXperimentalInteraction Evidence

Query determines

biological context

Myers et al. Discovery of biological networks from diverse functional genomic data. Genome Biology (2005).

From integrated pairwise data to process-specific networks

use existing knowledge:

Expert-driven discovery36

• Rad23 entered with Rad4, Rad3, and Rad24

• The resulting network is enriched (22 of 44) for DNA repair proteins (GO:0006281)

Experts can drive the search process

• Query: Rad23 with proteasome components Pup1, Pre6, Rpn12

• Recovered network is enriched (36 of 44) for ubiquitin-dependent catabolism proteins and only contains 2 DNA repair proteins (Rad6 and Rad23).

proteins

A: determine a “characteristic” interaction profile for the query setB: search the remaining set of proteins for the closest matches to the characteristic profile

Basic idea: local search in the PPI network centered at the query

Which proteins should we extract as a single, functionally coherent group?

RNA splicing (GO:0008380)

Evaluation: the importance of biological context

RNA splicing: same 5 query genes

Global network:22 FPs/27% precision

Context-specific network:6 FPs/ 80% precision

10-protein query; each point-average of 50 trials

Context-specific integration improves 44/53 evaluated bio. process GO terms an average of 25%

RNA splicing dataset relevance

(16 of 174 input datasets)

00.20.40.60.8

Co-purification (BIND+GRID)

Affinity capture- MS (GRID+BIND)

Krogan et al. TAP-MS (2006)

Gavin et al. TAP-MS (2002)

Two-hybrid (GRID+BIND)

Krogan et al. RNA processing Af..

Uetz et al. Two-hybrid (2000)

Tong et al. Synthetic lethality

Ito et al. Two-hybrid (2001)

Nucleus co-localization

Brem et al. microarray (2002)

Hughes et al. microarray (2000)

Pitkanen et al. microarray (2004)

Cho et al. microarray (1998)

Spellman et al. microarray (1998)

RNA splicing network

00.20.40.60.8

Co-purification (BIND+GRID)

Affinity capture- MS (GRID+BIND)

Krogan et al. TAP-MS (2006)

Two-hybrid (GRID+BIND)

Krogan et al. RNA processing Af..

Uetz et al. Two-hybrid (2000)

Tong et al. Synthetic lethality

Ito et al. Two-hybrid (2001)

Nucleus co-localization

Brem et al. microarray (2002)

Hughes et al. microarray (2000)

Pitkanen et al. microarray (2004)

Cho et al. microarray (1998)

Spellman et al. microarray (1998)

RNA splicing networkProtein folding network

A consistent improvement• Context-specific integration improves 44/53 evaluated bio.

process GO terms an average of 25%

Global network (# of recovered proteins)

10-protein query; each point: average of 50 trials

• How accurately can we recover known network components?

• How much does integration of diverse data help?

Evaluation: measure how often observed data connects functionally related proteins (e.g. shared GO annotations)

General network recovery evaluation

# of recovered same-process protein pairsPrec

(8 of 174 input datasets)

Evaluation: what about noise in the query set?

# of random proteins out of 20 total query

proteins

Biological validation: characterizing unknown genes

Uncharacterized genes:YPL077C, YPL017C,

YPL144WPredicted involvement in chromosome segregation

Biological validation: characterizing unknown genes

Wild type

YPL017CΔ

YPL077CΔ

YPL144WΔ

Differential InterferenceContrast DAPI FACS

Prediction: Chromosome

segregation46

Using bioPIXIE to form testable hypotheses

• Hsp90 complex– Heat shock protein (Hsp):

• present under normal conditions, but highly expressed under stress (e.g. heat shock, oxidative stress, heavy metals, etc.)

• Molecular chaperones that refold, translocate denatured proteins to prevent aggregation

– Hsp90 is unique: many of its clients are signaling kinases, hormone receptors

– Targeted by recent cancer drugs (Geldanamycin)– highly conserved protein (bacteria to humans).– two Hsp90 homologs in yeast: Hsc82 and Hsp82.

We predicted and have initial experimental confirmation for a link between Hsp82/Hsc82 and several co-

chaperones with DNA replication complex (Cdc7/Dbf4)

47(illustration by Helmut Pospiech)

DNA replication initiation: Cdc7/Dbf4

Cdc7: “switch” that starts replication (activated by Dbf4)

Hsp90 – DNA replication genetic interactions

105 cells

7Δ(ts

Cdc7-Cdc37 interaction

DNA replication

Hsp90 & co-chaperones

aggravatingintegration

Specific to DNA replication (sensitive to HU, not MMS)

A (possible) bigger picture

Ras/cAMP

DNAReplication

initiation

Hsp90 conclusions:We confirm several genetic interactions between

yeast Hsp90 proteins and Cdc7/Dbf4Hsp90’s plays specific role in DNA replication

(HU sensitivity)Possible new link between glucose signaling,

stress, and DNA replication from expression data

So what?Analysis of integrated genomic data

can direct generation of testable, non-trivial hypothesesImportant to integrate data and to

take into account biological process

Olga Troyanskaya Assistant Professor Lewis-Sigler ...€¦ · proteasome components Pup1, Pre6,...

Documents

Transcript of Olga Troyanskaya Assistant Professor Lewis-Sigler ...€¦ · proteasome components Pup1, Pre6,...

G N G N D S P U P U L E 1 8 1 7 1 6 1 5 1 4 1 3 D …When the attenuator powers up in LE=1 or P/S = 1, PUP1 and PUP2 are not active. ut When the attenuator powers up in P/S = 0 with

Pre6€¦ · Web viewThe content and concept goals for these activities follow those of the lecture topics: earthquakes, plate tectonics, minerals and rocks, volcanoes, geologic time,

GRM 2013: Cloning, characterization and validation of PUP1/P efficiency in maize -- L Kochian

apps.lib.whu.edu.cnapps.lib.whu.edu.cn/jcxt/Upfile/C9787030359285.pdf · Huaiyu Mi and Paul Thomas Aaron N. Chang Chad L. Myers, Camelia Chiriac, and Olga G. Troyanskaya 10. Yuri

More Microarray Analysis: Unsupervised Approaches Matt Hibbs Troyanskaya Lab.

GRM 2013: Improving phosphorus efficiency in sorghum by the identification and validation of sorghum homologs for Pup1, a major QTL underlying phosphorus iptake in rice (SorghumPup1)

Query-driven search methods for large microarray databases Matt Hibbs Troyanskaya Laboratory for BioInformatics and Functional Genomics.

Computational analysis of biological networks. Olga ... · Computational analysis of biological networks. Olga Troyanskaya, PhD. ... Using conditional independence relationships,

Developing Rice with High Yield under Phosphorus · Developing Rice with High Yield under Phosphorus Deﬁciency: Pup1 Sequence to Application1[W][OA] Joong Hyoun Chin, Rico Gamuyao,

YAMAHA Motif News Guide - · PDF filePRE5 = Brass, SaxWW, SynLd, Pads PRE6 = Pads, SyComp, S.EFX, PRE7 = SyComp, S.EFX, M.EFX, PRE8 = M.EFX, Ethnic, Dr/Pc, Mega Guitar, Mega Bass,

Th1_Introgression of phosphorus uptake 1 (Pup1) QTL into rice varieties locally adapted in sub-Saharan Africa

WiBro / WiMAX / 4G OBSOLETE · If LE is set to logic LOW at power-up, the logic state of pUp1 and pUp2 determines the power-up state of the part per pUp truth table. if the le is

How can computers help cure cancer? (computational … · COS116 Instructor: Olga Troyanskaya How can computers help cure cancer? (computational biology and bioinforamtics)

D 1 2 17 G N D 1 18 DIGITAL VARIA LE GAIN AMPLIFIER 6-B it … · 14 PUP1 Power-Up State Selection its. These pins set the attenuation value at power-up (see Table 11). There is no

Computational analysis of biological networks. Olga Troyanskaya, … · 2004-03-10 · Computational analysis of biological networks. Olga Troyanskaya, PhD. Available Data. Coexpression

· The Bel Canto PRe6 multi-channel analog control pre-amplifier bridges the gap between fidelity and theatre systems with flexibility, convenience and transparency. The soul of