Proteomics - Analysis and integration of large-scale data sets

83
Proteomics Analysis and integration of large-scale data sets Lars Juhl Jensen EMBL Heidelberg

description

Second European School on Bioinformatics, CMBI, Nijmegen, Netherlands, January 22-25, 2005

Transcript of Proteomics - Analysis and integration of large-scale data sets

Page 1: Proteomics - Analysis and integration of large-scale data sets

ProteomicsAnalysis and integration of large-scale data sets

Lars Juhl JensenEMBL Heidelberg

Page 2: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Overview

• Methods for predicting protein-protein interactions Cross-species inference Genomic context methods Prediction from expression data Automated extraction from text

• Quality control of high-throughput interaction data Types of data sets available Network representations of

interaction data sets Topology-based quality scores Benchmarking of data sets Filtering strategies

• Prediction of protein features and function Linear motifs in proteins Relation to interaction networks Motif prediction from sequence From features to function

• Qualitative modeling of the yeast cell cycle Modeling the cell cycle through

large-scale data integration What does the model tell us? A neural network approach to

predicting cell cycle proteins The cell cycle in feature space

Page 3: Proteomics - Analysis and integration of large-scale data sets

Part 1Methods for predicting protein-protein interactions

Lars Juhl JensenEMBL Heidelberg

Page 4: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Overview

• Methods for predicting protein-protein interactions Cross-species inference Genomic context methods Prediction from expression data Automated extraction from text

• Quality control of high-throughput interaction data Types of data sets available Network representations of

interaction data sets Topology-based quality scores Benchmarking of data sets Filtering strategies

• Prediction of protein features and function Linear motifs in proteins Relation to interaction networks Motif prediction from sequence From features to function

• Qualitative modeling of the yeast cell cycle Modeling the cell cycle through

large-scale data integration What does the model tell us? A neural network approach to

predicting cell cycle proteins The cell cycle in feature space

Page 5: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Cross-species integration of diverse data

• Challenges and promises of large-scale data integration Explosive increase in both the amounts and different types of

high-throughput data sets that are being produced These data are highly heterogeneous and lack standardization Most data sets are error-prone and suffer from systematic biases Experiments should be integrated across model organisms

• STRING is a web resource that integrates and transfers diverse large-scale data across 100+ species, but it is not a primary repository for experimental data a curated database of complexes or pathways a substitute for expert annotation

Page 6: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

What is STRING?

Genomic neighborhood

Species co-occurrence

Gene fusions

Database imports

Exp. interaction data

Microarray expression data

Literature co-mentioning

Page 7: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Genomic context methods

© Nature Biotechnology, 2004

Page 8: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Inferring functional modules fromgene presence/absence patterns

Page 9: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Inferring functional modules fromgene presence/absence patterns

Page 10: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Inferring functional modules fromgene presence/absence patterns

Page 11: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Inferring functional modules fromgene presence/absence patterns

Restingprotuberances

Protractedprotuberance

Cellulose

© Trends Microbiol, 1999

CellCell wall

Anchoring proteins

Cellulosomes

Cellulose

The “Cellulosome”

Page 12: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Formalizing the phylogenetic profile method

Align all proteins against allAlign all proteins against all

Calculate best-hit profileCalculate best-hit profile

Join similar species by PCAJoin similar species by PCA

Calculate PC profile distancesCalculate PC profile distances

Calibrate against KEGG mapsCalibrate against KEGG maps

Page 13: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Inferring functional associations from evolutionarily conserved operons

Identify runs of adjacent geneswith the same direction

Identify runs of adjacent geneswith the same direction

Score each gene pair based onintergenic distances

Score each gene pair based onintergenic distances

Calibrate against KEGG mapsCalibrate against KEGG maps

Infer associationsin other species

Infer associationsin other species

Page 14: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Predicting functional and physical interactions from gene fusion/fission events

Find in A genes that matcha the same gene in B

Find in A genes that matcha the same gene in B

Exclude overlappingalignments

Exclude overlappingalignments

Calibrate againstKEGG maps

Calibrate againstKEGG maps

Calculate all-against-allpairwise alignments

Calculate all-against-allpairwise alignments

Page 15: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Integrating physical interaction screens

Make binaryrepresentationof complexes

Make binaryrepresentationof complexes

Yeast two-hybriddata sets are

inherently binary

Yeast two-hybriddata sets are

inherently binary

Calculate scorefrom number of

(co-)occurrences

Calculate scorefrom number of

(co-)occurrences

Calculate scorefrom non-shared

partners

Calculate scorefrom non-shared

partners

Calibrate against KEGG mapsCalibrate against KEGG maps

Infer associations in other speciesInfer associations in other species

Combine evidence from experimentsCombine evidence from experiments

Page 16: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Mining microarray expression databases

Re-normalize arraysby modern methodto remove biases

Re-normalize arraysby modern methodto remove biases

Buildexpression

matrix

Buildexpression

matrix

Combinesimilar arrays

by PCA

Combinesimilar arrays

by PCA

Construct predictorby Gaussian kerneldensity estimation

Construct predictorby Gaussian kerneldensity estimation

Calibrateagainst

KEGG maps

Calibrateagainst

KEGG maps

Inferassociations inother species

Inferassociations inother species

Page 17: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

The Qspline method for non-linear intensity normalization of expression data

• From the empirical distribution, a number of quantiles are calculated for each of the channels to be normalized (one channel shown in red) and for the reference distribution (shown in black)

• A QQ-plot is made and a normalization curve is constructed by fitting a cubic spline function

• As reference one can either use an artificial “median array” for a set of arrays or use a log-normal distribution, which is a good approximation

Page 18: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Non-linear normalization of intensities and correction for spatial effects

DownloadedSMD data

After intensitynormalization

Spatial biasestimate

After spatialnormalization

Page 19: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Co-mentioning in the scientific literature

Associate abstracts with speciesAssociate abstracts with species

Identify gene names in title/abstractIdentify gene names in title/abstract

Count (co-)occurrences of genesCount (co-)occurrences of genes

Test significance of associationsTest significance of associations

Calibrate against KEGG mapsCalibrate against KEGG maps

Infer associations in other speciesInfer associations in other species

Page 20: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

?

Source species

Target species

Evidence transfer based on “fuzzy orthology”

• Orthology transfer is tricky Correct assignment of orthology

is difficult for distant species Functional equivalence cannot

be guaranteed for in-paralogs

• These problems are addressed by our “fuzzy orthology” scheme Confidence scores for functional

equivalence are calculated from all-against-all alignment

Evidence is distributed across possible pairs according to confidence scores in the case of many-to-many relationships

Page 21: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

The power of cross-species transferand evidence integration

Page 22: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

The power of cross-species transferand evidence integration

Page 23: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

The power of cross-species transferand evidence integration

Page 24: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

The power of cross-species transferand evidence integration

Page 25: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

The power of cross-species transferand evidence integration

Page 26: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

The power of cross-species transferand evidence integration

Page 27: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Conclusions

• Many types of data can be used for interaction prediction

• To make the best of these data they must each be benchmarked integrated across species

• The STRING web resource does just this

Page 28: Proteomics - Analysis and integration of large-scale data sets

Questions?

Page 29: Proteomics - Analysis and integration of large-scale data sets

Part 2Quality control of high-throughput interaction data

Lars Juhl JensenEMBL Heidelberg

Page 30: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Overview

• Methods for predicting protein-protein interactions Cross-species inference Genomic context methods Prediction from expression data Automated extraction from text

• Quality control of high-throughput interaction data Types of data sets available Network representations of

interaction data sets Topology-based quality scores Benchmarking of data sets Filtering strategies

• Prediction of protein features and function Linear motifs in proteins Relation to interaction networks Motif prediction from sequence From features to function

• Qualitative modeling of the yeast cell cycle Modeling the cell cycle through

large-scale data integration What does the model tell us? A neural network approach to

predicting cell cycle proteins The cell cycle in feature space

Page 31: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Protein interaction data sets

• Many high-throughput data sets published the past 5 years S. cerevisiae is by far the best

covered organism Recently, large data sets were

made for two metazoans

• Two fundamentally different techniques have been used Affinity purification/MS The yeast two-hybrid assay

• Interaction databases IntAct, BIND, DIP, MINT Species specific databases © Current Opinions in Structural Biology, 2004

Page 32: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

The topology of protein interaction networks

• A multitude of publications exist on protein network topology

• Global measures of topology Degree distribution Mean path length Clustering coefficient

• Theoretical models of networks Random Scale-free Hierarchical

• Local topology, network motifs

Page 33: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

What is an interaction?

• Physical protein interactions Proteins that physically touch each other within a complex Members of the same stable complex Transient interactions, e.g. a protein kinase with its substrate

• More broadly defined “functional interactions” Direct neighbors in metabolic networks Members of the same pathway

• The pragmatic definition – whatever the assays detect Affinity purification tends to find members of stable complexes Yeast two-hybrid assays also detects more transient interactions

Page 34: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Binary representations of purification data

© Drug Discovery Today: TARGETS, 2004

Page 35: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Topology based quality scores

• Scoring scheme for yeast two-hybrid data: S1 = -log((N1+1)·(N2+1))

N1 and N2 are the numbers of non-shared interaction partners

Similar scoring schemes have been published by Saito et al.

• Scoring scheme for complex pull-down data: S2 = log[(N12·N)/((N1+1)·(N2+1))]

N12 is the number of purifications containing both proteins

N1 is the number containing protein 1, N2 is defined similarly

N is the total number of purifications

• Both schemes aim at identifying ubiquitous interactors

Page 36: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Calibration of quality scores andcombination of evidence

• Different pieces of evidence are not directly comparable A different raw quality score is

used for each evidence type Quality differences exist among

data sets of the same type

• Solved by calibrating all scores against a common reference The accuracy relative to a “gold

standard” is calculated within score intervals

The resulting points are approximated by a sigmoid

Page 37: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Benchmarks for protein interaction sets

• To benchmark interaction sets, one needs a reference set

• Several options exist Directly compare with a curated

set of protein complexes from e.g. MIPS

Check consistency with metabolic pathways from e.g. KEGG

Check consistency with GO biological process or cellular component categories

Look for co-expression of genes within complexes

© Current Opinions in Structural Biology, 2004

Page 38: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Benchmark of published interaction sets against the MIPS curated yeast complexes

• Data sets were filtered to remove the most obvious biases by removing ribosomal proteins and interactions obtained from MIPS

• High specificity is often obtained at the price of low coverage

Page 39: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Filtering by subcellular localization

• Proteins cannot interact if they are not in the same place Large-scale subcellular localization screens have been made in yeast A matrix can be constructed that described the compartments between

which interactions should be allowed Two proteins cannot interact if no combination of observed subcellular

compartments allow for interaction

Page 40: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Restricting the network to a “system”

• Why do large-scale interaction data have high error rates? In a systematic screen we test the hypotheses that any protein in

interacts with any other protein in the cell The vast majority of these possible interaction do not take place By subsequently limiting the “interaction search space” to only the

system of interest, the error rate can be reduced to that of small scale experiments!

• A simple strategy for making a network of a “system” Define an initial parts list of proteins that should be in the system Use “high confidence” interactions to pull in additional proteins Show all “medium confidence” interactions within the system

Page 41: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Can the type of interaction be predictedby combining different evidence types?

• Different types of experiment evidence tell us something different

• Correct Y2H interactions that are missed by complex purification methods generally correspond to transient interactions

Page 42: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Conclusions

• When dealing with high-throughput experimental data, it is crucial to do proper benchmarking

• Globally, the error rates are generally very high

• A very large part of the errors can be filtered away by computational methods, allowing high confidence data sets to be constructed

Page 43: Proteomics - Analysis and integration of large-scale data sets

Questions?

Page 44: Proteomics - Analysis and integration of large-scale data sets

Part 3Prediction protein features and function

Lars Juhl JensenEMBL Heidelberg

Page 45: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Overview

• Methods for predicting protein-protein interactions Cross-species inference Genomic context methods Prediction from expression data Automated extraction from text

• Quality control of high-throughput interaction data Types of data sets available Network representations of

interaction data sets Topology-based quality scores Benchmarking of data sets Filtering strategies

• Prediction of protein features and function Linear motifs in proteins Relation to interaction networks Motif prediction from sequence From features to function

• Qualitative modeling of the yeast cell cycle Modeling the cell cycle through

large-scale data integration What does the model tell us? A neural network approach to

predicting cell cycle proteins The cell cycle in feature space

Page 46: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Proteins – more than just globular domains

• Eukaryotic linear motifs (ELMs) Ligand peptides Modification sites Targeting signals

• Disordered regions

• Transmembrane helices

Toby Gibson, EMBL Heidelberg

Insulin ReceptorSubstrate 1

Page 47: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

L . C . E RB interaction

[RK] .{0,1} V . F PP1 interaction

R . L .{0,1} [FLIMVP] Cyclin binding motif

SP . [KR] CDK phosphorylation

L . . LL NR Box

P . L . PMYND finger interaction

F . . . W . . [LIV] MDM2-binding

RGD Integrin-binding

SKL$ Peroxisome targeting

[RK][RK] . [ST] PKA phosphorylation

Most ELMs are “information poor”

• Weak/short consensus sequences for ELMs The typical ELM only has

three conserved residues Some variance is often

allowed even for these

• ELMs are very hard to predict from sequence Consensus sequences

simply match everywhere The information is not in

the local sequence Most ELMs can only be

predicted using context

Toby Gibson, EMBL Heidelberg

Page 48: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

KnownDomains

OrderPreference

DisorderPreference

Prediction of protein disorder/globularity

• Using known domains SMART Pfam Interpro

• Ab initio from sequence GlobPlot DisEMBL PONDR

Toby Gibson, EMBL Heidelberg

Page 49: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Prediction of signal peptides from sequence

• Signal peptides play different roles They mediate transport of proteins

to the ER in eukaryotes They target proteins for secretion

in prokaryotes

• The architecture of signal peptides Positively charged N-terminus Hydrophobic core Short, more polar region Cleavage site with small amino

acids at positions -3 and -1

• Signal peptides can be accurately predicted by several methods

Henrik Nielsen, CBS, DTU Lyngby

Page 50: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Function prediction from post translational modifications

• Proteins with similar function may not be related in sequence

• Still they must perform their function in the context of the same cellular machinery

• Similarities in features such like PTMs and physical/chemical properties could be expected for proteinswith similar function

Henrik Nielsen, CBS, DTU Lyngby

Page 51: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

The concept of ProtFun

• Predict as many biologically relevant features as we can from the sequence

• Train artificial neural networks for each category, also optimizing the feature combinations

• Assign a probability for each category from the NN outputs

© Journal of Molecular Biology, 2002

Page 52: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Training of neural networks

• Human protein protein sequences from SWISS-PROT were assigned to functional classes based on their keywords by using the EUCLID dictionary

• The set of sequences was divided into a test and a training set with no significant sequence similarity between the two sets

• Neural networks were first trained for single features and subsequently for combinations of the best performing features

Page 53: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Prediction performance on cellular role categories

© Journal of Molecular Biology, 2002

Page 54: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg © Journal of Molecular Biology, 2002

Page 55: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

An example – 1AOZ vs. 1PLC

scoring matrix: BLOSUM50, gap penalties: -12/-215.5% identity; Global alignment score: -23

10 20 30 40 50 601AOZ SQIRHYKWEVEYMFWAPNCNENIVMGINGQFPGPTIRANAGDSVVVELTNKLHTEGVVIH .. .. : ... . . ..: . :...: . .: ...:. 1PLC ---------IDVLLGA---DDGSLAFVPSEFS-----ISPGEKIVFK-NNAGFPHNIVFD 10 20 30 40

70 80 90 100 110 1201AOZ WHGILQRGTPWADGTASISQCAINPGETFFYNFTVDNPGTFFYHGHLGMQRSAGLYGSLI .: :. . . : . :::: .. . .:. : : ::. :.. 1PLC EDSI-PSGVDASKISMSEEDLLNAKGETFEVALSNKGEYSFYCSPHQG----AGMVGKVT 50 60 70 80 90

1AOZ VDPPQGKKE :. 1PLC VN-------

Page 56: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

An enzyme and a non-enzyme from the Cupredoxin superfamily

Page 57: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

# Functional category 1AOZ 1PLC Amino_acid_biosynthesis 0.126 0.070 Biosynthesis_of_cofactors 0.100 0.075 Cell_envelope 0.429 0.032 Cellular_processes 0.057 0.059 Central_intermediary_metabolism 0.063 0.041 Energy_metabolism 0.126 0.268 Fatty_acid_metabolism 0.027 0.072 Purines_and_pyrimidines 0.439 0.088 Regulatory_functions 0.102 0.019 Replication_and_transcription 0.052 0.089 Translation 0.079 0.150 Transport_and_binding 0.032 0.052

# Enzyme/nonenzyme Enzyme 0.773 0.310 Nonenzyme 0.227 0.690

# Enzyme class Oxidoreductase (EC 1.-.-.-) 0.077 0.077 Transferase (EC 2.-.-.-) 0.260 0.099 Hydrolase (EC 3.-.-.-) 0.114 0.071 Lyase (EC 4.-.-.-) 0.025 0.020 Isomerase (EC 5.-.-.-) 0.010 0.068 Ligase (EC 6.-.-.-) 0.017 0.017

Similar structure different functions

• Many examples exist of structurally similar proteins which have different functions

• Two PDB structures from the Cupredoxin superfamily were shown 1AOZ is an enzyme 1PLC is not an enzyme

• Despite their structural similarity, our method predicts both correctly

Page 58: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Conclusions

• Short linear motifs are likely equally important for protein function as the large well studied domains

• The features are generally very hard to predict from sequence, however, some can be predicted

• Many functional classes of proteins can be predicted from sequence alone by non-homology based methods

Page 59: Proteomics - Analysis and integration of large-scale data sets

Questions?

Page 60: Proteomics - Analysis and integration of large-scale data sets

Part 4Qualitative modeling of the of the yeast cell cycle

Lars Juhl JensenEMBL Heidelberg

Page 61: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Overview

• Methods for predicting protein-protein interactions Cross-species inference Genomic context methods Prediction from expression data Automated extraction from text

• Quality control of high-throughput interaction data Types of data sets available Network representations of

interaction data sets Topology-based quality scores Benchmarking of data sets Filtering strategies

• Prediction of protein features and function Linear motifs in proteins Relation to interaction networks Motif prediction from sequence From features to function

• Qualitative modeling of the yeast cell cycle Modeling the cell cycle through

large-scale data integration What does the model tell us? A neural network approach to

predicting cell cycle proteins The cell cycle in feature space

Page 62: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Qualitative versus quantitative modeling

• Our aim: a qualitative model of the yeast cell cycle that is accurate event at the level

individual interactions provides a global overview of

temporal complex formation

© Chen et al., Mol. Biol. Cell, 2004Ulrik de Lichtenberg, CBS, DTU Lyngby

Page 63: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Model Generation

A Parts List

• Literature

• Microarray data

Dynamic data

• Microarray data

• Proteomics data

• PPI data

• TF-target data

Connections

YER001WYBR088CYOL007CYPL127CYNR009WYDR224CYDL003WYBL003CYDR225WYBR010WYKR013W…

YDR097CYBR089WYBR054WYMR215WYBR071WYBL002WYGR189CYNL031CYNL030WYNL283CYGR152C…

Model generation through data integration

Ulrik de Lichtenberg, CBS, DTU Lyngby

Page 64: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Cho et al. & Spellman et al.

yeast culture Microarrays Gene expression Expression profile

600 periodically expressed genes (with associated peak times) that encode “dynamic

proteins”

The Parts listNew Analysis

Getting the parts list

Ulrik de Lichtenberg, CBS, DTU Lyngby

Page 65: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Observation: For two thirds of the dynamic proteins, no interactions were foundWhy?

• Some may be missed components of the complexes and modules already in the network

• Some may not participate in protein-protein interactions

• But, the majority probably participate in transient interactions that are not so well captured by current interaction assays

The temporal interaction network

Ulrik de Lichtenberg, CBS, DTU Lyngby © Science, 2005

Page 66: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Observation: Interacting dynamic proteins typically expressed close in time

Interactions are close in time

Ulrik de Lichtenberg, CBS, DTU Lyngby © Science, 2005

Page 67: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Observation: Static (scaffold) proteins comprise about a third of the network and participate in interactions throughout the entire cycle

Static proteins play a major role

Ulrik de Lichtenberg, CBS, DTU Lyngby © Science, 2005

Page 68: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Observation: The dynamic proteins are generally expressed just before they are needed to carry out their function, generally referred to as just-in-time synthesis

But, the general design principle seems to be that only some key components of each module/complex are dynamic

This suggests a mechanism of just-in-time assembly or partial just-in-time synthesis

Just-in-time synthesis? yes and no!

Ulrik de Lichtenberg, CBS, DTU Lyngby © Science, 2005

Page 69: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Observation: The network places 30+ uncharacterized proteins in a temporal interaction context.

The network thus generates detailed hypothesis about their function.

Observation: The network contains entire novel modules and complexes.

Network as a discovery tools

Ulrik de Lichtenberg, CBS, DTU Lyngby © Science, 2005

Page 70: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Network Hubs: “Party” versus “Date”

“Date” Hub: the hub protein interacts with different proteins at different times.

“Party” Hub: the hub protein and its interactors are expressed close in time.

Ulrik de Lichtenberg, CBS, DTU Lyngby © Science, 2005

Page 71: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Transcription is linked to phosphorylation

Observation: 332 putative targets of the cyclin-dependent kinase Cdc28 have been determined experimentally (Übersax et al.). We find that:

• 6% of all yeast proteins are putative Cdk targets

• 8% of the static proteins (white) are putative Cdk targets

• 27% of the dynamic proteins (colored) are putative Cdk targets

• Conclusion: this reveals a hitherto undescribed link between the levels of transcriptional and post-translation control of the cell cycle

Ulrik de Lichtenberg, CBS, DTU Lyngby © Science, 2005

Page 72: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

A neural network strategy for predictionof cell cycle related proteins

Ulrik de Lichtenberg, CBS, DTU Lyngby

Page 73: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Prediction of cell cycle related proteinsfrom sequence derived features

Ulrik de Lichtenberg, CBS, DTU Lyngby © Journal of Molecular Biology, 2003

Page 74: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

0.0 0.2 0.4 0.6 0.8 1.0Sensitivity

0.0

0.2

0.4

0.6

0.8

1.0R

ate of false positives

0 .0

0.2

0.4

0.6

0.8

1.0M

atth

ews

corr

elat

ion

coef

ficie

nt

Network ANetwork BNetwork CNetwork DNetwork EEnsem bleCorrelation Coeffic ient

Evaluating the performance

Ulrik de Lichtenberg, CBS, DTU Lyngby

Page 75: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

ORF ANN F-score Intensity Protein functionYIL169C 0,98 2,8 176 Protein of unknown functionYNL322C 0,98 1,7 870 Cell wall protein needed for cell wall beta-1,6-glucan assemblyYJL078C 0,98 5,5 86 Protein that may have a role in mating efficiencyYDL038C 0,98 5,3 165 Protein of unknown functionYOL155C 0,97 3,0 391 Protein with similarity to glucan 1,4-alpha-glucosidaseYJR151C 0,97 1,3 251 Member of the seripauperin (PAU) familyYLR286C 0,97 9,3 520 EndochitinaseYOL030W 0,97 4,1 817 Protein with similarity to Gas1pYOR220W 0,97 2,5 340 Protein of unknown functionYNR044W 0,97 6,5 172 Anchor subunit of a-agglutininYGR023W 0,97 1,8 129 Signal transduction of cell wall stress during morphorgenesisYDL016C 0,97 0,8 338 Protein of unknown functionYDL152W 0,97 1,0 156 Protein of unknown functionYPR136C 0,97 1,1 76 Protein of unknown functionYGR115C 0,97 1,0 71 Protein of unknown function, questionable ORFYMR317W 0,97 2,1 260 Protein of unknown functionYCR089W 0,97 3,4 104 Protein involved in mating inductionYLR194C 0,96 5,4 1870 Protein of unknown functionYIL011W 0,96 2,6 565 Member of the seripauperin (PAU) familyYGR161C 0,96 2,4 190 Protein of unknown functionYBR067C 0,96 5,9 825 Cold- and heat-shock induced mannoprotein of the cell wallYNL228W 0,96 1,9 250 Protein of unknown function; questionable ORFYNL327W 0,96 8,7 1320 Cell-cycle regulation protein involved in cell separationYLR332W 0,96 1,5 642 Putative sensor for cell wall integrity signaling during growthYNR067C 0,96 6,3 222 Protein with similarity to endo-1,3-beta-glucanase

Ulrik de Lichtenberg, CBS, DTU Lyngby

Page 76: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

The yeast cell cycle in feature space

© Journal of Molecular Biology, 2003Ulrik de Lichtenberg, CBS, DTU Lyngby

Page 77: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

S phase feature snapshot

• S phase

• 40% into the cell cycle we see High isoelectric point Many nuclear proteins Short proteins Low N-glycosylation potential Low potential for Ser/Thr-

phosphorylation Few PEST regions Low aliphatic index

Ulrik de Lichtenberg, CBS, DTU Lyngby © Journal of Molecular Biology, 2003

Page 78: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

G1/S phase feature snapshot

• G1/S transition

• 25% into the cell cycle we see Low isoelectric point Many extracellular proteins Many PEST regions Very high Tyr-phosphorylation

potential Higher glycosylation potential Higher potential for Ser/Thr-

phosphorylation

Ulrik de Lichtenberg, CBS, DTU Lyngby © Journal of Molecular Biology, 2003

Page 79: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Conclusions

• Accurate models can be constructed by careful integration of several types of high-throughput experimental data

• We have constructed a model of the yeast cell cycle that reveals global trends that were not previously known

• The same strategies are applicable to other systems The integrative approach is applicable to any process for which

both interaction data and time series are available Most broad classes of proteins can be predicted using neural

networks with sequence derived features as input.

Page 80: Proteomics - Analysis and integration of large-scale data sets

Questions?

Page 81: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Summary

• The many types of high-throughput data should to be Better standardization and quality control is crucial Scoring schemes and filtering schemes can reduce the error rate

of high-throughput data drastically Integration of many evidence types allows high-confidence

predictions of functional relationships New biological discoveries can be made through data integration

• There is more to proteins than just globular domains Proteins contain many short linear motifs (ELMs) Most of these are very difficult to predict from sequence Sequence derived features can give hints about protein function

Page 82: Proteomics - Analysis and integration of large-scale data sets

Lars Juhl Jensen, EMBL Heidelberg

Acknowledgments

• STRING and ArrayProspector Peer Bork Christian von Mering Jan Korbel Berend Snel Martijn Huynen Daniel Jaeggi Steffen Schmidt Sean Hooper Mathilde Foglierini Julien Lagarde Chris Workman

• ELMs – linear motifs Rune Linding Toby Gibson Rob Russell

• Protein feature/function prediction Søren Brunak Alfonso Valencia Ramneek Gupta Can Kesmir Kristoffer Rapacki Hans-Henrik Stærfeldt Henrik Nielsen Nikolaj Blom Claus A.F. Andersen Anders Krogh Steen Knudsen Chris Workman Damien Devos Javier Tamames

• Analysis of the yeast cell cycle Ulrik de Lichtenberg Thomas Skøt Anders Fausbøll Søren Brunak

Page 83: Proteomics - Analysis and integration of large-scale data sets

Thank you!