Integration of heterogeneous data

Post on 10-May-2015

345 views 0 download

Tags:

description

9th Course in Bioinformatics for Molecular Biologist, Bertinoro, Italy, March 22-26, 2009

Transcript of Integration of heterogeneous data

Lars Juhl Jensen

Integration of heterogeneous data

Lars Juhl Jensen

Integration of heterogeneous data

Lars Juhl Jensen

Integration of heterogeneous data

what went wrong?

a good question

signaling networks

Oda & Kitano, Molecular Systems Biology, 2006

long way to go

mass spectrometry

Linding, Jensen, Ostheimer et al., Cell, 2007

phosphorylation sites

in vivo

kinases are unknown

peptide assays

Miller, Jensen et al., Science Signaling, 2008

sequence specificity

kinase-specific

in vitro

no context

what a kinase could do

not what it actually does

computational methods

sequence specificity

Miller, Jensen et al., Science Signaling, 2008

kinase-specific

no context

what a kinase could do

not what it actually does

in vitro

in vivo

context

co-activators

scaffolders

expression

association networks

Linding, Jensen, Ostheimer et al., Cell, 2007

a good idea

Linding, Jensen, Ostheimer et al., Cell, 2007

Part Isequence motifs

curated motifs

PROSITE

ELM

HPRD

regular expressions

[ST]P.[KR]

no score

Miller, Jensen et al., Science Signaling, 2008

insufficient

machine learning

NetPhosK

PredPhospho

PHOSITE

GPS

KinasePhos

PPSP

GANNPhos

PhoScan

no regular updates

NetPhorest

Miller, Jensen et al., Science Signaling, 2008

data sources

Phospho.ELM

Diella et al., Nucleic Acids Res., 2008

Diella et al., Nucleic Acids Res., 2008

Scansite

Obenauer et al., Nucleic Acids Res., 2003

Miller, Jensen et al., Science Signaling, 2008

common basis

Miller, Jensen et al., Science Signaling, 2008

automated pipeline

compilation of datasets

classification vs. prediction

Miller, Jensen et al., Science Signaling, 2008

homology reduction

Miller, Jensen et al., Science Signaling, 2008

training and evaluation

cross-validation

Miller, Jensen et al., Science Signaling, 2008

classifier selection

Miller, Jensen et al., Science Signaling, 2008

motif atlas

179 kinases

93 SH2 domains

8 PTB domains

BRCT domains

WW domains

14-3-3 proteins

phosphatases

model organisms

S. cerevisiae

D. melanogaster

C. elegans

biological insights

docking domains

Miller, Jensen et al., Science Signaling, 2008

disease-related kinases

Miller, Jensen et al., Science Signaling, 2008

predictive power

ROC curves

Miller, Jensen et al., Science Signaling, 2008

comparison

Miller, Jensen et al., Science Signaling, 2008

conclusions

data collection

automation

benchmarking

homology reduction!

Part IIassociation networks

STRING

Jensen, Kuhn et al., Nucleic Acids Research, 2009

functional associations

data integration

common basis

630 genomes

model organism databases

Ensembl

RefSeq

genomic context methods

gene fusion

Korbel et al., Nature Biotechnology, 2004

conserved neighborhood

operons

Korbel et al., Nature Biotechnology, 2004

bidirectional promoters

Korbel et al., Nature Biotechnology, 2004

phylogenetic profiles

Korbel et al., Nature Biotechnology, 2004

primary experimental data

protein interactions

yeast two-hybrid

affinity purification

fragment complementation

Jensen & Bork, Science, 2008

genetic interactions

Beyer et al., Nature Reviews Genetics, 2007

BINDBiomolecular Interaction Network Database

BioGRIDGeneral Repository for Interaction Datasets

DIPDatabase of Interacting Proteins

IntAct

MINTMolecular Interactions Database

HPRDHuman Protein Reference Database

PDBProtein Data Bank

inferred associations

gene coexpression

GEOGene Expression Omnibus

expression compendia

curated knowledge

complexes

MIPSMunich Information center

for Protein Sequences

Gene Ontology

pathways

Letunic & Bork, Trends in Biochemical Sciences, 2008

KEGGKyoto Encyclopedia of Genes and Genomes

MetaCyc

Reactome

PIDNCI-Nature Pathway Interaction Database

literature mining

MEDLINE

SGDSaccharomyces Genome Database

The Interactive Fly

OMIMOnline Mendelian Inheritance in Man

co-mentioning

statistical methods

NLPNatural Language Processing

Gene and protein namesCue words for entity recognitionVerbs for relation extraction

[nxgene The GAL4 gene]

[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]

easy in theory …

… but not in practice

different formats

parsers

different identifiers

thesaurus

redundant sources

book keeping

variable quality

raw quality scores

reproducibility

von Mering et al., Nucleic Acids Research, 2005

benchmarking

von Mering et al., Nucleic Acids Research, 2005

spread over 630 genomes

transfer by orthology

von Mering et al., Nucleic Acids Research, 2005

two modes

COG mode

von Mering et al., Nucleic Acids Research, 2005

protein mode

von Mering et al., Nucleic Acids Research, 2005

combine all evidence

visualize

Frishman et al., Modern Genome Annotation, 2009

STITCH

metabolite–enzyme links

pathway databases

Letunic & Bork, Trends in Biochemical Sciences, 2008

drug–target links

Drugbank

PDSP Ki

MATADOR

Campillos & Kuhn et al., Science, 2008

chemical–chemical links

shared targets

fingerprint similarity

chemical–protein network

conclusions

more data is better

quality scores

benchmarking

cross-species integration

Part IIIputting it all together

Linding, Jensen, Ostheimer et al., Cell, 2007

NetworKIN

benchmarking

Linding, Jensen, Ostheimer et al., Cell, 2007

2.5-fold better accuracy

context is crucial

localization

Linding, Jensen, Ostheimer et al., Cell, 2007

DNA damage response

Linding, Jensen, Ostheimer et al., Cell, 2007

Linding, Jensen, Ostheimer et al., Cell, 2007

small-scale validation

ATM phosphorylates Rad50

Linding, Jensen, Ostheimer et al., Cell, 2007

Cdk1 phosphorylates 53BP1

Linding, Jensen, Ostheimer et al., Cell, 2007

high-throughput validation

multiple reaction monitoring

Linding, Jensen, Ostheimer et al., Cell, 2007

systematic validation

kinase inhibitor matrix

Fedorov et al., PNAS, 2007

design optimal experiments

integration with literature

Reflect

conclusions

complementary data

visualization

a good question

Acknowledgments

NetworKIN.info– Rune Linding– Gerard Ostheimer– Francesca Diella– Karen Colwill– Jing Jin– Pavel Metalnikov– Vivian Nguyen– Adrian Pasculescu– Jin Gyoon Park– Leona D. Samson– Rob Russell– Peer Bork– Michael Yaffe– Tony Pawson

STITCH.embl.de– Michael Kuhn– Christian von Mering– Monica Campillos– Peer Bork

NetPhorest.info– Martin Lee Miller– Francesca Diella– Claus Jørgensen– Michele Tinti– Lei Li– Marilyn Hsiung– Sirlester A. Parker– Jennifer Bordeaux– Thomas Sicheritz-Pontén– Marina Olhovsky– Adrian Pasculescu– Jes Alexander– Stefan Knapp– Nikolaj Blom– Peer Bork– Shawn Li– Gianni Cesareni– Tony Pawson– Benjamin E. Turk– Michael B. Yaffe– Søren Brunak

STRING.embl.de– Christian von Mering– Michael Kuhn– Manuel Stark– Samuel Chaffron– Philippe Julien– Tobias Doerks– Jan Korbel– Berend Snel– Martijn Huynen– Peer Bork

Reflect.ws– Sean O’Donoghue– Evangelos Pafilis– Heiko Horn– Michael Kuhn– Nigel Brown– Reinhardt Schneider