The STRING database - Quality scores for heterogeneous interaction data

Post on 27-Jun-2015

1.231 views 0 download

Tags:

description

Lyon, France, April 23-25, 2007

Transcript of The STRING database - Quality scores for heterogeneous interaction data

The STRING databaseQuality scores for heterogeneous interaction data

Lars Juhl Jensen

EMBL Heidelberg

data integration

Jensen et al., Drug Discovery Today: Targets, 2004

functional interactions

Genomic neighborhood

Species co-occurrence

Gene fusions

Database imports

Experimental interaction data

Microarray expression data

Literature mining

373 proteomes

model organism databases

Ensembl

Genome Reviews

RefSeq

genomic context methods

gene fusion

gene neighborhood

phylogenetic profiles

scoring schemes

benchmarking

cross-species transfer

primary experimental data

many sources

different formats

different gene identifiers

redundancy

physical protein interactions

IntAct

BINDBiomolecular Interaction Network Database

MINTMolecular Interactions Database

DIPDatabase of Interacting Proteins

GRIDGeneral Repository for Interaction Datasets

HPRDHuman Protein Reference Database

PSI-MI

reference proteomes

merge data by publication

thousands of interactions

correct interactions

wrong interactions

scoring scheme

complex pull-down

von Mering et al., Nucleic Acids Research, 2005

log[(N12·N)/((N1+1)·(N2+1))]

yeast two-hybrid

non-shared interactors

-log((N1+1)·(N2+1))

not directly comparable

calibrate vs. gold standard

other types of evidence

co-expression

GEOGene Expression Omnibus

species-specific datasets

correlation coefficient

calibrate vs. gold standard

directly comparable

curated knowledge

many sources

different formats

different gene identifiers

redundancy

protein complexes

MIPSMunich Information center

for Protein Sequences

Gene Ontology

pathway databases

KEGGKyoto Encyclopedia of Genes and Genomes

Reactome

PIDNCI-Nature Pathway Interaction Database

STKESignal Transduction Knowledge Environment

BioPAX

reference proteomes

literature mining

MEDLINE

SGDSaccharomyces Genome Database

The Interactive Fly

OMIMOnline Mendelian Inheritance in Man

different gene identifiers

synonyms lists

black list

flexible matching

co-occurrence

log[(N12·N)/((N1+1)·(N2+1))]

NLPNatural Language Processing

Gene and protein namesCue words for entity recognitionVerbs for relation extraction

[nxgene The GAL4 gene]

[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]

calibrate vs. gold standard

directly comparable

combine all evidence

spread over many species

transfer by orthology

von Mering et al., Nucleic Acids Research, 2005

two modes

orthologous groups

von Mering et al., Nucleic Acids Research, 2005

fuzzy orthology

von Mering et al., Nucleic Acids Research, 2005

add probabilistic scores

P = 1-(1-P1).(1-P2).(1-P3)…

Genomic neighborhood

Species co-occurrence

Gene fusions

Database imports

Experimental interaction data

Microarray expression data

Literature mining

Acknowledgments

The STRING team– Christian von Mering

– Michael Kuhn

– Berend Snel

– Martijn Huynen

– Sean Hooper

– Samuel Chaffron

– Julien Lagarde

– Mathilde Foglierini

– Peer Bork

Literature mining project– Jasmin Saric

– Rossitza Ouzounova

– Isabel Rojas