Integration of heterogeneous data

142
Integration of heterogeneous data Lars Juhl Jensen

description

10th Course in Bioinformatics and Systems Biology for Molecular Biologists, Schloss Hohenkammer, Hohenkammer, Germany, March 15, 2010.

Transcript of Integration of heterogeneous data

Page 1: Integration of heterogeneous data

Integration of heterogeneous data

Lars Juhl Jensen

Page 2: Integration of heterogeneous data
Page 3: Integration of heterogeneous data
Page 4: Integration of heterogeneous data
Page 5: Integration of heterogeneous data
Page 6: Integration of heterogeneous data

data mining

Page 7: Integration of heterogeneous data

text mining

Page 8: Integration of heterogeneous data

interaction networks

Page 9: Integration of heterogeneous data
Page 10: Integration of heterogeneous data

Kuhn et al., Nucleic Acids Research, 2010

Page 11: Integration of heterogeneous data

parts lists

Page 12: Integration of heterogeneous data

630 genomes

Page 13: Integration of heterogeneous data

2.5 million proteins

Page 14: Integration of heterogeneous data

~74,000 small molecules

Page 15: Integration of heterogeneous data

many databases

Page 16: Integration of heterogeneous data

different formats

Page 17: Integration of heterogeneous data

model organism databases

Page 18: Integration of heterogeneous data

Ensembl

Page 19: Integration of heterogeneous data

RefSeq

Page 20: Integration of heterogeneous data

PubChem

Page 21: Integration of heterogeneous data

genomic context

Page 22: Integration of heterogeneous data

gene fusion

Page 23: Integration of heterogeneous data

Korbel et al., Nature Biotechnology, 2004

Page 24: Integration of heterogeneous data

conserved neighborhood

Page 25: Integration of heterogeneous data

operons

Page 26: Integration of heterogeneous data

Korbel et al., Nature Biotechnology, 2004

Page 27: Integration of heterogeneous data

bidirectional promoters

Page 28: Integration of heterogeneous data

Korbel et al., Nature Biotechnology, 2004

Page 29: Integration of heterogeneous data

phylogenetic profiles

Page 30: Integration of heterogeneous data

Korbel et al., Nature Biotechnology, 2004

Page 31: Integration of heterogeneous data

experimental data

Page 32: Integration of heterogeneous data

gene coexpression

Page 33: Integration of heterogeneous data
Page 34: Integration of heterogeneous data

protein interactions

Page 35: Integration of heterogeneous data

Jensen & Bork, Science, 2008

Page 36: Integration of heterogeneous data

genetic interactions

Page 37: Integration of heterogeneous data

Beyer et al., Nature Reviews Genetics, 2007

Page 38: Integration of heterogeneous data

small molecule interactions

Page 39: Integration of heterogeneous data

in vitro binding assays

Page 40: Integration of heterogeneous data

cellular activity assays

Page 41: Integration of heterogeneous data

many databases

Page 42: Integration of heterogeneous data

GEOGene Expression Omnibus

Page 43: Integration of heterogeneous data

BINDBiomolecular Interaction Network Database

Page 44: Integration of heterogeneous data

BioGRIDGeneral Repository for Interaction Datasets

Page 45: Integration of heterogeneous data

DIPDatabase of Interacting Proteins

Page 46: Integration of heterogeneous data

IntAct

Page 47: Integration of heterogeneous data

MINTMolecular Interactions Database

Page 48: Integration of heterogeneous data

HPRDHuman Protein Reference Database

Page 49: Integration of heterogeneous data

PDBProtein Data Bank

Page 50: Integration of heterogeneous data

BindingDB

Page 51: Integration of heterogeneous data

CTDComparative Toxicogenomics Database

Page 52: Integration of heterogeneous data

DrugBank

Page 53: Integration of heterogeneous data

GLIDAGPCR-Ligand Database

Page 54: Integration of heterogeneous data

MATADOR

Page 55: Integration of heterogeneous data

PDSP KiPsycoactive Drug Screening Program

Page 56: Integration of heterogeneous data

PharmGKBPharmacogenomics Knowledge Base

Page 57: Integration of heterogeneous data

different formats

Page 58: Integration of heterogeneous data

different identifiers

Page 59: Integration of heterogeneous data

partially redundant

Page 60: Integration of heterogeneous data

Campillos & Kuhn et al., Science, 2008

Page 61: Integration of heterogeneous data

curated knowledge

Page 62: Integration of heterogeneous data

complexes

Page 63: Integration of heterogeneous data

pathways

Page 64: Integration of heterogeneous data

Letunic & Bork, Trends in Biochemical Sciences, 2008

Page 65: Integration of heterogeneous data

many databases

Page 66: Integration of heterogeneous data

Gene Ontology

Page 67: Integration of heterogeneous data

MIPSMunich Information center

for Protein Sequences

Page 68: Integration of heterogeneous data

KEGGKyoto Encyclopedia of Genes and Genomes

Page 69: Integration of heterogeneous data

MetaCyc

Page 70: Integration of heterogeneous data

Reactome

Page 71: Integration of heterogeneous data

PIDNCI-Nature Pathway Interaction Database

Page 72: Integration of heterogeneous data

high confidence

Page 73: Integration of heterogeneous data

different formats

Page 74: Integration of heterogeneous data

different identifiers

Page 75: Integration of heterogeneous data

partially redundant

Page 76: Integration of heterogeneous data

literature mining

Page 77: Integration of heterogeneous data

>10 km

Page 78: Integration of heterogeneous data

human readable

Page 79: Integration of heterogeneous data

not computer readable

Page 80: Integration of heterogeneous data

different names

Page 81: Integration of heterogeneous data

text corpus

Page 82: Integration of heterogeneous data

MEDLINE

Page 83: Integration of heterogeneous data

SGDSaccharomyces Genome Database

Page 84: Integration of heterogeneous data

The Interactive Fly

Page 85: Integration of heterogeneous data

OMIMOnline Mendelian Inheritance in Man

Page 86: Integration of heterogeneous data

thesaurus

Page 87: Integration of heterogeneous data

co-mentioning

Page 88: Integration of heterogeneous data

statistical methods

Page 89: Integration of heterogeneous data

NLPNatural Language Processing

Page 90: Integration of heterogeneous data

Gene and protein namesCue words for entity recognitionVerbs for relation extraction

[nxgene The GAL4 gene]

[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]

Page 91: Integration of heterogeneous data
Page 92: Integration of heterogeneous data

restricted access

Page 93: Integration of heterogeneous data

Reflect

Page 94: Integration of heterogeneous data

augmented browsing

Page 95: Integration of heterogeneous data

Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009

Page 96: Integration of heterogeneous data

integration

Page 97: Integration of heterogeneous data

the easy problems

Page 98: Integration of heterogeneous data

many databases

Page 99: Integration of heterogeneous data

different formats

Page 100: Integration of heterogeneous data

different identifiers

Page 101: Integration of heterogeneous data

partially redundant

Page 102: Integration of heterogeneous data

parsers

Page 103: Integration of heterogeneous data

thesaurus

Page 104: Integration of heterogeneous data

book keeping

Page 105: Integration of heterogeneous data

the hard problems

Page 106: Integration of heterogeneous data

many data types

Page 107: Integration of heterogeneous data

not comparable

Page 108: Integration of heterogeneous data

variable quality

Page 109: Integration of heterogeneous data

raw quality scores

Page 110: Integration of heterogeneous data

intergenic distances

Page 111: Integration of heterogeneous data

Korbel et al., Nature Biotechnology, 2004

Page 112: Integration of heterogeneous data

correlations

Page 113: Integration of heterogeneous data
Page 114: Integration of heterogeneous data

reproducibility

Page 115: Integration of heterogeneous data

von Mering et al., Nucleic Acids Research, 2005

Page 116: Integration of heterogeneous data

score calibration

Page 117: Integration of heterogeneous data

gold standard

Page 118: Integration of heterogeneous data

von Mering et al., Nucleic Acids Research, 2005

Page 119: Integration of heterogeneous data

spread over 630 genomes

Page 120: Integration of heterogeneous data

transfer by orthology

Page 121: Integration of heterogeneous data

von Mering et al., Nucleic Acids Research, 2005

Page 122: Integration of heterogeneous data

two modes

Page 123: Integration of heterogeneous data

COG mode

Page 124: Integration of heterogeneous data

von Mering et al., Nucleic Acids Research, 2005

Page 125: Integration of heterogeneous data

protein mode

Page 126: Integration of heterogeneous data

von Mering et al., Nucleic Acids Research, 2005

Page 127: Integration of heterogeneous data

combine all evidence

Page 128: Integration of heterogeneous data

P = 1-(1-P1)(1-P2)(1-P3) …

Page 129: Integration of heterogeneous data

visualize

Page 130: Integration of heterogeneous data

Kuhn et al., Nucleic Acids Research, 2010

Page 131: Integration of heterogeneous data

access

Page 132: Integration of heterogeneous data

access for humans

Page 133: Integration of heterogeneous data

web interfaces

Page 134: Integration of heterogeneous data
Page 135: Integration of heterogeneous data
Page 136: Integration of heterogeneous data
Page 137: Integration of heterogeneous data

access for computers

Page 138: Integration of heterogeneous data

web services

Page 139: Integration of heterogeneous data

RESTRepresentational State Transfer

Page 140: Integration of heterogeneous data

SOAPSimple Object Access Protocol

Page 141: Integration of heterogeneous data

Acknowledgments

STITCH– Michael Kuhn

– Damian Szklarczyk

– Andrea Franceschini

– Monica Campillos

– Christian von Mering

– Lars Juhl Jensen

– Andreas Beyer

– Peer Bork

Reflect– Sean O’Donoghue

– Heiko Horn

– Sune Frankild

– Evangelos Pafilis

– Michael Kuhn

– Nigel Brown

– Reinhardt Schneider

STRING– Christian von Mering

– Michael Kuhn

– Manuel Stark

– Samuel Chaffron

– Chris Creevey

– Jean Muller

– Tobias Doerks

– Philippe Julien

– Alexander Roth

– Milan Simonovic

– Jan Korbel

– Berend Snel

– Martijn Huynen

– Peer Bork

Page 142: Integration of heterogeneous data

larsjuhljensen