Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo...

89
Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo [email protected] Havana, Cuba, 21.11.2003

Transcript of Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo...

Page 1: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Mining the functional genomics data III

Data integration:Gene Ontology, PPI, URLMAP

Jaak [email protected]

Havana, Cuba, 21.11.2003

Page 2: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

EPCLUST Expression data GENOMES

sequence, function, annotation

SPEXSdiscover patterns

URLMAPprovide links

Components of Expression Profilerhttp://ep.ebi.ac.uk/

Expression data

External data, toolspathways, function,

etc.

PATMATCHvisualise patterns

EP:GOGeneOntology

EP:PPIProt-Prot ia.

SEQLOGO

Page 3: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Expression Profiler: EPCLUST

DATA SELECT/FILTER

FOLDER ANALYZE

A “CLUSTER”

URLMAP

GeneOntologyPathwaysDatabasesSPEXSOther tools

Page 4: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

URLMAP

• Given a cluster of genes - many web based tools and databases to consult/follow up. How to link to them?

• How to manage many links, many tools?

• Answer: Centralize that linking

Page 5: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

URLMAP - no need to “cut & paste”

KEGG:

SRS/InterPro

• Generates all links/forms dynamically

• Maintain links in one place• Handle renaming of gene

id’s by synonyms• Allow domain-specific link

pages

Page 6: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

A Simple Metabolic Pathway

Shoshanna Wodak, Jacques van Helden

Page 7: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Links for each item type

• Yeast S. cerevisiae gene ID-s (ORFname, SP id, SGD ID, …)

• Pattern collections, e.g. substrings to profile generation by SEQLOGO

• Keyword searches from web based search engines

Page 8: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Management of links

• Hierarchies of link collections• One can point to any (sub)hierarchy

directly• LINK =

• URL, title, • form parameters• modifications/code• DB lookups for synonyms

Page 9: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

“Screen scraping” – doable with a little perl programming

g1,g2,g356g1,g2,g356

g1,g2,g356

g1,g2,g356

g1

g2

g356

Report

Page 10: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Gene OntologyTM

www.geneontology.org

• GO is a systematic effort for data annotation

• Three independent ontologies• Molecular Function• Biological Process• Cellular component

• How to integrate that into analysis tools?

Page 11: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.
Page 12: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

DAG Structure

Annotate to any level within DAG

mitosisS.c. NNF1

mitotic chromosome condensation

S.c. BRN1, D.m. barren

Page 13: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

• Database object: gene or gene product

• GO term ID

• Reference

• publication or computational method

• Evidence supporting annotation

GO Annotation: Data

Page 14: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

IDA - Inferred from Direct Assay

IMP - Inferred from Mutant Phenotype

IGI - Inferred from Genetic Interaction

IPI - Inferred from Physical Interaction

IEP - Inferred from Expression Pattern

GO Evidence Codes

TAS - Traceable Author Statement

NAS - Non-traceable Author Statement

IC - Inferred by Curator

ISS - Inferred from Sequence or structural Similarity

IEA - Inferred from Electronic Annotation

ND - Not Determined

Page 15: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

IDA - Inferred from Direct Assay

IMP - Inferred from Mutant Phenotype

IGI - Inferred from Genetic Interaction

IPI - Inferred from Physical Interaction

IEP - Inferred from Expression Pattern

GO Evidence Codes

TAS - Traceable Author Statement

NAS - Non-traceable Author Statement

IC - Inferred by Curator

ISS - Inferred from Sequence or structural Similarity

IEA - Inferred from Electronic Annotation

ND - Not Determined

From reviews or introductions

From primary literature automated

Page 16: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Example (GoMiner)

Page 17: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

EP:GO tool for GeneOntology

• Browse

• Search by keywords; EC, term. etc..

• Get associated genes• Submit associated genes to URLMAP

• Annotate gene clusters using GO terms

Page 18: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

URLMAP => Look up expression data

EP:GO EPCLUST

Page 19: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Annotate Clusters (EP:GO)

1

2 3 4

5 6

A,D

B,C

E

F,G,H

J

I

F,G,H

F,G

B,E

B,A

F,G,I

B,E,F,I

Page 20: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Set overlap

GO term CLUSTER

A: |G ∩ C| / min( |G|, |C|)

B: P( choose |C| from N with |G|, observe |G ∩ C|+)

N genes

G ∩ C

Page 21: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Annotation of clusters•

GO:0042254 <U:L> Process: ribosome biogenesis and assembly (+2:15) (depth=7) [sgd:2:187]GO:0042254: 47 from cluster (size 98) vs 187 in this class (including subclasses)

GO:0006364 <U:L> Process: rRNA processing (+3:3) (depth=8) [sgd:50:126]GO:0006364: 35 from cluster (size 98) vs 126 in this class (including subclasses)

GO:0006360 <U:L> Process: transcription from Pol I promoter (+6:14) (depth=8) [sgd:23:155]GO:0006360: 38 from cluster (size 98) vs 155 in this class (including subclasses)

GO:0005730 <U:L> Component: nucleolus (+10:17) (depth=6) [sgd:154:210]GO:0005730: 45 from cluster (size 98) vs 210 in this class (including subclasses)

GO:0030515 <U:L> Function: snoRNA binding (depth=6) [sgd:23:23]GO:0030515: 17 from cluster (size 98) vs 23 in this class (including subclasses)

GO:0030490 <U:L> Process: processing of 20S pre-rRNA (depth=9) [sgd:33:33]GO:0030490: 18 from cluster (size 98) vs 33 in this class (including subclasses)

GO:0005732 <U:L> Component: small nucleolar ribonucleoprotein complex (depth=6) [sgd:30:30]GO:0005732: 16 from cluster (size 98) vs 30 in this class (including subclasses)

GO:0006396 <U:L> Process: RNA processing (+7:52) (depth=7) [sgd:7:370]GO:0006396: 40 from cluster (size 98) vs 370 in this class (including subclasses)

• …

Page 22: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

>YAL036C chromo=1 coord=(76154-75048(C)) start=-600 end=+2 seq=(76152-76754)

TGTTCTTTCTTCTTCTGCTTCTCCTTTTCCTTTTTTTCCTTCTCCTTTTCCTTCTTGGACTTTAGTATAGGCTTACCATCCTTCTTCTCTTCAATAACCTTCTTTTCTTGCTTCTTCTTCGATTGCTTCAAAGTAGACATGAAGTCGCCTTCAATGGCCTCAGCACCTTCAGCACTTGCACTTGCTTCTCTGGAAGTGTCATCTGCACCTGCGCTGCTTTCTGGATTTGGAGTTGGCGTGGCACTGATTTCTTCGTTCTGGGCGGCGTCTTCTTCGAATTCCTCATCCCAGTAGTTCTGTTGGTTCTTTTTACTCTTTTTCGCCATCTTTCACTTATCTGATGTTCCTGATTGCCCTTCTTATCCCCTCAAAGTTCACCTTTGCCACTTATTCTAGTGCAAGATCTCTTGCTTTCAATGGGCTTAAAGCTTGAAAAATTTTTTCACATCACAAGCGACGAGGGCCCGTTTTTTTCATCGATGAGCTATAAGAGTTTTCCACTTTTAAGATGGGATATTACGGTGTGATGAGGGCGCAATGATAGGAAGTGTTTGAAGCTAGATGCAGTAGGTGCAAGCGTAGAGTTGTTGATTGAGCAAA_ATG_>YAL025C chromo=1 coord=(101147-100230(C)) start=-600 end=+2 seq=(101145-101747)CTTAGAAGATAAAGTAGTGAATTACAATAAATTCGATACGAACGTTCAAATAGTCAAGAATTTCATTCAAAGGGTTCAATGGTCCAAGTTTTACACTTTCAAAGTTAACCACGAATTGCTGAGTAAGTGTGTTTATATTAGCACATTAACACAAGAAGAGATTAATGAACTATCCACATGAGGTATTGTGCCACTTTCCTCCAGTTCCCAAATTCCTCTTGTAAAAAACTTTGCATATAAAATATACAGATGGAGCATATATAGATGGAGCATACATACATGTTTTTTTTTTTTTAAAAACATGGACTCGAACAGAATAAAAGAATTTATAATGATAGATAATGCATACTTCAATAAGAGAGAATACTTGTTTTTAAATGAGAATTGCTTTCATTAGCTCATTATGTTCAGATTATCAAAATGCAGTAGGGTAATAAACCTTTTTTTTTTTTTTTTTTTTTTTTGAAAAATTTTCCGATGAGCTTTTGAAAAAAAATGAAAAAGTGATTGGTATAGAGGCAGATATTGCATTGCTTAGTTCTTTCTTTTGACAGTGTTCTCTTCAGTACATAACTACAACGGTTAGAATACAACGAGGAT_ATG_

...>YBR084W chromo=2 coord=(411012-413936) start=-600 end=+2 seq=(410412-411014)CCATGTATCCAAGACCTGCTGAAGATGCTTACAATGCCAATTATATTCAAGGTCTGCCCCAGTACCAAACATCTTATTTTTCGCAGCTGTTATTATCATCACCCCAGCATTACGAACATTCTCCACATCAAAGGAACTTTACGCCATCCAACCAATCGCATGGGAACTTTTATTAAATGTCTACATACATACATACATCTCGTACATAAATACGCATACGTATCTTCGTAGTAAGAACCGTCACAGATATGATTGAGCACGGTACAATTATGTATTAGTCAAACATTACCAGTTCTCGAACAAAACCAAAGCTACTCCTGCAACACTCTTCTATCGCACATGTATGGTTCTTATTGTTTCCCGAGTTCTTTTTTACTGACGCGCCAGAACGAGTAAGAAAGTTCTCTAGCGCCATGCTGAAATTTTTTTCACTTCAACGGACAGCGATTTTTTTTCTTTTTCCTCCGAAATAATGTTGCAGCGGTTCTCGATGCCTCAAGAATTGCAGAAGTAAACCAGCCAATACACATCAAAAAACAACTTTCATTACTGTGATTCTCTCAGTCTGTTCATTTGTCAGATATTTAAGGCTAAAAGGAA_ATG_

101 Sequences relative to ORF start

GATGAG.T 1:52/70 2:453/508 R:7.52345 BP:1.02391e-33G.GATGAG.T 1:39/49 2:193/222 R:13.244 BP:2.49026e-33AAAATTTT 1:63/77 2:833/911 R:4.95687 BP:5.02807e-32TGAAAA.TTT 1:45/53 2:333/350 R:8.85687 BP:1.69905e-31TG.AAA.TTT 1:53/61 2:538/570 R:6.45662 BP:3.24836e-31TG.AAA.TTTT 1:40/43 2:254/260 R:10.3214 BP:3.84624e-30TGAAA..TTT 1:54/65 2:608/645 R:5.82106 BP:1.0887e-29...

GATGAG.TTGAAA..TTT

YGR128C + 100

Page 23: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

EP:PPI Protein-protein interaction

• There are high-throughput technologies for identifying hypothetical protein-protein interactions

• Which ones of these are more likely to be true?

• Can these predictions help predicting gene function?

Page 24: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.
Page 25: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

PPI pairs

Page 26: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

We have expression data

Page 27: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Cluster

Page 28: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Trust those within the same cluster

Page 29: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

PPI are enriched within clusters

Ge, Liu, Church, Vidal: Nature Genetics Nov. 2001

Page 30: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Protein-protein interactions: which to trust more?

Answer: Use the distance measure alone

Page 31: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Kemmeren et.al. Randomized expression data

Yeast 2-hybrid studies

Known (literature) PPI

MPK1 YLR350w SNF4 YCL046WSNF7 YGR122W

Molecular Cell, Vol. 9, 1133–1143, May, 2002

Page 32: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

d0

Interacting pairs of proteins A and B; C and DWhich would you trust?

A

B1 0

d13

12

7

C

D

Page 33: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

EP:PPI – combine PPI and expression

Page 34: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Results

• Confidence in 973 out of 5342 putative two-hybrid interactions from S. cerevisiae is increased.

• Besides verification, integration of expression and interaction data is employed to provide functional annotation for over 300 previously uncharacterized genes.

• The robustness of these approaches is demonstrated by experiments that test the in silico predictions made.

• This study shows how integration improves the utility of different types of functional genomic data and how well this contributes to functional annotation.

Page 35: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

promotercoding DNA

GENE 1 GENE 2 GENE 3 GENE 4DNA

transcriptionfactors

G1

G2 G4

G3

Gene regulation by transcription factors

Page 36: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Networks

• Graphical models

• Directed labelled graph

• Nodes genes

• Arcs/Edges relationships

• Labels types of relationships

Page 37: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Start node (gene)

End node (gene)

Connection weight, w

Graph drawing

A BW

Page 38: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Different interpretation of arcs

• Edges can have different meanings, hence different networks

• Binding site for A is in front of B• Proteins A and B interact• Deletion of gene A affects expression of

B (is somewhere in regulation cascade)• “Literature” mentions genes together

Page 39: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

promotercoding DNA

GENE 1 GENE 2 GENE 3 GENE 4DNA

transcriptionfactors

G1

G2 G4

G3

Gene regulation by transcription factors

Page 40: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

A B C

gene B

gene C

gene D

gene A

A D

B C

Deletion mutants (gene knockouts)

Page 41: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Hughes, T. R. et al: “Functional Discovery via a Compendium of Expression Profiles”, Cell 102 (2000), 109-126.

Page 42: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Green arrows - upregulationRed arrows - downregulationThickness of arrow represents certainty of direction (up/down)

Page 43: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

A complete graph

Page 44: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Features/distributions that do not depend on discretisation thresholds

• Visual inspection, biological interpretation

• General statistics and features of the graphs

• Indegree/Outdegree

• Complexity of the networks

• What is the modularity? • How many components? • Deletion of hot-spots, does it break the net?

Page 45: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Filter•choose a list of genes

(MATING, marked in red)•filter for these genes plus

neighbouring genes from the graph

CUP5

AKR1

VMA8

YAR014C

SST2

YEL044W

YER050C

MFA1STE2

BAR1

MFA2

AGA1

AFG3FUS1

FKS1

FUS3

VCX1

ADR1

URA3

ICL1

YGR250CPGU1

YLR042C

YNR067C

HOG1

FIG1

AGA2

KSS1

RAD6

STE6

RAS2

RPD3

CRS4

ASG7

KAR4

NRC465

YIL080W

FUS2

YNL279W

YOL154W

YPL156CYPL192C

YML048W-A

STE11

STE12

GPA1

STE18

STE24

STE4

STE5

STE7

TUP1

YER044C

YJL107C

AFR1

SHE4

CMK2PHO89

RAD16

CYC8 QCR2SWI4NPR2

Mutation network =4

Page 46: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

AEP2

AKR1

CMK2

ANP1

RAD16

AFR1

CEM1

CUP5

SST2

DIG1

UBP10

STE2

ERG2

PHO89ERG6

GAS1 PTP2

GYP1

HIR2HPT1

ISW1

FIG1 ISW2

KIN3

MAC1MRPL33

MSU1

NPR2

PET111

RAD57

RIP1

RRP6

ASG7

STE6RTS1

SCS7

SGS1

MFA1

SHE4AGA1

SWI4

FUS1SWI5

VAC8

VMA8

YAL004W

YAR014C

YEL044W

YER050C

FUS3

GPA1

BAR1

MFA2

YER083C

RTT104

YMR014W

YMR029C AGA2YMR031W-A

YMR293C

YOR078W

ADE2

AFG3

BNI1

CLA4

ERG3

FKS1

KAR4

YAR064W

CHS3

VAP1

ICS2

YCLX09W

YDL009C

STP4

PMT1

VCX1HO

THI13

ADR1

YDR249C PAM1

YDR275W

HXT7

HXT6 YDR366CYDR534C

URA3

YEL071W

MNN1

ICL1

RNR1

YER130C

YER135C

SPI1 DMC1

HSP12

NIL1

GSC2

KSS1

MUP1

YGR138C

SKN1

YGR250C

YHR097C YHR116W

YHR122W

YHR145C

YIL060W

YIL096C

YIL117C

RHO3

YIL122W FKH1

NCA3

YJL145W

RPL17B

YJL217W

CYC1

DAN1

PGU1

GFA1

HAP4

RRN3

STE3

PRY2

KTR2

SRL3

YLR040C

YLR042C

SSP120

HSP60

YLR297W

RPS22B YLR413W

HOF1

DDR48

RNA1

YMR266W

YNL078W

SPC98

YNL133C

YNL217W

WSC2YPT11

RFA2

YNR009W

YNR067C

MDH2

YOL154W

NDJ1

WSC3

CDC21

PFY1

RGA1

MSB1

SRL1

YOR248W

YOR296W

YOR338W

GDS1PDE2

FRE5

YPL080C

RPS9A

BBP1

YPL256C

SUA7

MEP3

YPR156C

HMG1

HOG1

MED2

QCR2

RAD6

RAS2

RPD3

RPS24A

CRS4CYC8

YAR031W

YBR012C

HIS7

YCLX07W

YCRX18C PCL2

YDR124W

ECM18APA2

YER024W

HOM3

THI5

YGL053W

NRC465

YGR161C YHR055C

YIL037C

YIL080W

YIL082W

HIS5

YJL037W

SAG1

CPA2

AAD10

HYM1

MET1

MID2

YML047C

KAR5

CIK1

FUS2 SCW10

BOP3

YNL279WTHI12

YOL119C

YOR203W

TEA1

ISU1

YPL156C

YPL192CYPL250C

KAR3YIL082W

-A

YML048W-A

YMR085W

STE11

STE12 STE18

URA1

URA4

STE24

STE4

STE5

STE7SWI6

MAK1

TUP1

YER044C YJL107C

Mutation network =2

Page 47: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

lacZ ...Promoter Operator

Repressor

lacIPromoter

Activator

Glucose

Lactose GlucoseGalactose

+

Galactosidase

Lac-Operon

Thomas Schlitt

Page 48: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Gene regulatory networks

• What formalisms to use to describe them?• When does model correspond to biological

reality?• How to simulate models on computer• Is it possible to verify models by

experiments?• How to restore networks from raw data

without knowing the structure or parameters?

Page 49: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Most genes have only a few incoming / outgoing edges, but some have high numbers (>500)

0

20

40

60

80

100

120

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96

number of outgoing edges

coun

t

...

Number of incoming/outgoing edges

Page 50: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

ARG5,6(108,28)

SST2(60,25)

TEC1

HPT1

GCN4 ERG3(164,15)GAS1

FUS3ERG28QCR2 YER083C

GLN3 SPF1

MRT4

CLB2

YHL029C

0

5

10

15

20

25

30

35

40

45

020406080100120

outdegree

ind

egre

e

Rank of outdegreeR

ank of indegree

Page 51: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

outdegree m n indegree m n2.0 Carbohydrate metabolism 363 4 Amino-acid metabolism 9 194

RNA turnover 353 4 Nucleotide metabolism 6 82Meiosis 244 3 Energy generation 5 242Cellstress 207 9 Small molecule transport 5 343Protein translocation 197 3 Other metabolism 5 148

2.8 RNA turnover 110 4 Amino-acid metabolism 4 167Cellstress 62 8 Nucleotide metabolism 3 67Meiosis 54 3 Energy generation 2 184Proteinsynthesis 53 7 Differentiation 2 43

Cellwallmaintenance 47 6 Small molecule transport 2 286

3.6 RNA turnover 48 4 Small molecule transport 2 230RNA processing/ modification 41 4 Other metabolism 2 96Cellstress 27 8 Nucleotide metabolism 2 58Small molecule transport 19 8 Matingresponse 2 57Cellwallmaintenance 19 6 Amino-acid metabolism 2 133

Cellular role table showing the top 5 groups with the highest median degrees for the networks with =2.0, 2.8 and 3.6 with a minimum group size of 3 for outdegree and 40 for the indegree (m median degree, n number of genes per group)

High outdegree High indegreeR

egul

atio

n Metabolism

Page 52: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

• Is there one “big” dominant connected component and possibly a number of small components, or several components of comparable sizes?

• Can the network be broken down in several components of comparable size by removing nodes of high degree (i.e., nodes with many incoming or outgoing edges)?

Network modularity

Page 53: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

network modularity

Number of connected components in the networks

Page 54: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Number of connected components in the networks

network modularity

Page 55: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

component

full network

1% removed

5% removed

10% removed

2.0 largestsecond

total

5383

1

4707

1

368222

261452

3.0 largestsecond

total

355622

246122

138549

7646

17

4.0 largestsecond

total

235434

120537

5426

22

452851

Number of connected components in the networks

network modularity

Page 56: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

• Wagner, Genome Research 2002 – there exist many independent modules

• Featherstone and Broadie, Bioessays 2002 - there is only one giant module

• All depends on the definition of the ‘module’

Modularity

other opinions

Page 57: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Gene disruption network for Saccharomyces cerevisiae

Page 58: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

a closer look

Page 59: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Filter•choose a list of genes

(MATING, marked in red)•filter for these genes plus

neighbouring genes from the graph

CUP5

AKR1

VMA8

YAR014C

SST2

YEL044W

YER050C

MFA1STE2

BAR1

MFA2

AGA1

AFG3FUS1

FKS1

FUS3

VCX1

ADR1

URA3

ICL1

YGR250CPGU1

YLR042C

YNR067C

HOG1

FIG1

AGA2

KSS1

RAD6

STE6

RAS2

RPD3

CRS4

ASG7

KAR4

NRC465

YIL080W

FUS2

YNL279W

YOL154W

YPL156CYPL192C

YML048W-A

STE11

STE12

GPA1

STE18

STE24

STE4

STE5

STE7

TUP1

YER044C

YJL107C

AFR1

SHE4

CMK2PHO89

RAD16

CYC8 QCR2SWI4NPR2

Mutation network =4

Page 60: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

This subnetwork is the result of filtering the full network at =4.0 for the core set marked in red and their next neighbours (red arcs: downregulation, green arcs: upregulation).

CUP5

AKR1

NPR2

PHO89

SHE4

AFR1

CMK2

SWI4

RAD16

VMA8

YAR014C

YEL044W

YER050C

MFA1

STE2

BAR1

MFA2

AGA1

AFG3

FUS1 FKS1FUS3

VCX1

ADR1

URA3ICL1

YGR250C

PGU1

YLR042C

YNR067CHOG1

FIG1

AGA2

KSS1

QCR2

RAD6

STE6

RAS2

RPD3

CRS4

ASG7

CYC8

KAR4

NRC465

YIL080W

FUS2

YNL279W

YOL154W

YPL192C

YML048W-A

STE11

STE12

GPA1

STE18

STE24

STE4

STE5

STE7TUP1

YER044C

YJL107C

SST2

YPL156C

Mating subnetwork

Page 61: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

AEP2

AKR1

CMK2

ANP1

RAD16

AFR1

CEM1

CUP5

SST2

DIG1

UBP10

STE2

ERG2

PHO89ERG6

GAS1 PTP2

GYP1

HIR2HPT1

ISW1

FIG1 ISW2

KIN3

MAC1MRPL33

MSU1

NPR2

PET111

RAD57

RIP1

RRP6

ASG7

STE6RTS1

SCS7

SGS1

MFA1

SHE4AGA1

SWI4

FUS1SWI5

VAC8

VMA8

YAL004W

YAR014C

YEL044W

YER050C

FUS3

GPA1

BAR1

MFA2

YER083C

RTT104

YMR014W

YMR029C AGA2YMR031W-A

YMR293C

YOR078W

ADE2

AFG3

BNI1

CLA4

ERG3

FKS1

KAR4

YAR064W

CHS3

VAP1

ICS2

YCLX09W

YDL009C

STP4

PMT1

VCX1HO

THI13

ADR1

YDR249C PAM1

YDR275W

HXT7

HXT6 YDR366CYDR534C

URA3

YEL071W

MNN1

ICL1

RNR1

YER130C

YER135C

SPI1 DMC1

HSP12

NIL1

GSC2

KSS1

MUP1

YGR138C

SKN1

YGR250C

YHR097C YHR116W

YHR122W

YHR145C

YIL060W

YIL096C

YIL117C

RHO3

YIL122W FKH1

NCA3

YJL145W

RPL17B

YJL217W

CYC1

DAN1

PGU1

GFA1

HAP4

RRN3

STE3

PRY2

KTR2

SRL3

YLR040C

YLR042C

SSP120

HSP60

YLR297W

RPS22B YLR413W

HOF1

DDR48

RNA1

YMR266W

YNL078W

SPC98

YNL133C

YNL217W

WSC2YPT11

RFA2

YNR009W

YNR067C

MDH2

YOL154W

NDJ1

WSC3

CDC21

PFY1

RGA1

MSB1

SRL1

YOR248W

YOR296W

YOR338W

GDS1PDE2

FRE5

YPL080C

RPS9A

BBP1

YPL256C

SUA7

MEP3

YPR156C

HMG1

HOG1

MED2

QCR2

RAD6

RAS2

RPD3

RPS24A

CRS4CYC8

YAR031W

YBR012C

HIS7

YCLX07W

YCRX18C PCL2

YDR124W

ECM18APA2

YER024W

HOM3

THI5

YGL053W

NRC465

YGR161C YHR055C

YIL037C

YIL080W

YIL082W

HIS5

YJL037W

SAG1

CPA2

AAD10

HYM1

MET1

MID2

YML047C

KAR5

CIK1

FUS2 SCW10

BOP3

YNL279WTHI12

YOL119C

YOR203W

TEA1

ISU1

YPL156C

YPL192CYPL250C

KAR3YIL082W

-A

YML048W-A

YMR085W

STE11

STE12 STE18

URA1

URA4

STE24

STE4

STE5

STE7SWI6

MAK1

TUP1

YER044C YJL107C

This subnetwork is the result of filtering the full network at =2.0 for the core set marked in red and their next neighbours (red arcs: down- regulation, green arcs: upregulation).

Mating subnetwork

Page 62: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

•more information than randomised networks •no optimal •powerlaw distribution of arcs•no obvious modules•local networks make sense

Conclusion

Page 63: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

lacZ ...Promoter Operator

Repressor

lacIPromoter

Activator

Glucose

Lactose GlucoseGalactose

+

Galactosidase

Lac-Operon

Thomas Schlitt

Page 64: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

A gene network(?)

b1

b2

b3

F1

F2

r1

r2

Page 65: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Of transcription factors

Page 66: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Of transcription factors and KO’s

Page 67: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Of transcription factors and KO’s

Page 68: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Hughes, T. R. et al: “Functional Discovery via a Compendium of Expression Profiles”, Cell 102 (2000), 109-126.

Page 69: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

All genes

Effectual set and regulation setAll genes

Transcription factors

Disrupted genes

tRegulation set of t

h Effectual set of h

Page 70: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

All genes

Effectual set and regulation setAll genes

Transcription factors

Disrupted genes

g

Regulation set of t

Effectual set of h

Page 71: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

How to estimate that the overlap is more than expected by

random?

G

R

E

RE

We assume that the elements of the set E are marked, and pick the set of size |R| at random. Then the size x=|RE| of the intersection are distributed according to hypergeometric distribution. The probability of observing an intersection of size k or larger can be computed according to formula:

k

i R

G

iR

EG

i

E

kxP

0 ||

||

||

||||||

1)(

Page 72: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Data

• Disrupted genes – 263 disrupted genes excluding drug treatments and haploid states (Hughes et al)

• Transcription factor binding sites – 356 binding sites, from these 37 experimentally proved (Pilpel et al, 2001)

Page 73: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Disrupted TF

• Only 5 transcription factors from our set (of known binding sites) were disrupted on the experiments – mbp1, yap1, yaf1, swi5, gcn4

• For three of them – mbp1, yap1, gcn4 –the regulation and effectual sets were highly correlating

• yaf1 is activated with oleate, while in oleate free environment Yaf1 (alias OAF1) disruption does not have significant effect

• swi5 affects only haploid state, while we use only diploid

Page 74: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Effectual sets correlating with other TF binding sites

• From 37 of the experimentally proven binding sites, 20 correlate with one or more effectual sets

• If the disrupted gene correlate with a regulation set of a different gene, the correlation should be explained

Page 75: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Possible explanations why disruption of gene A may correlate with regulation

set of a different gene (TF) T:

• T belongs to the disruption set of A (cascade)

Page 76: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Gene regulation cascade

Page 77: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Possible explanations why disruption of gene A may correlate with regulation

set of a different gene (TF) T:

• T belongs to the disruption set of A (cascade)• T is regulated by A (transcription or translation)

or by a gene on the cascade of A• T is modified (e.g., phosphorylated) by A or a

cascade of A• T and A belongs to the same protein complex

• A and T are functionally related

Page 78: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Binding site/disruption correlation summary

|R| Site |E| Disruption |RE| Description 184 MCB 8 MBP1 5 Part of a DNA binding complex. 78 Yap1 55 YAP1 6

116 Ume6 346 SIN3 20 Interacting proteins. 210 Zap1 3 MSC7 3 Relation via hydrogenases. 243 Ste12 437 DIG1,DIG2 36 DIG1 represses STE12. 153 Ndt80 151 ISW1,ISW2 13 Genetic interaction with ISW2. 180 Rpn4 33 UBR2 17 Similar cellular role. 257 Rap1 121 VAC8 23 Weak link through vacuole 480 Bas1 23 HPT1 11 Adenine response. 149 Stress 126 SIR4 13 Unexplained. 116 Hap234 4 disruptions Unexplained. 151 Gcn4 34 disruptions Central biosynthesis regulator. 89 Leu3 20 disruptions In biosynthesis pathway. 58 Met31-32 16 disruptions In biosynthesis pathway.

188 Aft1 CUP5 MAC1 VMA8 Small molecule transport, ironuptake. 907 rRNA proc. 9 disruptions Ribosomal activity. 514 PAC 9 disruptions Ribosomal activity. 356 ECB 5 Weak disruptions Early Cell-Cycle box 371 PDR 11 Weak disruptions 410 Mcm1 10 disruptions Mcm1p needs coregulators.

Page 79: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Conclusion

• Most of the binding site/disruption set correlations can be explained via • Regulation cascades• Protein complexes

(K. Palin et al, to appear in ECCB 2002, special issue of Bioinformatics)

Page 80: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

or

SAME SYMPTOMS

SAME DRUG

RESPONSE VARIATION…

SNP ACCTGACGTGGACCTGTCGTGG

PHARMACOGENETICS = NEW OPPORTUNITIES

SNP = Single nucleotide polymorphisms, 0.1% = 3million

Page 81: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

SNP’s make us unique~0.1%, 3.000.000

Goal: Associate SNPs with diseases

A C G T G A C

G T A - AA C T

Page 82: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Genotyping: select fewMeasure

Goal: Associate SNPs with diseases,i.e. identify areas of interest

Page 83: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Association analysis:Identifies MANY, if not

all contributing genesLinks genes to disease

pathways for optimal

target selection

FROM DISEASE GENES TO DRUG TARGETS

Page 84: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

InternetInternet

GP

DNA,Plasma,storage

Data + Analysis = Value

LIMS

Informed consentPersonal data

Unique code

Genotypes

Medicalinformation

EGV: Process of data collection and handling

SNP’s

Page 85: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Bioinformatics: Where does IT stand?

• Data modelling, storage, access• Inference from data• Hypotheses generation and testing

• Allow novel types of questions to be asked by providing analysis methods that are able to cope with all the information that is available today

Page 86: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Compute Infrastructure

Page 87: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Bioinformatics: Challenges

• Knowledge representation, data semantics• Data size and its speed of growth• New/emerging data collection technologies• Integration of different data types• Discovery of useful knowledge• Modeling living systems as a whole• Improved health care products• Medical informatics – bringing the

knowledge to doctor’s bench

Page 88: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

References for this talkhttp://www.egeen.ee/u/vilo/Publications/

Jaak Vilo, Misha Kapushesky, Patrick Kemmeren, Ugis Sarkans, Alvis Brazma. Expression Profiler. In Parmigiani,G., Garrett,E.S., Irizarry,R. and Zeger,S.L. (eds), The Analysis of Gene Expression Data: Methods and Software, Springer Verlag, New York, NY.

Patrick Kemmeren, Nynke L. van Berkum, Jaak Vilo, Theo Bijma, Rogier Donders, Alvis Brazma, and Frank C.P. Holstege Protein Interaction Verification and Functional Annotation by Integrated Analysis of Genome-Scale Data Molecular Cell 2002, May 24; 9(5) pp. 1133-1143

Johan Rung, Thomas Schlitt, Alvis Brazma, Karlis Freivalds, Jaak Vilo Building and analysing genome-wide gene disruption networks Bioinformatics 2002 Oct;18 Suppl 2:S202-210 European Conference on Computational Biology (ECCB 2002)

Kimmo Palin, Esko Ukkonen, Alvis Brazma, Jaak Vilo Correlating gene promoters and expression in gene disruption experiments Bioinformatics 2002 Oct;18 Suppl 2:S172-180;European Conference on Computational Biology (ECCB, 2002)

Page 89: Mining the functional genomics data III Data integration: Gene Ontology, PPI, URLMAP Jaak Vilo vilo@egeen.ee Havana, Cuba, 21.11.2003.

Acknowledgements

Alvis Brazma

Patrick Kemmeren, EBI, UMC Utrecht

Frank Holstege, UMC Utrecht

Thomas Schlitt, Johan Rung EBI

Kimmo Palin, Esko Ukkonen, U. Helsinki

+ the rest of the EBI microarray team