On the analysis of protein interaction networks · 2010. 1. 28. · Abstract Protein interaction...

On the analysis ofprotein interaction networks

by

William Paul Kelly

A thesis submitted for the degree ofDoctor of Philosophy of the University of London

Department of MathematicsImperial College London

180 Queen’s GateLondon, England

October, 2009

c© 2009 William Paul KellyAll rights reservedTypeset in Times by LATEXGraphs typeset in R for Mac OS X

This dissertation is the result of my own workand includes nothing which is the outcome ofwork done in collaboration except wherespecifically indicated in the text.

This dissertation is not substantially the sameas any submitted by the author for any otherdegree or diploma or other qualification atany other university.

No part of this dissertation has already been,or is currently being submitted by the authorfor any other degree or diploma or otherqualification.

This dissertation does not exceed 50,000words, including appendices, footnotes,tables and equations. It does not containmore than 100 figures.

This work is supported by a Wellcome Trustgrant and completed in the Department ofMathematics and Centre for Bioinformaticsat Imperial College, London.

2

Abstract

Protein interaction networks describe the reported protein interactions found in an or-ganism. Understanding their organisation will have an impact on all areas of systemsbiology. The amount of interaction data has expanded dramatically since the advent ofhigh-throughput experimental technologies. However, interaction data are believed tocontain a high proportion of false-positive interactions as well as true interactions. Incor-porating knowledge of other biological characteristics may allow more reliable interactionnetworks to be produced.

This thesis presents an analysis of the reported Saccharomyces cerevisiae protein interac-tion network, providing an overview of its contents and a comparison of the contributingexperimental techniques. Algorithms for constructing random networks are described andused to assess whether the network’s topology depends upon biological covariates. It isshown that the choice of random network generation algorithm can affect the conclusionsdrawn.

Phylogenetic trees of S. cerevisiae proteins are compared in order to assess possible evolu-tionary linkage of protein-protein interactions. The similarity of phylogenetic tree topolo-gies found between interacting proteins are compared to those found for a variety ofrandomly constructed networks. Whilst the orthologues of interacting proteins show atendency to be conserved together, the topologies are not more similar than those foundfor random networks. However, topological similarity is shown to be a means of differ-entiating between interacting and non-interacting protein pairs that have been reported asbinding together in the same multi-protein complex structure.

Finally, a model is described that predicts interactome size and false discovery rate forreported data. The model uses all available interaction data to present the relationshipbetween error rate, interactome size, and the proportion of observed true interactions.The classification of true interactions is through the use of repeated data and plausibleinteractome sizes are used to assess the number of reported interactions necessary to findtrue interactions reliably.

3

Contents

List of Figures 9

List of Tables 11

List of Abbreviations 14

List of Mathematical Notation 15

Acknowledgements 16

1 Introduction 17

1.1 Thesis overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.1.1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.1.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.2 Biological systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.2.1 Genomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.2.2 Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.2.3 Protein interactions . . . . . . . . . . . . . . . . . . . . . . . . . 23

1.2.4 HIV example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.3 Comparative genomics . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1.3.1 Sequence alignment . . . . . . . . . . . . . . . . . . . . . . . . 26

4

1.3.2 Phylogenetic trees . . . . . . . . . . . . . . . . . . . . . . . . . 28

1.3.3 Correlated evolution . . . . . . . . . . . . . . . . . . . . . . . . 30

1.4 Protein interaction data . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

1.4.1 Traditional methods . . . . . . . . . . . . . . . . . . . . . . . . 34

1.4.2 High-throughput methods . . . . . . . . . . . . . . . . . . . . . 36

1.4.3 Interaction inference . . . . . . . . . . . . . . . . . . . . . . . . 37

1.4.4 Computational predictions . . . . . . . . . . . . . . . . . . . . . 39

1.5 Graph theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

1.5.1 Graph properties . . . . . . . . . . . . . . . . . . . . . . . . . . 42

1.5.2 Graph ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . 48

1.6 Noise in interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

1.6.1 Sampling notation . . . . . . . . . . . . . . . . . . . . . . . . . 53

1.6.2 Error rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

1.6.3 Error and size estimates . . . . . . . . . . . . . . . . . . . . . . 56

1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

2 An exploratory analysis of interaction data 63

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

2.2 Interactome databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

2.3 Analysis of the BioGRID S. cerevisiae database . . . . . . . . . . . . . . 65

2.3.1 Year of publication . . . . . . . . . . . . . . . . . . . . . . . . . 66

2.3.2 Experiment size and technique . . . . . . . . . . . . . . . . . . . 67

2.3.3 Self-interactions . . . . . . . . . . . . . . . . . . . . . . . . . . 69

2.3.4 Gene Ontology annotations of interacting proteins . . . . . . . . 70

2.3.5 Gene Ontology annotations and experimental techniques . . . . . 74

2.3.6 Repeated interactions . . . . . . . . . . . . . . . . . . . . . . . . 78

5

2.4 Interaction networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

2.4.1 Local graph structure . . . . . . . . . . . . . . . . . . . . . . . . 80

2.4.2 Degree sequence . . . . . . . . . . . . . . . . . . . . . . . . . . 81

2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

3 Graph ensembles 86

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

3.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

3.2.2 Rewiring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

3.2.3 Perturbations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

3.3.1 Rewiring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

3.3.2 Perturbations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

4 Phylogenetic topologies of interacting proteins 117

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

4.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

4.2.2 Correlated divergence . . . . . . . . . . . . . . . . . . . . . . . 121

4.2.3 Measuring topological differences . . . . . . . . . . . . . . . . . 122

4.2.4 Phylogenetic analyses . . . . . . . . . . . . . . . . . . . . . . . 123

4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

4.3.1 Phylogenetic profiles . . . . . . . . . . . . . . . . . . . . . . . . 125

4.3.2 Topological similarity . . . . . . . . . . . . . . . . . . . . . . . 128

4.3.3 Phylogenetic methods . . . . . . . . . . . . . . . . . . . . . . . 131

6

4.3.4 Further analyses . . . . . . . . . . . . . . . . . . . . . . . . . . 134

4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

5 Measuring the interactome 138

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

5.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

5.2.2 Coupon collecting . . . . . . . . . . . . . . . . . . . . . . . . . 142

5.2.3 Single coupon . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

5.2.4 Multiple coupons . . . . . . . . . . . . . . . . . . . . . . . . . . 148

5.2.5 Finding true interactions . . . . . . . . . . . . . . . . . . . . . . 150

5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

5.3.1 Interactome size . . . . . . . . . . . . . . . . . . . . . . . . . . 152

5.3.2 Experiment size . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

5.3.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

6 Conclusions 162

6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

6.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

6.3 Further work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

A Mathematical techniques 167

A.1 Likelihood analysis of degree distributions . . . . . . . . . . . . . . . . . 167

A.2 Scaling degree random graphs . . . . . . . . . . . . . . . . . . . . . . . 168

A.3 Exponential random graphs . . . . . . . . . . . . . . . . . . . . . . . . . 169

A.4 Further biological random graphs . . . . . . . . . . . . . . . . . . . . . . 171

7

B Data tables for biological traits 172

B.1 Experimental interaction techniques . . . . . . . . . . . . . . . . . . . . 172

B.2 Further Gene Ontology annotation analysis . . . . . . . . . . . . . . . . 174

C Graph ensemble output 177

D Phylogenetic topology 180

D.1 Phylogenetic topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

D.2 Supplementary phylogenetic results . . . . . . . . . . . . . . . . . . . . 181

D.3 Escherichia coli phylogenetic trees . . . . . . . . . . . . . . . . . . . . . 186

E Sampling schemes 188

8

Figures

1.1 Interacting proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

1.2 HIV virion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.3 Sequence alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

1.4 A phylogenetic tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

1.5 Example protein interaction network . . . . . . . . . . . . . . . . . . . . 35

1.6 Complex interaction models . . . . . . . . . . . . . . . . . . . . . . . . 39

1.7 A graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

1.8 Overlap method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

1.9 High-throughput interaction overlap . . . . . . . . . . . . . . . . . . . . 58

1.10 Sample space overlap . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

2.1 Number and type of interaction reported in S. cerevisiae by year . . . . . 67

2.2 Experimental techniques contribution to BioGRID . . . . . . . . . . . . 68

2.3 Molecular function annotations of reported interactions . . . . . . . . . . 71

2.4 Cellular component annotations of reported interactions . . . . . . . . . . 72

2.5 Biological process annotations of reported interactions . . . . . . . . . . 73

2.6 Proportion of matching functional annotations by experiment technique . 75

2.7 Proportion of matching component annotations by experiment technique . 76

2.8 Proportion of matching biological process annotations by experiment tech-nique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

2.9 Accrual of reported yeast protein interactions over time . . . . . . . . . . 78

9

2.10 Rank-degree plots of network data. . . . . . . . . . . . . . . . . . . . . . 82

3.1 Node shuffle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

3.2 Network shuffle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

3.3 Bipartite shuffle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

3.4 Biological node shuffle . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

3.5 Biological network shuffle . . . . . . . . . . . . . . . . . . . . . . . . . 94

3.6 Co-expression trait for ensembles using LC graph . . . . . . . . . . . . . 100

3.7 Complex trait for ensembles using LC graph . . . . . . . . . . . . . . . . 101

3.8 Gene Ontology traits for graph ensembles . . . . . . . . . . . . . . . . . 103

3.9 Component and clustering traits for graph ensembles . . . . . . . . . . . 104

3.10 Co-expression trait for topological ensembles . . . . . . . . . . . . . . . 106

3.11 Instability and distance for GO perturbations . . . . . . . . . . . . . . . . 109

3.12 Null homology perturbations for complex annotations . . . . . . . . . . . 110

3.13 Null homology perturbations for process annotations . . . . . . . . . . . 111

3.14 Similarity score by perturbation method . . . . . . . . . . . . . . . . . . 113

4.1 Phylogeny of study species . . . . . . . . . . . . . . . . . . . . . . . . . 120

4.2 Topology edit distance . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

4.3 Phylogenetic profiles for each ensemble . . . . . . . . . . . . . . . . . . 126

4.4 Phylogenetic profile differences . . . . . . . . . . . . . . . . . . . . . . 127

4.5 Topological matching for LC interaction graph . . . . . . . . . . . . . . 129

4.6 Mismatch score using LC interaction graph . . . . . . . . . . . . . . . . 130

4.7 Topological similarity for LC interaction graph . . . . . . . . . . . . . . 131

4.8 Similarity of topologies for different tree algorithms . . . . . . . . . . . . 133

5.1 Single coupon function . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

10

5.2 S. cerevisiae physical interactome size . . . . . . . . . . . . . . . . . . . 153

5.3 S. cerevisiae genetic interactome size . . . . . . . . . . . . . . . . . . . . 154

5.4 Experiment and interactome size . . . . . . . . . . . . . . . . . . . . . . 155

5.5 Single or multiple coupons . . . . . . . . . . . . . . . . . . . . . . . . . 156

5.6 Multiple coupon interactome size results . . . . . . . . . . . . . . . . . . 157

B.1 GO slim matching annotations through time . . . . . . . . . . . . . . . . 175

B.2 Known GO annotations for PPIs by method . . . . . . . . . . . . . . . . 176

C.1 Graph ensemble traits for DIP data . . . . . . . . . . . . . . . . . . . . . 178

C.2 Graph ensemble traits for CORE data . . . . . . . . . . . . . . . . . . . 179

D.1 Phylogeny results for DIP (PROML trees) . . . . . . . . . . . . . . . . . 182

D.2 Phylogeny results for CORE (PROML trees) . . . . . . . . . . . . . . . . 183

D.3 Phylogeny results for LC (PAML trees) . . . . . . . . . . . . . . . . . . 184

D.4 Phylogeny results for LC (PARS trees) . . . . . . . . . . . . . . . . . . . 185

D.5 Phylogenetic topology matches for E. coli data . . . . . . . . . . . . . . 187

E.1 Node sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

E.2 Edge sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

E.3 Edge discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

11

Tables

1.1 HTP experimental methodologies . . . . . . . . . . . . . . . . . . . . . 37

1.2 Error rate notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

1.3 FDR estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

1.4 S. cerevisiae interactome size predictions . . . . . . . . . . . . . . . . . 61

2.1 Interaction databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

2.2 Self-interactions found from each experimental technique . . . . . . . . . 70

2.3 Components and degree for empirical graphs . . . . . . . . . . . . . . . 80

2.4 Clustering coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

2.5 AIC analysis of possible degree distribution . . . . . . . . . . . . . . . . 83

3.1 Empirical graph traits . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

3.2 Size of homology sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

4.1 Similarity for each phylogenetic tree construction algorithm . . . . . . . 132

4.2 Complex results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

5.1 Interaction datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

5.2 Classification performance if ρm = 20,000 . . . . . . . . . . . . . . . . . 158

5.3 Classification performance if ρm = 40,000 . . . . . . . . . . . . . . . . . 159

B.1 Interaction prediction methodologies . . . . . . . . . . . . . . . . . . . . 172

B.2 GO slim annotation classes . . . . . . . . . . . . . . . . . . . . . . . . . 173

12

B.3 BioGRID experimental methods . . . . . . . . . . . . . . . . . . . . . . 174

D.1 Number of topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

13

Abbreviations

BioGRID Biological general repository for interaction datasetsBLAST Basic Local Alignment Search ToolCORE PPI subset taken from DIP (interaction graph)DIP Database of interacting proteins (interaction graph)DNA Deoxyribonucleic acidER Erdos-Renyi (random graph)ERGM Exponential random graph modelFDR False discovery rateFN False negativeFP False positiveFRET Fluorescence resonance energy transferGCC Giant connected componentGO Gene OntologyHTP High-throughput experimentLC Literature curated PPIs (interaction graph)MIPS Munich information center for protein sequencesML Maximum likelihoodmRNA Messenger ribonucleic acidMS Mass spectrometryPAML Phylogenetic analysis by maximum likelihood (phylogeny

inference)PARS Parsimony (phylogeny inference)PHYLIP Phylogeny inference packagePIN Protein interaction networkPPI Protein-protein interactionPROML Protein maximum likelihood (phylogeny inference)RNA Ribonucleic acidSSE Small scale experimentTN True negativeTP True positiveY2H Yeast two-hybrid

14

Mathematical Notation

E(X) Expectation of random variable XP(X) Probability of event XD (A) Distance matrix for protein ArA,B Pearson correlation coefficient between A and B|z| Absolute value of zE Edge setV Node setG ∼ (V,E) Graph with nodes, v ∈ V , and edges, e ∈ Ed (v) Degree of node vC (v) Clustering coefficient of node vN (v) Set of neighbours of node vβ (v) Biological characteristic for node vφ (e) Biological characteritisc for edge eΠ (G) Network trait for graph G∆ (G,Φ) Biological trait, φ, for graph GηA,B Distance between phylogenetic topology for proteins A

and BΓA,B Similarity of topologies for proteins A and B

15

Acknowledgements

I would like to thank all the members of the Centre for Bioinformatics for providinga helpful and friendly environment. In particular: Ino Agrafioti, Sara Dobbins, IsabelHolmquist, Piers Ingram, Paul Kirk, Yussanne Ma, Ronald Stewart and Tom Thorne.

I thank my supervisors Niall Adams and Michael Stumpf for providing guidance, adviceand support throughout my research at Imperial College. David Stephens and Frank Kellyalso provided additional support through different sections of the project.

This thesis was greatly enhanced by those that have commented on and proof read itthroughout. Thanks go to Niall Adams, Paul Kirk, Katherine Sharrocks and MichaelStumpf.

My friends have provided a huge amount of diversionary support over the last four years,alleviating some of the pressure at key stages of my Ph.D. Thanks also go to Mum, Dad,my brother and Kat for love and support throughout the project.

Finally, I am grateful for the funding and support from the Wellcome Trust for both thePh.D. and preceding masters which enabled me to complete my studies at Imperial Col-lege.

16

Chapter 1

Introduction

This chapter presents an overview of this thesis and reviews background material used insubsequent chapters. The scope and contents of the thesis are outlined (Section 1.1). Thegenome and the interactome are introduced (Section 1.2). Comparative genomics is dis-cussed together with the use of phylogenetics for classifying potential protein interactions(Section 1.3). Techniques used to generate protein interaction data are also introduced(Section 1.4). Graph theory is introduced and relevant notation defined (Section 1.5).The literature concerning error found in protein interaction data, and the notation used ispresented (Section 1.6).

17

1.1. THESIS OVERVIEW Introduction

1.1 Thesis overview

Systems biology is an emerging inter-disciplinary field which studies the function of bi-ological organisms using a breadth of experimental and computational approaches. Aprimary aim of these systems approaches is to provide mechanistic, quantitative and pre-dictive models for the dynamics of biological interactions (Schwikowski et al., 2000;Luscombe et al., 2001). Molecular information is used to study the interactions betweenelements of a biological entity. These interactions form networks that are used to modelthe overall system (Hintze and Adami, 2008).

This thesis explores the properties of a protein interaction network. The biological fea-tures of proteins are used to generate network models of protein-protein interactions. Re-peated experimental data are used to form an estimate of the unknown number of distinctinteractions found in the interactome and to compare the different experimental tech-niques that have been used to generate interaction network data.

1.1.1 Scope

This thesis assumes that biological characteristics can explain aspects of the structure andevolution of protein interaction networks. However, the data are considered from twodistinct perspectives. First, a collection of empirical graphs are assumed to represent thecomplete protein interaction network and are analysed. Second, the reported data areassumed, as a result of experimental noise, to be a collection of true and false interactionswhich form a subset of the complete interaction set. This latter view is used to model thenumber of different possible protein interactions and assess the reliability of the publisheddata.

Algorithms used to generate protein interaction networks attempt to understand how thesesystems have evolved, or to provide means of assessing the relevance of biological char-acteristics. One aim of this thesis is to understand how statistical analyses of biologicalcharacteristics on large scale interaction networks are affected by the choice of randomgraph null models. The relevance of particular biological characteristics (which have beenused to find protein interactions previously) and the accuracy of current protein interac-tion data are also considered. A further aim is to develop a means of elucidating thecomplete set of possible protein-protein interactions from reported datasets. Finally, this

18


thesis serves as an overview of the current state of Saccharomyces cerevisiae interactomedata.

1.1.2 Outline

Chapter 1 presents the required biological background. This breaks down into an in-troduction to genomics and comparative analyses before discussing recent research onphysical networks. Then the published literature regarding interaction data error rates andpossible sizes of protein interaction networks are introduced. The chapter concludes witha discussion regarding how the interaction data have been generated.

Chapter 2 analyses the available interaction data in S. cerevisiae and presents the empir-ical graphs that are subsequently used in this thesis. The protein interaction data for S.

cerevisiae are analysed in order to motivate the biological constraints used to generaterandom networks. Due to the recent proliferation of new experimental data the analysesare crucial both to reappraise previous results and to fix (in time) the context in which thiswork is conducted.

Chapter 3 describes a variety of random graph ensembles. These graph ensembles aremotivated by the biological literature and factors considered relevant to protein interac-tion network structure. The ensemble averages of various covariates are compared andcontrasted across the graph ensembles and the empirical data.

Chapter 4 uses the graph ensembles to test whether protein interactions have more similarphylogenetic topologies than would be expected by chance in the random ensembles. Thisis tested both for individual interactions as well as in the context of the graphs producedby subsamples of the reported S. cerevisiae interactome.

Chapter 5 presents a model to determine the number of distinct interactions found inthe interactome. The relationship between the interactome size and the number of falselyreported interactions are assessed. The model is also used to assess the number of interac-tion reports required to use validated information to reliably classify the true interactionsfrom erroneous data. The error rates for different types of experimental interaction dataare compared along with predicted interactome sizes for protein-protein and protein-DNAinteractions in S. cerevisiae.

Chapter 6 draws the work together through a summary and general conclusion before a

19


discussion of future work that can be used to develop the methods and results presentedin this thesis.

1.1.3 Publications

Contributions from this thesis have been published and the references are:

1. (Stumpf et al., 2007) Stumpf, MPH, Kelly, WP, Thorne, T, and Wiuf, C. Evolutionat the system level: the natural history of protein interaction networks. Trends in

Ecology & Evolution, 22:366–373, 2007.WPK completed analysis on S. cerevisiae PPI and evolutionary rate correlations

for this article. Figure 1.1 is from this article.

2. (Kelly and Stumpf, 2008) Kelly, WP and Stumpf, MPH. Protein-protein interac-tions: from global to local analyses. Current Opinion in Biotechnology, 19:396–403, 2008.WPK and MPHS wrote the article, WPK performed the data analysis and created

figures which also appear in Chapter 2 as Figures 2.3, 2.4, 2.5, and 2.9.

20

1.2. BIOLOGICAL SYSTEMS Introduction

1.2 Biological systems

Genomic techniques paved the way for the biological sciences to characterise the molec-ular constituents of life (Bruggeman and Westerhoff, 2007). These constituents have beenfound to organise and function through various systems of molecular interactions. Net-works can be used to describe biological interactions such as: the atomic interactionsoccuring between protein structures; the interactions of metabolites and proteins duringspecific cellular events such as the cell cycle; and, on a macroscopic level, the inter-relationships between organisms in an ecosystem (Alm and Arkin, 2003). Systems ap-proaches aim to develop an understanding of the inter-relationships between proteins,metabolites or other molecules across organisms (Barabasi and Oltvai, 2004).

Modern high-throughput techniques, taking measurements on a system-wide level, arewell suited to the global analysis and modelling of networks and processes (LaCountet al., 2005). The published data, when adequately verified, can be used to train computa-tional models, as well as to validate models that have been proposed (Shen et al., 2007).In parallel, computational methods have the potential to reduce noise and systematic er-rors (Gilchrist et al., 2004), whilst also forming a new means of providing constructivefeedback across in vitro and in vivo experiments.

The yeast species Saccharomyces cerevisiae is the study organism for this thesis. S. cere-

visiae has multiple studies providing global analyses of its interactions (Gavin et al., 2006;Hart et al., 2006). Its cells are approximately spherical, around ten micrometres in di-ameter, and are easy to culture and perform biological experiments upon. The specieshas at least 5,800 distinct proteins and its genome sequence has 12,495,682 base pairs(Hirschman et al., 2006). S. cerevisiae has a large number of proteins homologous tohuman proteins, including cell cycle and signalling proteins, making it a good model or-ganism for experiments probing fundamental eukaryotic processes. S. cerevisiae is oneof the most intensively studied eukaryotic model organisms in molecular and cell biology(Hong et al., 2008).

1.2.1 Genomes

Deoxyribonucleic acid (DNA) is the hereditary material of the vast majority of organisms.DNA encodes all of the information required for the processes of individual cells, and

21


consequently the functions and inherited characteristics of organisms. The DNA of a cellcomprises that cell’s genome – the book of instructions.

Definition 1.1 (Genome) A genome is all the genetic information, the entire genetic com-

plement of the hereditary material, possessed by an organism.

DNA is a polymer consisting of four nucleotide bases: adenine (A), cytosine (C), guanine(G) and thymine (T). Sequences of these nucleotides are joined by covalent bonds toform strands of DNA, each strand of DNA forming hydrogen bonds with a second strandin a specific manner known as complementary base pairing. DNA is composed of twocomplementary strands in the shape of a double helix. Each base in a strand forms ahydrogen bond with another specific base – adenine with thymine; cytosine with guanine.

1.2.2 Proteins

A protein is formed from a DNA sequence. This sequence of bases, a gene, is ‘read’ usingthe cellular enzyme RNA polymerase to produce an RNA copy of the DNA, which is inturn ‘read’ by the cell’s ribosomes to produce a protein. Within each gene, some of theDNA does not directly provide information that can be read to produce proteins and arenon-coding sequences, or introns. That DNA which contains sequences that can be readto produce proteins are coding sequences, or exons.

Proteins comprise sequences of the twenty different amino acids, joined together as perthe instructions found within the cell’s DNA. A stretch of DNA that codes for a singleprotein is called a gene. As there are 20 amino acids that produce proteins but only 4nucleotide bases found in DNA it is impossible that one nucleotide base ‘codes’ for oneamino acid. A sequence of three contiguous bases forms a unit, or codon. 61 of the64 (43) possible codons each map to a fixed amino acid, and there is redundancy in thegenetic code, with each amino acid being coded for by up to 6 different codons. The 3remaining codons map to a stop codon within messenger RNA that signals termination oftranslation, and the Methionine amino-acid codon, or start codon, initiates the productionof a protein.

Definition 1.2 (Protein) A protein, p, is a sequence of amino acids defined by the DNA

sequence of a gene.

22


As the collection of DNA, including the genes, forms an organism’s genome, so the col-lection of proteins expressed by an organism forms the organism’s proteome. Protein in-teraction network research is concerned with the inter-relationships between the proteinsof a proteome.

Definition 1.3 (Proteome) A proteome, P , is the complete set of proteins, p1, · · · , pn,expressed by a genome.

1.2.3 Protein interactions

The functional operation of biological processes and systems is dependent on the inter-relationships between proteins. Understanding these interactions helps not only to eluci-date how the system works but also to increase our knowledge regarding the evolution oforganisms and function.

Protein interactions are observed using a variety of different experimental techniques, asdiscussed in Section 1.4 on page 34. Although the theoretical concept of a protein inter-action is well-defined, observed interactions are subject to errors and misclassification.Accordingly, care has to be taken to note the difference between observed interactionsand true interactions. This is expanded upon in Section 1.6 where the space of possibleprotein-protein interactions is defined, and in Chapter 5.

Definition 1.4 (Protein interaction) A protein interaction is the binding of a protein, p,

to another molecule.

Interactomes form the complete set of possible molecular interactions which can occurwithin the cell (Sanchez et al., 1999). These sets may include interactions between anytype of biological molecule, including proteins. Throughout this thesis, the complete setof interactions found within the proteome, P , comprises an interactome – or a proteininteraction network (PIN). Consequently, the complete interactome may form a supersetof the interactions that may occur within a particular individual or environment for thestudied system. The networks observed in this thesis involve two types of protein basedinteraction:

• physical: the binding (molecules join to form combined structure) of two differentproteins, a protein-protein interaction (PPI) (Collins et al., 2007a; Tarassov et al.,

23


2008).

• genetic: the binding of a protein to a component of the genetic sequence (Booneet al., 2007).

Definition 1.5 (Interactome) An interactome is the complete collection of biological in-

teractions, of a given type, found within an organism. This is the set of interactions that

can be detected experimentally under any conditions, in vivo or in vitro, for the defined

set of molecules.

Protein domains are components of a protein that have been found to exist independentlyof the rest of the structure – for instance units that have been found in several differentproteins. In general, physical protein interactions are found to exist between particularpairs of domains. A typical PPI is shown in Figure 1.1 where two proteins are shownbound to each other, and the protein domains involved in binding are depicted in differentcolours.

The complexity of evolutionary analysis of biologicalnetworks is reflected by the diversity of differentapproaches used to study or model PIN evolution: frommethods taken straight from statistical physics, via studiesthat involvemethods frommolecular evolution, to analysesthat are heavily influenced by structural genomics. Here,

we review these approaches as well as future challengessurrounding the evolutionary study of PINs.

From bags of genes to networks of interacting lociThe field of evolutionary genetics has made much progressin unravelling the molecular basis of genetic and pheno-typic variation among individuals in a population, as wellas among species. In particular, the interplay betweentheoretical analysis and experimental studies has led tothe development of statistical frameworks for the quanti-tative analysis of genetic variation. At the level of popu-lations of individuals belonging to the same species,population genetics and quantitative genetics have devel-oped sets of extensively tested models for the evolution ofsystems consisting of a small and large number of geneticloci, respectively. These models have been studied care-fully and, given a set of suitable assumptions, are amen-able to exact mathematical analysis.

In population genetics, most studies focus on either asingle locus or a few loci. Although for the former, ourunderstanding of the model is now fairly complete [13,14],systemsof interacting loci areanactivefield of interest,withmany questions remaining. Most studies have looked eitherat pairs of loci or at systems of loci with certain simplifyinglimits, such as independent loci, where loci are in linkageequilibrium and are inherited independently. One crucialaspect of such theoreticalmodels is the preciseway inwhichthe genotype is related to the phenotype (generally sub-sumed into some measure of darwinian fitness). The morethat loci contribute to a trait, the more difficult modellingbecomes, as additional assumptions have to bemade: gener-ally independence of the contributions from different loci isassumed. As the number of loci increases, however, systemsenter the realm considered by quantitative genetics: here, a

Figure 1. Example of a network and network statistics discussed in the main text. Anetwork is generally described by a graph, G, which contains a set of nodes orvertices, V (red) and edges, E (cyan): thus, G = (V,E). Here, we only considerundirected graphs with binary edges; that is, interaction between two proteins iseither present or not; edges have no directions and no distinction is made betweenthe relative strengths of different edges. In the future, quantitative interaction datawill require straightforward extensions to the mathematical description of G. Forrecent reviews, see Refs [2,8,9].

Box 1. What are protein interaction networks?

Whereas metabolic networks and gene regulatory networks aim tosummarize the basic biochemistry and the set of regulatory interac-tions of biological organisms, respectively, PINs lack such astraightforward interpretation. A PIN consists of all reported pro-tein–protein interactions in an organism. When reporting an interac-tion between two proteins, we typically mean that somephysicochemical interaction has been detected in in vitro biochemicalassays, such as yeast-2 hybrid, immuno-precipitation and tandem-affinity purification, using protein tags. These experimental assaysare subject to considerable noise levels, especially when used in high-throughout settings; thus, it is generally difficult to determine theextent to which interactions detected in vitro are relevant in vivo. Notall of these interactions will be realized simultaneously and there is asyet no data that would enable the analysis of protein interactions inthe same organism under different environmental or physiologicalconditions. In general, the network data are also only of a qualitativenature; that is, interactions are either present or not but their strengthis not quantified.

Finally, in reality, interactions are between different proteindomains rather than proteins. Figure I shows the structure of theporcine pancreatic a-amylase (blue structure) in complex with a beanlectin-like inhibitor (red and yellow structure; protein database code1DHK) [76]. The interaction occurs solely between the blue and reddomains, although the inhibitor also has a 2nd domain, shown inyellow; other proteins containing the red and blue domains mightalso interact.

Figure I.

Review TRENDS in Ecology and Evolution Vol.22 No.7 367

www.sciencedirect.com

Figure 1.1: Interacting proteins. This shows the structure of the porcine pancreatic a-amylase(blue) bound with a bean lectin-like inhibitor (protein with two domains: red and yellow) (Gilles et al.,1996). The interaction occurs solely between the blue and red domains.

24


1.2.4 HIV example

This example is given to put the previously detailed theory into the context of a modelorganism – the HIV-1 provirus. Figure 1.2 shows the HIV virion. This is a well-studiedvirus with a complete genome less than 10 kilobases long, encoding 15 different proteins.Although this virus has a short genome sequence, researchers are interested in both theset of relationships within its proteome, and those that can occur between human and HIVproteins (Sharrocks, 2007).

HIV-1 has been reported to interact with 1,448 human proteins (Ptak et al., 2008). This in-volves 2,589 HIV-1 to human protein interactions. Across the different proteins of HIV-1,a single regulatory protein, Tat (Trans-Activator of Transcription), participates in arounda third of these unique interactions. The high number of interactions, and the dispropor-tionate number of reported interactions with only one of the proteins, shows the potentialinhomogeneity of networks even when such a small set of proteins is considered. This il-lustrates the possible scale of the human interactome, which contains interactions between20,000-24,000 proteins, and emphasises the need to first understand model examples. Formodel organisms, the interactions of interest are those that can occur between proteinsfound solely in the organism’s own genome sequence: those interactions forming theorganism’s interactome.

1.2 HIV-1

1.2.1 Structure of the HIV-1 Virion

Simplified, the mature HIV-1 virion appears as a core of structural proteins sur-rounded by a lipid envelope containing glycoproteins (see Figure 1.3 below) [Wrightet al., 2007]. The two copies of the RNA genome are encapsidated by nucleocap-sid (NC), and surrounded by capsid (CA) to form a cone-shaped core. Matrix(MA) lines the inner surface of the membrane to which it is tethered by a myristylmoiety, and where it engages in an undefined interaction with the trimeric Envglycoproteins. Also contained within the virion are the Pol polyprotein products,RT, PR, and IN, and the p6 protein.

Figure 1.3: Structure of the HIV-1 virionTrimeric gp120/gp60 embedded in a double lipid bilayer lined by MA surrounds a cone shapedcore. NC intimately associates with the viral RNA within the core, where the enzymes IN, RT, andPR, the accessory proteins and various host proteins may also be found. The structure of the coreis maintained by CA. This is the structure of a mature virion, and is only present once PR hasacted upon the Gag and Gag-Pol polyproteins.

1.2.2 Genome and Proteome of HIV-1

The HIV-1 provirus is approximately 9.2 kb in length and encodes 15 proteins in9 open reading frames (ORFs) (see Figure 1.4 on page 10). Three of the ORFsproduce the prototype retroviral Gag, Pol and Env polyproteins which are prote-olytically cleaved post-translation into MA, CA, NC and p6 from Gag; PR, RT,

8

Figure 1.2: HIV virion. HIV-1 provirus is approximately 9.2kb in length, encoding 15 distinct proteins(image from Sharrocks (2007)).

25

1.3. COMPARATIVE GENOMICS Introduction

1.3 Comparative genomics

The field of comparative genomics mirrors biological studies that have been conductedfor decades in ecology and the study of taxa diversity (MacDonald, 1979; Bangert et al.,2006; Pratt et al., 2008). Comparisons are made in order to understand organic diversityand to appreciate the role of evolution in creating that diversity (Allen et al., 2005; Bangertet al., 2006).

Evolutionary work is complicated by a lack of direct knowledge of the history of differ-ent organisms, although data for some model organisms are more readily available thanothers. Studies of model organisms with short generation times – such as Escherichia

coli (Konagurthu and Lesk, 2008) or S. cerevisiae (Wolfe, 2006) – have been most use-ful when aiming to understand the links between characteristics of genomes, genes andproteins.

1.3.1 Sequence alignment

Sequence alignment provides a means of measuring similarity between strings: in thiscase for DNA, RNA, or amino acid sequences. Sequences that show high levels of sim-ilarity may be linked by function, structure, or close evolutionary relationships (Yang,2006). Different sequences are aligned, a multiple sequence alignment (MSA) is shownin Figure 1.3, according to a scoring matrix. This matrix is dependent on the alphabet ofpossible items found in the sequences.

Definition 1.6 (Alphabet) An alphabet is the set of letters that make up the possible items

in a code – e.g. DNA genetic code has an alphabet of A,C,G, T.

Definition 1.7 (Sequence) A sequence, of length q, is a q-tuple of letters, (a1, a2, . . . , aq),

where each letter, ai, is in an alphabet: ai ∈ A

Sequence similarities are commonly interpreted as indicating some biologically relevantlink between the studied DNA, RNA or protein (Ramani et al., 2008). Alignments areused to classify sequences of unknown function or origin by inference from sequences

26


Figure 1.3: Sequence alignment. An example section of a MSA of five similar amino acid sequencesfrom different yeast species. Each dash, ‘-’, represents a gap in the alignment.

that have been more often studied. Accordingly, the roles of genes and proteins in non-model organisms can be inferred from work completed on experimental species usingsequence alignments (Lehner and Fraser, 2004).

Differences between sequences can be used to determine the probable evolutionary his-tory of biological sequences (Felsenstein, 1984). Under appropriate models of geneticevolution, the genetic distances between different genes, proteins, or organisms can befound through the alignment of relevant genetic material. These alignments are used toproduce phylogenetic trees that represent the relationships between different biologicalsamples, as discussed further in Section 1.3.2.

BLAST (Altschul et al., 1990) is a program used to compare sequences such as DNA oramino acids. A BLAST search is performed on a sequence of interest, in general a com-plete gene or protein, and this is aligned against a library of other sequences. The BLASTalignment algorithm is similar to the Smith and Waterman (1981) algorithm for sequencealignment, producing an ordered list of sequences similar to the query information. Thislist may include homologous proteins.

Definition 1.8 (Homology) Proteins are homologous if they have evolved from a com-

mon ancestor. This may be indicated through a high level of sequence similarity, as

measure by alignment.

BLAST searches, on appropriate libraries of distinct individual species, can be used tofind orthologous proteins, as demonstrated in Figure 1.3. Paralogous proteins are foundby the same means, but using a library of sequences from the same proteome as the queryprotein. Paralogous proteins are brought about by a gene duplication event in an ancestralspecies. These duplication events are a consequence of some error which increases thesize of the genome by replicating some subset of the DNA through a variety of means

27


including whole chromosome duplications (Hakes et al., 2007b).

Paralogous genes, at the point of duplication, are in general identical to genes alreadyfound in the genome. Accordingly, this leads to functional redundancy as it is often notadvantageous to have two identical genes. Thus, this enables mutations which may disruptthe structure and function of one of the two genes are not selected against enabling novelfunctions to evolve or other forms of evolutionary development (Zhang, 2003). The roleof duplication events, and the subsequent evolution of redundant genes, is an importantdriver of PIN development (Pastor-Satorras et al., 2003).

Definition 1.9 (Orthology) Orthologous proteins have evolved from a common ancestor,

separated by a speciation event. Orthologous proteins are homologous and found in

different species.

Definition 1.10 (Paralogy) Paralogous proteins have a common ancestor in the same

species, arising due to a gene duplication. Paralogous proteins are homologous and

occur in the same species.

1.3.2 Phylogenetic trees

Phylogenetic trees are used to represent evolutionary relationships between genomes,genes or proteins. Differences between sequences are used to reconstruct a branchingprocess of divergence from a common ancestor, resulting in a diagrammatic representa-tion of the historical evolutionary relationships between different entities (see Figure 1.4).

Definition 1.11 (Phylogenetic tree) A phylogenetic tree details the inferred evolutionary

relationships among a set of species, genes, or proteins.

Phylogenetic trees not only depict information as to the evolution of particular sequences,but may also inform about theoretical inter-relationships between sequences (such as theirparticipation in PPIs (Sato et al., 2003)).

The phylogenetic trees considered have two main components: the distances betweensequences and a set of branching events that represent when sequences have diverged. A

28


Figure 1.4: A phylogenetic tree. This shows a sample phylogenetic tree, with time flowing from leftto right, for 6 Saccharomyces species. Approximate number of million years (Myr) to the common ancestorsbetween S. cerevisiae and S. paradoxus, S. mikatae, S. kudriavzevii are shown (Wolfe, 2006).

branching event, where a line splits in Figure 1.4, describes a common ancestor diverginginto distinct entities. Branching, or divergence, events are characterised by the topology ofthe phylogenetic tree. The model used to reconstruct the phylogenetic trees can producetwo sets of possible topologies: bifurcating or multifurcating.

Definition 1.12 (Bifurcating tree) Bifurcating trees are such that a branching event re-

sults in exactly two divergent sequences, a binary tree.

Definition 1.13 (Multifurcating tree) Multifurcating trees are such that a branching event

can result in any number of divergent sequences.

Figure 1.4 shows a tree on 6 different yeast species. This tree also may be representedusing a bracket notation, representing the locations of branching events between dif-ferent lineages. For example, the topology of the tree in Figure 1.4 is: (S.castellii,(S.bayannus, ((S.cerevisiae, S.paradoxus), (S.mikatae, S.kudriavzevii)))). Withinany pair of brackets, the ordering is irrelevant – e.g. for two species(S.cerevisiae, S.paradoxus) = (S.paradoxus, S.cerevisiae). This is a bifurcating treeas each branching events divides the sequences into two subsets.

Multifurcating trees can have any number of sets at each branching event. This includes,for three species, the tree topology: (S.cerevisiae, S.paradoxus, S.mikatae). The set

29


of multifurcating trees includes all bifurcating tree topologies on the same number ofsequences.

A variety of different methods are used to construct phylogenetic trees that detail the evo-lutionary linkages between different sequences, although they can be described broadly asfalling into the following categories: parsimony; maximum likelihood; or distance meth-ods. They produce a tree given a sequence alignment for a collection of sequences, Ai.Let ai,j be letter j from sequence i in the alignment.

Parsimony methods are non-parametric approaches used to find phylogenies. A maxi-mum parsimony method assigns a model of evolutionary change onto the sequence alpha-bet. Then the best tree is found by determining the minimum number of letter changesrequired to match letter j for all sequences,Ai. Each step is a change, for a given ai,j , fromone letter to another. The algorithm aims to produce a phylogenetic tree that minimisesthe total number of changes across the alignment.

Maximum Likelihood methods are parametric, employing some probability model ofsequence evolution. They use expected patterns of mutational change, alongside theprobability model used, to find the most likely tree arrangement. These maximum like-lihood (ML) methods, including algorithms found in the Phylogeny Inference Package

(PHYLIP) (Felsenstein, 1995) or Phylogenetic Analysis by Maximum Likelihood (PAML)(Yang, 2004), take a model of evolutionary change for the letters of the sequences beingconsidered – e.g. amino acids for a set of aligned protein sequences. The model assumesthis pattern of evolutionary change, and then assesses the probability of each potential treearrangement for every position – i.e. letter j – of the sequence alignment. The tree thatis the most likely, after permuting through all possible combinations, is the phylogeny forthe sequence alignment assessed (Mount, 2004).

Distance methods produce trees based on the number of differences between sequencesfound in a MSA. For instance, neighbour-joining algorithms produce a phylogeny byadding the most similar sequence as an additional branch to a given tree by using theevolutionary distances found between the sequences.

1.3.3 Correlated evolution

Studies have asserted linkage between the evolutionary rate of proteins and PPIs (Pelle-grini et al., 1999; Goh and Cohen, 2002; Gertz et al., 2003; Pazos et al., 2005). For exam-

30


ple, chemokines and their corresponding receptors show evidence for correlated evolution

reflected by similarity of their respective phylogenetic trees (Goh et al., 2000). In the caseof TGFβ ligands and their receptors (Gertz et al., 2003), topological similarities betweenclosely related proteins’ phylogenies have been used to find novel PPIs.

Definition 1.14 (Correlated evolution) The level of correlated evolution between two

proteins is the linkage, or correlation, found between the evolutionary rates of the two

proteins.

Pellegrini et al. (1999) introduced the phylogenetic profile as whole genome sequencesbecame widely available. Phylogenetic profiles have been used to infer the complexes orpathways in which an unknown protein participates, or to predict protein function (Lo-ganantharaj and Atwi, 2007).

Definition 1.15 (Phylogenetic profile) A phylogenetic profile for a protein is an n-bit

string which details whether an orthologue exists, defined by some threshold on sequence

similarity, for the protein in each of n distinct species.

The mirrortree approach is based on an observation that interacting or functionally relatedproteins have similar phylogenetic trees (Juan et al., 2008a). The mirrortree algorithm(Pazos and Valencia, 2001; Juan et al., 2008b) uses MSA of orthologous sequences, andthe underlying species distance matrix, to help predict PPIs. The correlation between dis-

tance matrices for proteins is used to help find potential PPIs. Distance matrices detailthe evolutionary time separating sequences based on a probability model of evolution be-tween the sequence alphabet. An n×n matrix contains information on distances betweenn sequences. Suppose we have found homologous proteins for A and B in n species.

Definition 1.16 (Distance matrix) A distance matrix,D, is a two-dimensional array where

each entry, di,j , is the distance between sequence i and sequence j.

The basic mirrortree algorithm uses distance matrices for each protein of a proteome toaid PPI classification (Pazos and Valencia, 2001). Let the distance matrix, D (A), forprotein A be defined such that di,j (A) is the evolutionary distance between homologous

31


proteins found in species i and j. Let the correlation of the evolutionary rates of A and Bbe the Pearson correlation coefficient, r, of the two distance matrices:

rA,B =

∑i<j∈[1,n]

(di,j (A)− di,j (A)

) (di,j (B)− di,j (B)

)√ ∑

i<j∈[1,n]

(di,j (A)− di,j (A)

)2√ ∑

i<j∈[1,n]

(di,j (B)− di,j (B)

)2, (1.1)

where di,j (A) =P

i<j∈[1,n] di,j(A)

(n2)

.

Distance matrices can be used without calculating the complete phylogeny of each pro-tein. Instead the correlation between distance matrices of protein pairs is used to find PPIs.Mirrortree was tested using 118 known E. coli proteins and their orthologues across 47different genomes (Pazos et al., 2005). For half of the proteins a reported interactionpartner was found in the highest 6.4% of scores.

However, the mirrortree results do not necessarily imply that there is an overall correlationbetween the evolutionary rates of interacting partners in the PIN as a whole. To demon-strate this requires the use of a complete interactome. The DIP data (Xenarios et al.,2002) for E. coli, as used for the analysis, has very low coverage and the results over all

of the known interaction partners for each protein — rather than just the top interactor— are not known. It may be true that for each test protein at least one other interactingprotein has highly correlated evolution whilst the same effect is not apparent when as-sessed against the complete set of interaction partners. Kann et al. (2007) extended themirrortree approach, restricting the analysis to highly conserved regions of protein do-main sequences. This technique has been analysed and shown to increase the predictionaccuracy for domain-domain interactions, rather than PPIs. Kann et al. (2008) describesa further study using the mirrortree approach that shows the correlated evolution detectedis found across the binding sites and throughout the domain sequence.

Hakes et al. (2007a) performed a study on yeast proteins to attempt to discover whetherthe observed levels of correlated evolution were a result of co-evolution: compensatorymutations to maintain the interaction between two proteins. They found that the observedcorrelated evolution of interacting proteins is due to similar constraints on evolutionaryrate, as opposed to co-evolution. They observed similar levels of correlated evolutionacross protein sequences, rather than simply in the binding interfaces for each interac-tion. The similar rates of evolution, therefore, were suggested not to be linked to theco-evolution of protein pairs that interact.

32


Definition 1.17 (Co-evolution) Co-evolution of two proteins occurs if divergent changes

in one protein are complemented by compensatory changes in the second protein.

Distance matrix methods assume evidence of co-evolution to justify the PPI predictions(Jothi et al., 2005, 2006), rather than correlated evolution – as highlighted in Hakes et al.(2007a). It is important to differentiate between these concepts, as they guide the interpre-tation of results and to ensure that the biological characteristics are correctly defined. Thecoefficient itself, rA,B, details linkage in the evolutionary rates and nothing about how theproteins have actually evolved. Co-evolution, on the other hand, requires evolutionarychanges to be complementary between the candidate proteins.

Whilst evidence of co-evolution implies correlated evolution, the opposite does not hold,as a similarity of evolutionary rates does not mean the mutations are necessarily compen-satory. Correlated evolution may just reflect the evolutionary divergence that has occurredwhich has been linked with the expression rates of individual proteins rather than PPIs(Jordan et al., 2003; Agrafioti et al., 2005; Drummond et al., 2006).

Chapter 4 explores the possible link of evolution with PPIs focusing on the topologyof phylogenetic trees. Sequence alignment tools are employed to identify orthologousproteins that are used to construct phylogenetic trees for each protein in S. cerevisiae.The topologies are then used to assess whether interacting proteins’ phylogenies are moresimilar in observed PINs than expected through comparison to the properties of randomlygenerated networks.

33

1.4. PROTEIN INTERACTION DATA Introduction

1.4 Protein interaction data

Experimental mapping of biological networks is challenging and requires considerableresources and effort. A collection of large protein interaction datasets exists for someorganisms (Hermjakob et al., 2004; Breitkreutz et al., 2008). Large quantities of pro-tein interaction data, for model organisms such as S. cerevisiae, became available as aconsequence of high-throughput experimental technologies (Bader et al., 2008). Theseexperiments report thousands of putative interactions each year (Collins et al., 2007a).In contrast, relatively few interactions had been reported in total before the turn of thecentury.

The large number of reported protein interactions represents the work of dozens of groupsover many years (e.g. Uetz et al. (2000); Ito et al. (2001); Gavin et al. (2002); Lappe andHolm (2004)). The techniques used across these groups vary considerably but may be di-vided into two broad categories: traditional methods that delineate individual interactionsor the interactions between a small number of proteins; and high-throughput methods thatprobe thousands of possible interactions simultaneously. Techniques analyse differentgenetic, biochemical or physical traits, probing various subsamples of the complete setof protein pairs. These studies have helped to enable resources such as the Gene Ontol-ogy relational database to be developed (Camon et al., 2004; Hong et al., 2008) and haveallowed protein interaction maps to be produced.

Figure 1.5 shows an example PIN based on the data from two S. cerevisiae based exper-iments. Each protein has a variety of biological characteristics and every interaction canbe described by a collection of biological and experimental details. Although protein in-teractions either occur, or do not occur, between each protein pair available experimentaldata itself is often expressed quantitatively rather than qualitatively. This output is oftenreduced to binary information for each presented protein pair to enable their simple de-scription as a graph, as shown in Figure 1.5. The types of experiment used to generate thestatic S. cerevisiae PPIs used throughout this thesis are described in this section.

1.4.1 Traditional methods

Traditional or small-scale experiments (SSE) are mainly hypothesis driven tests that aimto answer specific biological questions (Cusick et al., 2009). They focus on understand-ing biochemical properties, binding affinities, or how processes are performed through

34


Figure 1.5: Example protein interaction network. The nodes represent proteins, whilst eachedge represents an interaction reported between the two proteins that it joins. The data are from Yuan et al.(2001) and Gurunathan et al. (2002) which form a subset of the BioGRID database. Different colours rep-resent different biological process annotations that have been assigned to the proteins, a label concerningthe biological properties of the protein.

combinations of protein interactions. PPI data from these techniques are limited to thoseproteins targeted to address hypotheses of interest.

Fluorescence resonance energy transfer (FRET) generates information on interacting pro-teins and provides in vivo spatial information using spectroscopy (Andrews and Demidov,1999; Raveh et al., 2009). The proteins of interest are associated with complementary flu-orophores that fluoresce when located closely together. FRET can be used to observeproteins binding as well as the abundance of the bound structure in vivo.

X-ray crystallography is used to determine the structures of molecular structures at theatomic level. Performing x-ray crystallography on these structures, which may includebound proteins, provides information about how the constituent parts bind with each other(Meinke et al., 2008). Other structural methods, such as nuclear magnetic resonance(NMR) (Freifelder, 1982), can also be used to provide similar information about proteincomplexes (Kiel et al., 2008).

Atomic force microscopes, which can measure to a resolution of fractions of a nanometre,can be used to measure interaction forces. These microscopy methods enable the analysisof protein interactions at the molecular level, but only for single interactions (Gaczynska

35


et al., 2004).

1.4.2 High-throughput methods

High-throughput (HTP) experiments aim to survey as large a number of PPIs as possibleusing technology that can be scaled to test thousands of protein pairs (Cusick et al., 2009).These techniques can be readily automated and generally report more interactions thanSSEs. Coverage (the protein pairs tested) of the interactome space depends on a varietyof experimental limitations including unknown systematic bias and the inability to testcertain proteins. Some HTP experiments may also exhibit bias towards testing proteins ofparticular function or interest.

Affinity capture, including co-immunoprecipitation, with mass spectrometry (Ho et al.,2002; Gavin et al., 2002) and yeast-two-hybrid (Ito et al., 2001; Uetz et al., 2000) tech-niques have been used extensively to identity PPIs. Mass spectrometry (MS) analysesproteins in vitro by producing peptide ions which are recognized by their mass-to-chargeratios and consequently can be directly associated to particular proteins.

Yeast two-hybrid (Y2H) experiments require a transcription factor gene that producestwo protein domains, DNA-binding and DNA-activating, which are both essential forthe transcription of an associated reporter gene. The DNA-binding and DNA-activatingdomains, which are required in close proximity for the reporter gene to be transcribed,are separated for the Y2H experiment. A protein of interest (bait) is fused to the DNA-binding domain, and another protein (prey) is fused to a DNA-activating domain. Thesetwo fusion proteins, or any of the four original parts, are not sufficient to initiate thetranscription of the reporter gene alone. The bait and prey are reported to bind if thereporter gene is transcribed when the two fused proteins are present (Ito et al., 2001).

There is a lack of symmetry when using bait-prey techniques. Whether a reported inter-action can be replicated with the bait and prey swapped is important when determiningthe reliability of reported PPIs (Scholtens et al., 2008). The interaction characteristics ofeach protein cannot be assumed to be just the collection of all observed interactions asthe data contain noise. Information on the context of the experimental technique can alsohelp to improve confidence in the predictions.

A variety of other high-throughput experimental methodologies have been used to popu-late protein interaction databases (Shoemaker and Panchenko, 2007a), some of which are

36


detailed in Table 1.1. These methodologies probe subtly different biological traits, fromwhich putative interactions have been derived. For instance, gene co-expression studiesobserve functional linkages between proteins rather than physical binding relationships(Bhardwaj and Lu, 2005). These identify different types of interaction, helping to popu-late the database of functional (or other biological) characteristics for individual proteins.

Method Interaction AssayYeast-two-hybrid binary in vivoMass spectrometry complex in vitroProtein microarray binary, complex in vitroGene co-expression functional in vitroSynthetic lethality functional in vivo

Table 1.1: HTP experimental methodologies. Methods used to find different types of protein-protein association, including protein-protein interactions.

Reliability issues in a range of contemporary HTP studies have been highlighted by vonMering et al. (2002) using a benchmark reference set of thousands of protein interactions.The putative interactions reported by each mapping showed little agreement relative tothe number of interactions that each global study presented (Ito et al., 2001; Uetz et al.,2000). Assuming that each method has probed the same protein pairs this suggested: alow true-positive rate, a high false-positive rate, or a combination of both.

von Mering et al. (2002) reported that a variety of the new techniques had FDRs of be-tween 90% and 99%. However, these error rates were estimated by the overlap between apreviously known reference set of PPIs and the new data. Although a flawed comparison– if all the interactions were already known the experiments were pointless – it high-lighted the skepticism that some contemporary approaches provoked. This skepticismhas resulted in the analysis and development of several techniques designed to estimateand account for noise (Nariai et al., 2005; Shoemaker and Panchenko, 2007b).

1.4.3 Interaction inference

Interaction data, as well as being reported directly, can also be inferred from other biolog-ical association studies. Techniques such as gene co-expression probe functional linkagesbetween proteins in vitro. Consequently, the collection of PPIs also includes a body ofinferred evidence that has been derived from these studies alongside the binary PPI datafound directly.

37


Protein complexes

Protein complexes have added indirect evidence for binary interaction partners. Eachcomplex is a collection of proteins that have been found to bind as a multi-protein struc-ture (i.e. one with more than 2 proteins). Krogan et al. (2006) and Gavin et al. (2006)reported on protein complexes in S. cerevisiae: Krogan et al. (2006) found 547 distinctcomplexes, averaging just under 5 proteins per complex; whilst Gavin et al. (2006) pub-lished 491 complexes.

There are a variety of methods that can identify pairwise interactions from complex ex-periment results. These include the matrix and spoke models (Bader and Hogue, 2002;Hakes et al., 2007c). Each method, having observed a multi-protein complex, assignssome subset of the protein pairs as binary interactions, according to some structural argu-ment. These interactions may not actually have been observed, but are generally reportedalong with the complete structure that makes up the protein complex.

Figure 1.6 illustrates these assignments for a toy example of a complex. The complex ismade up of 3 core proteins, always in the complex, and a selection of unessential peripheryproteins.

(a) Matrix The matrix approach assigns protein interactions to all possible pairs foundto co-occur in the experimentally observed complex. This ignores the possibility thateach protein may not actually bind with every other protein, but is used to infer pairwiseinteractions from observed complexes.

(b) Spoke The spoke model refines the set of interactions that are assigned based on thecomplex found. A subset of the proteins, such as the core proteins found, are assumed tointeract with all other members of the complex.

(c) Observed topology The topology approach observes the actual topology of the pro-tein complex, assigning an interaction if the topology suggests the proteins actually bindto each other.

Hakes et al. (2007c) studied the differences between these methods to assess whether thespoke or matrix assignments produced a higher proportion of false-postive interactions.

38


Protein Complex

Periphery proteins

Core proteins

(c) Observed

(b) Spoke

(a) Matrix

Figure 1.6: Complex interaction models. Each complex is made up of core and periphery pro-teins, and binary interactions can be inferred through: (a) Matrix method that assigns interactions betweenall possible protein pairs; (b) Spoke method that attributes interactions from a protein to all other proteins;or (c) a method by which interactions are assigned according to the molecular structure that has beenexperimentally observed.

Analysis of S. cerevisiae protein complexes showed that smaller complexes are best de-scribed by the matrix model. If the number of proteins in the complex exceeds 5 the spoke

model is a better means of inferring pairwise PPIs.

Ultimately, it is important to note that the biological structures formed by proteins bindingare not solely the result of pairwise interactions. The full collection of protein interactionsincludes a set of binary interactions and a separate (possibly overlapping) set of multi-protein complexes. The sets’ properties are not necessarily identical.

1.4.4 Computational predictions

In silico methods predict protein and domain interactions using trait information, by con-sidering a variety of physical or functional associations (see Table B.1 in Appendix B).

39


However, comparisons with the small reference sets of known interactions suggests thateven the most successful novel PPI prediction methods suffer from high false-positive andfalse-negative rates (Lu et al., 2005; Mika and Rost, 2006). In silico methods have alsoused a combination of the reported PPI data and biological characteristics in an attemptto reduce the noise found in putative interaction data (Deane et al., 2002).

Experimental data and computational predictions complement each other to form ourknowledge of the true interactome. However, it is possible that the interpretation of theinteractome using computational methods is biased by prior knowledge and assumptions.

Cross-species interactions

A selection of promising prediction methods have been used to infer interactions acrossdifferent species (Wojcik et al., 2002; Li et al., 2004; Bork et al., 2004). These aimto transfer knowledge of interactions from a model organism to another organism. Forexample, if proteins A and B have been reported to interact in one species, and if ortholo-gous (see Definition 1.9) proteins, A′ and B′ can be found in a different species, then theinteraction is assigned to the second species provided certain conditions are met (Gertzet al., 2003; Albert and Albert, 2004). This is clearly a sensible starting point, but limita-tions are also evident: unreliable interaction data will be propagated across species and itmay be difficult to reconcile conflicting data.

Biological trait based inference

A reference set of interactions can be used to assign belief, or confidence, to potentiallyinteracting protein pairs. The traits of the reference set, such as sequence data or expres-sion profiles, can predict which protein pairs are more likely to be in the true interactiongraph (Bader et al., 2004; Ben-Hur and Noble, 2005). Hypothesis testing can be used tosee whether a particular trait is correlated with observed PPI or PIN data (Agrafioti et al.,2005). If traits are being assessed against a PIN the graph structure (or topology) maybe important, which may be captured using a probabilistic graph ensemble. However,the choice of model used for comparison subtly, and possibly significantly, affects thehypothesis tested (Thorne and Stumpf, 2007). Differences between graph ensembles, andtheir effects on inferences about PIN data, are explored in Chapter 3.

40

1.5. GRAPH THEORY Introduction

1.5 Graph theory

This section details graph theory as used in Chapters 2-4. The inter-relationships of pro-teins found in an organism’s proteome are studied in this thesis. In order to study theseinteractions, the experimental data are represented as a set of binary interactions betweendistinct proteins, forming a graph, G. In this thesis, the terms ‘network’ and ‘graph’ areused interchangeably.

Graphs are used to represent the PINs to enable possible understanding of the evolutionand structure of the interactome. However, each individual interaction may only occurunder specific circumstances and at particular times within the cell cycle. The interactionsall are found on a set of proteins, V , which forms the proteome of interest. The aim is tobe able to find which of the possible protein pairs, (vi, vj) : i < j, vi 6= vj ∈ V , areinteractions and to analyse this set.

An undirected graph, G ∼ (V,E), consists of a set of nodes, V , together with a set ofedges, E. Each edge, e ∈ E, is a pair of (unordered) nodes found in V . The degree ofeach node is equal to the number of edges that connect to it, see Definition 1.20. A graph,G, has order |V | and size |E|. Figure 1.7 shows a graph with order 9 and size 10.

Figure 1.7: A graph. The red circles are the nodes, the set V , whilst the cyan links between nodes arethe edges, E, of G ∼ (V,E).

41


Definition 1.18 (Graph) A graph, G ∼ (V,E) is a set of nodes V = v1, . . . , vn and a

set of edges E = e1, . . . , em ⊆ (vi, vj) : i < j, vi 6= vj ∈ V .

Definition 1.19 (Subgraph) A graph, H ∼ (VH , EH) is a subgraph of G ∼ (VG, EG),

H ⊆ G if and only if VH ⊆ VG and EH ⊆ EG.

Definition 1.20 (Degree) The degree, d(vi), of node vi ∈ V is the number of nodes,

vj ∈ V , such that (vi, vj) ∈ E.

d(vi) =n∑j=1

I ((vi, vj) ∈ E) ,

where,

I ((vi, vj) ∈ E) =

0 if (vi, vj) /∈ E1 if (vi, vj) ∈ E

.

The graphs considered here are simple – having no self-interactions, i.e. (vi, vi), or edgedirections. A directed graph has a direction on each edge e = (vi, vj), leading to eachnode having both an in- and out- degree. The graph can also be labelled: each element,v ∈ V or e ∈ E, is then associated with some label, φ (v) or φ (e).

A simple graph G can be represented as an upper-triangular binary matrix, A. This adja-

cency matrix, A, is an n× n matrix detailing the edges found in the graph.

Definition 1.21 (Adjacency matrix) An adjacency matrix, A, is an n×n upper triangu-

lar matrix where the entry ai,j denotes whether there is an edge between the nodes vi and

vj for i < j.

ai,j = I ((vi, vj) ∈ E) . (1.2)

1.5.1 Graph properties

The statistical properties of graphs, from analyses of individual nodes to measurementsacross the complete graph, G, motivate the studies throughout this thesis. The propertiesmeasured are divided into two distinct types: graph structural characteristics that can be

42


found from the adjacency matrix; and biological characteristics that require edge or nodelabels. Each characteristic is referred to as a trait when it is a global property of thegraph. For example, the average degree is a network trait whilst the average level ofco-expression of interacting proteins across the whole graph is a biological trait.

Network traits

Network traits of a graph are defined as those statistical properties that can be derivedsolely from the adjacency matrix, A. These are topological characteristics of the graph.Some of the important properties, which also have relevance to PINs, are now introduced.The network trait, Πα (G), for a characteristic α (vi) (or α (ei)) is defined as the arithmeticmean of the characteristic across the graph.

Definition 1.22 (Degree sequence) The degree sequence of the graph G is a list of the

node degrees from Definition 1.20: [d(v1), . . . , d(vn)] of degrees for all nodes, vi ∈ V.

The distribution of degrees found in a graph is used to summarise a graph’s structure.A graph where each node has the same number of edges may have different statisticalproperties (biological or topological) than another graph that has the same size and orderbut where most of the edges are found between only a subset of the nodes. Experimentallydetermined PINs have been found to contain a set of nodes, hubs, that have a very highdegree (He and Zhang, 2006). Hub enriched biological networks have encouraged thedevelopment of evolutionary models (Stumpf et al., 2007) that attempt to explain howthese graphs have been created.

The clustering coefficient summarises properties of graph nodes and can help the analysisof graph motifs (recurring subgraphs found in the data). PINs, as well as other physicalnetworks, have been observed with a small proportion of the possible edges, a sparse

graph, together with a relatively high clustering coefficient for each node (Barabasi andAlbert, 1999; Carlson and Doyle, 1999; Yook et al., 2002).

Definition 1.23 (Neighbours) The neighbours, N (vi), of a node vi are those nodes vjthat are connected to the node of interest: N (vi) = vj ∈ V : (vi, vj) ∈ E.

Definition 1.24 (Clustering coefficient) The clustering coefficient, C(v), of a node v ∈

43


V is the proportion of its neighbours that are themselves neighbours,

C(v) =

∑vi,vj∈N(v) I ((vi, vj) ∈ E)(

d(v)2

) .

Nodes found in a clique are likely to have higher clustering coefficients than other nodes,and regions of the graph where there is high proportion of the possible edges are alsoreferred to as highly clustered subgraphs.

Definition 1.25 (Complete graph) A complete graph, K|V | ∼ (V,E), is a graph where

E contains all possible edges, (vi, vj), between nodes in V .

Definition 1.26 (Clique) A subgraph, H ∼ (Vc, Ec), of G is a clique if Vc ⊆ V and H is

complete, H = K|Vc|.

Two nodes are connected if there is a path between them, and the graph is connected ifa path can be found between all nodes. Measuring the distance between nodes providesinformation about a graph’s connectedness. Examining the different paths between twonodes, vi and vj , enables an assessment of the structural stability or robustness of thegraph.

Definition 1.27 (Path) A path in a graph, G ∼ (V,E), is a sequence of distinct nodes,

vi ∈ V , such that from each of the nodes there is an edge to the next node, vi+1, in the

sequence. A path, P (v1, vp), is a set, v1, v2, . . . , vp, such that:

∃ (vi, vi+1) ∈ E ∀ i ∈ [1, p− 1] .

Definition 1.28 (Distance) The distance, dis (vi, vj), between two different nodes, vi, vj ∈V, i 6= j, is the length of the minimal path, dis (vi, vj) = min |P (vi, vj) |, that exists be-

tween the two nodes.

44


The shortest path lengths found between nodes have also been used as a measure of cen-

trality for graphs. For PINs, the average path length is small relative to the number ofnodes. The distance between any pair of nodes of the giant connected component (GCC)is small (Dorogovtsev and Mendes, 2001; Watts, 2004).

Definition 1.29 (Giant connected component) The giant connected component (GCC)

of a graph is the largest connected subgraph.

Definition 1.30 (Robustness) The ability of a system to respond to either external or

internal changes whilst maintaining consistent behaviour.

The robustness of a graph can be measured in terms of its topological robustness bymonitoring the effect of small perturbations on each characteristic (Barabasi and Oltvai,2004). If, for instance, deleting random edges or nodes from the graph has little effecton the average distance between nodes, then the distance over the graph can be said to berobust.

Definition 1.31 (Motif) A motif of a graph is any local, recurring subgraph. For in-

stance, a ‘triangle’ motif is the complete graph on 3 nodes.

Motifs, or graphlets, refer to small subgraphs of a large graph (Shen-Orr et al., 2002; Miloet al., 2002; Przulj, 2007). Counting the different motifs, for example the subgraphs thatcan occur on 3 nodes, has been used to differentiate between different random graph mod-els and empirical data. The definitions, however, are not always consistent and it is diffi-cult to compare subgraphs with different orders. Differences in the counting techniqueshave plagued some of the early work on the significance of particular motif features. Thishas made it difficult to compare the statistics to expectations or to each other (Kashtanet al., 2004; Konagurthu and Lesk, 2008). However, analytical results have been obtainedregarding the ability to test the significance of the number of individual motifs found in aPIN given a random graph model (Picard et al., 2008).

Empirical graphs have been shown to contain various correlations between neighbour-ing nodes – for example, nodes of similar degree may also be neighbours. Assortativity

measures correlations between node properties. The assortativity of degree has been of in-terest for physical networks (Maslov and Sneppen, 2002; Vazquez et al., 2002; Newman,

45


2003). For example, in social networks, the probability of each interaction is not inde-pendent, as friendship groups make up highly connected regions, affecting the graph’sstructure (Newman and Park, 2003).

Definition 1.32 (Graph assortativity) Assortativity details the correlation between nodes

of a graph, e.g. the correlation of the connectivity of neighbours.

Each characteristic, for a node or edge, can be averaged over the graph under consider-ation. This ensemble average can then be compared with other graphs to observe differ-ences, e.g. the difference in the average degrees of graphs G and H .

Biological traits

The biological properties of a graph are described by biological traits, ∆ (G,Φ). These areassessed using both the graph, G, and a characteristic Φ = φ1, · · · , φm of the graph’sedges (or nodes). Unlike network traits, a biological trait will not necessarily be invariantunder permutation of node labels.

Biological traits can include information on the properties or sequence composition of theproteins as well as measurements of protein activity such as abundance measurements.When the biological characteristic of interest, β (.), is associated to each node, vi ∈ V , afunction f : β (V )× β (V )→ R is used to find a characteristic, Φ, for the edges.

Biological characteristics have been proposed as being linked with PPIs (Valencia andPazos, 2002; Bhardwaj and Lu, 2005; Thorne and Stumpf, 2007), and as a means of clas-sifying PPIs (Salwinski and Eisenberg, 2003; Yu and Fotouhi, 2006; Skrabanek et al.,2008; Ramani et al., 2008). These classification methods use various biological char-acteristics to predict interactions (Ben-Hur and Noble, 2005; Srinivasan et al., 2007) orconfer additional support for observed PPIs (Bader et al., 2004; Shen et al., 2007).

Bader et al. (2004) developed a quantitative method for evaluating the biological relevanceof PPIs from large scale experimental data. Information from other sources, such asmRNA expression, genetic interactions and other biological annotations, was comparedwith PPIs. These characteristics are used to assign levels of confidence to the putativePPIs to enable the generation of more reliable interaction datasets.

46


Genes that produce proteins which interact are believed to be found in close physicalproximity on the genome (Overbeek et al., 1999; Skrabanek et al., 2008). These inter-acting proteins have shown correlated functional and process annotations across manyorganisms and genomes. Classification methods have used the cellular location of pro-teins to define protein pairs that do not interact (Jansen et al., 2003; Jansen and Gerstein,2004).

Protein complexes have been shown to be linked to the function of proteins (Krogan et al.,2006), and are linked to the set of PPIs. In order to clearly differentiate between a complexand a PPI, the former is defined here as involving more than 2 different proteins. Eitherway, complexes must be linked with PPIs as each is formed by a collection of individualinteractions.

Definition 1.33 (Protein complex) A protein complex is a bound protein structure con-

taining more than two proteins.

Correlated mRNA expression patterns of proteins have been used to infer function acrossspecies (Eisen et al., 1998; Marcotte et al., 1999; Stuart et al., 2003). TranscriptionalmRNA levels have also been used to determine the age of physical PPIs and to infer phys-ical binary interactions (Deane et al., 2002; Jansen et al., 2003). Ramani et al. (2008) com-pared human mRNA co-expression patterns between orthologous genes in other species,to demonstrate the ability of mRNA data to identify proteins that are found in the sameprotein complexes and PPIs.

Gene Ontology

Biological traits are annotated within databases through the use of frameworks such asGene Ontology (GO) (Ashburner et al., 2000; Camon et al., 2004). GO is a structuredhierarchy for gene and protein annotations based on their involvement in: biological pro-cesses; cellular components; or molecular functions. A category forms a relational vo-cabulary, each term having a hierarchical relationship to one another, similar to the ECnomenclature for enzymes.

GO is used to categorise, and compare, different proteins with the potential to determinePPIs or improve the reliability of experimental predictions (Lin et al., 2004). GO slim,a non-hierarchy based classification system based on the relational vocabulary, is used

47


throughout this thesis. This allows a broad overview of the ontology, enabling easiercomparison of terms without any requirement to define semantic distances between dif-ferent annotations.

For each category, known annotations relate to particular biological properties that havebeen experimentally determined. For instance, for cellular components, a protein willbe annotated with each component in which it has been found (e.g. nucleus, plasma, orribosome).

These are a selection of characteristics that have been reported as being associated with in-teractions. Accordingly, biological traits related to these findings are used to characteriseobserved PIN data throughout the thesis.

1.5.2 Graph ensembles

Random graphs are used extensively throughout this thesis. A random graph is used hereto refer to a graph with a fixed number of nodes and edges, n andm respectively. Randomgraphs are generated from various graph probability distributions. These distributionsproduce graphs with characteristics and properties that match those found in an empiricalgraph. Graphs sampled from a probability distribution are referred to as being drawn froma particular graph ensemble.

Definition 1.34 (Graph ensemble) Each graph ensemble is a probability distribution

over the space of possible graphs, in general for a fixed number of nodes, n, and/or

edges, m.

The ensembles considered later represent the set of graphs that meet certain constraints.The constraints used are biological and network traits that are believed to be relevant tothe generation of PPIs or PINs, as described in Section 1.5.1.

Random graph studies have developed over the last 50 years from the study of graphs witha fixed probability that each edge is present (Erdos and Renyi, 1959) to the generation oflarge scale graphs, with thousands of nodes, that exhibit similar properties to real-lifegraphs (Watts and Strogatz, 1998). The development of generative methods for real-lifelarge scale graphs has accompanied the study of the evolution of biological networksincluding PINs. This section describes some graph ensembles used in network researchand more recently for PIN analyses.

48


Erdos-Renyi graphs

Erdos and Renyi (1959) introduced the random graph. Starting with an empty graphG ∼ (V, ∅) on n nodes, V = v1, . . . , vn, edges are added at random with the sameprobability. The number of edges m found in each random graph is fixed, so sampling(without replacement) m edges from the complete set of edges (vi, vj) : vi, vj ∈ V, i <j determines the graph.

However, the Erdos-Renyi (ER) model of random graphs used throughout this thesis issubtly different to the above description. A random graph is denoted by G (n, p), a graphwith n nodes and such that every possible edge occurs independently with probability p.

Definition 1.35 (Erdos-Renyi Graph) An ER graph is G (n, p) on n nodes such that

each edge, (vi, vj) occurs independently with probability p, P ((vi, vj) ∈ E) = p.

Although these graphs can be used as random samples to compare to observed data, theydo not replicate some of the prominent properties of real-life data. The combination ofa small number of edges and high clustering coefficients, as seen in empirical systems(Takemoto and Oosawa, 2005), are not well represented by ER random graphs.

Small-world graphs

Small-world graphs (Dorogovtsev and Mendes, 2001; Watts, 2004) describe networkswhere nodes can be reached from each other by traversing a small number of edges, sothe average path length is small. The path length between two nodes, vi.vj ∈ V, i 6= j

is dis (vi, vj) ≤ log (n) = log (|V |). A further typical property of small-world models isthat the graphs are sparse. An average node has a small number of neighbours, and thegraph size |E| << n(n−1)

2.

Scale-free graphs

A scale-free graph has a power-law distributed degree sequence, such as the Pareto distri-bution (Newman, 2005). The probability, P (d (vi) = k), of nodes having k neighbours isdistributed as k−γ . The value of γ, for observed data, has been generally estimated to be

49


in the range 2 < γ < 3. Many empirical networks have been observed to be scale-free inaddition to having small-world features (Li et al., 2007), or not (Small et al., 2007, 2008).Empirical PINs have been shown to have both small-world and scale-free properties (Al-maas, 2007).

Definition 1.36 (Scale-free graph) A scale-free graph G on n nodes is such that the de-

gree distribution follows a power-law distribution, P (d (vi) = k) ∼ k−γ .

The networks of the internet (Carlson and Doyle, 1999), power grids (Watts and Strogatz,1998), and latterly PPIs (Stumpf and Wiuf, 2005) have been found to have a small numberof hubs along with a degree distribution similar to the power-law distribution. In biologi-cal systems, Barabasi and Albert (1999) also observed that empirical networks exhibitedan abundance of hubs and claimed that the degree distribution of the PIN is best describedby the same power-law degree model.

The role of hubs and cliques in PINs has generated much interest (He and Zhang, 2006;Batada et al., 2006a,b; Kim et al., 2008). Their ability to confer apparent robustness hasalso been studied extensively (Barabasi and Oltvai, 2004; Wagner, 2005). Hub enrichedgraphs are robust (see Definition 1.30) to random deletion of edges (Albert and Barabasi,2000; Yook et al., 2002), as these perturbations do not greatly affect the average pathlength between nodes. However, removal of particular hubs can drastically alter the aver-age path length across the graphs, hence these graphs are referred to as ‘robust yet fragile’(Wagner, 2005; Doyle et al., 2005).

The power-law distribution has been used as a diagnostic test for the degree sequencesof PIN data (Reguly et al., 2006). However, it has been shown that the scaling propertiesof the observed degree distribution are not best approximated by this simple probabilitydistribution (Tanaka et al., 2005).

Stumpf et al. (2005a) used a likelihood based approach (see Section A.1 in Appendix A)to assess how best to model the degree distribution for empirical PIN data. For observeddata, the likelihood analysis of the degree distribution can allow an interpretation, overthose assessed, of the most likely generation model. The authors showed that simplepower-law models do not provide the best description of the observed data from S. cere-

visiae, where discretised log-normal distributions are seen to be a better fit. The conflict-ing reports found in the literature suggest that using these simple probability models forthe degree distribution may be a flawed means of modelling the empirical data.

50


Biological graphs

Empirical graphs have also generated interest in further models that fix network traitssuch as the degree sequence, clustering coefficients and other local structure (Barabasiand Albert, 1999; Park and Newman, 2004; Stumpf et al., 2007; Kim and Marcotte, 2008).Several different types of graph model have been used to model PINs including: graphensembles that generate scale-free degree sequences (Aiello et al., 2000; Gkantsidis et al.,2003; Li et al., 2005); exponential random graph models (ERGMs) that have been usedin social network research (Pattison and Wasserman, 1999; Robins et al., 2007); mixturemodels that use ER random graphs (Daudin et al., 2007); and geometric random graphs(Higham et al., 2008) (described in Section A.2 in Appendix A).

Graph models have also been developed that use evolutionary motivation (Aiello et al.,2000), starting from a graph with two nodes by duplicating nodes and preferentially at-taching new edges to highly connected nodes. Duplication-divergence graphs model theprocess of gene duplication, each new node added is identical to an existing node andthen small changes are made to reflect evolutionary divergence (Gkantsidis et al., 2003).These aim to both model the observed PIN data as well as improve knowledge of howthe networks may have evolved. ERGMs produce graphs that have the same expectedtrait statistics as empirical network data (Pattison and Wasserman, 1999), enabling anassessment regarding the types of graphs generated with given traits. These techniquescan be used to test the influence of traits on the graph or when assessing the significanceof observed features against what would be expected in a random graph with the sameproperties.

51

1.6. NOISE IN INTERACTIONS Introduction

1.6 Noise in interactions

It is essential to appreciate how accurately empirical data reflects the true interactome. Forthis reason it is of crucial importance to have an understanding of noise found in PIN datawhen performing any global PIN analyses. Noise may be either stochastic, systematicexperimental error or related to biological properties of the proteins being tested (inter-actions may be transient or condition dependent). The focus in Chapter 5 is on how theamount of stochastic error can be measured. Error definitions and literature concerningestimations of the PIN size and PPI data error rates are discussed in this section.

Determining whether two proteins interact is hard to achieve. Many issues with exper-imental data exist including: biases or systematic errors from experimental techniques(Aloy and Russell, 2002; Chiang et al., 2007); how to use binding affinities to infer inter-actions (Aloy and Russell, 2006); and basic uncertainties regarding our understanding ofthe regulation system of the cell.

Interactions may only occur under specific conditions, or experimental techniques may beinaccurate, producing sets of data where interactions cannot be known for certain. Sim-ilarly, these experimental techniques may only be able to produce a subset of the true

interactions, making it difficult to define absolutely proteins that do not interact. Thesefalse interactions are needed, as well as true interactions, for biological prediction algo-rithms to work effectively when reference sets for both interactions and false interactionsare required (Ben-Hur and Noble, 2005).

PINs are represented using uncertain data to produce static empirical graphs. The edges,interactions, of the graph either exist (1) or do not exist (0). Although this may be correctfor the actual protein interaction network, over all environments, our representation isbased on the collation of putative interactions, as opposed to an analysis of a subgraph ofthe true graph. Theoretically, the interactome sought is the set of all interactions that canoccur under any conditions in vivo or in vitro for a fixed collection of proteins (and theirpossible genotypes).

Measurement error is often ignored in graph based studies of these systems (for examplein Schwikowski et al. (2000) and Barabasi and Oltvai (2004)). Biological traits can aid inminimising the number of false reported interactions considered in the graphs: by pruningputative interaction data according to some known biological information (Deane et al.,2002), or using interaction set overlaps to find errors in each dataset in contrast to known

52


interactions (D’haeseleer and Church, 2004). The choice of known interactions, however,will undoubtedly be biased until the complete true interactome is discovered.

Chapters 3-5 study experimental data in order to interpret empirical graphs and also todevelop a model to find the size of the true PIN — this forming the interactome of interestthroughout the thesis.

1.6.1 Sampling notation

The set of nodes, V , for the PINs considered is a proteome (see Definition 1.3). Let thecomplete set of protein pairs be EΩ, forming the edge set for a graph of the interaction

sample space, Ω. This graph is used to assess the coverage (tested protein pairs) and errorrates of experimental data.

Definition 1.37 (Interaction sample space) The interaction sample space, Ω, is the com-

plete graph, excluding loops, between node pairs found in the proteome V (so Ω ∼ K|V |).

Ω ∼ (V,EΩ) , where EΩ = (vi, vj) : vi < vj ∈ V .

The interaction graph, G, is the protein interaction network or interactome, as introducedin Definition 1.5. Any pair of nodes, vi, vj ∈ V , is either in the true (unknown) interac-tome G, or not.

Definition 1.38 (Interaction) An interaction is a pair of proteins (vi, vj) that is in the

true interaction graph, G.

Definition 1.39 (Non-interaction) A false interaction is a pair of proteins (vi, vj) that is

not in the true interaction graph, G.

Let any edge, for example (vi, vj), be either an interaction or a false interaction, based onwhether it is found in the true interaction graph, G. Let the set of false interactions, onthe node set V , be defined as the false interaction graph, G′. The union of these graphs isthe interaction sample space, G ∪G′ = Ω.

53


Definition 1.40 (Interaction graph) The interaction graph,G, has an edge between each

pair of proteins that interact in the considered interactome.

G ∼ (V,E), where (vi, vj) ∈ E ⇐⇒ vi and vj bind.

Definition 1.41 (False interaction graph) The false interaction graph,G′, has edges be-

tween all proteins in the proteome that do not interact.

G′ ∼ (V,E ′), where E ′ = EΩ \ E.

The set of all edges, EΩ, is needed in order to estimate the proportion of possible proteinpairs that have been tested, and to assess the error rates defined in Definitions 1.42-1.45.

1.6.2 Error rates

Any experiment, P , may be represented as a graph. The reported interaction dataset, EP ,forms the edges of the graph, GP , whilst the nodes, VP , are the set of proteins that appearin the interaction data. The experiment size is defined as the number of edges, |EP |, thatare inGP . Given uncertainty in experimental interaction data the edges found in the graphmay be members of E or E ′.

Two further pieces of information are required to measure the error rates for a particularexperiment: the complementary set of protein pairs that were tested but gave negativeresults, and the true interactome (G). The latter is unknown whilst the negative falseinteractions are not generally explicitly reported. This makes it difficult to assess the errorrates. The set of tested interactions, those that have been considered in the experiment,P , could be any superset of EP . Chapter 5 assumes that all protein pairs from VP aretested in any experiment P , a technique known as node sampling (Lee et al., 2006) (seeAppendix E). The graph G is also required or a representative reference set sampled fromthe interactome. The notation used for experimental error is now briefly outlined andpresented in Table 1.2.

54


Definition 1.42 (False positives) The false-positive set of interactions, FP , contains re-

ported protein pairs that are erroneously reported as true interactions (as these are in

fact not interacting proteins). These interactions are actually found in E ′. Denote by pFPthe conditional probability that a reported interaction is a false-positive.

Definition 1.43 (True positives) The true-positive set of interactions, TP , contains in-

teractions that are correctly reported (found in E) in an experiment. Denote by pTP the

conditional probability that an interaction is a true-positive.

Definition 1.44 (False negatives) The false-negative set of interactions, FN , are those

that are tested but incorrectly not reported in an experiment (in E). Denote by pFN the

probability that an interaction is a false-negative.

Definition 1.45 (True negatives) The true-negative set of interactions, TN , are those

that are tested but correctly not reported in an experiment (in E ′). Denote by pTN the

probability that an interaction is a true-negative.

Edge ∈ E Edge /∈ EEdge found pTP pFPEdge not found pFN pTN

Table 1.2: Error rate notation. Four error characteristics required to study the available interac-tome data. Each pA is a conditional probability regarding whether an interaction is reported (edge found)or not (edge not found). The probabilities are conditional on whether the interaction is actually a trueinteraction, so whilst the columns sum to 1 the rows may not.

The false discovery rate (FDR) is directly associated with the size estimates for the inter-actome model described in Chapter 5. The determination of the sets of true interactionsand false interactions, for a dataset, form the key to estimating the test statistic FDR.Throughout the thesis, when referring to PPI data, this is used as a summary statistic fora dataset, rather than an expectation.

Definition 1.46 (False discovery rate) The false discovery rate for a dataset, P , is the

proportion of false-positive interactions found.

FDR =|FP |

|TP |+ |FP |.

55


1.6.3 Error and size estimates

PPIs are observed in a variety of environments and using different methods. The ex-perimental techniques employed to determine interactions (discussed in Section 1.4 onpage 34) present differing amounts of noise. There is a need to reassess the reliabilityof PIN data as more interaction studies are published. Studies interested in error ratesprimarily focus on individual large experimental datasets (von Mering et al., 2002). How-ever, this thesis is interested in assessing the FDR found across a collection of studies,rather than a single experiment.

Several studies have estimated the FDR and interactome size of S. cerevisiae. They usesimilar methods to each other and the estimates can be dependent on each other. However,they differ in their approach to the set of tested protein pairs and their use of different ex-perimental studies. A rapid accumulation of new PPI data in recent years (see Chapter 2)has also affected the estimates. This section describes a collection of methods that havebeen used to find the size of the S. cerevisiae physical interactome.

Overlapping interactions

D’haeseleer and Church (2004) presented an overlap method for estimating error ratesin PIN data sets. As a consequence of their methodology they are also able to estimatethe size of the S. cerevisiae PPI interactome. Their overlap method estimates the FDRfrom 3 data sets: a reference set (taken from a reliable PPI source) along with two otherlarge experimental sets. Using the overlaps between all the sets, and assuming that theseoverlaps are error free (both as a consequence of the validation, and also the assumptionthat the reference is highly accurate) the ratio that should occur if the other sets are errorfree can be found.

Figure 1.8 shows the overlap sets, between the 3 different datasets, that are used to findthe false discovery rate. The FDR estimate is found from the ratios of the number ofinteractions, I-VI, found in each part of Figure 1.8. A and B are experimental datasetswhilst REFERENCE set is a set of true interactions. The area separated by the dashed linecontains all the false interactions. The FDR is approximated by assuming that the datawere obtained independently such that the ratio between I and II is equal to the ratio ofIII and IV. The the sizes of V and VI can be found that solve,

56


IV

Reference

IIII

II

A B

VIV

Figure 1.8: Overlap method. The overlap found between three different interaction datasets, twoof which are being compared against a reference set. FDR is estimated using the ratios of the numberof interactions, I-VI, found for the 3 datasets. A and B are experimental datasets whilst REFERENCE isa set of true interactions. The area separated by the dashed line represents estimated interaction noise,containing V and VI interactions from sets A and B respectively.

IV = III III . (1.3)

The solution to this equation may not be unique, and the size of VI has also been sepa-rately estimated to help find a unique FDR (Deng et al., 2003).

Figure 1.9, an example of I-VI found in Figure 1.8 for reported interaction studies, showsthe overlaps found between PPI data for sets assessed by D’haeseleer and Church (2004).The FDR rates found using this method ranged from 0.46 to 0.90. Interactome size esti-mates from the overlap found between pairs of studies: Uetz et al. (2000) and Ito et al.(2001) [8,535-10,127]; Ho et al. (2002) and Gavin et al. (2002) [7,257-25,440].

Grigoriev (2003) measured the overlap of interaction data for each protein to find thenumber of interactions. The number of interactions for a protein from a particular exper-iment are assumed to be binomially distributed, similarly to the assumption used for thecoupon collecting model described in Chapter 5.

Suppose that for each protein, A, aA different interactions are found in one study and bA

57


Ito et al

MIPS

Uetz et alIto et al

MIPS

Uetz et al

1411

544730

1574241 706

Figure 1.9: High-throughput interaction overlap. This venn diagram shows the overlap be-tween two protein interaction studies and a reference set. The reference set is from MIPS (Guldener et al.,2006) whilst the interaction data are reported in Ito et al. (2001) and Uetz et al. (2000).

different interactions in a second study. The average overlap, O, between the interactionsets for every protein is used to eliminate noise and find the number of interactions, nA,for protein A. Then nA for each individual protein, A, is found from,

nA =aAbAO

. (1.4)

Grigoriev (2003) concluded that S. cerevisiae proteins have an average degree of between3 and 5, leading to an estimate of between 16,000-26,000 distinct interactions in S. cere-

visiae. These estimates excluded proteins thought to generate a number of false-positiveinteractions. The method can easily be extended to take account of known error rates asaA and bA can be adjusted for each protein, A, to reflect the occurrence of false interac-tions.

Protein coverage

The coverage of each experiment is of vital importance when considering error rates.Each experimental technique has possibly different amounts of noise or may only be ableto test a subspace of the possible protein pairs. Experimental noise and coverage bias

58


have been considered in order to produce FDR and interactome size estimates (Chianget al., 2007; Huang et al., 2007; Gentleman and Huber, 2007). The probability of a false-positive, or the ability to test certain protein pairs, may not be the same for each type ofexperiment and this may influence the error results found using overlap methods.

Huang et al. (2007) used capture-recapture tests for yeast-two-hybrid experiments to es-timate the FDR, coverage and interactome size of S. cerevisiae. This method is similar toan overlap study (see Figure 1.8) although the coverage is fixed as replicates are analysedwhere possible. This indicated that between 15% and 27% of the yeast (and potentially45% of worm and fly) data are misclassified as interactions.

Hart et al. (2006) focused on improving the methodology set out in Figure 1.8 (D’haeseleerand Church, 2004). Early studies were found to understate the interactome size as eachdataset probed different sets of protein pairs. Using estimated error rates and intersectionof datasets they estimated which protein pairs have been tested. The estimated size isthen scaled to take account of this coverage. The S. cerevisiae PIN was found to have38,000-76,000 interactions and current data only contained half of the true interactions(Hart et al., 2006) .

The need to assess the coverage of each study was highlighted by Gentleman and Huber(2007). Direct comparison of interaction data fails to take account of the different proteinpairs tested in each study (illustrated in Figure 1.10). Each study does not test everyprotein-pair in the whole set EΩ. Accordingly, unless this is explicitly taken into account,the overlap between studies will be lower than expected increasing the reported error rate.

Stumpf et al. (2008) assessed the size of S. cerevisiae whilst discussing the possible sizesof various interactomes including H. sapiens. The proportion of proteins tested in eachstudy and an assumed model for the true interaction graph, G, are used to find the size ofthe interactome. The authors estimated that the S. cerevisiae interactome size is 24,000-26,000. The study showed the potential differences in interactome size in comparison toproteome sizes. Although H. sapiens has 20% more genes than Caenorhabditis elegans

it has over twice as many PPIs – over 500,000 in contrast to approximately 250,000.

False-negative interactions

The rate of false-positive and false-negative interactions has also been estimated usingrepeated interaction data from two studies. Chiang et al. (2007) performed error analysis

59


I

II

IV

III

Figure 1.10: Sample space overlap. When using overlap in PPI experiments, the underlying spacethat is being probed is of crucial importance to the final estimate. Many maps suggest they are globallysampling the protein space, but in reality as a result of experimental limitations are observing only a subsetof the possible interaction pairs. Each black circle represents a protein, whilst I,II,IV show the spacestested by each method. III is the space that should be used as a reference set of proteins when the overlapin interactions between studies I and II are performed.

using: the number of repeated reported interactions (X); the number of reported proteinpairs between which no interaction exists (Y ); and the number of non-repeated reportedinteractions (Z). The error rates, for a pair of datasets, were found using the followingrelationships:

E (X) = m (1− pFN)2 + npFP2,

E (Y ) = mpFN2 +m(1− pFP )2,

E (Z) = 2mpFN (1− pFN) + npFP (1− pFP ) , (1.5)

where m is the number of true interactions (observable interactome size) and n is thenumber of false interactions. If we know the protein set, V , tested against each other thena further condition is

(|V |2

)= m+ n.

Equations 1.5 reveal the trade-off between pFP and pFN inherent in two datasets com-parisons. If a reference set is used to find the FDR in experimental data then careful

60


consideration about its coverage and false-negative rates are required for accurate results.The reference set should not be assumed to be just a random subset of the true interac-tome.

Reported estimates

Tables 1.3-1.4 detail FDR and PIN interactome size estimates for S. cerevisiae. The FDRestimates are high, ranging from around 0.15 to 0.90 for interaction sets, but are still infor-mative due to the low marginal probability that a random protein pair form an interaction.As time has passed, and more data have become available, the size estimates have tendedto increase. However, the current consensus appears to suggest that the number of distinctPPIs is between 20,000-40,000. Hart et al. (2006) present a larger and wider predictionthan other methods such as that of Stumpf et al. (2008) who estimate perhaps a third fewerS. cerevisiae PPIs.

Study Data FDRD’haeseleer and Church (2004) Uetz et al. (2000) 0.46D’haeseleer and Church (2004) Gavin et al. (2002) 0.50-0.68D’haeseleer and Church (2004) Ho et al. (2002) 0.79-0.90Huang et al. (2007) Ito et al. (2001) 0.15-0.27

Table 1.3: FDR estimates. This table contains some FDR estimates found in S. cerevisiae PPIstudies.

Study Year Interactome SizeTucker et al. (2001) 2001 8-12, 000von Mering et al. (2002) 2002 > 30, 000a

Bader and Hogue (2002) 2002 ≈ 20, 000Sprinzak et al. (2003) 2003 10-17, 000Grigoriev (2003) 2003 16-26, 000Hart et al. (2006) 2006 38-76, 000Stumpf et al. (2008) 2008 24-26, 000

aincludes inferred interactions from matrix complex annotations which possibly overstate binary PPIs.

Table 1.4: S. cerevisiae interactome size predictions. Published estimates of S. cerevisiae’sinteractome size from different sources, along with the year of publication.

61

1.7. SUMMARY Introduction

1.7 Summary

Chapter 1 has outlined the concepts that feature prominently in Chapters 2-5. The graphs,random graph methods and traits are relied upon in all subsequent chapters. Biologicalinteractions form graphs on thousands of distinct nodes, containing an unknown numberof inter-relationships between individual nodes. These relationships are studied in orderto understand how biological organisms function, and to appreciate possible differencesin complexity across different species. An understanding of how function may be inferredacross organisms enables easier experimentation on model organisms.

Graphs have been studied in an attempt to understand better the structural properties ofbiological networks. Random graph models have made progress in simulating the de-gree sequences, and elements of the local structure, of the observed graphs. The degreesequence, as well as other network traits, has been used to characterise biological net-works. Whilst the degree sequence does not fully summarise all aspects of the graph, itis believed to contain information pertinent to the underlying evolution of the graph andhow that relates to the stability and operation of the system. The traits of empirical graphdata are discussed further in Chapter 2 along with analysis of the biological traits used insubsequent chapters.

Random graphs can be compared to empirical data to assess the relevance of traits andcharacterise properties of real life networks, as pursued in Chapter 3. In turn, graph stud-ies can further the biological understanding of PINs as well as the underlying proteins andbiological interactions. Evolutionary rates of protein pairs have been linked, along witha number of biological characteristics, to the incidence of PPIs. Studies have looked forlinkage over small subsamples of the complete interactome which may have been biasedby prior knowledge. Chapter 4 uses graph theory alongside the phylogenetic conceptsintroduced in this chapter to compare phylogenetic tree topologies.

Experimental data from PPI studies form the knowledge about interactomes. However, thedata are both a subsample of the true interactome and may contain a number of incorrectlyobserved protein pairs. Accordingly, other biological traits that are perhaps easier tomeasure have been used to classify possible interactions and measure the error rates ofpublished data. The coverage of an experiment (the protein pairs that have been tested asinteractors) is paramount when considering its error. Chapter 5 introduces a model thatfinds the FDR and interactome size from any experimental dataset.

62

Chapter 2

An exploratory analysis of interactiondata

This chapter presents an analysis of the S. cerevisiae interactome data. A collection ofthe most prominent databases of S. cerevisiae PPI data are introduced (Section 2.2). Thebiological traits of the S. cerevisiae interactome data found in BioGRID are then analysed(Section 2.3). Finally, three empirical PINs are defined from the interaction databases andtheir traits contrasted (Section 2.4). These graphs are used in subsequent chapters.

63

2.1. INTRODUCTION PIN Analysis

2.1 Introduction

The vast increase in interaction data, for instance from novel methods introduced in Sec-tion 1.4, since early studies on PINs makes reappraisal of biological and network charac-teristics essential. Section 2.2 details some of the important PPI databases that contain S.

cerevisiae data. The analyses in Section 2.3 give an understanding of the current state ofthe S. cerevisiae interactome data, as well as the properties of the graph data that are usedsubsequently in this and later chapters.

The interactome is further investigated in Section 2.4, through the use of three empiricalgraphs taken from the Database of Interacting Proteins (DIP) and the Biological Gen-eral Repository for Interaction Datasets (BioGRID). These datasets are analysed and theirtraits compared to inform later work. Validations of individual interactions are also dis-cussed as these can confer information about the reliability of putative data.

2.2 Interactome databases

A variety of databases is available that include different samples of the complete set ofinteraction data. These databases provide different services: some include computation-ally inferred interactions, whilst others report only experimentally determined PPIs thatsatisfy certain criteria. Table 2.1 details a selection of these interactome sources. Threekey S. cerevisiae databases are described here: Munich Information Center for ProteinSequences (MIPS); Database of Interacting Proteins (DIP); and the database of primaryconcern throughout this thesis, Biological General Repository for Interaction Datasets(BioGRID).

Database Website InteractionsBioGRID (Stark et al., 2006) www.thebiogrid.org 220,000IntAct (Kerrien et al., 2006) www.ebi.ac.uk/intact 170,000BIND (Bader et al., 2001) bind.ca 84,000DIP (Xenarios et al., 2002) dip.doe-mbi.ucla.edu 57,000MPact (Guldener et al., 2006) mips.gsf.de/services/ppi 15,000MIPS (Guldener et al., 2006) mips.gsf.de/services/ppi 4,000

Table 2.1: Interaction databases. A selection of available online sources of protein interactiondata, for different species, that can be used to form empirical PINs.

64

2.3. ANALYSIS OF THE BIOGRID S. CEREVISIAE DATABASE PIN Analysis

MIPS interaction data are found in the MPact database (Mewes et al., 2006). MIPS in-cludes a set of high confidence yeast protein interactions from small scale experiments.It separates HTP from SSE experimental data enabling researchers to use MIPS PPIs as apotentially reliable reference set.

DIP was established to collate interactions found in S. cerevisiae (Xenarios et al., 2002).DIP includes, in addition to the list of interaction data found from the literature, a COREsubset of interactions. CORE PPIs are those from DIP that satisfy criteria which aim toimprove data quality (Deane et al., 2002). Although the majority of the data found in DIPcome from S. cerevisiae studies, it also now contains interaction information on aroundtwenty other organisms.

BioGRID aims to store all the interactions experimentally reported in published literature(Breitkreutz et al., 2008). The data include physical (i.e. protein-protein) and genetic(e.g. protein-DNA) interactions and are regularly updated. It contains curated – reviewedmanually – interaction data originally formed by a comprehensive curation of publishedarticles (Reguly et al., 2006). Binary PPIs are reported from multi-protein complex datausing the spoke model (see Section 1.4.3). Having been established in 2006, it now con-tains over 220,000 interaction reports from 21,638 publications for 22 organisms. S. cere-

visiae contributes over half of the reported interactions to the BioGRID database. Thisextensive source contains repeats of reported PPIs.

2.3 Analysis of the BioGRID S. cerevisiae database

This section presents an analysis of the available S. cerevisiae PIN data from BioGRID.Owing to the rate of data generation, as illustrated later by Figure 2.1, it seems apt toreassess the properties of the reported interactions used here rather than assuming claimsprevious studies have asserted. The properties of genetic and physical interactions, foundusing the methods introduced in Section 1.4, are studied in this section.

BioGRID v2.0.39 (April 2008 release) is used here, containing 115,024 reported S. cere-

visiae protein interactions. This section details the stratification of these data according tothe experimental techniques used to obtain them, the year in which they were published,and the size of the experiment (i.e. SSE or HTP) from which they were obtained. Addi-tionally, the numbers of self-interactions and the GO annotations of interacting proteinsfound using different experimental methods are explored. The GO annotations are pre-

65


sented in order to assess how these traits relate to the reported interactions, having beenpreviously linked with PPIs. The overlap methods, and other error rate techniques, haverelied on the ability to compare data from contrasting techniques, so the biological prop-erties of the experimental methods are also compared in this section. Finally, the repeatedinteractions are analysed as a prelude to the estimation of PIN size and FDR described inChapter 5.

Throughout, when counting the number of distinct interactions, no distinction is madebetween bait and prey. For example, it makes no difference if proteins A and B arereported as (A,B) or (B,A). From a biophysical point of view this may not be thebest interpretation, but it enables easier comparison between techniques, whilst allowingcomparison of multiply repeated interactions easily.

2.3.1 Year of publication

Figure 2.1 shows the previously mentioned increase in reported interactions whilst divid-ing them into genetic and physical interactions. In general, more interactions are reportedevery year. However, during the early years shown, the reported data predominantly per-tained to genetic interactions and in the last decade physical PPIs have become the mainreported data. There are 66,464 physical and 48,560 genetic reported interactions in Bi-oGRID.

In 2002, when there was a marked increase in the number of physical interactions re-ported, HTP techniques based on mass spectrometry led to large scale complex identifi-cation alongside other novel PPIs (Ho et al., 2002; Gavin et al., 2002). There has beena further surge in the number of newly reported genetic interactions since 2003. Collinset al. (2007a) presented an analysis of the genetic interactions found from the newly ac-quired physical complex data, detailing previously unknown functional information. Thisbuilt on the earlier protein complex data of Krogan et al. (2006) and Gavin et al. (2006).

HTP techniques have become more widely used in recent years. The reduction in theyearly reported PPIs immediately after HTP studies in 2002 may reflect an early diffi-culty of publishing new experimental data obtained using these methods. However, thetechniques are now used more regularly along with other HTP methods for genetic (Panet al., 2006; Collins et al., 2007b) and physical (Ptacek et al., 2005; Collins et al., 2007a)interactions. These studies have found thousands of interactions in recent years, of which

66


Repo

rted I

ntera

ction

s

050

0010

000

1500

020

000

2500

030

000

19771978

19791980

19811982

19831984

19851986

19871988

19891990

19911992

19931994

19951996

19971998

19992000

20012002

20032004

20052006

2007

PhysicalGenetic

050

100

150

19771978

19791980

19811982

19831984

19851986

19871988

19891990

1991

Figure 2.1: Number and type of interaction reported in S. cerevisiae by year. Geneticinteractions dominate the earliest data until 1997. The inset shows, in more detail, the types of interactionreported between 1977 to 1991. Since 1997, the majority of interactions reported have been physical.

the majority are still novel (as shown in Figure 2.9).

2.3.2 Experiment size and technique

Reported interactions originate from studies using a number of different experimentaltechniques. There are 22 different experimental techniques contained in BioGRID. Thephysical and genetic interaction sets are divided by these techniques (see Table B.3 inAppendix B). Physical interactions are sourced mainly from affinity capture, two-hybridand FRET experiments whilst the genetic interactions are found using other techniques.

Interactions come from 7,393 different published studies across 5,232 yeast proteins,comprising two main types of study: HTP and SSE. HTP studies form a small proportionof experiments with only 52 studies (0.7%) reporting more than 100 interactions and 11(0.1%) that contribute over 1,000 interactions to the database. However, the majority ofthe data (77,854 interactions or 68%) come from these 52 studies. Amongst the 7,341

67


SSEs (i.e. those reporting 100 or fewer interactions), 88% report fewer than 10 distinctinteractions. Several authors have questioned the reliability of large scale studies relativeto small scale experiments (Aebersold and Mann, 2003; Phizicky et al., 2003; Bader et al.,2004). Figure 2.2 shows the number studies and interactions (including repeats) producedby the different experimental methods.

Figure 2.2: Experimental techniques contribution to BioGRID. The majority of the physicalinteractions (techniques in the red dashed area)have been found using affinity capture MS and yeast-two-hybrid studies whilst the majority of the studies focus on affinity capture western. The spread of the geneticdata (all within the green dashed area) are more even across a wider range of methodologies.

Figure 2.2 shows that physical interactions have been primarily identified from data gen-erated by affinity capture using either western blot or mass spectrometry (MS). Affinitycapture MS studies produce the most interactions both overall (59% of physical PPIs) andper published article (on average, 157 interactions). The yeast-two-hybrid (Y2H) studiesalso contribute a large number of the interactions (15% of physical), whilst the remain-ing methodologies contribute a smaller proportion of physical reports. Genetic data are

68


spread more evenly across 8 techniques, although phenotypic enhancement (37% of ge-netic) and synthetic lethality (26% of genetic) experiments contribute the majority of thedata.

As mentioned, affinity capture MS studies contribute, on average, over 150 distinct inter-actions to the data which is far higher than any of the other methods. Y2H (on average,13 interactions per study) and biochemical activity (23) form the only other methodsthat contribute more than an average of 10 physical interactions per published article.These 3 techniques make up the majority of the HTP data (if defined by experiment size).For genetic studies, phenotypic suppression (26), phenotypic enhancement (25), syntheticlethality (22) and synthetic growth defect (21) techniques contribute more than 10 inter-actions per study.

2.3.3 Self-interactions

There is a collection of proteins that have been shown to self-interact, for instance forminga dimer of two identical proteins (homodimer). Self-interactions have been reported usingthe majority of experimental techniques, as shown in Table 2.2. In order to compare thebiological characteristics of PPIs produced by each method it is necessary to appreciatethe propensity of self-interaction reporting. Trivially, the biological characteristics foundfor any self-interaction will match as only a single protein is involved. Self-interactionsfound in the BioGRID data are removed from subsequent analyses: interest lies only inthose PPIs found between different proteins.

Protein-peptide (29.2%) and co-crystal structure (18.2%) exhibit a high proportion of self-interactions, as shown in Table 2.2. This would clearly influence any potential interactionanalysis, or classification, based on similarity of an interacting partner’s biological an-notations. Self-interactions have been rarely reported using genetic methods (6 distinctreports from dosage rescue, 3 from phenotypic suppression and 4 across the synthetictechniques taken from for example: Mosch and Fink (1997); Brizzio et al. (1999); Hark-ness et al. (2002); Umemura et al. (2007)). Much of this variation can simply be explainedby the nature of the respective experimental methodology.

69


Method Interactions Self-interactions [%]Affinity Capture-MS 24,295 227 [0.9]Affinity Capture-RNA 57 1 [1.8]Affinity Capture-Western 4,523 124 [2.7]Biochemical Activity 5,192 23 [0.4]Co-crystal Structure 132 24 [18.2]Co-fractionation 562 14 [2.5]Co-localization 304 3 [1.0]Co-purification 1,193 19 [1.6]Dosage Growth Defect 63 0 [0.0]Dosage Lethality 433 0 [0.0]Dosage Rescue 3,059 6 [0.2]Far Western 53 1 [1.9]FRET 68 4 [5.9]Phenotypic Enhancement 15,948 0 [0.0]Phenotypic Suppression 4,395 3 [0.1]Protein-peptide 113 33 [29.2]Protein-RNA 33 0 [0.0]Reconstituted Complex 1,748 91 [5.2]Synthetic Growth Defect 5,809 0 [0.0]Synthetic Lethality 9,638 2 [0.0]Synthetic Rescue 1,931 2 [0.1]Two-hybrid 7,802 345 [4.4]

Table 2.2: Self-interactions found from each experimental technique. The number ofdistinct self-interactions found in the BioGRID data are shown. Protein-peptide and co-crystal structuretechniques have a far higher tendency of reporting self-interactions, whilst genetic techniques do not reportmany protein-DNA self-interactions.

2.3.4 Gene Ontology annotations of interacting proteins

A selection of probabilistic methods have been proposed that use location, functional, andprocess annotations to find novel, or prune existing, PPI data (Chinnasamy et al., 2006;Skrabanek et al., 2008). Methods have assumed that interactions require matching func-tional characteristics to be included in the training data used to represent true interactions(Jansen and Gerstein, 2004), and also that the proteins may not co-localise in order togenerate a negative training set (Jansen et al., 2003). Accordingly, the biological charac-teristics of proteins that interact should predominantly show matching properties for theavailable GO categories, as these have already been used to assign training sets of PPIs.

GO annotations for function, location, and process are used to assess how well these bi-ological characteristics reflect the BioGRID PPI data. An organism specific S. cerevisiae

70


‘GO slim’ scheme, based on GO categories and taken from the Saccharomyces GenomeDatabase (SGD), is used for comparison of proteins. The three GO categories have dif-ferent annotations for: 21 molecular functions; 23 cellular components; and 32 biological

processes. Each protein can have multiple annotations or no known annotation. In thelatter case, the protein is either classed as unknown or ignored dependent on the analy-sis. The proportion of PPIs that have been reported for all different combinations of GOannotation are shown in Figures 2.3-2.5.

Figure 2.3: Molecular function annotations of reported interactions. A heatmap describ-ing the proportion of possible protein pairs that are reported in BioGRID, stratified according to the GOannotation molecular function category. The protein pairs reported ranges from 0% to 32% (from white tored). The diagonal, showing incidence of matching annotations, classes exhibit the highest proportion ofreported interactions.

Figure 2.3 shows, for functional annotations, the percentage of protein pairs for each par-ticular annotation that have been reported as interactions, ranging from 0% to 32% of thepossible protein pairs. Proteins that share the same annotation have the highest marginalprobability of having been reported as interacting, although this set of interactions is only20% of the reported PPIs. The rest of the reported data are from proteins that do not

71


share functional annotations. For example, a high proportion of interactions have been re-ported between protein binding and motor activity classes, and also between the enzymeregulator and signal transducer activity classes.

Figure 2.4 shows how cellular component annotations relate to the observed PPIs. Thereare 23 different annotations, and again the within annotation protein pairs show the high-est tendency to have been observed as interactions. For this GO category fewer than afifth of the potential protein pairs have been reported for any annotation characteristic.Several annotation pairs also exhibit a higher proportion of PPIs than some of the withinannotation groups, e.g. between cell wall and extracellular proteins, and also betweennucleus and chromosome. Only 28% of the reported interactions are between proteinsthat share the same component annotation.

Figure 2.4: Cellular component annotations of reported interactions. A heatmap de-scribing the proportion of possible protein pairs that are reported in BioGRID, stratified according to theGO annotation cellular component category. The protein pairs reported ranges from 0% (white) to 17%(red). The diagonal, showing incidence of matching annotations, classes exhibit the highest proportion ofreported interactions.

72


Biological process annotations cover a larger number of classes. Figure 2.5 shows sim-ilar behaviour to the other ontology groups and the reported PPI distribution across thecategories. Relationships showing the highest marginal rate of reported interactions arebetween proteins with identical annotations. However, once again, a small overall per-centage (approximately 21%) of the reported data are from protein pairs that have thesame annotation class. Although the electron transport proteins appear to show littleagreement with the other classes, there are only a limited number of interactions reportedinvolving proteins from this class (17 in total), making any inference about how thesecomponents interact with other classes difficult.

Figure 2.5: Biological process annotations of reported interactions. A heatmap describingthe proportion of possible protein pairs that are reported in BioGRID, stratified according to the GO an-notation biological process category. The protein pairs reported range from 0% (white) to 16% (red). Thediagonal, showing incidence of matching annotations, classes exhibit the highest proportion of reportedinteractions.

In summary, although within class annotations are highly represented, these interactionsonly represent a small proportion of the overall PPI data. Whilst it may be appealingto assume that some clear link between these biological annotation categories and PPIs

73


exists, and attribute the other data to noise, the evidence suggests that assigning trainingsets for PPI predictions that rely on GO characteristics is misleading. The probabilisticmethods may, in fact, actually be predictive of GO characteristic linkage, rather thanreliable PPI predictions.

2.3.5 Gene Ontology annotations and experimental techniques

Experimental techniques may not be designed to sample the same protein pairs just assome are not designed to find homodimers. Overlap studies, introduced in Section 1.6.3,have reported error rates based on the comparison of data from different experimentsin general using a reference set from a different technique (e.g. the MIPS set used inD’haeseleer and Church (2004) is predominantly drawn from SSE methods). The levelof similarity shown between GO characteristics for each method may give an indicationof how similar the data are and guide whether it is a valid assumption to use a referenceset from SSE methods to test the reliability of HTP data.

For all subsequent analysis, an interaction is defined as having matching annotations ifboth proteins have the same known GO annotation for a particular annotation category.Figures 2.6-2.8 show the proportion of matching annotations found for PPIs reported byeach experimental technique. For each analysis an interaction is included only if thereexists a known GO annotation for each protein.

The dashed line, in each figure, shows the proportion of matching annotations for the com-plete physical or genetic interaction set. Each bar, for a given experimental technique, hasa colour density illustrating the p-value of a proportion test between matching annotationsfound in the particular experiment type and the complete interaction set (physical or ge-netic). These are more translucent if there is less evidence to support the experimentalstatistic being different from that of the physical or genetic data (and fully coloured ifsignificant at the 5% level as being different to the complete set proportion).

Figure 2.6 presents results for the molecular function category. The genetic and physicalinteractions show marked differences in the level of matching annotations, with fewerfound in genetic data. Both biochemical activity techniques and protein-RNA techniquesproduce a low level of matching annotations. In contrast, interactions elucidated usingco-crystal structure exhibit the highest proportion of matching annotations. Co-crystalstructure methods also contribute the highest proportion of self-interactions and the 108

74


other PPIs found using this method show a high level of annotation concordance from165 studies.

Mat

ching

func

tion

prop

ortio

n

0.0

0.2

0.4

0.6

0.8

2406

8

4399

1657

7457

5169 10

8 52 64 80 301 56 33

1174 54

8

9636

3053

5809

1929 43

3

1594

8

4392 63

Affinity

Capture−MS

Affinity

Capture−Weste

rn

Reconsti

tuted Complex

Two−hybrid

Biochemica

l Acti

vity

Co−crysta

l Stru

cture

Far Weste

rnFRET

Protein−peptide

Co−localiza

tion

Affinity

Capture−RNA

Protein−RNA

Co−purifica

tion

Co−fracti

onation

Synthetic

Lethality

Dosage R

escue

Synthetic

Growth D

efect

Synthetic

Resc

ue

Dosage Lethality

Phenotypic

Enhancement

Phenotypic

Suppression

Dosage G

rowth Defect

Physical interactions Genetic interactions

Figure 2.6: Proportion of matching functional annotations by experiment technique.The proportion of matching functional annotations found are shown for each experimental technique.Dashed lines show average proportion across complete genetic or physical interaction set. Bar densityshows p-value of binomial proportion test, assessing similarity, between technique and genetic or physicaldataset.

Figure 2.7 shows the proportion of matching cellular component annotations found forreported interactions. The overall difference between the proportion of matching annota-tions in the genetic and physical techniques is smaller than for functional annotations, andthe overall propensity to match is higher. The larger physical datasets are comprised ofaffinity capture MS, affinity capture western, biochemical activity and two-hybrid (againreferred to as Y2H) reported interactions. These datasets all exhibit different annotationcharacteristics. Affinity capture MS contributes around half of the physical data and hasa significantly different propensity for matching annotations than any of the other tech-niques that contribute more than 5% of the data. Two-hybrid and biochemical techniques

75


produce interactions with a relatively low propensity of having matching component, orfunctional, annotations.

Mat

ching

com

pone

nt p

ropo

rtion

0.0

0.2

0.4

0.6

0.8

2406

8

4399

1657

7457

5169 10

8 52 64 80 301 56 33

1174 54

8

9636

3053

5809

1929 43

3

1594

8

4392 63

Affinity

Capture−MS

Affinity

Capture−Weste

rn

Reconsti

tuted Complex

Two−hybrid

Biochemica

l Acti

vity

Co−crysta

l Stru

cture

Far Weste

rnFRET

Protein−peptide

Co−localiza

tion

Affinity

Capture−RNA

Protein−RNA

Co−purifica

tion

Co−fracti

onation

Synthetic

Lethality

Dosage R

escue

Synthetic

Growth D

efect

Synthetic

Resc

ue

Dosage Lethality

Phenotypic

Enhancement

Phenotypic

Suppression

Dosage G

rowth Defect


Figure 2.7: Proportion of matching component annotations by experiment technique.The proportion of matching component annotations found are shown for each experimental technique.Dashed lines show average proportion across complete genetic or physical interaction set. Bar densityshows p-value of binomial proportion test, assessing similarity, between technique and genetic or physicaldataset.

Figure 2.8 shows the same matching annotation data for biological process annotations.The proportion of matching annotations found for each technique replicate the trendsshown for molecular function annotations in Figure 2.6. Biochemical activity and Y2Hstudies report interactions with a lower propensity to share concordance of biological an-notations, whilst some of the smaller scale experimental techniques (including co-crystalstructure) display very high levels of concordance. In general, the level of matching an-notations shown by each technique is significantly different from those produced by theother techniques.

Overall, Figures 2.6-2.8 show large differences in the proportion of matching annota-

76


Mat

ching

pro

cess

pro

porti

on

0.0

0.2

0.4

0.6

0.8

2406

8

4399

1657

7457

5169 10

8 52 64 80 301 56 33

1174 54

8

9636

3053

5809

1929 43

3

1594

8

4392 63

Affinity

Capture−MS

Affinity

Capture−Weste

rn

Reconsti

tuted Complex

Two−hybrid

Biochemica

l Acti

vity

Co−crysta

l Stru

cture

Far Weste

rnFRET

Protein−peptide

Co−localiza

tion

Affinity

Capture−RNA

Protein−RNA

Co−purifica

tion

Co−fracti

onation

Synthetic

Lethality

Dosage R

escue

Synthetic

Growth D

efect

Synthetic

Resc

ue

Dosage Lethality

Phenotypic

Enhancement

Phenotypic

Suppression

Dosage G

rowth Defect


Figure 2.8: Proportion of matching biological process annotations by experimenttechnique. The proportion of matching biological process annotations found are shown for each exper-imental technique. Dashed lines show average proportion across complete genetic or physical interactionset. Bar density shows p-value of binomial proportion test, assessing similarity, between technique andgenetic or physical dataset.

tions found for genetic or physical interactions. Over one third of S. cerevisiae proteinshave unknown annotations, and the interactions found by each method, for either geneticor physical methods, produce interactions where both proteins have known annotationsaround 60% of the time (shown in Figure B.2 in Appendix B). Affinity capture and far-western techniques report interactions with an enriched level of known annotations. FRETinteraction data almost always involve proteins that have been annotated in all three GOcategories suggesting this technique focuses on well studied proteins. The use of inter-action data from particular experimental methods as high confidence reference sets maybias the final analysis of novel methods, as some of the SSE methods (such as FRET)appear highly biased in their reporting of PPIs relative to the overall corpus of reportedinteractions.

77


2.3.6 Repeated interactions

Figures 2.6-2.8 have shown that biological differences exist between the characteristicsof genetic and protein-protein interactions. Accordingly, the characteristics of these net-works may be fundamentally different and to eliminate any possible confounding influ-ence the remaining analyses consider only the physical interactome. In order to assess theoverall coverage of reported PPI data, the number of verifications of reported interactionscan be used. These can be used to assess either how long it may take to report the com-plete interactome, or to assess the error rates found in the data (conditional on a knowninteractome size).

Figure 2.9 shows the number of PPIs reported (and stored in BioGRID) until 2007, di-vided into novel or repeated reports. The complete data contain 30,074 interactions thathave been reported once and 9,506 other interactions that have been reported more thanonce. These validated interactions therefore make up approximately 24% of the distinctPPIs. Across the complete physical data, containing 66,464 reports, there are 1,679 (4%)interactions that have been reported at least 5 times and 321 (0.8%) that have been re-ported more than 10 times.

Rep

orte

d In

tera

ctio

ns

050

0010

000

1500

020

000

2500

030

000

1977

1978

1979

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

Novel InteractionsRepeated Interactions

Figure 2.9: Accrual of reported yeast protein interactions over time The numbers of re-ported interactions found for S. cerevisiae are shown over the last 30 years. Red indicates novel interactionswhile yellow bars represent the reported interactions that have been published before.

78

2.4. INTERACTION NETWORKS PIN Analysis

A single interaction, between YDR477W and YGL115W, has been reported 39 times.These proteins, which are both kinases, are part of the protein complex Snf1p. This com-plex is essential for regulating transcriptional changes in multiple different biological pro-cesses (Kuchin et al., 2000; Lo et al., 2001), and is homologous to AMP-K which is foundin all eukaryotes (Kemp et al., 1999). Protein kinases modify other proteins through phos-phorylation. This can result in functional change of the target protein, and approximately30% of yeast proteins can be modified at any time (Ptacek et al., 2005). The proteinsare both of crucial importance to understanding eukaryotic processes and have featuredin commonly completed stress response studies.

Section 2.3 has illustrated some of the properties of BioGRID S. cerevisiae protein inter-actions. The vast majority of interactions occur between proteins that do not share thesame GO annotations, and in general each experimental method reports interactions withsignificantly different GO annotation characteristics. Accordingly, the use of GO anno-tations to form strict training and reference sets for reliability analyses appears flawedwithout further evidence regarding the reliability of the BioGRID database. The avail-ability of PPI validations and a wide variety of published analyses means that our focusin the studies concentrates on the physical interactome.

2.4 Interaction networks

The development of novel experimental techniques, along with the concurrent increasein interaction data, makes it difficult to effectively isolate stochastic or systematic errors.Accordingly, graph analyses presented here use a collection of empirical network datasetsrather than just a single complete set. As well as possibly improving interpretation ofthe analyses, this enables observation of any differences between published empiricalnetworks (e.g. between a network that has been curated and one which has not been).

CORE and DIP graphs are formed from the data contained in DIP, whilst a literaturecurated graph (LC) has been generated using BioGRID. These network datasets, referredto as empirical graphs, are now defined and explored.

The DIP graph contains all distinct PPIs found in the Database of Interacting Proteins(April 2008). CORE is a subset of the DIP graph found by comparison of the expres-sion levels and the availability of paralogous interaction data for each interaction in DIP(Deane et al., 2002). These criteria define the CORE set which is around a third the size

79


and considered as a higher confidence set of interactions than DIP.

LC contains a subset, found from hand curation of the literature, of reported interac-tions from BioGRID (Reguly et al., 2006). The PPI data have been divided into high-throughput (HTP) and small scale experimental (SSE) data. The HTP data are taken fromfive studies: Uetz et al. (2000); Ito et al. (2000, 2001); Ho et al. (2002); Gavin et al.(2002).

The three empirical graphs form different samples of S. cerevisiae PPIs, containing datathat has been hand curated (LC), passed some expert criteria (CORE), or is a completeinteraction database (DIP). They are treated as being subsamples of the S. cerevisiae inter-actome in Chapters 3-4. The use of multiple observed PINs enables discussion regardinghow different network sizes, and reliability, affect analyses.

Network traits of the empirical graphs can be used to characterise the biological network(see Section 1.5.1 on page 43). They also form a means of discriminating between indi-vidual models used to generate graphs such as ERGMs (Pattison and Wasserman, 1999)or geometric models (Higham et al., 2008). The network traits of the empirical graphs arenow compared.

2.4.1 Local graph structure

The three empirical graphs have different sizes, with CORE containing approximately athird of the interactions found in DIP. LC graph is the largest graph, having the highestorder, 5,109 nodes or proteins, and size, 21,283 interactions or edges (as shown in Ta-ble 2.3). The LC set is split up into HTP and SSE subsets, of which HTP contributes11,571 distinct interactions whilst SSE data contributes 11,334 distinct interactions to thegraph (the intersection of HTP and SSE containing 1,622 interactions).

Graph Nodes Edges Components Maximum degree Mean degreeCORE 2,528 5,728 78 91 4.8DIP 4,931 17,471 31 283 7.0LC 5,109 21,283 42 319 8.5

Table 2.3: Components and degree for empirical graphs. The network traits for empiricaldata are shown including the size, order and components found in each graph.

The highest connected protein (node with highest degree) has a similar proportion of

80


the total interactions found in the empirical graph: 1.6% (CORE); 1.6% (DIP); 1.5%(LC). However, this highest connected protein is connected to a higher proportion of eachgraph’s proteins as the graph size increases: 3.6% (CORE); 5.7% (DIP); 6.2% (LC). Themaximum degree increases in line with the size of the graph as opposed to the graph’sorder.

The percentage of [nodes, edges] found in the (largest) component, the GCC, for eachgraph are: CORE [92.7%, 97.9%]; DIP [98.8%, 99.8%]; LC [98.4%, 99.8%]. The GCCfor each graph suggests that the CORE graph is structurally different from the other em-pirical graphs as it has a smaller proportion of the edges and nodes in its GCC.

Table 2.4 lists the clustering coefficients for the empirical graphs and the average cluster-ing coefficient found for the set of graphs with the same degree sequence. Graphs withan identical degree sequence, on average, exhibit a significantly lower clustering coeffi-cient than the empirical graphs. CORE data exhibits the highest level of clustering of theempirical graphs whilst producing the lowest level of clustering in the associated randomgraphs. This suggests that the data are highly clustered around small sets of nodes, whichdo not have many edges between them. In contrast, the LC and DIP graphs have lowerclustering coefficients which could be a consequence of the higher average degree of eachnode in these graphs. However, the average trait for the random graphs is significantlylower in all cases.

Clustering coefficientGraph Observed Random (avg.)CORE 0.205 0.005DIP 0.094 0.009LC 0.125 0.013

Table 2.4: Clustering coefficients. This table shows the clustering coefficients found for the empir-ical data in comparison to the average coefficient for a random graph with the same degree sequence.

2.4.2 Degree sequence

Figure 2.10 shows rank-degree plots on a log-log scale for the nodes of each of the em-pirical graphs: CORE, DIP and LC. Plots with a straight line have been used to justify theuse of a power-law distribution for PIN degree sequences. Whilst the degree sequencesdo not fall perfectly on a line, there may be evidence of a power law and scaling rela-tionship if the majority of proteins with few interactions are excluded. However, overall

81


the interaction graphs do not appear to have a degree distribution that can be reasonablytaken to be described by a power-law. A large proportion of the nodes have a limitednumber of edges, whilst a handful have a large number of the edges. The CORE data,which have around a third of the edges of DIP, also contains fewer nodes with noticeablyhigher degree than the DIP or LC graphs.

Node Rank

Nod

e D

egre

e

110

100

1 10 100 1000 5000

(a) CORE

Node Rank

Nod

e D

egre

e

110

100

1 10 100 1000 5000

(b) DIP

Node Rank

Nod

e D

egre

e

110

100

1 10 100 1000 5000

(c) LC

Figure 2.10: Rank-degree plots of network data. These show rank-degree plots for the degreesequence data, on a logarithmic plot.

In order to assess the fit of degree sequence data to a Poisson distribution and some com-monly used heavy-tailed distributions, maximum log-likelihood analyses (Burnham andAnderson, 1998) have been performed (described in Section A.1 in Appendix A). Ta-ble 2.5 shows the Akaike information criterion (AIC) results that are used to choosebetween the tested models (where the models have up to 4 free parameters). The PINdegree sequences demonstrate relationships more indicative of a heavy-tail distributionthan a Poisson distribution. The best fit of this selection is the stretched exponentialmodel, whilst the discretised log-normal distribution fits nearly as well. The maximumlikelihood parameters found for the exponential distribution are less likely than those pa-rameters found when using a power-law distribution as has been found previously (Regulyet al., 2006). However, the assertion that the degree sequences are best modelled by thepower-law distribution are not backed up by even a comparison with only a handful ofother heavy-tailed alternatives. Accordingly, it may be more reasonable to generate ran-dom graphs that have the same degree sequence as the empirical data, to avoid placingassumptions on the character of the degree distribution that are not well supported byanalysis.

The three empirical graphs have similar degree distributions even though they are of dif-

82

2.5. DISCUSSION PIN Analysis

Graph Poisson Exponential Power-law Log-normal Stretched exponential

CORE 18750 13220 12550 11820 11800DIP 62290 29700 28850 27160 27130LC 83820 32480 31790 29940 29900

Table 2.5: AIC analysis of possible degree distribution. This table shows the AIC for 5different possible degree distributions. The values relate to the log-likelihood and number of parametersfor the distribution used and each empirical graph’s degree sequence.

ferent sizes. The CORE set is more highly clustered than the larger networks, although italso has more connected components than the larger graphs. The graphs are all on sub-sets of the complete proteome, and do not appear to be complete subgraphs of the trueinteractome as the highest degree scales with the size rather than the order of the graph.The degree distribution of the graph can be easily replicated by fixing the structure of thegraph, rather than using a simple probability model for the degrees of each node whichdo not closely fit the empirical data. Accordingly, rewiring approaches may be a bettermeans of generating random graphs to compare with the reported networks.

2.5 Discussion

The majority of interaction data has been published in the last 5 years, and the publicationof novel interactions from S. cerevisiae, as shown in Figure 2.9, is still common. How-ever, the number of distinct PPIs reported already exceeds, or is comparable to, predictedinteractome sizes (listed in Table 1.4 on page 61). This could reflect that empirical datacontain erroneous interactions suggesting a need for means to improve their reliability.Recently, S. cerevisiae PPI data have been supplemented by two global studies of multi-protein complexes (Krogan et al., 2006; Gavin et al., 2006). These have added validationsfor some binary interactions whilst also presenting an extra biological characteristic thatmay help to predict further binary PPIs. Any correlation of interactions with biologicalcharacteristics may be aided by complex annotations and enable better understanding ofhow best to infer binary interactions from complex annotations.

Interaction data here are reported from a wide variety of experimental sources. As shownin Figures 2.3-2.5, there is a tendency for physical interactions to be between proteinsthat share GO slim biological characteristics. However, there are some inter-annotationgroups that exhibit a high number of reported interactions. Electron transport proteins

83


have been found to interact with those involved with membrane organisation, whilst nu-clear proteins show an abundance of interactions with chromosomal proteins. Both ofthese observations are consistent with the biological system, as electron transportationoccurs on internal mitochondrion membranes and chromosomes are found within the nu-cleus. Electron transport proteins are known to be connected by non-protein co-factorswhen performing their role within the cell, supporting the lack of within category inter-actions found for this annotation. Clearly, interactions both within annotation classes,and between classes, are biologically important. All these linkages may reflect importantproperties of the interaction network and preserving these graph characteristics maybecrucial if aiming to create a plausible graph model.

The proportion of matching annotations found for each experimental technique varieswidely. The differences found in GO characteristics for each technique could be a conse-quence of either experimental design or noise. If the techniques can all test the same setof protein pairs, then this suggests a higher level of noise from some techniques (depend-ing on which experimental technique is most accurate). However, it seems more realisticthat experimental design explains these differences and the reliability of each experimentcan best be assessed by replicated data, rather than comparison with results from differenttechniques.

It was also shown that data obtained by high throughput methodologies exhibit a less pro-nounced level of similarity in their GO slim biological characteristics than other small-scale experimental techniques. This corroborates the evidence supplied in Table 2.2 whichsuggests different techniques have different propensities for reporting self-interactions.The concept that different techniques produce different subsets of the data must be takeninto account when comparing data to measure noise or generalising about PPI charac-teristics, especially when examining the contribution of HTP experiments (52 studiescontribute 68% of the data).

The high prevalence of self-interactions found using x-ray crystallography or peptide-protein methods could be explained in a number of ways. First, the intricacy of the tech-nique: x-ray crystallography requires protein crystals, found using high concentration ofprotein in solution. The propensity of a protein to form a self-interaction, or homodimer,under these conditions may be abnormally, non-physiologically, high. Second, both x-ray crystallography and peptide-protein methods aim to isolate specific structures, andconsequentially may set out to deliberately elucidate homodimers. Third, the lack of self-interactions reported using other methods may be explained by the more macroscopic

84


tools they employ. Whereas x-ray crystallography and peptide-protein methods actuallyobserve the protein structures, this visible information is lost when mass spectrometry isemployed – a protein is either present or not, and the level of the protein (i.e. whetherthere is twice as much, as in a homodimer) cannot be determined.

Whilst biological traits can inform regarding the structure or characterise aspects of agraph, the reliability of individual PPIs, and the graph as a whole is informed by vali-dations and replicated reports of interactions. Previous publications have used specificmethods, or annotations, to verify data presuming that they are more reliable. However,as shown by Figures 2.6-2.8, individual methods show massive variation in matching an-notations so the skewed use of reference data from particular SSE methods may not beappropriate – for coverage and experimental reasons. Validated, within technique, repli-cates however can offer a less biologically biased view of the true interactome.

S. cerevisiae PPI graphs feature a small set of highly connected proteins contained in asparse graph for thousands of proteins. The majority of these proteins have few interactingpartners and the mean degree ranges from 5 to 8 for our 3 empirical graphs. The empiricalgraphs are more highly clustered than graphs with the same degree sequence, containingsmall cliques and highly clustered sets that share matching biological annotations.

High levels of clustering may reflect a tendency for small groups of proteins to exhibitsimilar interaction partners or be an artefact of the techniques used to find the data. Themajority of the experiments are small scale and focused on small sections of the proteome,so it is hard to know for certain if this feature is also true for the complete error-free in-teractome. Data from complex experiments have also influenced the observed clustering,whilst yeast two-hybrid data, completed on large components of the proteome, does notexhibit as high levels of clustering.

The degree sequences of the empirical graphs exhibit characteristics consistent with scalefree networks although they are not best modelled by a power-law degree distribution (asshown in Table 2.5). Indeed, the power-law distribution for the degree sequence may notcapture the intricacy and complexity of the interaction data, as has been seen for graphsof the internet (Doyle et al., 2005). Instead, the degree distribution could be preservedsimply by fixing the degrees observed in empirical data. This simple approach avoidsusing probability models to generate structurally similar random graphs for hypothesistesting. The use of probability models for the degree distribution could easily result incomparing empirical data with possibly inaccurate random graphs that do not reasonablyreflect the key properties of the empirical graph.

85

Chapter 3

Graph ensembles

This chapter analyses the network and biological traits of a selection of random graphensembles. These graphs (Section 3.2) replicate network and biological constraints togain insights into the evolution, and structure, of PINs. Random graphs sampled fromvarious graph ensembles are compared with empirical data (Section 3.3). The averageensemble traits for these methods form the statistics compared.

86

3.1. INTRODUCTION Ensembles

3.1 Introduction

Topological and biological features of empirical protein interaction graphs may informus about the network’s evolution (Stumpf et al., 2007). The interactome may also pro-vide further information about protein complexes, interactions, and function of biologicalsystems (Chen et al., 2007) or be used in comparative analyses.

A variety of different graph ensembles, or null models, have been used for PIN analyses(Milo et al., 2002; Jordan et al., 2003), although the rationale for their choice is not alwaysclear. Assumptions regarding how the graph is structured or its size and order may biasconclusions, leading to a model not being appropriate for our hypothesis, and risk falselydismissing findings or generating false positive conclusions (May, 2001). Ideally a nullmodel is used to negate the potential effects of confounding variables or processes. Inpractice, it is difficult to find a truly null model as it cannot be certain that features whichhave shaped the data are not already woven into the model (Harvey et al., 1983; Stronget al., 1984).

A selection of studies have investigated whether or not traits of interacting proteins aredifferent from those of non-interacting proteins (Fraser et al., 2002; Lemos et al., 2004).Particular topological traits have also been found in a variety of different biological graphs(Jordan et al., 2003; Berg and Lassig, 2004; Agrafioti et al., 2005). Graph ensembleswhich choose certain characteristics to fix have been used to show the significance of traitsobserved in the empirical data (under the assumption that the data represent a completePIN). For instance, the total number of nodes, or edges, have been fixed and the randomgraphs generated compared to an observed graph (Wagner, 2001). Degree sequences,and other biological traits, of the empirical data have also been fixed in the chosen graphensemble (Milo et al., 2002; Thorne and Stumpf, 2007).

To make a reasonable comparison, which may lead to the conclusion that a characteris-tic is important in determining the interaction graph, the ensemble graph model shouldretain certain properties of the empirical data. However, when generating graphs (wherethe properties of the nodes and edges are important) it is hard to define a satisfactoryparameter set, such as the number of edges, that should be fixed.

In order to assess the possible linkage of traits with protein interactions, or with the graphstructure, the empirical data are compared here to the traits of different graph ensembles(as introduced in Section 1.5.2 on page 48). Traits used form a selection of properties

87

3.2. METHODS Ensembles

that have been reported in the literature as linked with PPIs, or have previously been usedto generate ensembles in biological graph analyses (from Section 1.5.1 on page 46). Aselection of different graph ensembles are proposed and their average trait properties aremeasured. The traits assessed relate to the functions, processes and apparent abundanceof the S. cerevisiae proteins, as well as the structural properties of each protein found inthe interaction graph.

Analyses of graph ensembles are complemented by an investigation of perturbed graphswhich are close to each empirical graph. Here the graph is gradually perturbed, onlychanging a single edge or pair of nodes at each step. Traits are observed as the graphsare perturbed and the perturbation effects analysed. This analysis is used to assess thesignificance of the trait statistics found for the empirical data as well as being used as ameasure of each trait’s robustness.

3.2 Methods

Biological and network traits are studied through the generation of various random graphswith identical order and size. The data are assumed to represent the complete PINs, ig-noring noise in an attempt to model the available reported PPI data. These models, whilstnon-random, hope to inform about the possible structure of PINs in alternative specieswhose evolutionary histories are similar but divergent. The use of random graph models,with appropriate assumptions, should hopefully allow an assessment of how importanttopological structure, or other biological traits, are in the context of the overall graph.

Graphs are generated using two approaches:

• rewiring: graphs are sampled from graph ensembles based on traits of empiricaldata (Section 3.2.2);

• perturbing: empirical graphs are altered by permuting nodes or moving edges, one-by-one (Section 3.2.3).

88


3.2.1 Data

In order to define biologically motivated rewiring schemes for graph ensembles, biolog-ical traits are used. A number of biological features have been proposed as means ofclassifying protein-protein interactions (Salwinski and Eisenberg, 2003; Yu and Fotouhi,2006; Skrabanek et al., 2008; Ramani et al., 2008), or linked with PPIs (Valencia andPazos, 2002; Bhardwaj and Lu, 2005; Thorne and Stumpf, 2007).

A collection of noted biological characteristics are used here for analyses and generationof random graphs: (a) molecular function, biological process or cellular component an-notations taken from the GO slim ontology (Ashburner et al., 2000); (b) multi proteincomplexes found in Gavin et al. (2006); (c) mRNA expression levels as a proxy for S.

cerevisiae protein expression levels from Cho et al. (1998); and (d) percentage sequencesimilarity between S. cerevisiae proteins found from BLAST alignments.

The random graphs analysed in Section 3.3 also use empirical graphs (CORE, DIP or LC)defined in Section 2.4 on page 79.

3.2.2 Rewiring

Graph ensembles are generated that take account of observed traits of empirical graphs.Rather than focussing solely on network traits to find a probability model for plausiblePINs, biological constraints are used to construct graphs alongside network traits (whichare either fixed, or not, depending on the graph ensemble).

Graph rewiring (Bender and Canfield, 1978) is used to generate random graphs from theempirical data. Each rewiring maintains both the size, n, and order, m, of the empiricalgraph used to generate the rewired graph.

Definition 3.1 (Rewiring) An edge, e, is rewired if it is deleted from a graph’s edge set,

E, and a new edge, e′, is added to the graph from the same node set, V .

Definition 3.2 (Graph rewiring) A graph,H ∼ (VH , EH), is a rewiring ofG ∼ (VG, EG)

if |EH | = |EG| and VH = VG.

Each random graph considered is a sample from a graph ensemble (forming a particularprobability distribution over the space of graphs with n nodes andm edges). Comparisons

89


are made between the empirical graph and those found when sampled from the graphensemble. Consequently, the ensemble serves as a null model for the analyses presentedhere. Topological and then biological ensembles are discussed. The empirical graphs areconsidered to be G ∼ (V,E) throughout.

Topological ensembles

Three different graph ensembles are used that fix certain network traits of the empiricaldata: (i) Random graph; (ii) Node shuffle; and (iii) Network shuffle. These take accountof the degree sequence, size, and order of the empirical graph.

(i) Random graph A graph, H , from this ensemble is generated using the ER graphmodel (see Section 1.5.2 on page 48). This fixes the order, n, and size, m as the same asthe empirical graph, G. Biological node traits (such as sequence or annotations) are fixedand the m edges are sampled uniformly without replacement.

(ii) Node shuffle A graph, sampled from this graph ensemble, retains all network traitsof the empirical graph, maintaining the adjacency matrix, A. The node traits are permutedamongst all the nodes of the graph, G. Although the generated graph, H , has identicalstructure to the empirical graph the node specific traits are randomly allocated amongstthe nodes, V .

This graph ensemble produces graphs that retain the precise topological features of thenetwork, whilst disassociating the biological traits, βG(v), of each node, v, from its net-work characteristics. This enables assessment of whether the structure of the graph andthe node labels are related.

(iii) Network shuffle This graph ensemble generates graphs that preserve network traits,using the rewiring algorithm presented by Bender and Canfield (1978). The degree ofeach node, dG(v), along with each node biological trait, βG(v), are fixed. Edges arerandomly distributed under these constraints. The number of legal moves may be smallunder certain conditions, primarily as the proportion of possible edges increases. In thecase of PIN data this is not a concern in general as the graphs are sparse.

90


Node Shuffle(a) Randomly assign labels to proteins

(b) Perform analysis on simulated network

β(A), β(B), β(C), β(D), β(E)

A

B

C D E

β(A)

β(B)

β(C) β(D) β(E)A

B

C D E

β(E)

β(C)

β(A) β(B) β(D)

Figure 3.1: Node shuffle. This figure shows the process used for node shuffle. The labels for eachnode are permuted (e.g. node colour) such that the topology of the graph is fixed.

Network shuffle rewiring produces graphs with the same degree sequence and such thateach node has identical degree and biological characteristics. The neighbours of eachnode are altered whilst the degree of each node, dG(vi), is fixed:

H ∼ (V,E ′) where for each vi ∈ V, dH(vi) = dG(vi). (3.1)

Graphs maintain the degree distribution and other characteristics of each node.Network Shuffle(a) Randomly reassign edges

(b) Retain degree of each protein

e1, e2, e3, e4, e5

A

B

C D E

A

B

C D E

Figure 3.2: Network shuffle. The degree of each node, [A,B,C,D,E], is fixed along with the nodecharacteristic, colour, whilst the edges are randomly rewired.

91


Biological ensembles

The following graph ensembles fix both network and biological traits of the empiricaldata: (iv) Bipartite shuffle; (v) Biological node shuffle; and (vi) Biological network shuffle.

(iv) Bipartite shuffle This uses an edge characteristic, Φ = φ1, . . . , φm, to rewirethe empirical graph. Each edge, ei, is rewired such that it retains the same characteris-tic, φi. Figure 3.3 shows an example rewiring when the trait, φ(.), is the colour of theconnected nodes. Edges are rewired randomly whilst maintaining the connections of par-ticular colours. The number of edges that connect each of the possible characteristics – inthis case (blue, blue), (green, green), or (blue, green) – are fixed.

Bipartite shuffle ensemble ignores the network traits of the empirical graph, instead repli-cating the types of biological trait between connected nodes. It can be viewed as perform-ing a set of random graph rewirings, as for graph ensemble (i). Each rewiring, however,is over a subgraph of nodes that have particular characteristics. These are the subsets ofnodes with either the same node characteristic or with two specific characteristics. All ofBipartite Shuffle

(a) Permute edges to fix label frequencies

e1, e2, e3, e4, e5

φ(e1) = φ(e2) = φ(e5) = (g, b);φ(e3) = φ(e4) = (b, b)

A

B

C D Eφ(e1)

φ(e3)

φ(e2)φ(e4) φ(e5)

A

B

C D E

Figure 3.3: Bipartite shuffle. Each edge is rewired uniformly and at random to one of the set of nodepairs that share the same edge characteristic, φ(ei).

the edges, ei, retain the characteristic, φi, fixing the graph trait statistic, ∆(G,Φ), for thatcharacteristic. Obviously, this can be extended to fix multiple trait statistics, although thisincreases the complexity of the task whilst also limiting the size of the set of graphs thatcan be sampled.

92


This ensemble technique is not used in the rewiring component of this chapter. However,it is used to define biological network shuffle and the same biological rewiring constraintdefines possible perturbations applied to empirical data in Section 3.2.3.

(v) Biological node shuffle This graph ensemble produces a subset of the graphs thatcan be sampled from node shuffle, retaining the topology of the observed graph, G. Bio-

logical node shuffle permutes the nodes such that each node, vi, is switched with one, vj ,sharing a particular characteristic, β(vi) = β(vj).Bipartite Node Shuffle

(a) Permute nodes to maintain test characteristic

A, B, C,D, E

β(A) = β(C) = β(D) = b;β(B) = β(E) = g

A

B

C D E

β(A)

β(B)

β(C) β(D) β(E)C

E

D A B

β(C)

β(E)

β(D) β(A) β(B)

Figure 3.4: Biological node shuffle. This permutes each node to another node, vi → vj such thatβ(vi) = β(vj) for the particular characteristic, β, under consideration.

(vi) Biological network shuffle This graph ensemble is based on the algorithm used toproduce the network shuffle graph ensemble. An edge, eh, has a characteristic, φ(eh),determined by characteristics of the nodes it connects, φ(eh) = φ(vi, vj). Each edge isrewired to maintain the degree of each node, dG(v), as in network shuffle, and retains thecharacteristic of the rewired edge, eh. So eh → e′h if φ(eh) = φ(e′h).

This rewiring algorithm is a form of bipartite shuffle graph rewiring, only contingent onthe bipartite graphs being randomly rewired according to the Bender and Canfield (1978)rewiring, that forms the basis for the network shuffle ensemble (and similar in approachto that taken by Thorne and Stumpf (2007)).

Constraints used for the biological ensembles could involve any number of biologicaltraits. However, only fixing one characteristic for each edge/node is assessed, and then

93


Bipartite Network Shuffle(a) Permute edges to fix label frequencies

(b) Retain degree of each protein

e1, e2, e3, e4, e5

φ(e1) = φ(e2) = φ(e5) = (g, b);φ(e3) = φ(e4) = (b, b)

A

B

C D Eφ(e1)

φ(e3)

φ(e2)φ(e4) φ(e5)

A

B

C D E

Figure 3.5: Biological network shuffle. This retains the degree of each node, dG(v), and alsorewires each edge, e, to one of the available node pairs that share the same edge characteristic, φ(e).

the resulting effects on the trait ensemble averages.

Four biological traits are used for each biological ensemble: complex membership [com-plex]; functional annotation [function]; biological component annotation [component];biological process annotation [process]. For each graph ensemble, 1,000 random graphsare sampled for each empirical graph. The following 11 graph ensembles are comparedagainst the 3 empirical graphs:

• (1) Random graph [size and order fixed]

• (2) Node shuffle [nodes permuted according to node characteristic]

• Biological node shuffle: (3) [process]; (4) [component]; (5) [function]; (6) [com-plex]

• (7) Network shuffle [edges rewired according to edge characteristic]

• Biological network shuffle: (8) [process]; (9) [component]; (10) [function]; (11)[complex]

Graphs are sampled from each ensemble uniformly across all graphs which satisfy therelevant constraints. In order to sample a random graph, H , from the empirical graph, G,the following is implemented:

94


1. INITIALISE NEW EMPTY (EH = ∅) GRAPH, H ∼ (VG, EH) AND EMPTY SET, T

2. RANDOMLY PICK e ∈ EG \ T

3. FIND SET S OF (v1, v2) /∈ EH , v1, v2 ∈ VG WHICH FULFILL ENSEMBLE CON-STRAINTS

4. SAMPLE e′ ∈ S IF NON-EMPTY OR RETURN TO 2

5. ADD EDGE SO EH = EH ∪ e′, T = T ∪ e

6. RETURN TO 2 UNLESS T = EG

The algorithm may require knowledge of the degree of each node in G and partiallyformed H during the course of the implementation, as well as fixed node characteristicsrelated to the biology of each protein.

3.2.3 Perturbations

A graph is perturbed by rewiring a single edge or pair of nodes (forming a set of edgerewirings dependent on which nodes are changed). The random graphs are used to seehow traits are affected by small changes to the empirical data, which may represent smallevolutionary changes or effects of noise. The distance and how stable the perturbedgraph’s properties are to the empirical data are measured.

Definition 3.3 (Perturbed graph) A graph, Gi+1 ∼ (Vi+1, Ei+1), is a perturbed graph

of Gi ∼ (Vi, Ei) if either: they differ by a single edge; or two nodes have been permuted.

Both share the same order, |Vi+1| = |Vi|, and size, |Ei+1| = |Ei|. The subscript, i, is the

number of perturbation steps taken from the empirical graph, G.

Studies that analysed the effect of using incomplete data (de Silva et al., 2006; Lee et al.,2006) or subgraphs of true graphs (Stumpf and Wiuf, 2005; Stumpf et al., 2005b) haveshown that certain biases in the assessment of structural network traits are inevitable whenusing a subset of the true data. Subgraphs may not have similar properties, such as thedegree distribution, as the complete true graph. It is important to know whether a graphensemble can be linked to empirical PINs as well as if similarly motivated perturbationsreproduce similar trait statistics.

95


The rate of change of traits and a measure of how different the graphs are is used to com-pare perturbed and empirical graphs. Perturbed graphs, as in Definition 3.3, introducesa method for performing a single perturbation to a graph, or a step. The number of stepsbetween graphs may not be adequate as a comparison between graphs generated by differ-ent perturbation methods, as steps may cancel out or lead to different rates of topologicalchange. The differences between perturbed graphs are summarised using two measures:distance and instability between the graphs. These measures are defined for graphs thatshare the same order and size.

Distance between graphs, G ∼ (V,EG) and H ∼ (V,EH), is defined as the number ofdifferent edges found (given the same order and size). This can be easily found from the(upper triangular) adjacency matrices for the two different graphs, forming a Hammingdistance (Hamming, 1950). The distance is always a multiple of two due to each graphhaving the same size.

Definition 3.4 (Graph distance) Given two graphs, G ∼ (V,EG) and H ∼ (V,EH),

with adjacency matrices, A and B , let the distance between the graphs be defined as:

c(G,H) =∑

i<j∈[1,n]

|ai,j − bi,j|.

This distance is used to assess how sensitive trait statistics are to small perturbations of theempirical graphs. The trait statistic, if the trait is important for the PIN, should not varysubstantially for plausible close graphs. Let instability be defined as the rate of change ofa graph trait (which is linked to the trait’s robustness).

Trait instability is measured for each different perturbation method. In order for this mea-sure to be comparable across different traits, the trait statistic, ∆(Ω,Φ), for the completeprotein graph, Ω ∼ (V,EΩ), is used. It aims to assess the rate of change, after n perturba-tions, of a given trait from that found in the observed graph. Stability is non-symmetric,although this does not affect the results as it is always found relative to the same empiricalPIN.

Definition 3.5 (Instability) The instability, s, of a trait, Φ, is the difference between the

trait statistics of two graphs, ∆(G,Φ) and ∆(H,Φ), as the proportional difference to that

96


of the trait across all node pairs, ∆(Ω,Φ):

s(G,H,Φ) =∆(G,Φ)−∆(H,Φ)

∆(G,Φ)−∆(Ω,Φ).

Let graph Gn be a graph that is n perturbations from the empirical graph G. Now denotethe closeness and instability for these perturbed graphs as: cn = c(G,Gn) and sn(Φ) =

s(G,Gn,Φ) respectively. The distance, cn, instability, sn, and traits of the graph, Gn,generated by n perturbations to the empirical graph G are analysed. The three approachesused to perturb graphs are:

• Biological edge: a random edge, ea, is rewired to form a new edge, eb /∈ E, suchthat φ(ea) = φ(eb).

• Biological node: a random node, vi, is permuted with another node, vj , sharing agiven characteristic, β(vi) = β(vj).

• Biological shuffle: two edges ea = (va1 , va2) and eb = (vb1 , vb2), sharing nodecharacteristics β(va1) = β(vb1) and β(va2) = β(vb2), are deleted and replaced withe′a = (va1 , vb2) and e′b = (vb1 , va2).

GO annotations ([process], [component], [function]) and a series of homology sets (Hα)are used to constrain perturbations. The homology sets (Hα) are defined by scores fromBLAST (for proteins A and B: A ∈ Hα (B) if score (A,B) > α).

97

3.3. RESULTS Ensembles

3.3 Results

Rewired (Section 3.3.1) and perturbed (Section 3.3.2) graphs are presented in this section.The biological trait statistics used are, for GO and complex annotations, the proportion ofedges found between nodes with matching annotations. For expression data, the trait isthe average level of co-expression, using Spearman’s rank correlation coefficient, for allgraph edges. The clustering coefficient for a graph is the average clustering coefficientacross the full node set, as introduced for a given node in Definition 1.24 on page 44.

The trait statistics for each empirical graph are compared to the ensemble trait averages

found for rewired and perturbed graphs. An ensemble trait average is the arithmetic meanof the trait values found for (1,000) random samples from the considered graph ensem-ble. For each rewiring ensemble assessed the available space of allowed graphs greatlyexceeded the number sampled and there were no computational problems sampling fromeach set. The variability of the trait values for each ensemble method are also contrastedand compared, whilst the instability of each trait is focused on when observing the effectsof perturbations on the empirical graph data.

The results presented are descriptive in nature. Network shuffle and node shuffle ensem-bles are contrasted to the traits observed for random graphs and those found for empiricaldata. The effects of constraining on biological characteristics as well as network charac-teristics are also observed. The aim is to assess the biological properties of each of theensembles, and highlight anomalous findings, to inform the use of graph null models forfurther network analyses and those presented in Chapter 4.

3.3.1 Rewiring

The trait statistics that graphs from each ensemble produce are presented in this section.It was found that graph ensembles used here do not reproduce the observed traits of em-pirical graphs, which are shown in Table 3.1.

For each set of analyses, the level of each trait is displayed as a proportion of that observedfor the empirical graphs. Trends across the empirical graphs are similar. LC, which is thelargest empirical graph, is used for illustration here whilst further results can be found inAppendix C.

98


Graph TraitCo-expression Complex Function Process Component Clustering

CORE 0.11 0.59 0.31 0.41 0.50 0.21DIP 0.07 0.51 0.17 0.23 0.38 0.09LC 0.09 0.51 0.19 0.28 0.44 0.13Complete (Ω) 0.01 0.03 0.03 0.03 0.18 1

Table 3.1: Empirical graph traits. Trait values are detailed for the three empirical graphs andthe complete graph with nodes for all proteins found in LC (i.e. the graph with all possible edges). Thetraits for matching GO categories, complex annotations, average co-expression and clustering coefficientsare detailed. For each GO category and the complex annotations the trait statistic is the proportion ofedges found between nodes with matching annotations. The average level of co-expression of the nodes thatare connected forms the co-expression trait and the clustering co-efficient trait is the average clusteringco-efficient for each node.

Co-expression rates

Figure 3.6 shows the co-expression graph trait for each graph ensemble. Although thereare large differences between the ensembles, none of the ensembles produce traits close tothe co-expression trait of the empirical data. Biological ensembles that constrain complexannotations, [complex], produce the closest trait values to those observed in the empiricalgraphs. Biological network shuffle [complex] graphs have co-expression trait values of80-90% of the trait value for LC or CORE, and less than 80% when compared to DIP.Random graph and node shuffle ensembles, in contrast, produce graphs that have approx-imately 20% of the LC trait value.

Network shuffle ensembles produce graphs that have higher, and less variable, trait statis-tics than the equivalent node shuffle ensembles. The node shuffle ensembles show highervariance in trait values than the random graph ensemble, although when no further bio-logical constraints are applied the mean trait is approximately equal.

Complex annotations

The proportion of matching complex annotations show similar trends to those seen forthe co-expression trait values. Figure 3.7 shows the proportion of edges found betweenproteins that have been reported in the same protein complexes. Each boxplot shows theaverage trait value produced for graph ensembles as a proportion of the value found forLC (shown in Table 3.1). Figure 3.7 shows that the non-biological ensembles (random

99


Network shuffle [complex]

Network shuffle [function]

Network shuffle [component]

Network shuffle [process]

Network shuffle

Node shuffle [complex]

Node shuffle [function]

Node shuffle [component]

Node shuffle [process]

Node shuffle

Random graph

0.0 0.2 0.4 0.6 0.8 1.0

Average co−expression

Figure 3.6: Co-expression trait for ensembles using LC graph. Average co-expressiontrait results are shown rescaled in proportion to the trait value for the LC graph. The red line, alwaysat 1, shows the empirical trait statistic. The lowest four (yellow) boxplots relate to biological networkshuffle results, whilst the red relate to biological node shuffle. The fifth from bottom (green) boxplot isfor the unconstrained network shuffle, the blue boxplot for node shuffle and the top (cyan) boxplot showsthe results for the random graph ensemble. Co-expression found increases to the right on the graph, andnetwork shuffle ensembles show the highest levels, although lower than that found in the empirical graph.

graph, network shuffle or node shuffle) all produce similar number of matching complexannotations.

As for the other traits, network shuffle produces less variability in the complex trait valuesthan either random graph, or the most variable ensemble node shuffle. The trait does notdepend on whether the degree sequence is fixed, or whether the labels are fixed along withthe node degrees. The ensembles which only constrain network characteristics producean ensemble trait average almost identical to that shown by the random graph ensemble.The random graph ensemble average trait produces only a tenth of the matching complex

100






Network shuffle





Node shuffle

Random graph

0.0 0.2 0.4 0.6 0.8 1.0

Proportion matching complex annotations

Figure 3.7: Complex trait for ensembles using LC graph. Matching complex annotationstrait results are shown rescaled in proportion to the trait value for the LC graph. The red line, alwaysat 1, shows the empirical trait statistic. The bottom four (yellow) boxplots relate to biological networkshuffle results, whilst the red relate to biological node shuffle. The fifth from bottom (green) boxplot is forthe unconstrained network shuffle, the blue boxplot for node shuffle and the cyan shows the results for therandom graph ensemble. In contrast to co-expression data, the variance of the trait is lower, shown by thesmaller width of the boxplots.

annotations that are seen in the empirical LC graph.

Biological network shuffle ensembles produce a higher proportion of matching complexannotations, between just under 20% and 35%, than the biological node shuffle ensem-bles. After [complex], the [process] constrained ensembles produce graphs with the high-est trait value, followed by [function] and lastly [component] for the GO categories. Allthese ensembles produce graphs that exhibit less than half of the complex annotationmatches found in the LC data. This is the lowest proportion, in general, of matching an-notations retained of all biological traits examined in this section. Clearly, the [complex]

101


constrained ensembles produce the closest match to the LC graph for this trait. However,owing to multiple annotations (which mean that matching annotated links can be brokenunder rewiring as a unique annotation is chosen), both ensemble averages are lower thanthe value found for the LC graph.

Gene Ontology

The ensemble averages produced for the matching GO category traits are shown in Fig-ures 3.8-3.9. By construction, the trait values are closest to the empirical value for ensem-bles that are constrained by the same biological characteristic (i.e. [function] producesalmost identical results for matching function annotations). Otherwise, however, the traitstatistics are consistently lower in each different ensemble (whether network or biologi-cal characteristics are constrained) than those seen for LC. This is confirmed by the twofurther empirical graphs (shown in Appendix D).

Although biological ensembles reproduce similar trait statistics for the characteristic con-strained, there are still slight differences to the empirical trait. These differences must bea consequence of multiple annotations for each protein.

Topological ensembles, network shuffle and node shuffle with no biological constraints,exhibit different results across the GO slim categories (although consistent with earlierobserved traits). For each trait network shuffle ensemble graphs produce a higher propor-tion of the matching annotations than node shuffle, which permutes the labels over a fixedtopological structure. The network shuffle ensemble produces higher values for the com-ponent trait than the biological node shuffle [function] ensemble, shown in Figure 3.9(a).Otherwise topological ensembles all produce lower traits than any of the biological en-sembles presented.

Clustering coefficient

Each graph ensemble fixes different topological network characteristics, with all the node

shuffle methods fixing the complete topological structure of the observed graph. So in-evitably, Figure 3.9(b) shows that the clustering coefficient found for each node shuffle

ensemble (biological or topological) is the same as that found for LC.

In general, for other ensembles such as random graph and network shuffle, the clustering

102


Net

wor

k sh

uffle

[com

plex

]

Net

wor

k sh

uffle

[fun

ctio

n]

Net

wor

k sh

uffle

[com

pone

nt]

Net

wor

k sh

uffle

[pro

cess

]

Net

wor

k sh

uffle

Nod

e sh

uffle

[com

plex

]

Nod

e sh

uffle

[fun

ctio

n]

Nod

e sh

uffle

[com

pone

nt]

Nod

e sh

uffle

[pro

cess

]

Nod

e sh

uffle

Ran

dom

gra

ph

0.0

0.2

0.4

0.6

0.8

1.0

Pro

port

ion

mat

chin

g fu

nctio

n an

nota

tions

(a)

Func

tion

Net

wor

k sh

uffle

[com

plex

]

Net

wor

k sh

uffle

[fun

ctio

n]

Net

wor

k sh

uffle

[com

pone

nt]

Net

wor

k sh

uffle

[pro

cess

]

Net

wor

k sh

uffle

Nod

e sh

uffle

[com

plex

]

Nod

e sh

uffle

[fun

ctio

n]

Nod

e sh

uffle

[com

pone

nt]

Nod

e sh

uffle

[pro

cess

]

Nod

e sh

uffle

Ran

dom

gra

ph

0.0

0.2

0.4

0.6

0.8

1.0

Pro

port

ion

mat

chin

g pr

oces

s an

nota

tions

(b)

Proc

ess

Figu

re3.

8:G

ene

Ont

olog

ytr

aits

for

grap

hen

sem

bles

.Ens

embl

etr

aita

vera

ges

for

two

Gen

eO

ntol

ogy

cate

gori

esar

esh

own

resc

aled

inpr

opor

tion

toth

etr

aitv

alue

for

the

LCgr

aph.

The

red

line,

alw

ays

at1,

show

sth

eem

piri

calt

rait

stat

istic

.Th

ebo

ttom

four

(yel

low

)bo

xplo

tsre

late

tobi

olog

ical

netw

ork

shuf

flere

sults

,whi

lstt

here

dre

late

tobi

olog

ical

node

shuf

fle.T

hefif

thfr

ombo

ttom

(gre

en)b

oxpl

otis

for

the

unco

nstr

aine

dne

twor

ksh

uffle

,the

seco

ndfr

omto

p(b

lue)

boxp

lotf

orno

desh

uffle

and

the

top

(cya

n)bo

xplo

tsho

ws

the

resu

ltsfo

rth

era

ndom

grap

hen

sem

ble.

103


Net

wor

k sh

uffle

[com

plex

]

Net

wor

k sh

uffle

[fun

ctio

n]

Net

wor

k sh

uffle

[com

pone

nt]

Net

wor

k sh

uffle

[pro

cess

]

Net

wor

k sh

uffle

Nod

e sh

uffle

[com

plex

]

Nod

e sh

uffle

[fun

ctio

n]

Nod

e sh

uffle

[com

pone

nt]

Nod

e sh

uffle

[pro

cess

]

Nod

e sh

uffle

Ran

dom

gra

ph

0.0

0.2

0.4

0.6

0.8

1.0

Pro

port

ion

mat

chin

g co

mpo

nent

ann

otat

ions

(a)

Com

pone

nt

Net

wor

k sh

uffle

[com

plex

]

Net

wor

k sh

uffle

[fun

ctio

n]

Net

wor

k sh

uffle

[com

pone

nt]

Net

wor

k sh

uffle

[pro

cess

]

Net

wor

k sh

uffle

Nod

e sh

uffle

[com

plex

]

Nod

e sh

uffle

[fun

ctio

n]

Nod

e sh

uffle

[com

pone

nt]

Nod

e sh

uffle

[pro

cess

]

Nod

e sh

uffle

Ran

dom

gra

ph

0.0

0.2

0.4

0.6

0.8

1.0

Ave

rage

clu

ster

ing

coef

ficie

nt

(b)

Clu

ster

ing

Figu

re3.

9:C

ompo

nent

and

clus

teri

ngtr

aits

for

grap

hen

sem

bles

.Ens

embl

etr

aits

are

show

nfo

rth

egr

aphs

sam

pled

for

clus

teri

ngan

dbi

olog

ical

com

pone

nttr

aits

,res

cale

din

prop

ortio

nto

the

trai

tval

uefo

rth

eLC

grap

h.Th

ere

dlin

e,al

way

sat

1,sh

ows

the

empi

rica

ltra

itst

atis

tic.T

hebo

ttom

four

(yel

low

)bo

xplo

tsre

late

tobi

olog

ical

netw

ork

shuf

flere

sults

,whi

lstt

here

d(o

rfo

urof

thos

eon

the

red

line

in(b

))bo

xplo

tsre

late

tobi

olog

ical

node

shuf

fle.

The

gree

nbo

xplo

tis

for

the

unco

nstr

aine

dne

twor

ksh

uffle

,the

seco

nd(b

lue)

boxp

lot

for

node

shuf

flean

dth

eto

p(c

yan)

boxp

lots

how

sth

ere

sults

for

the

rand

omgr

aph

ense

mbl

e.

104


coefficient is generally less than a quarter of the value found in the equivalent empiricalgraph. Biological network shuffle [complex] forms an exception to the low clusteringcoefficients produced by these ensembles. The complex annotation constraint results inan ensemble average clustering coefficient of over half of that found for LC. Ultimately,however, only node shuffle reproduces the local structure of the empirical data: somethingit does by design.

Differences between empirical graphs

Although the analyses show similar trends across all 3 empirical graphs, the average co-expression trait results differ between DIP and the other two empirical graphs. For theDIP graph there is a noticeable difference between the trait statistics found for node shuffle

and random graph ensembles. This is not found in the results for LC (seen earlier in Fig-ure 3.6) or CORE. Figure 3.10 shows the trait values found for average co-expression foreach of the empirical graphs, as a proportion of the empirical observation. This shows theco-expression trait results across the topological ensembles: node shuffle (Figure 3.10(a));network shuffle (Figure 3.10(b)); and random graph (Figure 3.10(c)). Random graph andnetwork shuffle ensembles show the same properties, reproducing similar proportions ofthe co-expression that is evident in the relevant empirical data. However, the DIP co-expression proportion found for graphs generated from the node shuffle ensemble arenoticeably higher.

Overall, rewiring the empirical data generates graphs that have lower trait values, althoughfor each graph similar proportions of the empirical trait value are generated by each en-semble method. Node shuffle ensembles generate graphs with similar biological traits tothose found in random graph samples (except for DIP co-expression trait shown in Fig-ure 3.10(a)). The trait statistics are not greatly affected by fixing the exact structure ofthe empirical graph shown through comparison to the random graph ensemble. However,node shuffle trait values are more variable than those produced by graphs sampled fromrandom graph ensembles. The third topological method, network shuffle, retains a higherproportion of matching annotations than equivalent node shuffle ensembles. The averagedistance between the random graphs and empirical data was, as expected, influenced bythe constraints placed on the ensemble. Network shuffle ensemble graphs were on averageclosest to the empirical data, followed by the equivalent node shuffle ensembles and thenthe random graph ensemble.

105


Prop

ortio

n of

em

piric

al c

o−ex

pres

sion

Density

0.0

0.1

0.2

0.3

0.4

0.5

05101520

CORE

DIP

LC

(a)

Nod

esh

uffle

Prop

ortio

n of

em

piric

al c

o−ex

pres

sion

Density

0.0

0.1

0.2

0.3

0.4

0.5

05101520

CORE

DIP

LC

(b)

Net

wor

ksh

uffle

Prop

ortio

n of

em

piric

al c

o−ex

pres

sion

Density

0.0

0.1

0.2

0.3

0.4

0.5

05101520

CORE

DIP

LC

(c)

Ran

dom

grap

h

Figu

re3.

10:

Co-

expr

essi

ontr

ait

for

topo

logi

cale

nsem

bles

.H

isto

gram

ssh

owth

ede

nsity

ofsa

mpl

esfr

omth

eto

polo

gica

len

sem

bles

for

the

co-

expr

essi

ontr

ait.

The

x-ax

isis

the

prop

ortio

nof

the

empi

rica

ltra

itob

serv

edfo

rea

chsa

mpl

edgr

aph

–C

OR

E,D

IPor

LC.D

IPsh

ows

diffe

rent

beha

viou

rfo

rno

desh

uffle

ense

mbl

eth

anth

eot

her

two

empi

rica

lgra

phs.

For

each

othe

ren

sem

ble

met

hod,

the

trai

tsva

lues

show

sim

ilar

beha

viou

rre

lativ

eto

the

empi

rica

ltra

itva

lue.

106


3.3.2 Perturbations

The rewired graph ensembles present a means of generating random graphs based onbasic assumptions about the biological and network properties of PINs. Whilst theseensembles did not generate graphs that shared biological traits with the empirical data,they showed different properties dependent on the network characteristics maintained.Empirical graphs are now perturbed, rather than rewired, to assess how constraints onthe graph’s evolution affect the same graph traits. The trait statistics are now measuredagainst the distance found between graphs, whilst in Section 3.3.1 the distance betweenthe graphs was not considered.

Perturbed graphs were generated using the CORE graph (which has 2,528 proteins and5,728 interactions). Stability and closeness measures are used to compare each of thesimulated graphs. First, the use of GO category constraints on perturbations is described,and second the analyses performed on homology sets (Hα) for sets with scores α ∈ [10,100, 500, 1000]. Results for the perturbations are the average statistic across the runs (foreach of 0–10,000 steps from the empirical graph). Each constraint has been used for thethree different perturbation approaches found in Section 3.2.3: (a) biological edge; (b)biological node; and (c) biological shuffle.

This section concentrates on the relationship observed between graph distance and insta-bility (defined in Section 3.2.3). The graph distance (introduced in Definition 3.4) is aHamming distance between each of the perturbed graphs, which is always a multiple oftwo, and the CORE graph (and is between 0 and 11,456 as CORE has 5,728 edges).

Instability (introduced in Definition 3.5) is a measure of how the trait values compare tothe trait values found for CORE and the complete graph (on the same node set as CORE).For the traits considered the CORE graph trait value is higher than that found on averagein the complete graph. In general, a negative instability means that a trait value is largerthan the value found for CORE and an ER random graph has an expected trait instabilityof 1 (as an ER graph is a random sample of all possible edges).

Gene Ontology constrained graphs

Figure 3.11 shows the distance and instability for the perturbations when GO annotationsare used to constrain each step. Aside from the fixed trait, the non-constrained traits move

107


towards the value found for the complete graph, although the biological shuffle methodproduces a smaller value of instability and distance across the methods.

Figure 3.11(a) shows the instability and distance results for biological node [function].These perturbed graphs are the furthest away from the empirical graphs for a fixed num-ber of perturbations. Figures 3.11(b)-3.11(c) show the general behaviour of the other twoperturbation methods. Biological edge graphs are closer than biological node graphs tothe empirical data, and biological shuffle graphs are closer still, for fixed steps. This illus-trates the extra constraints that are placed on the adjacency matrices by these algorithms.

Homology null sets

The genetic similarity of proteins should be linked to their interaction partners. If theperturbations made to an empirical interaction graph are constrained according to geneticsimilarity it is expected that the biological traits would be retained more readily thanby chance. To test this hypothesis, four sets of homologous sequences were generatedfor each protein, using similarity scores of at least 10, 100, 500, 1000 to define whichprotein pairs are considered homologous. These four sets are of various sizes, formingsubsets of the proteome, V , for each protein. Clearly, the set for each protein decreases insize as the score increases (Hξ ⊆ Hβ ∀ ξ > β). Table 3.2 shows the average number ofsequences found for each protein.

Score, α10 100 500 1000

Proteins 65.95 1.92 0.20 0.04

Table 3.2: Size of homology sets. Average set sizes of Hα (v) for a protein v and score α. Theaverage set size that nodes can be permuted within for the simulated perturbations decreases as the scoreα is increased.

Two different null sets were constructed to assess the affect of constraining perturbationson homology sets. First, random sets were chosen such that members of H ′α (v) arepicked (uniformly) at random so ∀ v ∈ V : |H ′α (v)| = |Hα (v)|. Second, structure setswere generated such that H ′α (v) = Hα (f (v)), where f : V → V is a random permu-tation, or bijection, of the original protein set. Structure retains the size and structure ofthe homology sets whereas random just maintains the size of each homology set.

Figures 3.12-3.13 show the results, for complex annotation and GO process traits, of these

108


−0.5

0.0

0.5

010002000300040005000

Stab

ility,

s(.)

Distance, c(.)

Func

tion

Com

pone

ntPr

oces

sCo

−exp

ress

ion

Com

plex

(a)

Bio

logi

caln

ode

[fun

ctio

n]

−0.5

0.0

0.5

01000200030004000

Stab

ility,

s(.)

Distance, c(.)Fu

nctio

nCo

mpo

nent

Proc

ess

Co−e

xpre

ssio

nCo

mpl

ex

(b)

Bio

logi

cale

dge

[com

pone

nt]

−0.5

0.0

0.5

0100020003000

Stab

ility,

s(.)

Distance, c(.)

Func

tion

Com

pone

ntPr

oces

sCo

−exp

ress

ion

Com

plex

(c)

Bio

logi

cals

huffl

e[p

roce

ss]

Figu

re3.

11:

Stab

ility

and

dist

ance

for

GO

pert

urba

tions

.D

ista

nce

and

inst

abili

tym

easu

rem

ents

are

show

nfo

ra

sele

ctio

nof

biol

ogic

altr

aits

(GO

anno

tatio

ns,c

ompl

exan

nota

tions

and

aver

age

co-e

xpre

ssio

n)w

hen

CO

RE

ispe

rtur

bed

byth

eth

ree

pert

urba

tion

met

hods

.Th

ey-

axis

show

sth

edi

stan

ce,

c(.

),fr

omth

eem

piri

calg

raph

.Th

ex-

axis

show

sth

ein

stab

ility

,s(.

),fo

ra

part

icul

arbi

olog

ical

trai

tsta

tistic

.A

llof

the

trai

tval

ues

conv

erge

tow

ards

that

for

the

com

plet

egr

aph

asth

eem

piri

calg

raph

ispe

rtur

bed.

All

trai

tssh

owsi

mila

rin

stab

ility

and

dist

ance

rela

tions

hips

exce

ptfr

omth

etr

aitt

hati

sbe

ing

expl

icitl

yco

nstr

aine

din

the

algo

rith

m.

109


−0.4

−0.2

0.0

0.2

0.4

0500100015002000250030003500

Stab

ility,

s(.)

Distance, c(.)

Hom

olog

ues

Stru

ctur

e [N

ULL]

Rand

om [N

ULL]

(a)

Bio

logi

cale

dgeH

100

(v),

com

plex

trai

t

−0.4

−0.2

0.0

0.2

0.4

0200400600800100012001400

Stab

ility,

s(.)

Distance, c(.)

Hom

olog

ues

Stru

ctur

e [N

ULL]

Rand

om [N

ULL]

(b)

Bio

logi

cale

dgeH

500

(v),

com

plex

trai

t

Figu

re3.

12:N

ullh

omol

ogy

pert

urba

tions

for

com

plex

anno

tatio

ns.T

hetr

aitr

esul

tsfo

rpe

rtur

batio

nsi

mul

atio

nsof

null

sets

,H′ α

(v),

are

show

nin

com

pari

son

toth

ese

ts,H

α,d

eter

min

edby

sequ

ence

hom

olog

y.Fi

gure

s3.

12(a

)-3.

12(b

)sh

owth

eco

mpl

exan

nota

tion

trai

tins

tabi

lity,s

(.),

and

dist

ance

,c(.

),us

ing

the

biol

ogic

aled

gepe

rtur

batio

nal

gori

thm

.St

ruct

ure

grap

hsre

ach

com

para

ble

dist

ance

sto

the

hom

olog

yse

ts,a

lthou

ghth

ein

stab

ility

isse

vera

ltim

esla

rger

,sho

win

gth

atin

stab

ility

isgr

eate

rif

pert

urba

tions

ofed

ges

are

rew

ired

acco

rdin

gto

gene

ticse

quen

cesi

mila

rity

.Th

ein

stab

ility

,alo

ngw

ithth

epo

ssib

ledi

stan

ce,r

educ

esas

the

scor

eis

incr

ease

d.

110


−0.4

−0.2

0.0

0.2

0.4

020040060080010001200

Stab

ility,

s(.)

Distance, c(.)

Hom

olog

ues

Stru

ctur

e [N

ULL]

Rand

om [N

ULL]

(a)

Bio

logi

caln

odeH

500

(v),

proc

ess

trai

t

−0.4

−0.2

0.0

0.2

0.4

0100200300

Stab

ility,

s(.)

Distance, c(.)

Hom

olog

ues

Stru

ctur

e [N

ULL]

Rand

om [N

ULL]

(b)

Bio

logi

caln

odeH

1000

(v),

proc

ess

trai

t

Figu

re3.

13:

Nul

lhom

olog

ype

rtur

batio

nsfo

rpr

oces

san

nota

tions

.The

trai

tres

ults

for

pert

urba

tion

sim

ulat

ions

ofnu

llse

ts,H

′ α(v

),ar

esh

own

inco

mpa

riso

nto

the

sets

,Hα

,det

erm

ined

byse

quen

ceho

mol

ogy.

Figu

res

3.13

(a)-

3.13

(b)

show

the

proc

ess

anno

tatio

ntr

aiti

nsta

bilit

y,s

(.),

and

dist

ance

,c(.

),us

ing

the

biol

ogic

alno

depe

rtur

batio

nal

gori

thm

.The

inst

abili

ty,a

long

with

the

poss

ible

dist

ance

,red

uces

asth

esc

ore

isin

crea

sed.

Ifth

esc

ore

is1,

000

then

the

hom

olog

yse

tper

turb

edgr

aphs

have

am

axim

umdi

stan

ceof

100,

soon

ly2%

ofth

eed

ges

diffe

r.

111


null set simulations. Distance is plotted against instability for the graphs, Gn, generatedby the perturbation simulations. These plots show a noticeable difference in the behaviourof the true homologue sets in comparison to the null sets. Maintaining the structure of thehomologue sets leads to graphs that are approximately the same distance apart as thosefrom the true homology sets although the traits are less volatile (exhibiting a significantlylower instability which means they do not change as much).

As the score (α) increases both the instability and distance get smaller for a given numberof perturbations. Figures 3.12(b) and 3.13(a) show that although the null models forbiological node and biological edge reach similar distances apart across the simulations,the homologue sets show large differences in the average distances reached. As foundfor the ensemble methodologies, rewiring edges as opposed to nodes results in a lowervariability in the distance from the empirical data.

The structure and random null model comparisons show that rewiring according to ho-mologous sequence information generates graphs with more stable traits than expectedby chance for these particular null models. The homologous sequences, therefore, showa higher level of annotation similarity (or level of co-expression) than found for randomprotein pairs.

Similarity constrains instability and distance

Figure 3.14 shows that as the similarity score, α, increases the sampled perturbed graphsexhibit smaller distance and instability values. This is found for the null models above aswell, although there are differences between the perturbation methodologies. The biolog-

ical node perturbed graphs have greater instability and distance statistics than comparablegraphs (measured by number of steps) found using biological edge perturbations.

Figure 3.14 shows the distance and instability found for each of the three perturbationalgorithms, for all considered scores. For each method, the instability reduces as thescore increases, and when comparable graphs the same distance from the empirical dataexist, the instability is lower as the homology score is increased. For a fixed distance,instability is comparable across all three methods. The differences in the distance reachedby the methods – node, edge, or shuffle – are a consequence of the different topologicalconstraints placed on the graph, its nodes and edges.

112


−0.5

0.0

0.5

010002000300040005000

Stab

ility,

s(.)

Distance, c(.)

10 100

500

1000

(a)

Bio

logi

caln

ode

−0.5

0.0

0.5

01000200030004000

Stab

ility,

s(.)

Distance, c(.)

10 100

500

1000

(b)

Bio

logi

cale

dge

−0.5

0.0

0.5

01000200030004000

Stab

ility,

s(.)

Distance, c(.)

10 100

500

1000

(c)

Bio

logi

cals

huffl

e

Figu

re3.

14:S

imila

rity

scor

eby

pert

urba

tion

met

hod.

Dis

tanc

ean

din

stab

ility

valu

esar

edi

spla

yed

asth

esi

mila

rity

scor

e,α

,inc

reas

esin

the

thre

efig

ures

.The

show

npl

ots

are

for

the

GO

cellu

lar

com

pone

ntan

nota

tion

trai

t.

113

3.4. DISCUSSION Ensembles

3.4 Discussion

A series of simulations have been performed on empirical PINs in this chapter, to assesswhether random graphs replicate the biological traits found in empirical data. Randomgraphs have been formed by making both small changes to the observed graph, and com-pletely rewiring the data by different algorithms. Whilst the graph ensembles producedgraphs that did not share trait statistics with empirical data, the analysis has shown thatcertain characteristics can be more closely reproduced through a variety of topological, ornetwork dependent, means.

Rewiring has allowed the similarity of graph ensembles to the PINs to be observed as wellas how the ensembles differ in relation to each other. The node shuffle ensemble resultssuggest that the biological traits are not necessarily dependent on topological structure.The variability of the trait values observed is higher for node shuffle graphs than othertested ensembles, reflecting the effect of rewiring nodes in graphs that exhibit scalingproperties (see Section 2.5 on page 83). This shows the effect of the small number ofhub proteins and how they increase the variability of the measurements if only the degreesequence is maintained.

Traits can be maintained more readily by constraining characteristics for the edges ofthe random graphs. If biological characteristics of the edges are maintained, alongsidethe degree of each particular node, then the trait statistics produced are closer to thosefound in the empirical data than those found from solely topology dependent ensembles– e.g. node shuffle or random graph ensembles. The biological network shuffle ensembleshows that extra biological constraints can be used to produce graphs that maintain theproportion of matching annotations.

Biological constraints can be used to increase the expected trait statistics for each randomgraph generated. Functional annotations and complex annotations show the highest levelof correlation, whilst the complex annotations are the best constraint if the clustering co-efficient is important when generating random graphs. The complex annotations cover thewidest number of possible classes (one for each of the 547 observed protein complexes)and the smallest number of proteins per class. Therefore, these complex annotations arelocated in some of the highly clustered and well connected neighbourhoods of the em-prical graph. This may be a consequence of how complex data have been used to focusbinary interaction testing, or reflect the true biology of the graph.

114


The three empirical graphs display similar trends across most analyses, although the node

shuffle ensemble graphs show surprising differences for the DIP data. The graph data areall supposed to be representations of the same S. cerevisiae PIN. This suggests that thereare fundamental differences between the graphs which may be explained either by error ordifferent levels of coverage across the annotated proteins. However, even if, as observed,different results are obtained for different realisations it does not necessarily hold that thiscan be linked to the true interactome. The property noted may be a consequence of DIPbeing the only data set used that has not been curated.

GO annotations can be used to fix certain biological traits. When a GO characteristic isused under the perturbation methods, the other GO annotation measures rapidly approachthe complete graph trait value regardless of the algorithm used. Within the traits assessed,functional annotations are most highly correlated with complex annotations.

Homology sets have been shown to be linked to the biological traits. The instability anddistance that perturbed graphs reached using these sets is significantly lower than seen forequivalent sized null sets. These results also show that set structure, as well as the size ofthe sets, affects the instability of the traits alongside the distance between the empiricaldata and the resulting graphs.

Multiple characteristic annotations have meant that certain annotations are not retainedeven though the rewiring algorithm has aimed to fix them. For each rewiring, a singleannotation has been chosen to determine how the node, or edge, is rewired. For theGO categories this meant that biological node shuffle graphs exhibited a lower trait valuethan the empirical data, whilst biological network shuffle produced graphs with almostidentical (in general even marginally higher) trait values. Each rewiring technique has adifferent effect on the empirical graph. Biological network shuffle rewires each edge toretain a characteristic, whilst biological node shuffle retains only a given characteristic forthe rewired node.

Extra constraints could be added that would undoubtedly increase the rewired, or per-turbed, graphs’ similarities to empirical graphs. However, the low graph distance seenbetween the shuffle method graphs and empirical data shows that this ensemble may notgenerate effective random graphs for comparison with the empirical data. For higher ho-mology scores, the shuffle perturbation method only alters a very small number of possibleedges (showing at most 2% difference), suggesting that the sampled graphs will retain theempirical graph’s properties by design, irrespective of their significance.

115


The affect of topological structure, or biological structure, underlying the empirical datashould not be ignored when analysing complex graph structures. Graph ensembles offera means of generating different random graph structures for network analysis. However,maintaining the topology or a node characteristic does not appear sufficient to generategraphs that share the traits of the empirical PINs. The graph structure, which all the node

shuffle ensembles maintain, is not a sufficient property to reproduce any of the tested non-network traits. Indeed, for LC and CORE empirical data, the node shuffle ensemble traitaverages are similar to those found in the random graph ensemble, perhaps showing thatgraph structure does not influence the similarity of GO or complex annotations.

This chapter highlights the importance of using appropriate null models when testinghypotheses on large scale graphs. This may change the outcome of hypothesis tests.Indeed, the trait under consideration should be tested against a variety of different graphensemble probability distributions in order to effectively disassociate the possible effectsof topological, as well as other possibly biological, confounding factors from the analysis.The different ensembles enable a means of clearly defining what is meant by ‘expectedby chance’ in the network context. Whereas the linkage of individual PPIs to particulartraits can be made by assessment against an ER random graph, this is not true if the traitis believed to be linked to the network structure or other possibly biological covariates.The ensembles enable a more subtle view of linkage between traits and PINs, allowing atest of whether the trait found in the observed interactome are more similar than would beexpected in a random graph with clearly defined properties.

116

Chapter 4

Phylogenetic topologies of interactingproteins

This chapter presents a study of the phylogenetic topologies of yeast proteins (Section 4.2),analysing whether or not the topological properties of a protein’s phylogenetic tree aremore similar between interacting proteins than would be expected. Further analysis con-trasts the linkage of expression and topological characteristics between interacting pro-teins or proteins found in the same complex (Section 4.3).

117

4.1. INTRODUCTION Phylogenetic topologies

4.1 Introduction

The connection between the degree of a protein, and the ability of that protein to change,or evolve, is of considerable interest. That a protein involved in a high number of inter-actions can be evolutionarily constrained by those interactions has been suggested in thepast (Fraser et al., 2002). Several studies indicate a linkage between the evolutionary rateof proteins and the number of PPIs in which they are involved (Pellegrini et al., 1999;Goh and Cohen, 2002; Gertz et al., 2003; Pazos et al., 2005). Conversely the extent towhich PPIs may influence the evolutionary properties of proteins has been estimated us-ing relative sequence conservation by Jordan et al. (2003), who suggest that evolutionaryrate shows a much stronger association with factors other than a protein’s degree.

In addition to the connection between the number of PPIs in which a protein partakes andthe evolution of that particular protein, the idea that two proteins can evolve in tandemhas been postulated. The proteins involved in a small number of E. coli PPIs have beenshown to have correlated evolutionary rates (Pazos et al., 2005). This finding has beenreplicated for particular protein families (Jothi et al., 2005; Juan et al., 2008b). Thesestudies focus primarily on employing distance methods to demonstrate phylogenetic sim-ilarity and do not directly compare the topological properties of explicitly reconstructedprotein phylogenetic trees. Instead they measure the similarity between branch lengths ofthe phylogeny, assuming the same model of evolution across the complete tree. The topo-logical information, under an assumption of co-evolution should be significantly linkedto the presence of PPIs.

Although the construction of accurate phylogenetic trees (for many species) is computa-tionally difficult, it is necessary to assess whether the topology of these phylogenies canbe used to predict PPIs, an end to which protein phylogenetic profiles (Pellegrini et al.,1999), distance matrices (Pazos and Valencia, 2001; Sato et al., 2003; Pazos et al., 2005),and other measures of co-evolution between proteins (Goh et al., 2000; Goh and Cohen,2002; Gertz et al., 2003; Ramani and Marcotte, 2003) have already been put.

This chapter assesses whether the topological properties of a protein’s phylogenetic treeand interactions are linked. The hypothesis that phylogenetic topologies of interactingproteins are more similar than those of protein pairs connected is tested using randomgraph ensembles. A variety of different graph ensembles are used, along with a collectionof empirical PINs, to assess whether the topologies of PPIs, or the set of PPIs seen in thefull PIN, are more similar than those found in the different graph ensembles.

118

4.2. METHODS Phylogenetic topologies

High levels of concordance between the individual protein phylogenetic trees are antic-ipated as these should tend to follow the species tree. Whether or not characteristicsof phylogenetic trees, especially their topology, show concordance between interactingproteins greater than would be expected in random graphs has not previously been explic-itly tested on a global level. The similarity of phylogeny topologies found in S. cerevisiae

PIN data are compared to the same trait for a collection of random graph ensembles whichwere introduced in Chapter 3.

The difference between complex and binary interaction data is also assessed. To illustratepotential differences in these sets, the hypothesis that co-expression rates are higher forprotein pairs within complexes than for those outside complexes is tested using expressiondata alongside the topological characteristics of the proteins’ phylogenetic trees.

4.2 Methods

This section describes the analyses applied to explore the role of evolutionary constraintson S. cerevisiae protein pairs and the S. cerevisiae PIN. These have also been applied tocomplex annotation data, as introduced in Section 3.2.1 on page 89, and the topologicalsimilarity has been compared to co-expression rates as a possible classifier of PPI ormulti-protein complex membership.

4.2.1 Data

PIN data are used for the empirical graphs (CORE, DIP and LC) as defined in Section 2.4on page 79. Expression data, complex data and GO annotations used are taken from thesources described in Section 3.2.1 on page 89. To generate phylogenetic trees for each S.

cerevisiae protein, a selection of 9 other yeast species, from Saccharomyces and Candida

genera, have been mined for orthologous proteins using BLAST in the same means asdescribed in Agrafioti et al. (2005). The 10 species form a range of yeasts with commonancestry to S. cerevisiae of between approximately 10 million years (S. paradoxus) andover 300 million years (S. pombe), as shown in Figure 4.1. Protein coding sequencesfor each proteome used have been translated from their genome sequences (Mewes et al.,2006).

119


Figure 4.1: Phylogeny of study species. This shows the evolutionary relationship of the ten yeastspecies used (Wolfe, 2006; Fitzpatrick et al., 2006). S. cerevisiae proteins resulting from gene duplicationevents are thought to retain the same interactions as the original gene for millions of years rather than tensor hundreds of million years (Wagner, 2001). The genera Saccharomyces and Candida feature in the tenspecies.

BLAST queries were used to identify if orthologous proteins exist for each S. cere-

visiae protein in the other species, to enable the creation of each protein’s phylogenetictree. Multiple sequence alignments (MSA) (see Section 1.3.1) were performed usingCLUSTALW (Thompson et al., 2002) for each S. cerevisiae protein and the most similarprotein from every other species, as discovered using BLAST.

Phylogenetic tree topologies for each protein were inferred from the MSAs. Three differ-ent algorithms were used to infer the phylogenetic trees: PARS and PROML from Phylip3.6 (Felsenstein, 1995); and the Codonml routine from PAML (Yang, 2007). For eachphylogeny method, the analysis is restricted to those proteins where trees were inferredunambiguously (as an algorithm may return multiple trees with equal confidence). A treefor each S. cerevisiae protein is tested across a subset of the 10 related species, dependenton the availability of orthologous protein sequences.

The species tree, shown in Figure 4.1, and the protein trees may not necessarily agree

120


(Tajima, 1983). As well as the protein trees being on a subset of the 10 study species(dependent on the availability of homologues) the topology may also be different. Thespecies tree hopes to depict the evolutionary history, whilst the protein trees representshow a set of homologous proteins have evolved relative to each other through time. Thedifferences are particularly apparent when the divergence time between the species isshort (Pamilo and Nei, 1988), so for the yeast species used here there should be apparentvariability (which is required for meaningful results) between the trees produced.

4.2.2 Correlated divergence

Proteins that co-evolve have similar evolutionary paths (Pazos and Valencia, 2008) wherethe mutational changes in each protein are triggered by changes in the co-evolving protein– i.e. the changes are compensatory. One consequence of co-evolution between proteinpairs is a tendency to see similar rates of evolutionary change which are reflected throughthe branch lengths exhibited on the protein phylogenetic trees (Juan et al., 2008a). Thesebranch lengths, whilst indicative of possible co-evolutionary behaviour, may also be in-dicative of correlated evolutionary rates, which may also be non-compensatory, as hasbeen shown in S. cerevisiae (Hakes et al., 2007a). The correlation observed in the evo-lutionary distances is a consequence of constraints on the evolutionary rate, rather than aconsequence of compensatory changes.

Whilst co-evolutionary behaviour between proteins will influence their rates of evolution,it also should affect the topology of their respective phylogenetic trees. If the proteins dointeract, then each divergent split (reflected in the topology) will trigger changes in theco-evolving protein.

If proteins, labelled A and B, co-evolve then any evolutionary change in protein A willtrigger compensatory changes in the second protein B – and vice versa. If A diverges

forming proteins A′ and A′′, then B will either be triggered into diverging into B′ and B′′

(although it may be true that these new proteins are identical). Accordingly, the topologyof the protein trees should reflect co-evolutionary pressures that may be the result of theproteins interacting across the study species. Phylogeny analysis alone cannot discoverprotein pairs where this is true, as no genetic correlation is seen. Phylogenetic trees canbe used, however, to observe similar rates of evolutionary change or if divergence eventsoccur in similar patterns when comparing different proteins.

121


In order to provide an alternative view of phylogenetic similarity the topologies of thetrees are compared. Measurement of topological similarity aims to discover potentiallyco-evolutionary relationships between proteins A and B where both proteins diverge (be-coming A′ 6= A′′ and B′ 6= B′′) and share phylogenetic tree topology. When the topolo-gies match, protein pairs are defined as co-diverging across the study species (or the subsetwhere homologous proteins exist).

Definition 4.1 (Co-divergence) Co-divergent proteins, over a set of species, are those

that share the same protein phylogenetic tree topology.

The topologies of protein phylogenetic trees are the same if the proteins exhibit the samepattern of divergences, although topology alone cannot distinguish between compensatoryand non-compensatory divergence. Protein trees will differ from the consensus speciestree (found for the complete genome rather than a protein sequence), and these changesare assessed for linkage between reported PPIs and the phylogenetic topology similarity– as an assessment of whether PPIs exhibit evidence of co-evolution across yeast species.

4.2.3 Measuring topological differences

In order to measure similarity, an edit distance, η, between phylogenetic topologies ona set of n species is defined. This distance is based on a nearest-neighbour interchangemethod (Felsenstein, 2003).

A phylogenetic tree topology, e.g. ((1, 2), (5, (3, 4))), contains a set of species, 1,2,3,4,5,and divergence events or internal nodes, represented by brackets. Topologies are neigh-bours if they can be made identical when a single species is moved across a node. Forthe string tree notation, across a node means either: (i) swapping a species with the firstbracket either side in the string (deleting unnecessary brackets e.g. ((1, )2)) = (1, 2));or creating a bracket around two species in the same set of brackets – e.g. (1, 2, 3) isa neighbour of ((1, 2), 3). For ((1, 2), (5, (3, 4))) the neighbours are: (1, 2, (5, (3, 4))),((1, 2), (5, 3, 4)), and (5, (1, 2), (3, 4)). Neighbours are found from the set of multifurcat-ing trees as defined in Section 1.3.2 on page 28. Figure 4.2 shows a minimal sequence ofneighbouring phylogenetic trees to travel from topology ((1, 3), (2, 4, 5)), for protein A,to ((1, 2), (5, (3, 4))), for protein B. The distance, ηA,B, is the minimum number of treetopology changes required to generate matching trees.

122


Figure 4.2: Topology edit distance. An example of the measure of similarity between phylogenetictree topologies: ((1,3),(2,4,5)) and ((1,2),(5,(3,4))). The score here is 5.

Each protein may have a different number of homologous proteins on which the phylo-genetic tree is based. The number of possible trees (see Section D.1 in Appendix D) isdependent on the number of species included. As a consequence of this, the edit distanceis not directly comparable if the trees have a different number of species. The similarityof topologies, ΓA,B ∈ [0, 1], which takes account of the number of species, is:

ΓA,B = 1− ηA,BMn

. (4.1)

where ηA,B is the score between two trees sharing the same n species and Mn is themaximum possible score between two trees on n species.

The maximum edit distance between two phylogenetic trees on n species is found by therecursion:

Mn+1 = Mn + (n− 2) , (4.2)

with M3 = 2.

4.2.4 Phylogenetic analyses

The similarity in the phylogenetic tree topologies of interacting proteins is assessed.Given two trees, their topologies match if the phylogenetic trees, on the set of speciesthat appear in both topologies, are (non-trivailly) identical. This requires that the twotrees share at least 3 different species. Along with the match characteristic, both thescore, η, and similarity, Γ, are used to assess the similarity of PPIs in the empirical graphs(CORE, DIP and LC). The orthologue information for each protein is used to construct

123


the phylogenetic profiles for each protein pair compared.

Analyses are completed on the empirical graphs and sampled random graphs from differ-ent graph ensembles (see Section 3.2 on page 88). These graph ensembles, making up 11different probability distributions on graphs with fixed size and order, are:

• (1) Random graph [size and order fixed]

• (2) Node shuffle [nodes permuted according to node characteristic]

• Biological node shuffle: (3) [process]; (4) [component]; (5) [function]; (6) [com-plex]

• (7) Network shuffle [edges rewired according to edge characteristic]

• Biological network shuffle: (8) [process]; (9) [component]; (10) [function]; (11)[complex]

This chapter focuses primarily on the differences between the three types of ensemble:random graph; node shuffle; network shuffle. These ensembles probe different aspectsof a putative association between the PIN and phylogenetic properties of the constituentproteins. As discussed at length in Chapter 3, node shuffle graph ensembles fix the graphstructure and the phylogenetic tree labels are permuted randomly amongst the nodes.network shuffle graph ensembles associate a tree phylogeny and fixed degree with eachnode but randomise the interactions. These probe the relative similarity of interactingphylogenies against the traits produced by various types of random graph.

For the analyses, 1,000 graphs are sampled from each graph ensemble. Traits are com-pared for these graphs with the empirical data. For each empirical graph, the three differ-ent phylogenetic techniques (PROML, PARS and PAML) are also contrasted.

124

4.3. RESULTS Phylogenetic topologies

4.3 Results

The results from the phylogenetic analyses are now presented. First, the phylogeneticprofiles are assessed for each empirical graph and graph ensemble method. Second, thetopological similarity of interacting proteins is presented and followed by a comparison ofthe three different phylogenetic tree construction algorithms. Finally, additional analysesof data from E. coli and on the most closely related yeast species are presented.

The results presented in this section focus on the PROML phylogenetic trees, althoughthere is also a comparison of the three phylogeny techniques in Section 4.3.3. The numberof trees generated (owing to either no result from the algorithm or ambiguous trees) forthe methods are: PROML – 4,380; PARS – 3,617; and PAML – 4,260. The averagenumber of species presented in each protein tree, across all different phylogeny methods,is greater than 6.

4.3.1 Phylogenetic profiles

The orthologue data for each S. cerevisiae protein form a 9-bit phylogenetic profile whereeach bit signifies the presence or absence of an identifiable (sequence) orthologue in agiven yeast species. Proteins for which no orthologue data are available have been ex-cluded from the analysis, as they have been for all subsequent topological analyses. Onaverage, each protein has more than five identifiable orthologous proteins across the 9searched species (after those proteins which did not produce phylogenetic trees have beendiscarded).

For each edge, the difference between the profiles of the two connected proteins is mea-sured. This reflects the number of species where only one of the proteins is conserved.Figure 4.3 shows the phylogenetic profile differences found for the sampled ensemblesalongside a red line showing the average for LC. Across all the ensembles, the phyloge-netic profile difference is higher in general for the graphs sampled from each ensemble incomparison to the value found for LC. Network shuffle ensembles are closer than the node

shuffle ensembles, as found in Chapter 3 for other biological traits. Similarly, the node

shuffle ensembles produce higher variability than either the random graph ensembles orthe network shuffle ensembles. If the edge rewirings are constrained by complex annota-tions (the [complex] ensembles) the phylogenetic profile differences are closest to thosefound in the empirical graph.

125






Network shuffle





Node shuffle

Random graph

3.1 3.2 3.3 3.4 3.5 3.6 3.7

Average difference in phylogenetic profile

Figure 4.3: Phylogenetic profiles for each ensemble. Boxplots of the phylogenetic profiledifferences for edges in graphs for the 11 graph ensembles. The red line shows the average phylogeneticprofile difference found for edges found in the LC graph.

Figure 4.4 shows a selection of the graph ensembles – node shuffle, network shuffle,random graph and biological node shuffle [complex] – against the true output for thethree empirical graphs: CORE, DIP and LC. The proportion of interacting proteins (afterrewiring or the red dot for empirical data) are shown for each possible phylogenetic pro-file (0–9). The horizontal axis shows the differences found between phylogenetic profiles,ranging from 0 (both proteins have orthologues in exactly the same species) to 9 (one ofthe compared proteins has orthologues in only those species that the other does not).

The results show that empirical interactions exhibit a higher propensity for similar phy-logenetic profiles across all four shown ensembles. Biological node shuffle [complex]ensemble graphs, shown in Figure 4.4(d), are closer to the empirical data than any of theother ensembles. For all graph datasets the phylogenetic profiles with 3 or fewer differ-ences are found more often among the real interacting pairs than in tested random graph

126


Difference

Pro

port

ion

0.00

0.05

0.10

0.15

0.20

0.25

0 1 2 3 4 5 6 7 8 9

COREDIPLCEmpirical

(a) Random graph

Difference

Pro

port

ion

0.00

0.05

0.10

0.15

0.20

0.25

0 1 2 3 4 5 6 7 8 9

COREDIPLCEmpirical

(b) Network shuffle

Difference

Pro

port

ion

0.00

0.05

0.10

0.15

0.20

0.25

0 1 2 3 4 5 6 7 8 9

COREDIPLCEmpirical

(c) Node shuffle

Difference

Pro

port

ion

0.00

0.05

0.10

0.15

0.20

0.25

0 1 2 3 4 5 6 7 8 9

COREDIPLCEmpirical

(d) Biological node shuffle [complex]

Figure 4.4: Phylogenetic profile differences. The differences in phylogenetic profiles, for eachedge, shown as a proportion of the comparisons made across the data for four different graph ensembles.The empirical data, shown as red dots for each boxplot, are generally higher for differences less than 4,showing that the observed PPIs are more likely to share phylogenetic profiles than those edges found in anyof the random graph ensembles.

ensembles.

Figure 4.4 shows that there is little difference between the results found for each empiricalgraph. Although the graphs are of different sizes, the PPIs in each of them show similar

127


phylogenetic profile differences in both the random ensembles and empirical data. Anexception to this is the DIP graph, where a higher proportion of edges are found betweenproteins that have matching phylogenetic profiles. The proportion of matching phyloge-netic profiles is also higher in the node shuffle ensemble graphs sampled using DIP thanin either the empirical results for CORE or LC or the other graph ensembles using DIP.

4.3.2 Topological similarity

The phylogenetic topologies for each graph are measured in three ways across all edges:the proportion of matching topologies; the topology score, η; and the similarity score,Γ, found on average. The similarity found in the empirical data should be higher thanthat found for random ensembles if there is any evidence for enriched co-evolutionary be-haviour between interacting proteins. This section describes analysis using the PROMLphylogenetic trees, although the trends between the ensemble methods are the same foreach of the tree construction methods (PAML, PROML and PARS), which are in Ap-pendix D.

Figure 4.5 shows the proportion of matching topologies for each of the graph ensemblesin comparison to the proportion found for the LC graph (shown as a red line). Once again,the node shuffle ensemble shows higher variance of the trait than other graph ensembles.Each of the biological node shuffle ensembles constrained by a GO category exhibits ahigher proportion of matching topologies than is seen in either LC or any of the net-

work shuffle ensembles or random graph ensemble. Indeed, the average level of topologymatching seen in all but the [complex] constrained ensembles is higher than found in theLC graph.

Figure 4.6 shows the topological scores between proteins that have interactions for LCand the graph ensembles. The topological score trait, which measures the average scorebetween phylogenetic topologies, is higher for the random graph and node shuffle ensem-bles than for LC data. The reported interactions have more similar topological trees, ifmeasured by the average score, than those found in sampled graphs from either the ran-

dom graph or node shuffle ensembles. However, the average trait score for the network

shuffle ensembles is lower than seen for the LC graph, and lowest for the basic network

shuffle ensemble.

Score results, seen in Figure 4.6, may be influenced by the number of lineages compared

128






Network shuffle





Node shuffle

Random graph

0.34 0.36 0.38 0.40 0.42 0.44 0.46

Average topology matches

Figure 4.5: Topological matching for LC interaction graph. Boxplots for the distributionof matching topologies found for each sampled graph ensemble. The red line shows the result for the LCgraph. The [complex] constrained graphs are the only ensembles that present fewer matching topologiesthan found in the empirical data.

in each tree comparison. Indeed, for each edge the average number of shared orthologues(the number of orthologues found in both proteins) is lower for each network shuffle en-semble. This will downwardly influence the average score as the topological score, η,does not take account of the number of lineages in each tree comparison. In contrast,higher scores will be possibly be evident between proteins whose phylogenetic profilesare more similar.

Figure 4.7 shows the same results for the similarity measure, Γ, which takes account ofthe number of lineages and the score when comparing phylogenetic topologies. Unlikethe results for the average score, all of the network shuffle ensembles now are not singnif-icantly different from the value observed empirically, although the average similarity isconsistently marginally lower. Node shuffle ensembles have a lower similarity measure

129






Network shuffle





Node shuffle

Random graph

1.6 1.8 2.0 2.2 2.4

Average score

Figure 4.6: Mismatch score using LC interaction graph. Boxplots for the distribution ofaverage scores between topologies found for graph ensembles. The red line shows the trait statistic for LC.Node shuffle and random graph ensembles exhibit higher scores than found in the empirical data, whilstnetwork shuffle ensembles under any of the tested constraints exhibit a lower average score than is foundin the empirical graph.

than the empirical data, although the sampled distribution of average similarities overlapswith the empirical result. The random graph ensemble has significantly lower similarity.Whilst the topologies of individual PPIs are more similar than expected for a random pro-tein pair, the average similarity across the empirical graph is not higher than is expected ifthe degree of each protein is maintained and the edges reshuffled as is the case for network

shuffle ensembles.

130






Network shuffle





Node shuffle

Random graph

0.81 0.82 0.83 0.84 0.85 0.86

Average similarity

Figure 4.7: Topological similarity for LC interaction graph. Boxplots for the distribution ofaverage similarity between topologies found for graph ensembles. The red line shows the trait statistic forLC. The empirical data have similar similarity values as those found for the network shuffle graphs, whilstthe similarity is significantly higher for each of these than is found in graphs sampled from the randomgraph ensemble.

4.3.3 Phylogenetic methods

The level of similarity for the PROML tree construction method showed little differencebetween the similarity found in graphs from the network shuffle ensembles and the realempirical phylogenetic topology. However, the level of similarity found for the tree con-struction methods shows more variability. Table 4.1 shows the similarity trait, Γ, for eachof the empirical graphs using trees constructed by each of PAML, PARS and PROML.There are large differences between the traits statistics for each of phylogenetic construc-tion algorithm, although the trends are similar when comparing the traits produced bygraph ensemble with those seen empirically.

131


Tree construction Graph Real Similarity, ΓNode shuffle Network shuffle

PAMLCORE 0.74 0.747 [0.737,0.757] 0.742 [0.737,0.747]DIP 0.75 0.760 [0.750,0.769] 0.743 [0.740,0.746]LC 0.74 0.750 [0.741,0.760] 0.741 [0.739,0.743]

PROMLCORE 0.84 0.848 [0.839,0.857] 0.846 [0.843,0.849]DIP 0.84 0.841 [0.834,0.849] 0.838 [0.837,0.840]LC 0.84 0.831 [0.822,0.839] 0.837 [0.836,0.839]

PARSCORE 0.90 0.901 [0.893,0.909] 0.899 [0.896,0.903]DIP 0.90 0.893 [0.884,0.900] 0.893 [0.891,0.895]LC 0.89 0.882 [0.874,0.890] 0.893 [0.891,0.895]

Table 4.1: Similarity for each phylogenetic tree construction algorithm. Average simi-larity, Γ, for each phylogenetic tree construction algorithm, for the empirical graphs. The results for nodeshuffle and network shuffle graph ensembles are given along with the 95% sample range for the similaritytrait.

PARS topologies show a higher level of concordance across the PPIs in all cases. The twomaximum likelihood phylogenetic algorithms (PAML and PROML) produce lower levelsof similarity. Whilst each algorithm produces a different level of similarity for the empir-ical graph data, these different values are also seen in the random graphs sampled fromthe network shuffle and node shuffle ensembles. Differences between the phylogeneticalgorithms are also reflected in the level of matching topologies found in the empiricaldata for each tree algorithm. For example, in the case of the CORE data, phylogeniesinferred using PAML match in approximately 17% of comparisons; phylogenies inferredusing PROML match 42%; and phylogenies inferred using PARS match in 57%. Thecontrasting results for these phylogenetic algorithms may be explained by the differencesbetween the possible number of bifurcating and multifurcating topologies (see Table D.1)as well as the tree search heuristics used.

Figure 4.8 shows the average similarity, for comparisons made on a fixed number of or-thologues, for each of the tree construction methods. The empirical traits are shown,along with the results using the network shuffle ensembles. Figure 4.8(a) uses the PAMLphylogenetic trees and the similarity of phylogenetic topologies increases as they sharemore orthologous sequences. The similarity levels range from 0.65 to 0.80 for the empir-ical PIN data. Figures 4.8(b) and 4.8(c) show a different trend (for PARS and PROMLtrees) as the similarity of topologies decreases significantly as the number of shared or-thologous sequences increases. PARS trees exhibit average similarity greater than 0.90for comparisons between proteins which share 4 orthologous proteins in the study species.

132


Spec

ies

Average similarity

0.60.70.80.91.0

45

67

89

10

CORE

DIP

LC Empi

rical

(a)

PAM

L

Spec

ies

Average similarity

0.60.70.80.91.0

45

67

89

10

CORE

DIP

LC Empi

rical

(b)

PRO

ML

Spec

ies

Average similarity

0.60.70.80.91.0

45

67

89

10

CORE

DIP

LC Empi

rical

(c)

PAR

S

Figu

re4.

8:Si

mila

rity

ofto

polo

gies

for

diff

eren

ttr

eeal

gori

thm

s.Av

erag

ele

vel

ofsi

mila

rity

for

topo

logy

com

pari

sons

(whe

nre

wir

ing

usin

gne

twor

ksh

uffle

)m

ade

for

afix

ednu

mbe

rof

shar

edsp

ecie

s.Th

eav

erag

esi

mila

rity

for

each

met

hod

isdi

ffere

nt,

and

the

tren

dsse

enin

PAM

Lsh

owm

arke

ddi

ffere

nce

toth

ose

met

hods

take

nfr

omP

hylip

.The

vari

ance

ofsi

mila

ritie

sfo

rea

chtr

eeco

nstr

uctio

nm

etho

din

crea

ses

asm

ore

spec

ies

are

com

pare

d.

133


4.3.4 Further analyses

Phylogeny analyses have also been applied to a small set of E. coli PPI data (producing themirrortree results found in Pazos et al. (2005)). The results (in Appendix D.3) corroboratethose presented earlier in this section. The E. coli set consisted of 118 proteins in additionto phylogenetic information on 47 bacterial species. The similarity between topologiesfound for the E. coli data, using network shuffle graphs (with no biological constraints),matched the reported results for the S. cerevisiae data.

As well as there being no obvious link between the similarity of phylogenetic topologieson the S. cerevisiae species, the same results hold for a smaller subset of the 10 studyspecies. To assess if the divergent range of study species makes a difference to the topo-logical analyses, the 3 most divergent species (S. pombe, C. albicans and S. kluyveri) havebeen excluded and the same comparisons carried out.

Similarity and the proportion of matching topologies increases as species are excluded.However, leaving out these species does not change the topological results when viewedin relation to the results for the random graph ensembles on the same tree topologies.The network shuffle and empirical similarity levels are almost the same and node shuf-

fle ensembles produce graphs with only marginally less similarity. The choice of nullensemble potentially affects the outcome of the analysis, although none of the measuresproduce differences as large as those seen for biological traits or the clustering coefficientmeasured in Chapter 3.

The [complex] ensembles use complex annotations to determine how edges or nodescan be rewired in the sampled random graphs. This constraint resulted in the closestresults to empirical data for the average difference in phylogenetic profiles, scores and thenumber of shared orthologues. Similarity of phylogenetic topologies, or co-divergence,is assessed for PPIs and protein pairs that co-occur in complexes in order to observedifferences between these two classes. Co-expression rates, which have also been linkedwith PPIs, are also contrasted with the similarity.

Similarity levels are on average higher for reported PPIs than the set of protein pairs thathave been found in the same complex, as displayed in Table 4.2. However, for othertraits (such as co-expression, phylogenetic profile differences or functional annotations)the observed correlation differs. Reported interactions that do not have matching complexannotations produce more divergent phylogenetic profiles, and lower functional similarity,

134

4.4. DISCUSSION Phylogenetic topologies

than the average for any protein pair that has matching complex annotations. Finally, thecomplete set of protein pairs exhibits a higher level of similarity, across phylogenetictopologies, than protein pairs that have been reported in the same complex.

LC PPIs Protein pairsTrait All Same complex Different complex Same complex AllEdges 21283 3663 3582 33373

(5109

2

)Function 0.19 0.46 0.15 0.18 0.03Co-expression 0.09 0.20 0.10 0.17 0.01Phylogenetic profile 3.24 2.94 3.23 3.17 3.38Similarity, Γ 0.84 0.83 0.84 0.81 0.83

Table 4.2: Complex results. Results for phylogenetic topologies (PROML trees) and the co-expression trait. These show how traits differ for sets of reported PPI and reported matching complexannotations. Co-expression trait is the average co-expression found between all protein pairs considered

4.4 Discussion

This chapter showed that there is no significant evidence for phylogenies of interactingproteins to show higher levels of topological similarity than expected in a PIN by chance.This finding was further investigated, to address potential reasons for such a conclusion,by: (i) employing different phylogenetic inference approaches; (ii) using a range of dif-ferent PIN data sets; (iii) investigating the role of protein abundance as a potentially con-founding variable; and (iv) investigating the diversity of phylogenetic trees of proteinsforming complexes.

The objective was to determine whether protein phylogenetic tree topologies are moresimilar among pairs of interacting proteins than among pairs of proteins for which no in-teractions have been reported accounting for the PIN structure. Empirical graph data havebeen assessed against a variety of graph ensembles. The two main ensembles which takeaccount of topological structure (network shuffle and node shuffle) show contrasting re-sults regarding the similarity of tree topologies. Node shuffle results suggest a marginallyhigher level of both topological matches, and of the similarity of empirical data. In con-trast, network shuffle graph ensembles produced similarity which is not significantly dif-ferent from the empirical data’s similarity, highlighting the importance of choosing anappropriate graph ensemble when probing traits of biological networks.

135


What does emerge from contrasting network shuffle and node shuffle ensembles is the roleof hub proteins. This is particularly apparent from the wider variation in the node shuffle

ensemble results. This variation is primarily due to changes in the phylogenetic profileof the highly connected proteins. In network shuffle the topology-degree relationshipremains fixed, and because degree-degree correlations are low, less variability is observedin the probability of matches.

The tree phylogeny methods used have made use of both likelihood and parsimony ap-proaches. Although there are some differences between the methods, as seen in Fig-ure 4.8, the topological results are consistent across the methods. However, the tree editdistance used, whilst easy to compute, may not be the best means of comparing proteinphylogenetic trees. For the likelihood approaches, PROML and PAML, an alternativeanalysis could be completed that compares the full likelihoods across the possible treetopologies. This would generate an alternative measure of the similarity between thetrees, and perhaps be a better indicator of similarity (although the need for a fair compar-ison between trees on different numbers of lineages may make this a difficult procedure).In practice, the use of heuristics to find the best tree make application of such likelihoodcomparisons cumbersome, whilst the tree edit distance used works across all phylogenyapproaches and is easy to implement.

Although there is little evidence for the proportion of matching phylogenetic tree topolo-gies to be significantly enriched in the empirical graph data when compared to structurallysimilar graphs, the topologies show notable results when interacting and non-interactingproteins that occur in the same yeast complexes are compared (in Table 4.2). The complexprotein pairs show a higher correlation to mRNA co-expression levels than for randomprotein pairs or reported interactions that are not in the same complex. However, all re-ported interactions exhibit a higher propensity to share topologies than protein pairs thatco-occur in a complex but have not been reported to interact.

Moreover, the mRNA co-expression data show a higher level of co-expression for pro-teins that appear in the same complex, above those found for reported PPIs. The useof co-expression data for PPI classification should actively take account of the potentialconfounding factor of complex membership. Topological similarity, however, appears todeliver the same results on average for protein interactions found within complexes orthose that occur between complexes. The observed similarity is greater than that foundfor non-interacting protein pairs that are found in the same complex. This suggests thatnon-interacting complex partners are more divergent in topology than a random pair of

136


proteins, which may be surprising given an assumption that proteins with similar func-tional roles have a higher propensity to co-evolve.

These results concerning the topology of interacting proteins do not, however, necessarilycontradict previous work regarding the co-evolution of interacting proteins (Goh et al.,2000; Goh and Cohen, 2002; Ramani and Marcotte, 2003; Pazos et al., 2005). Measuresof the evolutionary rate or functional similarity are not accounted for in this analysis andcould be linked with interactions; in yeast (and also in Caenorhabditis elegans), how-ever, there is evidence that such a correlation among the evolutionary rates on interactingproteins is at best weak (Agrafioti et al., 2005). Several sets of authors have also shownthat it is in fact the expression level of a gene (or a measure that may act as a proxy forgene expression level, such as the codon-adaptation index (Sharp and Li, 1987)) whichexplains most of the variation in protein evolutionary rate (Jordan et al., 2003; Agrafiotiet al., 2005; Drummond et al., 2006; Hakes et al., 2007a) and not properties related to thetopology of the interaction network. This also appears to be independent of noise in, andincompleteness of, the PIN data (de Silva et al., 2006).

This chapter has highlighted the conceptual difference between predicting individual in-teractions, and predicting the whole interactome of an organism. Correlations amongpairs of proteins may be used to detect some interaction, or perhaps complex, partners ofa protein. For any given protein it will frequently be found that some of its interactionpartners have similar properties. Whilst the set of protein pairs with very similar prop-erties is enriched for true interactors, not all are or have been reported as interactors. Itis important to realise that although co-evolution has been shown to be important acrosskey functional proteins, evidence for any evolutionary properties – correlated evolution,co-evolution. or our co-divergence measure – may be absent or weak when the wholeinteractome is being considered. Overall, the observed level of phylogenetic similarity isnot higher for the empirical PINs than that expected for a random network with the samedegree sequence.

137

Chapter 5

Measuring the interactome

This chapter describes a model for finding the interactome size or the false-discovery ratefor interaction data (Section 5.2). The model estimates the size and false-discovery rateusing the number of repeated interactions, and provides suggestions as to how repeatedreports can be used to reduce noise (Section 5.3).

138

5.1. INTRODUCTION Interactome size

5.1 Introduction

Knowledge of the interactome size provides a view about the biological complexity of anorganism (Copley, 2008). Proteomes may have similar orders whilst exhibiting differentinteractome sizes (Stumpf et al., 2008). Interactome size determination also allows anappreciation of how close the reported data are to a full picture of the underlying trueinteraction network.

Recent publications have assessed the quality of the available protein interaction data forS. cerevisiae (Chiang et al., 2007; Scholtens et al., 2008). In parallel, as discussed in Sec-tion 1.6.3 on page 56, studies have used graph theoretic methods to find the interactomesize for a variety of species (Stumpf et al., 2008). The estimates are generally based onsmall collections of HTP studies (Grigoriev, 2003). Reference sets (such as MIPS or SSEdata) are often used in parallel to estimate the error rates in HTP data (D’haeseleer andChurch, 2004).

In this chapter a model is presented which estimates interactome size and FDR fromreported interaction data. The model observes all the data that has been reported, ratherthan the output from a small number of studies. Multiply reported interactions are usedto obtain estimates for interactome size and FDR.

The model presented here assumes that interaction data are sampled independently so thatthe reporting of PPIs can be viewed as a coupon collecting problem or multiple capture-recapture approach (Shokouhi et al., 2006). A coupon, treated as an individual proteininteraction, is drawn from an urn of fixed size. The observed reported interactions aredrawn from either an urn containing true interactions, or a second urn containing falseinteractions.

Modern global mappings of protein interactions (using HTP methods) attempt to surveyas many protein pairs as possible to find protein interactions and produce over two thirdsof the data. Accordingly, HTP data can be reasonably equated to independent samplingfrom the set of true, or false, interactions. SSEs, however, make up the majority of theexperiments and produce a significant minority of reported interactions. SSEs have beenviewed as more reliable but are difficult to summarise from a sampling point of view. Forthe sampling of proteins, an independent sampling approach has been shown as the mean-field approximation to non-independent and non-random sampling (Stumpf et al., 2008,Supporting Information). In this case, the same argument is used to justify assuming the

139

5.2. METHODS Interactome size

independent sampling of interactions for the PPI data.

The FDR is used to observe the influence of false-positive interactions which have beenfound to be inherent in experimental studies (von Mering et al., 2002). The proposedcoupon collecting model is analysed and compared to an alternative model which modelsthe effect of drawing multiple coupons simultaneously. Finally, the use of validationinformation is proposed and analysed as a means of separating reported PPI data intointeractions and false interactions. The chapter aims to understand whether the observederror of HTP data is an insurmountable barrier from its usage to elucidate the completeinteractome in S. cerevisiae. The use of validations, and the coupon model proposed, aretested to assess how the current methods could be utilised to elucidate the full interactome.

5.2 Methods

This section builds on methods developed in the field of systems biology over the lastdecade. Studies have focused on the size of the complete interactome with a hope thatthis will aid the assessment of whether reported interactions are true or false postives(Salwinski and Eisenberg, 2003). Section 1.6 on page 52 introduced error rate notationfor PPIs, and methods used to find the interactome size were discussed in Section 1.6.3 onpage 56. The use of validated information from two HTP datasets features prominentlyin the discovery of interactome size (Grigoriev, 2003).

Validation data are used to estimate the number of distinct interactions in the S. cerevisiae

interactome. The coupon model described here also provides, for a given FDR and in-teractome size, an assessment of the number of reported interactions required to separatePPI data into interacting and non-interacting sets.

5.2.1 Data

The data required to model the FDR and interactome size are taken from the BioGRIDinteraction dataset for S. cerevisiae. The information required for the coupon collectingmodel consists of: the number of different protein pairs observed (mobs); the number ofinteractions reported (sobs); the number of distinct interactions reported (iobs); and a listof experiment sizes (robs1 , robs2 , . . . , robsq ) as defined in Section 1.6.2.

140


Table 5.1 shows data used in Section 5.3, although the experiment sizes (robsk ) are notshown. The number of different protein pairs observed, mobs, is defined using the numberof distinct proteins, n, in the data according to mobs =

(n2

). This is used as an estimate

for the complete dataset, rather than a proposal of which protein pairs have been assessedin each experiment. In order to assess the size of the complete interactome this figureprovides an estimate of the observed coverage of the methods from which the interactionshave been sampled.

Dataset Model parametersSize, rk Experiments All, sobs Distinct, iobs Proteins, nAll PPI 4,167 59,956 41,313 4,967≥ 5 932 54,320 39,222 4,856< 5 3,235 5,636 4,768 2,203≥ 10 398 50,884 37,761 4,817< 10 3,769 9,072 6,914 2,550≥ 100 32 42,573 33,779 4,719< 100 4135 17,383 12,452 3,216≥ 1000 7 35,596 28,710 4,239< 1000 4,160 24,360 18,119 4,011All genetic 4,426 44,275 38,071 3,793

Table 5.1: Interaction datasets. The different data subsets, of physical PPIs from BioGRID, usedto find FDR, κ, and interactome size. The protein data exclude proteins that have only been reported asself-interacting. Both subsets (≥ and <) are shown as the number of distinct items, iobs, cannot be inferredfrom a single set and knowledge of the complete set (as they are, in general, non-disjoint).

As discussed in Chapter 2, the majority of the PPIs are found in experiments reportingmore than 1,000 interactions. Over a quarter of the reported interactions are validations ofinteractions found in other experiments. Genetic interactions consist of a smaller numberof reported interactions, although over a larger number of studies, and have a smallerproportion of validations and distinct proteins than found in the physical data.

A scaling factor, ρ, is defined in order to find the interactome size from the observableinteractome size, m, found by the coupon model introduced in this section. This is neces-sary as each experimental technique may only be able to probe a subset of the completeset of protein-pairs, and there is no evidence regarding possible interactions between otherunobserved proteins. Assuming a uniform distribution of true interactions across the pro-teome, the scaling factor to find the complete interactome size, as used in Stumpf et al.

141


(2008), is:

ρ =

(n2

)(n2

)=

n (n− 1)

n (n− 1). (5.1)

This estimator is used to find the total number of interactions from those found by sam-pling only n proteins (node sampling is further described in Appendix E). This providesan unbiased estimate of the complete network size assuming that the interactions havebeen sampled uniformly (Stumpf et al., 2008). Uniform sampling of the possible interac-tions is also a necessary assumption of the coupon model later introduced. Equation 5.1assumes that

(n2

)protein pairs are observed, or tested, to produce reported data. The

total number of possible proteins found in S. cerevisiae, n, is here defined to be 5,800(Hirschman et al., 2006). The number of proteins in a dataset, n, is used to find ρ.

5.2.2 Coupon collecting

A model is proposed to describe the sampling of true and false interactions from the S.

cerevisiae PIN. The aim is to estimate the overall population size of true interactions byusing knowledge of the total number of observed interactions and the number of timesrepeated interactions are observed.

The model can be considered to be a capture-recapture approach (Bunge and Fitzpatrick,1993; Chao, 2001). These approaches have commonly been employed in the literature inorder to find a population’s size or to elucidate its class structure. The overlap found be-tween two samples (i.e. the number of items recaptured) is used to estimate the completepopulation’s size, as also used for interactome analyses (see Section 1.6.3).

Multiple capture-recapture (Shokouhi et al., 2006) is an extension of this approach toaccount for any number of samples. This has been used to estimate the size of differentpopulations by observing the overlap between different samples (Xu et al., 2007). Thismethod has also been generalised to use non-uniform sample sizes (Thomas, 2008).

The population size estimator introduced in this section is equivalent to the homogeneouscapture-recapture estimator (Shokouhi et al., 2006) when taking samples (of size 1) with

142


replacement from a finite population. The population considered to begin with is onlytrue interactions (no false interactions are considered). It is also a natural extension of theoverlap methodologies which have focused on individual PPI experiments. Sections 5.2.3and 5.2.4 describe extensions to the simple estimator. First, the capture-recapture modelis adapted in order to account for the presence of false as well as true interactions. Themodel is then further modified so that non-uniform sample sizes may be considered.

Suppose that interactions, or coupons, are sampled (where each sample reports one inter-action) with replacement from an urn containing m different interactions. Having sam-pled i distinct interactions, the probability that the next sampled interaction is novel is,

P (novel interaction sampled | i distinct interactions) =m− im

. (5.2)

The number of samples to find a novel interaction, given that i ≥ 0 have already beencollected, is geometrically distributed with success parameter θ = m−i

m. For this geo-

metrically distributed variable, the expected number of samples required to find a novelinteraction is: E (samples, to find novel interaction) = 1

θ= m

m−i . Thus, using the linearityof expectations, the expected number of samples, S, to collect i distinct interactions is:

E (S, to find i distinct interactions) =i−1∑k=0

E (novel sample | k distinct)

= 1 +m

m− 1+ . . .+

m

m− i+ 1

= m

i−1∑k=0

1

m− k. (5.3)

The variance of the number of samples, S, to collect i distinct interactions, can also be

143


found from the sum of variances of independent geometric random variables,

V (S, to find i distinct interactions) =i−1∑k=0

V (novel sample | k distinct)

=i−1∑k=0

(m

m− k

)2(1− m− k

m

)

<i−1∑k=0

(m

m− k

)2

< m2π2

6(5.4)

The coefficient of variation (CV ) for the distribution of samples necessary to find i distinctinteractions is,

CV =

√V (S, i)

E (S, i)

<π√6

1∑i−1k=0

1m−k

. (5.5)

For the coupon distribution theCV decreases asm increases, and is below 1 for all param-eters of interest, informing about the reliability of the model presented in Section 5.2.3.

When m is known then Equation 5.3 can be used to estimate the number of samples nec-essary to have found all of the distinct interactions. Alternatively, given the the number ofdistinct interactions found, i, and a givenm, the expected number of interactions sampled,S, can be compared to observed data.

5.2.3 Single coupon

The single coupon model now described is a modified version of that introduced in Sec-tion 5.2.2. Suppose that PPIs are reported from a set of protein pairs, Eobs, defined as allpairs of n different proteins (those protein pairs that can be observed experimentally). Letmobs be the size of Eobs, so mobs =

(n2

).

The observed PPIs are either found as edges of the true interaction graph or false inter-

144


action graph (introduced in Section 1.6 on page 52). These may be considered as beingfound in two urns containing either: PPIs, ea = (vi, vj), found in E ∩ Eobs; false interac-tion protein pairs, eb = (vk, vh), found in E ′ ∩Eobs. Each reported interaction is found inone of these urns, since E ∪ E ′ = EΩ. Let m be the size of E ∩ Eobs and m′ the size ofE ′ ∩ Eobs.

The proportion of reported data that are found inE ′ is also the FDR, κ. sobs is the observednumber of reported interactions. Now let S be the number of interactions sampled fromE and S ′ be sampled from E ′. Then, suppose S ∈ [0, sobs] ∩ Z is fixed,

S = (1− κ) sobs, (5.6)

and also trivially, S ′ = κsobs. Then κ can be found directly from S.

The observed number of distinct interactions, iobs, is made up of those sampled from E

and those fromE ′. Let i be sampled fromE and i′ be fromE ′. AsE∩E ′ = ∅, iobs = i+i′.In summary,

sobs = S ′ + S,

iobs = i′ + i,

mobs = m′ +m. (5.7)

m and i are required to satisfy Equations 5.7, alongside a fixed S (from which κ is found)and sobs. S (and S ′) are assumed to be the expected number of samples necessary to find i(and i′) interactions and E (S, i) = S (from Section 5.2.2). m and i are sought that satisfythe Equation 5.7 and,

S = mi−1∑k=0

1

m− k,

S ′ = m′i′−1∑k=0

1

m′ − k. (5.8)

In order to find a solution for m, i and S, solutions are sought such that the following

145


function, g (m, i), is zero,

g (m, i) = sobs −mi−1∑k=0

1

m− k−m′

i′−1∑k=0

1

m′ − k

= sobs −mi−1∑k=0

1

m− k− (mobs −m)

iobs−i−1∑k=0

1

mobs −m− k. (5.9)

An approximate solution for m and i is found by assuming that the parameters are froma continuous function (rather than discrete as they are in truth) such that for all m ∈[0,mobs], solutions are sought (if they exist) for i ∈ [0, iobs]. The complete interactomesize, mΩ, is then found for a given solution using ρ, the scaling factor introduced inEquation 5.1, and m:

mΩ = ρm

=n (n− 1)

n (n− 1)m. (5.10)

Uniqueness of solution

In order to examine the possible uniqueness of i such that g (m, i) = 0 take m and i bothas positive reals (as performed to find a solution). The expectation found in Equation 5.3is approximated as the following only for this section,

E (S, to find i distinct interactions) = mi−1∑k=0

1

m− k

= mi−1∑k=1

1

m− k+ 1

≈ m

∫ i−1

0

1

m− xdx+ 1

= m log

(m

m− i+ 1

)+ 1. (5.11)

146


Now to examine the uniqueness of a solution for the coupon model, Equation 5.8 areapproximated using Equation 5.11 as,

S ≈ m log

(m

m− i+ 1

),

S ′ ≈ m′ log

(m′

m′ − i′ + 1

). (5.12)

g (m, i) now is,

g (m, i) ≈ sobs −m log(

m

m− i+ 1

)−m′ log

(m′

m′ − i′ + 1

)= sobs −m log

(m

m− i+ 1

)− (mobs −m) log

(mobs −m

(mobs −m)− (iobs − i) + 1

),

(5.13)

and the derivative of g (m, i) with respect to i is,

∂g (m, i)

∂i≈ − m

m− i+ 1+

mobs −m(mobs −m)− (iobs − i) + 1

. (5.14)

0 10000 20000 30000 40000

05000

10000

15000

i

g(10000,i)

(a) g (10000, i) for i ∈ [0, iobs]

0 10000 20000 30000 40000

-20000

-10000

010000

20000

i

g(50000,i)

(b) g (50000, i) for i ∈ [0, iobs]

Figure 5.1: Single coupon function. g (m, i) for the physical interactome data parameters, sobs =59956, mobs =

(4967

2

)and iobs = 41313. For m ∈ 10000, 50000 the function can be seen to have a

single solution satisfying g (m, i) = 0.

147


This derivative is negative if

m ((mobs −m)− (iobs − i) + 1) > (mobs −m) (m− i+ 1) ,

which reduces to

mobs

m>iobs − 2

i− 1. (5.15)

As protein interaction graphs are assumed to be sparse (i.e. m mobs), it follows that∂g(m,i)∂i

can be positive only for small i. Figure 5.1 shows the behaviour of g (m, i) for thephysical data parameters taken from Table 5.1.

Using Equation 5.13 and setting i = 1 for simplicity,

g (m, 1) = sobs − (mobs −m) log

(mobs −m

(mobs −m)− iobs

), (5.16)

which is positive for all parameter sets defined in Table 5.1 and m mobs.

Further, Equation 5.14 is decreasing in i, so the second derivative of g with respect to iis negative. Therefore, as g (m, 1) for considered m is positive, if an i exists such thatg (m, i) = 0 then the solution is unique.

5.2.4 Multiple coupons

Rather than a series of independent studies reporting individual interactions, the S. cere-

visiae data have been published in studies producing multiple interactions. Each study,Pk, contains a set of reported interactions EPk

. A multiple coupon model assumes thatinteractions are drawn without replacement from the observable protein pairs, Eobs. Thisdiffers from the assumption in Section 5.2.3 where each interaction is drawn from Eobs

with replacement.

Recall that the number of true interactions, the interactome size, is m. Now supposethat q experiments, P1, . . . , Pq, are conducted and that the number of true interactionsreported in experiment Pk is rk. For each experiment, Pk, let ph,j,k be the probability ofdrawing (j − h) novel true interactions, given that h distinct true interactions are observed

148


in experiments P1, . . . , Pk−1. The probability ph,j,k can be described as a transitionmatrix (each state referring to the number of distinct interactions sampled) where for thekth experiment,

ph,j,k =

0 if j < h(m−h

j−h )( hrk−j+h)

(mrk

)if j ≥ h,

(5.17)

which is equivalent to,

ph,j,k =

((m− h)!h!rk! (m− rk)!

(j − h)! (m− j)! (rk − j + h)! (j − rk)!m!

)if j ≥ h. (5.18)

To find possible values of κ and m that are consistent with the data found in Table 5.1,different values of m, s and i are simulated. Unlike the single coupon model, however,the experiments provide sobs samples and in each experiment the reported interactionshave to be split into true (e ∈ E) and false (e ∈ E ′) reported interactions. The completeexperiment sizes, r1,obs, r2,obs, . . . , rq,obs, are such that,

sobs =

q∑k=1

rk,obs. (5.19)

In order to simulate this model, κ ∈ 1sobs

, . . . , sobs−1sobs is chosen, and then the number of

interactions drawn from the urns of true interactions and false interactions are uniformly,and at random, selected such that r1, r2, . . . , rq are sampled from the interaction urn(E) and r′1, r′2, . . . , r′q are sampled from the false interaction urn (E ′). These sampledsuch that rk + r′k = rk,obs ∀ k ∈ [1, q] and

∑qk=1 rk = (1− κ) sobs.

For each possible κ (along with a collection of 1,000 sampled experiment sizes) and eachm ∈ [0,mobs] the average number of distinct interactions, i, is found through simulationand forms a possible solution for m and κ only if i = iobs. The multiple coupon model issimulated in order to assess the effect of sampling from experiments of different sizes, incontrast to the simple with replacement model in Section 5.2.3. The model is also usedto assess the possible interactome size, and FDR, predictions found from HTP and SSEdata.

149


5.2.5 Finding true interactions

False interaction and true interaction data, using the single coupon model, can be gener-ated for known protein pairsmobs, interactome sizem, and FDR κ (having found solutionsκ and m from data in Table 5.1). The effect on the number of times an interaction, e, isreported, V (e) is simulated for different sobs values to assess how the repeats can be usedto classify true interactions (e ∈ E) and false interactions (e ∈ E ′).

Let V (e) be the number of times an interaction, e, has been reported. For e ∈ E and κsuch that s = (1− κ) sobs is large the probability that an interaction, e ∈ E, is reportedV (e) times is approximated as being Poisson distributed:

P (V (e) = k | e ∈ E) =

(s

k

)(1

m

)k (1− 1

m

)s−k≈ exp

(− s

m

) ( sm

)kk!

, (5.20)

and similarly if e ∈ E ′,

P (V (e) = k | e ∈ E ′) ≈ exp

(− s′

m′

) ( s′m′

)kk!

. (5.21)

For any reported interaction, e, P (e ∈ E) = 1 − f and P (e ∈ E ′) = f . V (e) is used toclassify e as being sampled from E if:

P (e ∈ E|V (e)) > P (e ∈ E ′|V (e)) . (5.22)

To find the threshold value, k∗, to ensure minimum error when using V (e) for classifica-tion, the solution of the following is found,

P (e ∈ E|V (e) = k∗) =P (V (e) = k∗|e ∈ E) P (e ∈ E)

P (V (e) = k∗). (5.23)

So using in turn Equations 5.20-5.23 the threshold value k∗ (noting k∗ ∈ [0,min(s, s′)])

150


is the solution of,

m exp(− s

m

) ( sm

)k∗k∗!

(1− κ) = m′ exp

(− s′

m′

) ( s′m′

)k∗k∗!

κ,( sm

)k∗ (m′s′

)k∗=

κ

1− κm′

mexp

(s

m− s′

m′

),(

sm′

ms′

)k∗=

κm′

(1− κ)mexp

(s

m− s′

m′

),(

s (mobs −m)

m (sobs − s)

)k∗=

κ (mobs −m)

(1− κ)mexp

(s

m− sobs − smobs −m

),

k∗ =

(s

m− sobs − smobs −m

) log(κ(mobs−m)

(1−κ)m

)log(s(mobs−m)m(sobs−s)

) .(5.24)

Equation 5.24 defines a k∗ (assuming an equal cost of misclassifying either class) forgiven sobs, mobs, m and κ such that e is classified as being in E if,

V (e) > k∗. (5.25)

Let C (e | V (e)) be the class predicted for interaction, e. Then,

C (e | V (e)) =

E if V (e) > k∗

E ′ otherwise. (5.26)

In order to observe how the classifier, C, performs, sobs interactions are drawn from thesingle coupon model using the parameters above. k∗ is found for a given sobs using Equa-tion 5.24 and each interaction, e, is classified as either a true interaction, e ∈ E, or falseinteraction, e ∈ E ′.

Let cFP be the percentage of interactions misclassified as true interactions (a second falsediscovery rate) and cTP be the percentage of true interactions (which can be actuallytested) correctly identified (the sensitivity). These are both equivalent to previously usednotation for interactions set out in Section 1.6.2 on page 54, but they are now definedclearly to avoid confusion with the FDR, κ, used for the coupon models:

false discovery rate = cFP =|E ′ ∩ e ∈ EΩ : V (e) > k||e ∈ EΩ : V (e) > k|

(5.27)

151

5.3. RESULTS Interactome size

and,

sensitivity = cTP =|E ∩ e ∈ EΩ : V (e) > k|

m. (5.28)

The misclassification rate is computed for the observed interactome size, m, and FDR, κ,estimates found from the coupon model, for sobs ∈ [40000, 600000].

5.3 Results

Section 5.3.1 presents the estimated S. cerevisiae interactome size along with an inves-tigation into the interplay between FDR, interactome size, and the proportion of distinctinteractions reported. Section 5.3.2 contrasts the multiple coupons model predictionswith those of the simple single coupon model. These results show how the predicted errorrates change, for a given interactome size, when the model takes account of experimentsize. Section 5.3.3 makes use of the range of FDR, κ, and interactome size estimates,ρm, found in Section 5.3.1 to see how the number of reported interactions, sobs, changesthe reliability of true interaction classification using the number of validations for eachinteraction.

5.3.1 Interactome size

The BioGRID physical interaction data, found in Table 5.1, were used to find the resultsfor FDR and interactome size shown in Figure 5.2. The scaling factor, ρ, is approximately1.36. Figure 5.2(a) shows the relationship between FDR, κ, and interactome size, ρm.This shows that the FDR, for the complete data, could be between 0 and 0.6, whilst theinteractome has fewer than 100,000 interactions. Using interactome size estimates guidedby the literature of 20,000-40,000 interactions (from Table 1.4 on page 61) produces anestimated FDR across the complete data of 0.32-0.47. Similarly, using FDR estimatesfrom the literature (which have predicted an FDR of larger than 0.2 in general) suggeststhat the interactome size has fewer than 60,000 interactions.

Figure 5.2(b) shows the proportion of the interactome, iρm

, that has been reported for therange of FDR estimates. Somewhere between 40% and 80% of the complete true interac-tome has been found depending on the FDR. A higher FDR, due to its associated lowerinteractome size (in Figure 5.2(a)), means that a higher proportion of the interactome

152


has been reported. There are fewer unseen true interactions if there is more noise (inter-actions sampled from false interaction urn, E ′), a result consistent with how validationinformation is used to find the FDR and interactome size in the coupon models.

2e+04 4e+04 6e+04 8e+04 1e+05

0.0

0.2

0.4

0.6

0.8

1.0

Interactome size

Fal

se p

ositi

ve r

ate

(a) Size, m, and FDR, κ.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Proportion of interactome found

Fal

se p

ositi

ve r

ate

(b) Proportion known, iρm , and FDR, κ.

Figure 5.2: S. cerevisiae physical interactome size. The results found for S. cerevisiae usingthe single coupon model are shown in the two plots. Figure 5.2(a) displays the estimated FDR and size.Figure 5.2(b) shows how the FDR relates to the proportion of the complete interactome that has beenreported.

Figure 5.3 shows the interactome size predictions for genetic interaction data (from Ta-ble 5.1). Although a similar number of distinct interactions are available for genetic andphysical interactions, the proportion of known interactions is substantially lower. Fig-ure 5.3 shows that the single coupon model estimates that less than 40% of the geneticinteractions have been reported and that the FDR is less than 0.8 for any interactome size.

Genetic interactome size estimates, using the same range of plausible FDRs (0.32-0.47)suggested by published PPI interactome sizes, are 80,000-150,000. However, these FDRestimates have been found using different experimental methods which may not be ap-plicable to the genetic data. The genetic interactome size estimates suggest that this in-teractome is much larger than the physical interactome (if the same FDR is assumed foreach dataset). However, this estimate has been made by a model that assumes only theoccurrence of n proteins, rather than modelling the binding of proteins to DNA. This mayact to underestimate the genetic interactome size as the number of possible interactions is

153


0 50000 100000 150000 200000 250000 300000

0.0

0.2

0.4

0.6

0.8

1.0

Interactome size

Fal

se p

ositi

ve r

ate

(a) Size, m, and FDR, κ.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Proportion of interactome found

Fal

se p

ositi

ve r

ate

(b) Proportion known, iρm , and FDR, κ.

Figure 5.3: S. cerevisiae genetic interactome size. Plots show the results found using the singlecoupon model, for the genetic interaction data found in BioGRID. Figure 5.3(a) displays the estimated FDRand size. Figure 5.3(b) shows how the FDR relates to the proportion of the complete interactome that hasbeen reported.

doubled. But the results suggest that there are at least several times greater genetic thanprotein-protein interactions in S. cerevisiae.

5.3.2 Experiment size

Each type of dataset may have a different FDR (some published estimates for HTP studiesare shown in Table 1.3 on page 61). To compare the noise found in HTP and SSE data,the complete interactome data are split by experiment, Pk, according to its experimentsize, rk,obs. The estimated FDR, κ, for a given interactome size is then compared betweenSSEε = Pk : rk,obs < ε and HTPε = Pk : rk,obs ≥ ε data for ε ∈ 5, 10, 100, 1000.Table 5.1 details the data used for parameter values in the single coupon model.

Figure 5.4 shows the FDR, κ, and interactome size, ρm, results for these SSE and HTPdatasets. The figures show, for a fixed interactome size, the variation in FDR betweenreported interaction sets.

In general, for a fixed interactome size, SSE experiments have a lower FDR than theHTP data (when defined using the same ε). HTP data produce a wider range of possible

154


0 20000 40000 60000 80000 100000 120000 140000

0.0

0.2

0.4

0.6

0.8

Interactome size

Fal

se p

ositi

ve r

ate

All5+10+100+1000+

(a) HTPε = Pk : rk,obs ≥ ε

0 20000 40000 60000 80000 100000 120000 140000

0.0

0.2

0.4

0.6

0.8

Interactome size

Fal

se p

ositi

ve r

ate

All5−10−100−1000−

(b) SSEε = Pk : rk,obs < ε

Figure 5.4: Experiment and interactome size. Plots show the interactome size, ρm, and FDR,κ, estimates found for SSEε and HTPε data using the single coupon model. The HTPε data are lessreliable than the complete data whilst the SSEε data are more reliable, although there is a general lackof validations in the smallest datasets (SSE5) which may suggest either a higher FDR or poor modellingperformance.

interactome sizes. For HTP1000 (containing only 7 studies reporting more than 1,000interactions) the possible maximal interactome size is 150,000, assuming that FDR isnegligible.

If only the smallest interaction experiments are considered, SSE5, the single couponmodel estimates a higher FDR than is found for the complete data. This suggests that theSSE5 set is less reliable than the complete interaction set. However, it is more probablethat for SSE5 the scaling factor used or the model’s dependence on multiple publicationsdetailing exactly the same results are not as realistic as for the other datasets.

The single coupon model ignores the effect of experiment size. This will have a moreprofound effect on the HTP results, as the single coupon model provides a better descrip-tion of data produced by smaller experiments. Section 5.2.4 described a multiple couponmodel that takes explicit account of experiment sizes, r1,obs, r2,obs, . . . , rq,obs, (at the ex-pense of simplicity) which is now used to estimate the size and FDR for the same datasets.

Figure 5.5 shows the difference between the predicted FDR and size values for the mul-tiple and single coupon models. The figures show the minimal changes on the solutions

155


when only smaller experiments are considered (in this example SSE100), whilst the effectof using the HTP100 data is more pronounced.

0 50000 100000 150000

0.0

0.2

0.4

0.6

0.8

Interactome size

Fal

se p

ositi

ve r

ate

Single sampleMultiple samples

(a) SSE100 = Pk : rk,obs < 100

0 50000 100000 150000

0.0

0.2

0.4

0.6

0.8

Interactome size

Fal

se p

ositi

ve r

ate

Single sampleMultiple samples

(b) HTP 100 = Pk : rk,obs ≥ 100

Figure 5.5: Single or multiple coupons. Plots show the differences between the interactome size,ρm, and FDR, κ, estimates found using the single and multiple coupon models for: (a) SSE100 and (b)HTP100 data. The single coupon results are shown in red and the multiple coupon results are shown inblue. The effect on SSE data is small between the models, whilst there is a larger difference to the predictionsmade when considering the HTP experiments.

Figure 5.6 shows HTP100 and SSE100 results from the multiple coupon model, alongwith the results for the full physical data shown in black. The complete data resultsonly show minor differences to the estimated FDR and sizes found for the single couponmodel (shown in Figure 5.2). The maximal interactome size is about 50% larger forthe HTP100 set than found for the SSE100. For interactome size estimates from recentpublications of 20,000-40,000 the FDR estimates for each dataset are: 0.31-0.46 (all);0.38-0.54 (HTP100); and 0.24-0.42 (SSE100). The lower FDR estimates relate to a higherestimated interactome size.

156


0 50000 100000 150000

0.0

0.2

0.4

0.6

0.8

Interactome size

Fal

se p

ositi

ve r

ate

All100−100+

Figure 5.6: Multiple coupon interactome size results. Plot shows the FDR and size results formultiple coupon model. The results are shown for three datasets: complete physical data; SSE100; andHTP100. This shows the differences between the data, with the maximal size, ρm, being over 50% morefor the HTP100 data in contrast with the SSE100 data.

5.3.3 Classification

Assuming that interactions are sampled by the single coupon model, a threshold can befound to classify a reported interaction, e, as being from E or E ′ according to the numberof times reported, V (e) (as set out in Section 5.2.5). In order to assess how the number ofinteractions, sobs, influences the misclassification rates we use estimates for mobs, m andκ from Section 5.3.1.

Two different sets of parameters for [ρm;κ] are used taken from earlier results for thesingle coupon model: [20,000; 0.32] and [40,000; 0.47]. The scaling factor, ρ, is 1.36 asfound for the complete BioGRID data. This is used to define the observable space of trueinteractions, m, along with mobs. Equation 5.24 defines a k∗ for given sobs, mobs, m and

157


κ such that e is classified in E or E ′. From k∗, a kint is found such that e ∈ E if,

V (e) ≥ kint. (5.29)

Tables 5.2-5.3 show how the misclassification rates change as the number of reportedinteractions, sobs, changes. Using the currently available data, the model estimates thatbetween 64% and 41% of the true interactions can be found using repeated interactiondata. At the same time, the misclassification rate for this set is lower than 1%. If onlythe 7 experiments which report more than 1,000 interactions were repeated then the useof repeated information would be able to find approximately 88% of the observable trueinteractions. This assumes that the repeated experiments would yield a similar number ofreported interactions.

sobs kint cFP [c0.05FP , c

0.95FP ] cTP [c0.05

TP , c0.95TP ]

40,000 2 0.003 [0.002,0.004] 0.424 [0.417,0.431]60,000 2 0.004 [0.003,0.005] 0.638 [0.631,0.644]80,000 2 0.006 [0.004,0.007] 0.784 [0.778,0.789]

100,000 2 0.008 [0.006,0.009] 0.876 [0.871,0.880]200,000 3 0 [0,0] 0.975 [0.973,0.977]300,000 3 0 [0,0] 0.999 [0.998,0.999]400,000 4 0 [0,0] 1 [0.999,1]500,000 4 0 [0,0] 1 [1,1]600,000 4 0 [0,0] 1 [1,1]

Table 5.2: Classification performance for ρm = 20,000. Simulated classification performanceof C (e | V (e)) for parameters ρm = 20,000 and κ = 0.47. The cFP and cTP percentages are shown assobs increases, including the 95% sample range. kint is the minimum number of validations, V (e), requiredfor an interaction, e, to be classified as being sampled from E.

The results show that even if the reported interaction data has a high FDR of 0.47, repli-cated experiments can be used to find the stochastic error that has been assumed to beinherent in the HTP technologies. This is also without the use of any corroborating bi-ological evidence (as used to generate the CORE graph) on those interactions treated asbeing sampled from the true interaction set E.

158

5.4. DISCUSSION Interactome size

sobs kint cFP [c0.05FP , c

0.95FP ] cTP [c0.05

TP , c0.95TP ]

40,000 2 0.001 [0,0.002] 0.238 [0.234,0.241]60,000 2 0.001 [0.001,0.002] 0.405 [0.400,0.410]80,000 2 0.002 [0.001,0.002] 0.553 [0.548,0.558]

100,000 2 0.002 [0.002,0.003] 0.673 [0.669,0.678]200,000 2 0.007 [0.006,0.007] 0.945 [0.943,0.947]300,000 2 0.014 [0.013,0.015] 0.992 [0.992,0.993]400,000 3 0 [0,0] 0.995 [0.994,0.996]500,000 3 0 [0,0] 0.999 [0.999,1]600,000 3 0 [0,0.001] 1 [1,1]

Table 5.3: Classification performance for ρm = 40,000. Simulated classification performanceof C (e | V (e)) for parameters ρm = 40,000 and κ = 0.32. The cFP and cTP percentages are shown assobs increases, including the 95% sample range. kint is the minimum number of validations, V (e), requiredfor an interaction, e, to be classified as being sampled from E.

5.4 Discussion

Coupon models have been used to find the association between FDR and interactomesize for an interaction dataset. The models require validated interactions to form a non-negligible proportion of the data in order to provide plausible results. However, the avail-able S. cerevisiae interaction data form a good dataset with a large number of validations.Results suggest that the maximal FDR rate for the physical data is 0.6, and that given aninteractome size of 20,000-40,000 the FDR is 0.32-0.47. In contrast, the genetic inter-actome is predicted to be several times bigger than the S. cerevisiae PIN, and less than athird of the true interactions have been reported.

The coupon models require an FDR to predict the exact interactome size. However, themodel can be used to verify published estimates for either FDR or size, as they shouldproduce plausible estimates for the other parameters. For instance, reported FDRs forHTP data (see Table 1.3 on page 61) have been published with a rate of over 0.9. Thecoupon model, run on only the largest datasets (HTP 1000), suggests that this is not pos-sible and the FDR is less than 0.8 even in an extreme case (where observable interactomesize, m, is minimal given the sample data).

The multiple coupon model produces a similar range of possible sizes as the single couponmodel, although the FDR estimates are higher in the single coupon model for a fixedinteractome size. For high estimates of interactome size (which only appear possiblewhen viewing the HTP data) the difference between the two models is larger, showing

159


the need to take account of the experiment size when only considering HTP data. Thefact that SSE data present a smaller maximal size of the interactome perhaps indicates adifference in the coverage of these experiments when compared to the HTP results. Thescaling factor used (which is found using the same method for all datasets) should reflectthe size of the experiment in order to compensate for the apparent lower coverage of thesmall experiments.

Repeated interaction information can be used to find a reference set of PPIs that havenot been classified by any biological characteristics. This requires uniform experimentaltesting across the observable space of protein pairs, thus resulting in the uniform samplingof interaction and false interactions that is assumed for the urn models. Whilst this maybe unrealistic, more recent HTP techniques present an opportunity for all the observableprotein pairs to be tested in this manner.

Uniform sampling of interactions, which is at least correct in the mean-field approxima-tion, has been assumed in the implementation of the coupon model. This assumption isfurther supported by the technical set up of the larger experiments that contributed themajority of the PPI data. However, the role of systematic error in any of the experimen-tal methods is consequently ignored. If the sampling is significantly skewed towards aparticular subset of proteins, or particular interactions, then the overall interactome sizeestimates will be lower than in reality, as will the FDR estimates.

The sampling factor has also made an implicit assumption about the coverage of each ofthe interaction datasets. This sampling factor means that it is assumed that each proteinhas at least one true or false interaction. If no reported interaction data can be found fora protein there is assumed to be no evidence that it has actually been tested. As a conse-quence of this assumption, the interactome size estimates may be overstated. Similarly, ifstudies do not test all protein pairs from the protein set inferred by the interaction reports,then the size may be understated. However, without further information on negative re-sults from these studies, it is hard to effectively find the coverage for all the studies used.This further signifies the need for the community to report the results of protein-proteininteraction studies more fully; in particular it is important to report protein pairs withnegative results.

Differences in how the protein pairs have been sampled may make the comparison oferror rates between SSE and HTP studies inaccurate. If the SSE size estimates are toolow, then the relative FDR will increase for a given interactome size, further reducing thedifference between the FDR seen for HTP and SSE data. Overall, the FDR is found to be

160


up to 50% larger in the biggest HTP experiments than found in the smallest SSE.

The coupon model also assumes that errors are stochastic in nature, rather than system-atic. This will lead to systematic errors being wrongly assigned as true interactions, asthey will almost certainly appear more often than stochastic errors. Ignoring this potentialset of errors will increase the interactome size estimate, whilst reducing the FDR pre-sented by the coupon models. In order to take account of systematic errors from differenttechniques, the same multiple coupon model should be reapplied to all the data from eachtechnique. Given enough data, the amount of systematic error from each technique canthen be assessed and a more reliable interactome size estimate reached.

Over half a million sampled interactions may be required to classify all the reported in-teractions correctly. This would be lower if the FDR could be reduced in experimentalreplicates. However, these data could just be found by repeating already observed HTPexperiments around 10 times and reporting all the data separately so that validations canbe found. Then, if the scaling factor is appropriate, validations enable complete elucida-tion of the (approximately 73%) PPIs that are currently observable. Then further inferencemethods using biological characteristics, or new experimental methods, can use this PPIreference set to fully elucidate the S. cerevisiae interactome.

161

Chapter 6

Conclusions

This chapter gives a summary of the work described in this thesis, the conclusions drawn,and finally a general discussion of the areas which require further work.

162

6.1. SUMMARY Conclusions

6.1 Summary

This thesis has presented a collection of PIN based analyses, using the S. cerevisiae inter-action data as the illustrative example.

Chapter 2 presented the characteristics of currently available protein interaction data forS. cerevisiae. Each experimental technique probes subtly different protein pairs and thusalso protein interactions. This makes a comparison of techniques fraught with difficultywhen assessing reliability. The S. cerevisiae PIN is relatively highly clustered forming agraph with over 5,000 nodes.

Chapter 3 described and analysed the properties of random graph ensembles that retainnetwork and biological characteristics of the empirical data. Ensemble averages, for vari-ous traits, were found to differ significantly dependent on whether the degree distribution,graph structure, or biological characteristics were fixed in the the random ensembles. Thevariability of trait statistics is larger when the adjacency matrix of the empirical graphis fixed when compared to ER random graphs, showing the effect of a small number ofnodes with high connectivity. The use of biological and network characteristics may affectsubsequent analysis regarding how biological covariates are linked with PPIs and PINs.A range of suitable graph ensembles should be tested when assessing trait associations inorder to appreciate better how these traits are linked to the observed graphs.

Chapter 4 used the random ensembles introduced in Chapter 3 to assess whether phyloge-netic topologies of S. cerevisiae proteins are linked to PPIs. Although they are found to bemore similar than the topologies of randomly selected protein pairs, if the random graphensemble fixes the network structure this linkage disappears. Accordingly, it is hard todistinguish whether the topologies are linked to the PPIs, or to the network structure thatis found for the empirical data.

Chapter 5 described a model for estimating the interactome size, false discovery rate andproportion of true interactions that have been found. This model showed that the physi-cal and genetic interactomes are of substantially different sizes. The current knowledgeof genetic interactions is more limited than that found for physical interactions. Thefalse-discovery rate found in HTP and SSE are closer than previously thought, althoughHTP data are more noisy. However, replicated sampling can be used to eliminate errorsand a doubling of the currently available reported interactions would greatly increase theamount of true interactions that can be found using repeated information.

163

6.2. CONCLUSIONS Conclusions

6.2 Conclusions

Graph structure plays an important role in the analysis of whether biological covariatesare significantly correlated with PINs. Underlying assumptions in studies regarding theinteractome’s structure, i.e. an ER or scale-free random graph, have the power to changethe conclusions about the significance of a variety of biological traits. The choice of nullmodel, without further knowledge about the true interactome, has to be clearly definedand understood when used to assess the significance of traits found in the empirical data.

Some graph ensembles have been shown to generate random graphs that show very sim-ilar biological properties to empirical data (Thorne and Stumpf, 2007). In these cases, ifthe generation method is linked to the biology or possibly to explain the evolution of theempirical data, there is a need to support the assertion more fully showing that the graphensemble does produce more similar traits than would be expected by chance. Althoughthis is hard to define, depending on the ensemble method used, the graph distances be-tween the empirical and random graphs can be used to help with this assessment. Agraph distance measure, or comparison of the ensemble with those graphs that are a sim-ilar distance away, may give more confidence in the biological relevance of the proposedensemble, or lead to the conclusion that the ensemble is just over-fitted to the empiricaldata.

Reported evidence for co-evolution between PPIs (Goh et al., 2000) is not backed up byan analysis of S. cerevisiae protein phylogenetic topologies. The network structure of thereported data, and how it is used to define a null graph set, is key to the resulting levelof similarity found between the phylogenetic topologies of PPIs. Whilst, on average,the similarity is higher for an observed PPI than for a random protein-pair, the similarityis not significantly higher for the PIN when compared to networks with the same degreedistribution. There is a need to clearly differentiate between analyses of particular familiesof PPIs, all PPIs, or features of the PIN in future work. Globally, the properties of proteinpairs may be too diverse and insufficiently specific or informative to reliably predict PPIswhen taken in isolation.

The available literature curated data, although clearly more reliable than large scale stud-ies, appears to have non-negligible error rates. HTP data, which have always been as-sumed to be error prone (Schwartz et al., 2009), do however have coverage propertieswhich are better understood. Ignoring the tested set of protein pairs used in these stud-ies has led to an exaggeration of noise found in HTP studies (Gentleman and Huber,

164

6.3. FURTHER WORK Conclusions

2007), backed up by the analysis presented here. The release of raw results from theseHTP experiments will undoubtedly aid further research and help to improve the couponmodels presented here, as well as enabling better measurement of the false-positive andfalse-negative rates.

The observable S. cerevisiae PPI network, under the assumption of an FDR lower than 0.5,has less than 40,000 interactions. Of these interactions, around half have been reported inBioGRID. PPI data have a significant level of false-positive data. The discovery of somenovel interactions perhaps requires new experimental methods or further replicated HTPstudies, although the level of error found in HTP is not too high to exclude the possibilitythat the observable set of interactions can be found solely from these studies.

6.3 Further work

Graph ensemble work presented in this thesis has shown how differences in the assumedgraph structure can affect biological traits. This exploratory work should be extendedto include additional rewiring algorithms and biological characteristics in order to under-stand better correlations between PPIs, PINs, and covariates. Distances between empiricaldata and the random graphs enables an assessment regarding how similar the trait valueswould be if any constraints were removed. This distance should be further used to assessthe effect of published ensembles upon the stability of traits found in the empirical data.It also will enable a better guide as to how different the random graphs are from empiricaldata. If some graph distance between empirical data and the constructed random graphsis not used, there is a risk that reported results, and the credence given to particular PINgeneration methods, may be over interpreted.

The diversity of experimental techniques, and differences in how PPIs are reported, makecomparison of SSE and HTP data difficult and potentially misleading. There is a need fora repository of interaction data that quantifies all experimental results (where possible)and lists both the set of reported positive interactions, and false interactions, when quanti-tative information is unavailable. SSE and HTP data should be separated in this databaseto enable easier analyses and also to contribute alternative training sets for future algo-rithms.

Phylogenetic topology results reiterate previous findings that have suggested a lack ofco-evolutionary signal found from mirrortree of protein interactions in the S. cerevisiae

165

6.3. FURTHER WORK Conclusions

PIN (Hakes et al., 2007a). These analyses should also be completed on PIN data fromother organisms. Together this will lead to a better understanding of how interactions areretained when independent observations, in different organisms such as C. glabrata, ofthe homologous proteins have been shown to either interact, or not interact. Analysis ofinteraction data for closely related species will also provide further means of assessingpossible co-evolutionary effects between interacting proteins.

Finally, the coupon model provides a means of using all the reported interaction data toassess the FDR and interactome size. The model can further be used to compare error ratesof different experimental techniques as well as data found from other model organisms.It also can serve as a means of estimating the number of replicated HTP experimentsrequired to isolate the true interactions and false interactions that can be observed for eachset of conditions. The coupon model can also be further used to compare the reliabilityof each biological technique for a given interactome, in order to help assess the value ofinteraction data from any methodology.

166

Appendix A

Mathematical techniques

This appendix provides further information on the random graph techniques introducedin Chapter 1.

A.1 Likelihood analysis of degree distributions

Suppose a probability model defines the observed degree distribution, P (d (vi) = k; θ),where θ are the model’s parameters. Maximum likelihood estimation can be used toestimate the parameters which best reproduce the observed degree data,D = d (v1) , . . . , d (vn). The log-likelihood is defined as:

log (L (θ)) =n∑i=1

log (P (d (vi) = k; θ)) . (A.1)

In order to compare the degree distribution models, likelihoods are measured for eachmodel. Model selection is performed by comparing the likelihoods found for each distri-bution after taking account of the different numbers of parameters in each model. As themodels are non-nested, an Akaike-information criterion (AIC) is used to choose betweenthe models (Burnham and Anderson, 1998). AIC, the measure used to choose the bestmodel for P (d (vi) = k; θ), is:

AIC = 2(− log

(L(θ)

)+ d), (A.2)

167

A.2. SCALING DEGREE RANDOM GRAPHS Techniques

where θ is the maximum likelihood estimate of θ and d is the number of parameters foundin the model.

A.2 Scaling degree random graphs

Graph ensemble models for generating graphs with scale-free properties have frequentlybeen proposed in the literature, taking a variety of guises (Aiello et al., 2000; Gkantsidiset al., 2003; Li et al., 2005; Stumpf et al., 2007). Some of these methods have aimed to re-produce the observed biological systems using pseudo-evolutionary schemas. Biologicalconcepts such as gene duplication events, the evolutionary divergence of similar genes,or functional importance of particular genes have been used to justify concepts such aspreferential attachment and duplication-divergence. These methods start with a small net-work (e.g. two nodes with a single edge) and define an iterative scheme for generatingnetwork edges as more nodes are added.

These generative models use two primary features to produce graphs with scale-free de-gree sequences. The different features used generate graphs with various levels of clus-tering, dependent on the parameter values (the probability of an edge or duplication of anode) (Chung et al., 2003).

(a) Preferential attachment The first generative model is preferential attachment (PA)where added nodes are more likely to connect to highly connected nodes (Dorogovtsevet al., 2000). This forms a methodology of generating graphs that exhibit power-lawdegree distributions. However, the aim of PA may not be to replicate the actual mechanismthat drove evolution of empirical data.

(b) Duplication-divergence A second technique, duplication-divergence (DD), takessome inspiration from actual duplication events found in biological systems (Chung et al.,2003). The new node is assumed to be a duplicate of a node in the network, thus preserv-ing all its edges. The divergence stage then randomly mutates the edges that are preservedby this duplication event, mirroring the role of duplication and specialisation of proteinsthat has been observed in real biological systems.

168

A.3. EXPONENTIAL RANDOM GRAPHS Techniques

(c) Duplication-attachment Finally, duplication attachment (DA) methods combine theproperties of PA and DD to generate random graphs. Nodes are duplicated, although thereis no divergence, and the inheritance of edges is random. Preferential attachment eventsoccur as new nodes are added to the graph.

The degree distributions typical of these growth models exhibit similar scaling proper-ties to observed PPIs. The models generate graphs which have a small collection ofhubs connected to the majority of the nodes in the graph that have a small number ofneighbours. Duplication-attachment produces a degree distribution most similar to the S.

cerevisiae protein interaction network, primarily as the highest degrees are lower than inthe preferential-attachment simulation.

There are a number of other methods, aside from these ‘evolution’ based schemes, thatcreate graphs with scale-free degree sequences, including: generalised random graphs;power-law random graphs (Aiello et al., 2000); and random degree-preserving rewiring(Gkantsidis et al., 2003). However, these are not discussed any further as all of thesemethods have been found to be asymptotically equivalent (Li et al., 2005).

A.3 Exponential random graphs

Exponential random graph models (ERGMs) have been used to create graphs with thesame properties as empirical social graphs (Pattison and Wasserman, 1999; Robins et al.,2007). Saul and Filkov (2007) have recently applied this technique to biological graphs.Typically, we have access to some measurements regarding the observed graphs, includingeither network or biological traits, and would like to make random graphs that share theseproperties. This technique is similar to the graph ensembles used to study traits andstability in Chapter 3.

Assuming that ERGMs form a reliable model of the data, they can be used for a varietyof reasons. Their parameters can be interpreted as conferring information on the relativeimportance of traits, perhaps also enabling feature selection over the available traits whenattempting to model the data effectively. The ERGMs are also easily extendable, as extratraits can be added to the model, enabling a better fit to the data.

Let G be a random graph, on n nodes, with adjacency matrix A. Let a be an observedadjacency matrix of a graph. Now consider a series of traits, z1 (a) , z2 (a) , . . . , zm (a).

169

A.3. EXPONENTIAL RANDOM GRAPHS Techniques

These can be node or edge specific, including structural traits such as motifs. We assumea log-linear model,

log (P (A = a)) ∝ θ1z1 (a) + . . .+ θmzm (a) , (A.3)

or, equivalently,

P (A = a) =1

κ(θ)exp(θ1z1(a) + . . .+ θmzm(a))

=1

κ(θ)exp(θ>z(a)) (A.4)

where θ = [θ1, . . . , θm]>, z(a) = [z1(a), . . . , zm(a)]>, and κ(θ) is a normalising constantwhich ensures that P (A = a) is a true probability distribution.

The model can be related to a logistic regression model. Let Ai,j be the element (i, j) ofthe matrix, whilst Ai,jc denote the remaining entries. Then:

P (Ai,j = 1|Ai,jc) =P (Ai,j = 1, Ai,j

c)

P (Ai,jc)

P (Ai,j = 1|Ai,jc) =P (Ai,j = 1, Aij

c)

P (Ai,j = 1, Aijc) + P (Ai,j = 0, Aij

c).

=P (A = a+)

P (A = a+) + P (A = a−), (A.5)

where a+ is the graph where ai,j = 1, and a− the graph where ai,j = 0.

Then, from Equation A.4:

P (Ai,j = 1|Aijc) =exp(θ>z(a+))

exp(θ>z(a+)) + exp(θ>z(a−)). (A.6)

Using the similar expression for P (Ai,j = 0|Ai,jc), we can write,

log

(P (Ai,j = 1|Ai,jc)P (Ai,j = 0|Ai,jc)

)= θ>

(z(a+)− z

(a−))

(A.7)

where δ = z (a+) − z (a−) is a vector known as the change statistic. Note that Equa-

170

A.4. FURTHER BIOLOGICAL RANDOM GRAPHS Techniques

tion A.7 has a similar form to a logistic regression model.

A vector of change statistics is used to find the parameters. Using a logistic regressor forthe parameters assumes that the training data are independent, with no interdependenceamongst the nodes. Markov Chain Maximum Likelihood Estimation (MCMCMLE) (Sni-jders, 2002) incorporates dependencies into the estimation to eliminate this problem. AnERGM has been used for PINs showing a good fit to the observed graph (Saul and Filkov,2007).

A.4 Further biological random graphs

Daudin et al. (2007) proposed an adaptation to the standard ER graph model to incorpo-rate the observation of heterogenity among the nodes found in biological graphs which,perhaps confusingly, they denoted ERMGs.

The ERMG model assumes that each node, v, of the graph, G, is in one of Q clusterswith prior probabilities of α1, . . . , αQ. A variable incorporates the probability thatnodes from different clusters have connections. Let πi,j be the probability that a nodefrom cluster i has an edge with a node from cluster j. This additional variable allowscontrol over the connectivity of the graph, allowing the model to generate highly clusteredsubgraphs, whilst still having a sparse set of edges.

To fit data the Bayesian Information Criterion (BIC) is used with an adapted Expectation-Maximisation (EM) algorithm to determine the optimal number of clusters. Then, thedegree distributions are found (within each group) to be better modelled as Poisson degreedistributions, as opposed to the scale-free distributions observed over the complete graph.This also raises an additional point as subgraphs of a scale-free graph are not necessarilyscale-free (Stumpf et al., 2005b), perhaps limiting the conclusions when attempting tomodel these systems using incomplete data.

Another random graph model assigns a distance function between different nodes andthe probability of edges according to the distance between nodes. Higham et al. (2008)developed a geometric random graph model that embeds PPI data into Euclidean space,testing whether the edges occur to some distance function. This algorithm suggested thattwo-dimensional Euclidean space was as effective as higher dimensional space to explainthe connectivity found in empirical PINs.

171

Appendix B

Data tables for biological traits

This appendix includes further data and analysis to complement Chapter 2.

B.1 Experimental interaction techniques

This section presents some further information on the experiment techniques used to findand infer protein interactions. These are the different types of computational interactionprediction method; a full list of the BioGRID experimental techniques; and the numberof proteins with each annotation for each GO ontology.

Method Interaction AssociationBayesian networks Domain PhysicalClassification Domain PhysicalDomain association Domain PhysicalDomain pair exclusion Domain PhysicalGene co-expression Protein FunctionalGene neighbour Protein FunctionalPhylogenetic profile Protein FunctionalRosetta stone Protein FunctionalSequence co-evolution Domain FunctionalSynthetic lethality Protein Functional

Table B.1: Interaction prediction methodologies. This shows tools that can be used to pre-dict interactions or association between proteins or domains using a combination of in vivo and in silicomethods.

172

B.1. EXPERIMENTAL INTERACTION TECHNIQUES Data tables

(a)

Mol

ecul

arfu

nctio

n

Ann

otat

ion

Prot

eins

Stru

ctur

alm

olec

ule

activ

ity20

4Pr

otei

nki

nase

activ

ity10

7Tr

ansp

orte

ract

ivity

248

Hyd

rola

seac

tivity

267

Tran

scri

ptio

nre

gula

tora

ctiv

ity23

2O

xido

redu

ctas

eac

tivity

155

Tran

sfer

ase

activ

ity28

8Is

omer

ase

activ

ity34

RN

Abi

ndin

g18

5Ph

osph

opro

tein

phos

phat

ase

activ

ity42

Pept

idas

eac

tivity

93D

NA

bind

ing

154

Tran

slat

ion

regu

lato

ract

ivity

39Pr

otei

nbi

ndin

g30

0N

ucle

otid

yltr

ansf

eras

eac

tivity

62Ly

ase

activ

ity62

Lig

ase

activ

ity73

Mot

orac

tivity

17H

elic

ase

activ

ity52

Enz

yme

regu

lato

ract

ivity

116

Sign

altr

ansd

ucer

activ

ity54

(b)

Cel

lula

rcom

pone

nt

Ann

otat

ion

Prot

eins

Rib

osom

e13

3N

ucle

us12

89N

ucle

olus

182

Plas

ma

mem

bran

e15

7M

itoch

ondr

ion

563

Vac

uole

88Pe

roxi

som

e43

Cyt

opla

sm11

92C

ellw

all

47M

embr

ane

frac

tion

44E

ndop

lasm

icre

ticul

um20

9M

itoch

ondr

ialm

embr

ane

109

Chr

omos

ome

100

Cyt

oske

leto

n61

Mic

rotu

bule

orga

nizi

ngce

nter

44M

embr

ane

89B

ud81

Cel

lcor

tex

49E

ndom

embr

ane

syst

em68

Gol

giap

para

tus

87C

ytop

lasm

icm

embr

ane-

boun

dve

sicl

e52

Site

ofpo

lari

zed

grow

th70

Ext

race

llula

rreg

ion

10

(c)

Bio

logi

calp

roce

ssA

nnot

atio

nPr

otei

nsPr

otei

nbi

osyn

thes

is21

9M

orph

ogen

esis

14Tr

ansc

ript

ion

260

Tran

spor

t36

2O

rgan

elle

orga

niza

tion

and

biog

enes

is12

1L

ipid

met

abol

ism

85M

eios

is98

Ele

ctro

ntr

ansp

ort

5D

NA

met

abol

ism

264

Am

ino

acid

and

deriv

ativ

em

etab

olis

m98

RN

Am

etab

olis

m29

1R

ibos

ome

biog

enes

isan

das

sem

bly

140

Cel

lwal

lorg

aniz

atio

nan

dbi

ogen

esis

92Pr

otei

nm

odifi

catio

n27

1C

arbo

hydr

ate

met

abol

ism

72Ps

eudo

hyph

algr

owth

39C

ellu

larr

espi

ratio

n58

Cel

lbud

ding

24V

itam

inm

etab

olis

m34

Prot

ein

cata

bolis

m64

Cyt

oske

leto

nor

gani

zatio

nan

dbi

ogen

esis

99G

ener

atio

nof

prec

urso

rmet

abol

ites

and

ener

gy57

Nuc

lear

orga

niza

tion

and

biog

enes

is47

Ves

icle

-med

iate

dtr

ansp

ort

188

Cel

lcyc

le11

1R

espo

nse

tost

ress

147

Sign

altr

ansd

uctio

n71

Spor

ulat

ion

43C

ellh

omeo

stas

is41

Con

juga

tion

44C

ytok

ines

is51

Mem

bran

eor

gani

zatio

nan

dbi

ogen

esis

19

Tabl

eB

.2:

GO

slim

anno

tatio

ncl

asse

s.Th

ese

tabl

esde

tail

the

anno

tatio

ncl

asse

sfo

rth

eth

ree

diffe

rent

Gen

eO

ntol

ogy

cate

gori

es.

For

each

clas

s,th

enu

mbe

rof

S.ce

revi

siae

prot

eins

with

inth

atcl

ass

isal

sogi

ven.

173

B.2. FURTHER GENE ONTOLOGY ANNOTATION ANALYSIS Data tables

(a)

Physical methodAffinity Capture-MSAffinity Capture-WesternTwo-hybridCo-localizationFRETAffinity Capture-RNAReconstituted ComplexProtein-peptideCo-purificationCo-fractionationBiochemical ActivityCo-crystal StructureFar WesternProtein-RNA

(b)

Genetic methodSynthetic LethalitySynthetic Growth DefectSynthetic RescueDosage LethalityDosage Growth DefectDosage RescuePhenotypic EnhancementPhenotypic Suppression

Table B.3: BioGRID experimental methods. These tables show the experimental techniques thatproduce either genetic or physical protein interactions.

B.2 Further Gene Ontology annotation analysis

Figure B.1 shows the proportion of matching GO slim annotations for interactions re-ported between 1990 and 2007. The contribution of HTP data are evident after 2000.This is associated with a lower level of similarity in annotations between the more re-cently reported interactions.

As HTP techniques started to produce interaction data there has been a larger proportionof interactions between proteins of dissimilar annotations. This trend can either be dueto: (a) a higher false-positive rate; (b) a bias in the discovery of interactions in earlierliterature; or (c) a need to assess the reliability of the GO annotations.

The choice between these explanations would be straightforward if matching annotationswere found to be necessary for certain protein-protein interactions. However, evidencefor this cannot come directly from prior assumed knowledge. Indeed, overconfidence inprior knowledge would contribute towards ignoring the possiblities that (b) or (c) maybe resulting in the perceived errors, rather than a higher FDR implied by (a). Evidenceof higher levels of matching annotations between interacting proteins should be shownthrough uniform random testing of protein pairs, or alternatively guided by validations of

174


Mat

ching

pro

porti

ons

0.0

0.2

0.4

0.6

0.8

1.0

ComponentProcessFunction

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

52 77 99 161

231

262

361

419

520

527

586

597

598

571

586

553

519

488

79 139

179

289

428

542

831

1358

1640

1643

3310

3470

9949

3714

7777

1543

4

1922

3

2604

5

Figure B.1: GO slim matching annotations through time. This shows the proportions ofmatching GO slim annotations for reported interactions between 1990 and 2007. The numbers in (green,red) are the (interactions, studies) for each year.

studies that probe the same (or complete) interaction space.

Ptacek et al. (2005) published a HTP study, contributing 4,182 protein-protein interac-tions, that focused on protein phosphorylation. This is a regulatory mechanism for basicprocesses that is thought to affect up to 30% of proteins at any given time. This singlestudy showed below average annotation similarity, with function and process annotationsmatching for interactors in only 6% of those with known information, whilst the com-ponent annotations matched in 37% of cases. This biochemical activity study explainsthe low level of similarity for interactions reported in 2005. Biochemical analysis studiesshow low levels of annotation similarity for function and process across all (240) avail-able studies (shown in Figures 2.6-2.8). The additional 239 biochemical activity studiescontribute a further 1,010 (19%) novel interactions, with the Ptacek et al. (2005) datacontributing the majority of the data for this technique.

175


Known function proportion

0.00.20.40.60.81.0

24068

4399

1657

7457

5169

108

52

64

80

301

56

33

1174

548

9636

3053

5809

1929

433

15948

4392

63

Affinity

Capture−MS

Affinity

Capture−W

estern

Reconsti

tuted Complex

Two−hybrid

Biochemica

l Acti

vity

Co−crysta

l Stru

cture Far W

estern

FRET

Protein−peptide

Co−localiz

ation

Affinity

Capture−RNAProtein−RNA

Co−purifica

tion

Co−fracti

onation

Synthetic

Lethality

Dosage R

escue

Synthetic

Growth D

efect

Synthetic

Resc

ue

Dosage Lethality

Phenotypic

Enhancement

Phenotypic

Suppression

Dosage G

rowth Defect

Phys

ical in

tera

ctio

nsG

enet

ic in

tera

ctio

ns

(a)

Mol

ecul

arfu

nctio

n

Known process proportion

0.00.20.40.60.81.0

24068

4399

1657

7457

5169

108

52

64

80

301

56

33

1174

548

9636

3053

5809

1929

433

15948

4392

63

Affinity

Capture−MS

Affinity

Capture−W

estern

Reconsti

tuted Complex

Two−hybrid

Biochemica

l Acti

vity

Co−crysta

l Stru

cture Far W

estern

FRET

Protein−peptide

Co−localiz

ation

Affinity


Co−purifica

tion

Co−fracti

onation

Synthetic

Lethality

Dosage R

escue

Synthetic

Growth D

efect

Synthetic

Resc

ue

Dosage Lethality

Phenotypic

Enhancement

Phenotypic

Suppression

Dosage G

rowth Defect

Phys

ical in

tera

ctio

nsG

enet

ic in

tera

ctio

ns

(b)

Bio

logi

calp

roce

ss

Known component proportion

0.00.20.40.60.81.0

24068

4399

1657

7457

5169

108

52

64

80

301

56

33

1174

548

9636

3053

5809

1929

433

15948

4392

63

Affinity

Capture−MS

Affinity

Capture−W

estern

Reconsti

tuted Complex

Two−hybrid

Biochemica

l Acti

vity

Co−crysta

l Stru

cture Far W

estern

FRET

Protein−peptide

Co−localiz

ation

Affinity


Co−purifica

tion

Co−fracti

onation

Synthetic

Lethality

Dosage R

escue

Synthetic

Growth D

efect

Synthetic

Resc

ue

Dosage Lethality

Phenotypic

Enhancement

Phenotypic

Suppression

Dosage G

rowth Defect

Phys

ical in

tera

ctio

nsG

enet

ic in

tera

ctio

ns

(c)

Cel

lula

rcom

pone

nt

Figu

reB

.2:

Kno

wn

GO

anno

tatio

nsfo

rPP

Isby

met

hod.

This

show

sth

epr

opor

tion

ofkn

own

GO

anno

tatio

nsfo

und

for

PP

Isre

port

edby

each

expe

rim

enta

lte

chni

que.

Das

hed

lines

show

aver

age

prop

ortio

nac

ross

com

plet

ege

netic

orph

ysic

alin

tera

ctio

nse

t.B

arde

nsity

show

sp

-val

ueof

bino

mia

lpr

opor

tion

test

,ass

essi

ngsi

mila

rity

,bet

wee

nte

chni

que

and

gene

ticor

phys

ical

data

set.

FR

ET

exhi

bits

afa

rhi

gher

prop

ortio

nof

know

nG

Oan

nota

tions

that

othe

rex

peri

men

talt

echn

ique

s.

176

Appendix C

Graph ensemble output

This appendix details results for CORE and DIP graphs to complement to LC resultspresented in the main text of Chapter 3. These show the same trait information pre-sented for the LC graph data: GO annotation traits; complex annotation matching trait;co-expression levels and the clustering coefficient found for each of the ensembles meth-ods.

177

Graph ensembles





Network shuffle





Node shuffle

Random graph

0.0 0.2 0.4 0.6 0.8 1.0


(a) Coexpression





Network shuffle





Node shuffle

Random graph

0.0 0.2 0.4 0.6 0.8 1.0


(b) Complex





Network shuffle





Node shuffle

Random graph

0.0 0.2 0.4 0.6 0.8 1.0

Proportion matching function annotations

(c) Function





Network shuffle





Node shuffle

Random graph

0.0 0.2 0.4 0.6 0.8 1.0

Proportion matching process annotations

(d) Process





Network shuffle





Node shuffle

Random graph

0.0 0.2 0.4 0.6 0.8 1.0

Proportion matching component annotations

(e) Component





Network shuffle





Node shuffle

Random graph

0.0 0.2 0.4 0.6 0.8 1.0

Average clustering coefficient

(f) Clustering

Figure C.1: Graph ensemble traits for DIP data. Trait statistic values for the graph ensemblesshown as a proportion of trait found in DIP. The effects are similar to those seen for LC in Chapter 3. Nodeshuffle ensembles exhibit higher variability and similar characteristics to random graph whilst networkshuffle ensembles replicate the trait values seen in the empirical data more closely.

178

Graph ensembles





Network shuffle





Node shuffle

Random graph

0.0 0.2 0.4 0.6 0.8 1.0


(a) Coexpression





Network shuffle





Node shuffle

Random graph

0.0 0.2 0.4 0.6 0.8 1.0


(b) Complex





Network shuffle





Node shuffle

Random graph

0.0 0.2 0.4 0.6 0.8 1.0

Proportion matching function annotations

(c) Function





Network shuffle





Node shuffle

Random graph

0.0 0.2 0.4 0.6 0.8 1.0

Proportion matching process annotations

(d) Process





Network shuffle





Node shuffle

Random graph

0.0 0.2 0.4 0.6 0.8 1.0

Proportion matching component annotations

(e) Component





Network shuffle





Node shuffle

Random graph

0.0 0.2 0.4 0.6 0.8 1.0

Average clustering coefficient

(f) Clustering

Figure C.2: Graph ensemble traits for CORE data. Trait statistic values for the graph en-sembles shown as a proportion of trait found in CORE. The effects are similar to those seen for LC inChapter 3. Node shuffle ensembles exhibit higher variability and similar characteristics to random graphwhilst network shuffle ensembles replicate the trait values seen in the empirical data more closely.

179

Appendix D

Phylogenetic topology

Further results found using the methods presented in Chapter 4 are detailed in this ap-pendix. These include analysis of the E. coli PPIs based on the protein phylogenetic treetopologies as well as how to find the number of possible bifurcating and multifurcatingtrees for any number of lineages.

D.1 Phylogenetic topologies

The possible distinct phylogenetic topologies is related to the number of species, or se-quences, from which the phylogenetic tree is generated (Felsenstein, 2003). For rootedbifurcating trees the total number of possible topologies for n species, Tn is:

Tn =(2n− 3)!

2n−1 (n− 1)!(D.1)

Multifurcating rooted trees can have any degree at each internal node of the tree, so thenumber of different topologies on n species is greater than found for bifurcating trees.The total number of different multifurcating trees is the sum over the number of internalnodes, m, of:

Tn,m = (n+m− 2)Tn−1,m−1 + Tn−1,m, (D.2)

for m ∈ [1, n− 1], Tn,1 = 1 and Tn,m = 0 ∀m ≥ n .

The maximum edit distance between two phylogenetic trees on n species is defined by

180

D.2. SUPPLEMENTARY PHYLOGENETIC RESULTS Phylogeny

the recursion: Mn+1 = Mn + (n− 2), M3 = 2. The possible topologies, and associatedmaximum scores, Mn, between distinct trees on n species are shown in Table D.1.

Species Bifurcating trees Multifurcating trees Maximum score

1 1 1 −2 1 1 −3 3 4 24 15 26 45 105 236 76 945 2, 752 117 10, 395 39, 208 168 135, 135 660, 032 229 2, 027, 025 12, 818, 912 2910 34, 459, 425 282, 137, 824 37

Table D.1: Number of topologies. The number of rooted labelled trees for n species.

D.2 Supplementary phylogenetic results

This section details the phylogenetic tree results for the three different empirical graphsand tree construction methods. These show the same trait information presented for theLC graph data, and PROML trees, in Chapter 4.

181






Network shuffle





Node shuffle

Random graph

0.36 0.38 0.40 0.42 0.44 0.46 0.48


(a) Topology matches





Network shuffle





Node shuffle

Random graph

1.5 1.6 1.7 1.8 1.9 2.0 2.1

Average score

(b) Topology score





Network shuffle





Node shuffle

Random graph

0.82 0.83 0.84 0.85 0.86

Average similarity

(c) Topology similarity





Network shuffle





Node shuffle

Random graph

4.9 5.0 5.1 5.2 5.3 5.4 5.5 5.6

Average orthologous species

(d) Lineage comparisons

Figure D.1: Phylogeny results for DIP (PROML trees). Boxplots show the distribution ofa trait statistic for graph ensembles: matching topologies; topology score; topology similarity; averagelineages for topology comparisons.

182






Network shuffle





Node shuffle

Random graph

0.38 0.40 0.42 0.44 0.46 0.48 0.50 0.52







Network shuffle





Node shuffle

Random graph

1.4 1.5 1.6 1.7 1.8 1.9 2.0

Average score

(b) Topology score





Network shuffle





Node shuffle

Random graph

0.82 0.83 0.84 0.85 0.86 0.87

Average similarity






Network shuffle





Node shuffle

Random graph

4.8 4.9 5.0 5.1 5.2 5.3 5.4



Figure D.2: Phylogeny results for CORE (PROML trees). Boxplots show the distributionof a trait statistic for graph ensembles: matching topologies; topology score; topology similarity; averagelineages for topology comparisons.

183






Network shuffle





Node shuffle

Random graph

0.12 0.14 0.16 0.18 0.20







Network shuffle





Node shuffle

Random graph

2.0 2.2 2.4 2.6 2.8

Average score

(b) Topology score





Network shuffle





Node shuffle

Random graph

0.72 0.73 0.74 0.75 0.76 0.77

Average similarity






Network shuffle





Node shuffle

Random graph

4.6 4.8 5.0 5.2



Figure D.3: Phylogeny results for LC (PAML trees). Boxplots show the distribution of a traitstatistic for graph ensembles: matching topologies; topology score; topology similarity; average lineagesfor topology comparisons.

184






Network shuffle





Node shuffle

Random graph

0.48 0.50 0.52 0.54 0.56 0.58 0.60







Network shuffle





Node shuffle

Random graph

1.1 1.2 1.3 1.4 1.5 1.6 1.7

Average score

(b) Topology score





Network shuffle





Node shuffle

Random graph

0.86 0.87 0.88 0.89 0.90 0.91

Average similarity






Network shuffle





Node shuffle

Random graph

4.6 4.7 4.8 4.9 5.0 5.1 5.2



Figure D.4: Phylogeny results for LC (PARS trees). Boxplots show the distribution of a traitstatistic for graph ensembles: matching topologies; topology score; topology similarity; average lineagesfor topology comparisons.

185

D.3. ESCHERICHIA COLI PHYLOGENETIC TREES Phylogeny

D.3 Escherichia coli phylogenetic trees

E. coli data (Pazos et al., 2005) produces similar results to those found for S. cerevisiae inChapter 4. Using phylogenetic trees from 47 bacterial species, the fraction of matchingtopologies in the DIP network for E. coli and the graph ensemble simulations broadlyconfirming results obtained for S. cerevisiae. The phylogenetic profiles are more similaramong pairs of interacting proteins than among protein pairs found in graphs drawn fromthe network shuffle and node shuffle graph ensembles.

There is little evidence that the topology of the protein phylogeny provides any indicationof an interaction in general. The protein phylogenetic topologies show more matches inthe random graphs than in the empirical data, as shown in Figure D.5. Node shuffle resultsshow most matching topologies, as a result of a large proportion of interacting proteinssharing homologues in few species. This results in more matches as there are fewer pos-sible topologies. However, the results are similar to those found for yeast across the data.A similar proportion of protein pairs match in empirical graphs and when two proteins’trees are randomly compared on the same number of shared homologous sequences (sothe complexity of the trees are identical). The node shuffle graph ensemble tends to gen-erate graphs with a lower propensity for phylogenetic topologies of proteins to match,although the graphs drawn from network shuffle ensemble tend to produce more matches.This suggests that the trees which match are generally found from edges leading involvingmore highly connected nodes.

186

D.3. ESCHERICHIA COLI PHYLOGENETIC TREES Phylogeny

Shared Homologues

Pro

port

ion

of M

atch

es

0.0

0.1

0.2

0.3

0.4

0.5

3 4 5 6 7 8 9 10

Net ShuffleTree Shuffle

Figure D.5: Phylogenetic topology matches for E. coli data. This shows the output for E.coli interaction data, showing a lack of difference between the two types of graph ensemble method andthe empirical graph over comparisons made on the same number of shared homologues. Network shuffleresults are shown in yellow (Net Shuffle) and node shuffle results in green (Tree Shuffle).

187

Appendix E

Sampling schemes

The sampling scheme used in Chapter 5 assumes that each interaction is reported uni-formly across all true interactions found in the complete set of protein pairs, EΩ. Thisedge sampling approach means that the reported interactions can be modeled as samplesfrom an urn of interactions. However, in order to estimate the interactome size, an assess-ment of the proportion of EΩ contained in either urn, the coverage of the experimentaltechniques, is required.

As the reported non-interactions are not known, sampling schemes are used to modelthe coverage of experimental techniques based on reported interactions. These samplingschemes can then be used to find estimates of the error rates described in Table 1.2. Threedifferent schemes are introduced that may describe how experimental procedures havetested possible protein pairs: node sampling - selection of proteins selected at random,then pairs tested that are combinations of this set; edge sampling - reported interactionsare picked at random from the true network, G, and according to some rate, f , also fromG′; edge discovery - preferential testing of either fixed proteins or particular protein pairs.

Node sampling Each experiment, P , proposes a set of proteins, VP , to probe for pos-sible interactions. The sampled region is the complete network on these proteins, madeup of: GP ∼ (VP , EP ) and the reported interactions EP ⊆ (vi, vj) : vi, vj ∈ VP. Thenumber of individual interactions tested is

(|VP |2

)and the non-interactions are assumed to

be: GP′ ∼ (VP , KVP

\ EP ) (KVPis the complete graph on set of nodes VP ).

To estimate this from a reported experiment, P (if no further evidence exists) the sample

188

Interactome

(a)

(c)

(e)

(b)

(d)

(f)

Figure 1: Node Sampling: (a) Complete space of possible proteins; (b) Trueunderlying interaction network (144 edges on 200 nodes); (c) Proteins sampledby experiments (coloured by experiment); (d) Interactions tested in experiment;(e) True interactions sampled; (f) Edges not observed.

Figure E.1: Node sampling. This shows how interactions are discovered using node sampling. Thefigures show: (a) complete set of proteins (200 nodes); (b) true underlying interaction network (144 edgeson 200 nodes); (c) proteins sampled for each study (coloured by study); (d) interactions tested in a study(forming the coverage); (e) true interactions tested in studies; (f) true interactions not tested.

set is approximated by using the set of proteins that have an interaction to generate VP .This will possibly be a subset of those proteins actually sampled in the study, understatingthe overall set of protein pairs tested.

Figure E.1 shows how a set of studies, covering a large subset of the proteome, may onlytest a small subset of the true interactions. In this case, the proteins tested have been given,rather than assumed from the positive interactions found from positive reported sets.

189

Interactome

Edge sampling Unlike node sampling, this scheme supposes that interactions are sam-pled directly, rather than nodes being chosen as testing candidates. This could be hypoth-esis driven, according to biological knowledge or other relationships. Each interaction isassumed to be independent and the discovery of any interaction is viewed as a randomprocess that is not influenced by the set of proteins being assessed. Figure E.2 shows anexample of how this would find true interactions if we assume that the FDR is zero.

(a)

(c)

(b)

(d)

Figure 1: Edge Sampling: (a) Complete space of possible proteins; (b) Trueunderlying interaction network (144 edges on 200 nodes); (c) Edges discoveredthrough randomly sampling edges; (d) Remaining edges not sampled.Figure E.2: Edge sampling. This shows how interactions are discovered using edge sampling, assum-ing zero FDR. The figures show: (a) complete space of possible proteins; (b) true underlying interactionnetwork (144 edges on 200 nodes); (c) edges discovered through randomly sampling interactions (colouredby study); (d) remaining edges not sampled.

Edge sampling is the easiest sampling scheme to use when interested solely in interactionvalidations and their ability to classify putative interactions, as used in Chapter 5. How-ever, it requires extra information to determine the amount of non-interaction testing thathas been completed. The set of tested edges cannot be reconstructed solely from the ex-periment data, P , and its associated interaction network: GP ∼ (VP , EP ), as assumptionsare not laid on the unseen protein pairs that have been tested.

One means of assessing the testing completed on non-interactions is to assume a non-zeroFDR, as is apparent in the true data. Now the reported interaction graph is a subsample

190

Interactome

of true interactions, G, and non-interactions, G′. Given a reference set of reported in-teractions and non-interactions (which could be obtained using results from Chapter 5),an assessment of the coverage can be obtained. The non-interactions should be sampleduniformly across the proteome, so an assessment of the testing of each individual proteincan be reconstructed from this non-interaction reference set. Bias in the testing of non-interactions, found through those that have been reported, can be used to estimate the setof tested proteins, and thus the coverage of the testing across the whole proteome.

Edge discovery Figure E.3 shows a sampling scheme where each interaction reportedis local to others reported. For reported interactions the proteins involved are more likelyto be studied further or individual fixed proteins are tested against a large set of otherproteins. This biological bias would result in increased sampling of specific proteins andperhaps their neighbours.

(a)

(c)

(b)

(d)

Figure 1: Edge Discovery: (a) Complete space of possible proteins; (b) Trueunderlying interaction network (144 edges on 200 nodes); (c) Edges discoveredthrough edge discovery; (d) Edges not sampled.Figure E.3: Edge discovery. This shows how interactions are discovered using edge discovery,assuming zero FDR. This discovers the collection of interactions by moving from known interactions to localprotein pairs. The figures show: (a) complete space of possible proteins; (b) true underlying interactionnetwork (144 edges on 200 nodes); (c) edges discovered through edge discovery; (d) edges not sampled.

The reported interactions, from edge discovery, can easily generate hub proteins. Thismay explain some perceived false-positive proteins such as YBR111W-A (a protein in-

191

Interactome

volved in mRNA export couple transcription activation) that appears in only 4 studiesbut has over 100 unique reported interactions. Millson et al. (2005) also searched the S.

cerevisiae proteome for interactors of HSP90, finding 125 interactions. The tested set ofprotein pairs for these studies is the same as the interactions inferred from the proteincomplex spoke model. This is in contrast to the matrix interpretation presented by thenode sampling scheme.

192

References

Aebersold, R and Mann, M, 2003. Mass spectrometry-based proteomics. Nature 422:198–207. (page 68)

Agrafioti, I, Swire, J, Abbott, J, Huntley, D, Butcher, S, and Stumpf, MPH, 2005. Comparative analysisof the Saccharomyces cerevisiae and Caenorhabditis elegans protein interaction networks. BMC Evolu-tionary Biology 5:23. (pages 33, 40, 87, 119, and 137)

Aiello, W, Chung, F, and Lu, L, 2000. A random graph model for massive graphs. Proceedings of the ACMSymposium on Theory of Computing 171–180. (pages 51, 168, and 169)

Albert, I and Albert, R, 2004. Conserved network motifs allow protein-protein interaction prediction.Bioinformatics 20:3346–3352. (page 40)

Albert, R and Barabasi, AL, 2000. Topology of evolving networks: local events and universality. PhysicalReview Letters 85:5234–5237. (page 50)

Allen, SCH, Byron, A, Lord, JM, Davey, J, Roberts, LM, and Ladds, G, 2005. Utilisation of the buddingyeast Saccharomyces cerevisiae for the generation and isolation of non-lethal ricin A chain variants.Yeast 22:1287–1297. (page 26)

Alm, E and Arkin, A, 2003. Biological networks. Current Opinion in Structural Biology 13:193–202.(page 21)

Almaas, E, 2007. Biological impacts and context of network theory. Journal of Experimental Biology210:1548–1558. (page 50)

Aloy, P and Russell, RB, 2002. Potential artefacts in protein-interaction networks. FEBS Letters 530:253–254. (page 52)

Aloy, P and Russell, RB, 2006. Structural systems biology: modelling protein interactions. Nature ReviewsMolecular Cell Biology 7:188–197. (page 52)

Altschul, S, Gish, W, Miller, W, Myers, E, and Lipman, D, 1990. Basic Local Alignment Search Tool.Journal of Molecular Biology 215:403–410. (page 27)

Andrews, D and Demidov, A, 1999. Resonance Energy Transfer. Wiley. (page 35)

Ashburner, M, Ball, C, Blake, J, and Botstein, D, 2000. Gene ontology: tool for the unification of biology.The Gene Ontology Consortium. Nature Genetics 25:25–29. (pages 47 and 89)

Bader, GD, Donaldson, I, Wolting, C, and Ouellette, B, 2001. BIND—The Biomolecular Interaction Net-work Database. Nucleic Acids Research 29:242–245. (page 64)

Bader, GD and Hogue, CWV, 2002. Analyzing yeast protein-protein interaction data obtained from differentsources. Nature Biotechnology 20:991–997. (pages 38 and 61)

193

Bader, JS, Chaudhuri, A, Rothberg, J, and Chant, J, 2004. Gaining confidence in high-throughput proteininteraction networks. Nature Biotechnology 22:78–85. (pages 40, 46, and 68)

Bader, S, Kuhner, S, and Gavin, AC, 2008. Interaction networks for systems biology. FEBS Letters582:1220–1224. (page 34)

Bangert, RK, Turek, RJ, Rehill, B, Wimp, GM, Schweitzer, JA, et al., 2006. A genetic similarity ruledetermines arthropod community structure. Molecular Ecology 15:1379–1391. (page 26)

Barabasi, AL and Albert, R, 1999. Emergence of scaling in random networks. Science 286:509–512.(pages 43, 50, and 51)

Barabasi, AL and Oltvai, Z, 2004. Network biology: understanding the cell’s functional organization.Nature Reviews Genetics 5:101–113. (pages 21, 45, 50, and 52)

Batada, NN, Hurst, LD, and Tyers, M, 2006. Evolutionary and physiological importance of hub proteins.PLoS Computational Biology 2:e88. (page 50)

Batada, NN, Reguly, T, Breitkreutz, A, Boucher, L, Breitkreutz, BJ, et al., 2006. Stratus not altocumulus: anew view of the yeast protein interaction network. PLoS Biology 4:e317. (page 50)

Ben-Hur, A and Noble, WS, 2005. Kernel methods for predicting protein-protein interactions. Bioinfor-matics 21:i38–i46. (pages 40, 46, and 52)

Bender, E and Canfield, ER, 1978. The asymptotic number of labeled graphs with given degree sequences.Journal of Combinatorial Theory, Series A 24:296–307. (pages 89, 90, and 93)

Berg, J and Lassig, M, 2004. Local graph alignment and motif search in biological networks. Proceedingsof the National Academy of Sciences 101:14689–14694. (page 87)

Bhardwaj, N and Lu, H, 2005. Correlation between gene expression profiles and protein-protein interactionswithin and across genomes. Bioinformatics 21:2730–2738. (pages 37, 46, and 89)

Boone, C, Bussey, H, and Andrews, BJ, 2007. Exploring genetic interactions and networks with yeast.Nature Reviews Genetics 8:437–449. (page 24)

Bork, P, Jensen, LJ, von Mering, C, Ramani, AK, Lee, I, and Marcotte, EM, 2004. Protein interactionnetworks from yeast to human. Current Opinion in Structural Biology 14:292–299. (page 40)

Breitkreutz, BJ, Stark, C, Reguly, T, Boucher, L, Breitkreutz, A, et al., 2008. The BioGRID InteractionDatabase: 2008 update. Nucleic Acids Research 36:D637–D640. (pages 34 and 65)

Brizzio, V, Khalfan, W, Huddler, D, Beh, CT, Andersen, SS, et al., 1999. Genetic interactions betweenKAR7/SEC71, KAR8/JEM1, KAR5, and KAR2 during nuclear fusion in Saccharomyces cerevisiae.Molecular Biology of the Cell 10:609–626. (page 69)

Bruggeman, FJ and Westerhoff, HV, 2007. The nature of systems biology. Trends in Microbiology 15:45–50. (page 21)

Bunge, J and Fitzpatrick, M, 1993. Estimating the Number of Species: A Review. Journal of the AmericanStatistical Association 88:364–373. (page 142)

Burnham, K and Anderson, DR, 1998. Model Selection and Inference: A Practical Information-TheoreticApproach. Springer. (pages 82 and 167)

Camon, EB, Barrell, DG, Lee, V, Dimmer, E, and Apweiler, R, 2004. The Gene Ontology Annotation(GOA) Database–an integrated resource of GO annotations to the UniProt Knowledgebase. In SilicoBiology 4:5–6. (pages 34 and 47)

194

Carlson, J and Doyle, J, 1999. Highly optimized tolerance: A mechanism for power laws in designedsystems. Physical Review E 60:1412–1427. (pages 43 and 50)

Chao, A, 2001. An overview of closed capture-recapture models. Journal of Agricultural, Biological, andEnvironmental Statistics 6:158–175. (page 142)

Chen, PY, Deane, CM, and Reinert, G, 2007. A statistical approach using network structure in the predictionof protein characteristics. Bioinformatics 23:2314–2321. (page 87)

Chiang, T, Scholtens, D, Sarkar, D, and Gentleman, R, 2007. Coverage and error models of protein-proteininteraction data by directed graph analysis. Genome Biology R186. (pages 52, 59, and 139)

Chinnasamy, A, Mittal, A, and Sung, WK, 2006. Probabilistic prediction of protein-protein interactionsfrom the protein sequences. Computational Biological Medicine 36:1143–1154. (page 70)

Cho, R, Campbell, M, Winzeler, E, and Steinmetz, L, 1998. A Genome-Wide Transcriptional Analysis ofthe Mitotic Cell Cycle. Molecular Cell 2:65–73. (page 89)

Chung, F, Lu, L, Dewey, T, and Galas, D, 2003. Duplication models for biological networks. Journal ofComputational Biology 10:677–687. (page 168)

Collins, SR, Kemmeren, P, Zhao, XC, Greenblatt, JF, Spencer, F, et al., 2007. Toward a comprehensive atlasof the physical interactome of Saccharomyces cerevisiae. Molecular & Cellular Proteomics 6:439–450.(pages 23, 34, and 66)

Collins, SR, Miller, KM, Maas, NL, Roguev, A, Fillingham, J, et al., 2007. Functional dissection of proteincomplexes involved in yeast chromosome biology using a genetic interaction map. Nature 446:806–810.(page 66)

Copley, RR, 2008. The animal in the genome: comparative genomics and evolution. Philosophical Trans-actions of the Royal Society B 363:1453–1461. (page 139)

Cusick, ME, Yu, H, Smolyar, A, Venkatesan, K, Carvunis, AR, et al., 2009. Literature-curated proteininteraction datasets. Nature Methods 6:39–46. (pages 34 and 36)

Daudin, JJ, Picard, F, and Robin, S, 2007. A mixture model for random graphs. Statistics for SystemsBiology Group 5840. (pages 51 and 171)

de Silva, E, Thorne, T, Ingram, PJ, Agrafioti, I, Swire, J, et al., 2006. The effects of incomplete proteininteraction data on structural and evolutionary inferences. BMC Biology 4:39. (pages 95 and 137)

Deane, CM, Salwinski, L, Xenarios, I, and Eisenberg, D, 2002. Protein interactions: two methods forassessment of the reliability of high throughput observations. Molecular & Cellular Proteomics 1:349–356. (pages 40, 47, 52, 65, and 79)

Deng, M, Sun, F, and Chen, T, 2003. Assessment of the reliability of protein-protein interactions and proteinfunction prediction. Pacific Symposium on Biocomputing 140–151. (page 57)

D’haeseleer, P and Church, G, 2004. Estimating and improving protein interaction error rates. Proceedingsof the IEEE Computational Systems Bioinformatics Conference. (pages 53, 56, 57, 59, 61, 74, and 139)

Dorogovtsev, SN and Mendes, JFF, 2001. Evolution of Networks. arXiv 0106144. (pages 45 and 49)

Dorogovtsev, SN, Mendes, JFF, and Samukhin, AN, 2000. Structure of growing networks with preferentiallinking. Physical Review Letters 85:4633–4636. (page 168)

195

Doyle, J, Alderson, D, Li, L, Low, S, Roughan, M, et al., 2005. The “robust yet fragile” nature of theInternet. Proceedings of the National Academy of Sciences 102:14497–14502. (pages 50 and 85)

Drummond, DA, Raval, A, and Wilke, CO, 2006. A single determinant dominates the rate of yeast proteinevolution. Molecular Biology and Evolution 23:327–337. (pages 33 and 137)

Eisen, MB, Spellman, PT, Brown, PO, and Botstein, D, 1998. Cluster analysis and display of genome-wideexpression patterns. Proceedings of the National Academy of Sciences 95:14863–14868. (page 47)

Erdos, P and Renyi, A, 1959. On random graphs. Publicationes Mathematicae Debrecen 6:290–297.(pages 48 and 49)

Felsenstein, J, 1984. Distance Methods for Inferring Phylogenies: A Justification. Evolution 38:16–24.(page 27)

Felsenstein, J, 1995. PHYLIP (Phylogeny Inference Package), version 3.57c. University of Washington.(pages 30 and 120)

Felsenstein, J, 2003. Inferring Phylogenies. Sinauer Associates. (pages 122 and 180)

Fitzpatrick, D, Logue, M, Stajich, J, and Butler, G, 2006. A fungal phylogeny based on 42 com-plete genomes derived from supertree and combined gene analysis. BMC Evolutionary Biology 6:99.(page 120)

Fraser, HB, Hirsh, A, Steinmetz, L, Scharfe, C, and Feldman, M, 2002. Evolutionary rate in the proteininteraction network. Science 296:750–752. (pages 87 and 118)

Freifelder, D, 1982. Physical Biochemistry: Applications to Biochemistry and Molecular Biology. W.H.Freeman. (page 35)

Gaczynska, M, Osmulski, PA, Jiang, Y, Lee, JK, Bermudez, V, and Hurwitz, J, 2004. Atomic force micro-scopic analysis of the binding of the Schizosaccharomyces pombe origin recognition complex and thespOrc4 protein with origin DNA. Proceedings of the National Academy of Sciences 101:17952–17957.(page 35)

Gavin, AC, Aloy, P, Grandi, P, Krause, R, Boesche, M, et al., 2006. Proteome survey reveals modularity ofthe yeast cell machinery. Nature 440:631–636. (pages 21, 38, 66, 83, and 89)

Gavin, AC, Bosche, M, Krause, R, Grandi, P, Marzioch, M, et al., 2002. Functional organization of theyeast proteome by systematic analysis of protein complexes. Nature 415:141–147. (pages 34, 36, 57,61, 66, and 80)

Gentleman, R and Huber, W, 2007. Making the most of high-throughput protein-interaction data. GenomeBiology 8:112. (pages 59 and 164)

Gertz, J, Elfond, G, Shustrova, A, Weisinger, M, Pellegrini, M, et al., 2003. Inferring protein interactionsfrom phylogenetic distance matrices. Bioinformatics 19:2039–2045. (pages 30, 31, 40, and 118)

Gilchrist, MA, Salter, LA, and Wagner, A, 2004. A statistical framework for combining and interpretingproteomic datasets. Bioinformatics 20:689–700. (page 21)

Gilles, C, Rousseau, P, Rouge, P, and Payan, F, 1996. Crystallization and preliminary x-ray analysis ofpig porcine pancreatic alpha-amylase in complex with a bean lectin-like inhibitor. Acta Crystallography581–582. (page 24)

196

Gkantsidis, C, Mihail, M, and Zegura, E, 2003. The Markov Chain Simulation Method for GeneratingConnected Power Law Random Graphs. Proceedings of the SIAM Alenex 16–25. (pages 51, 168,and 169)

Goh, CS, Bogan, A, Joachimiak, M, Walther, D, and Cohen, F, 2000. Co-evolution of Proteins with theirInteraction Partners. Journal of Molecular Biology 299:283–293. (pages 31, 118, 137, and 164)

Goh, CS and Cohen, F, 2002. Co-evolutionary Analysis Reveals Insights into Protein-Protein Interactions.Journal of Molecular Biology 324:177–192. (pages 30, 118, and 137)

Grigoriev, A, 2003. On the number of protein–protein interactions in the yeast proteome. Nucleic AcidsResearch 31:4157–4161. (pages 57, 58, 61, 139, and 140)

Guldener, U, Munsterkotter, M, Oesterheld, M, Pagel, P, Ruepp, A, et al., 2006. MPact: the MIPS proteininteraction resource on yeast. Nucleic Acids Research 34:D436–D441. (pages 58 and 64)

Gurunathan, S, David, D, and Gerst, JE, 2002. Dynamin and clathrin are required for the biogenesis of adistinct class of secretory vesicles in yeast. The EMBO Journal 21:602–614. (page 35)

Hakes, L, Lovell, SC, Oliver, SG, and Robertson, DL, 2007. Specificity in protein interactions and itsrelationship with sequence diversity and coevolution. Proceedings of the National Academy of Sciences104:7999–8004. (pages 32, 33, 121, 137, and 166)

Hakes, L, Pinney, JW, Lovell, SC, Oliver, SG, and Robertson, DL, 2007. All duplicates are not equal: thedifference between small-scale and genome duplication. Genome Biology 8:R209. (page 28)

Hakes, L, Robertson, DL, Oliver, SG, and Lovell, SC, 2007. Protein interactions from complexes: a struc-tural perspective. Comparative and Functional Genomics 2007:5. (page 38)

Hamming, R, 1950. Error detecting and error correcting codes. Bell System Technical Journal 29:2.(page 96)

Harkness, TAA, Davies, GF, Ramaswamy, V, and Arnason, TG, 2002. The ubiquitin-dependent targetingpathway in Saccharomyces cerevisiae plays a critical role in multiple chromatin assembly regulatorysteps. Genetics 162:615–632. (page 69)

Hart, GT, Ramani, AK, and Marcotte, EM, 2006. How complete are current yeast and human protein-interaction networks? Genome Biology 7:120. (pages 21, 59, and 61)

Harvey, P, Colwell, R, Silvertown, J, and May, R, 1983. Null Models in Ecology. Annual Reviews Ecologyand Systematics 14:189–211. (page 87)

He, X and Zhang, J, 2006. Why do hubs tend to be essential in protein networks? PLoS Genetics 2:e88.(pages 43 and 50)

Hermjakob, H, Montecchi-Palazzi, L, and Lewington, C, 2004. IntAct: an open source molecular interac-tion database. Nucleic Acids Research 32:D452–D455. (page 34)

Higham, D, Rasajski, M, and Przulj, N, 2008. Fitting a geometric graph to a protein-protein interactionnetwork. Bioinformatics 24:1093–1099. (pages 51, 80, and 171)

Hintze, A and Adami, C, 2008. Evolution of complex modular biological networks. PLoS ComputationalBiology 4:e23. (page 18)

Hirschman, JE, Balakrishnan, R, Christie, KR, Costanzo, MC, Dwight, SS, et al., 2006. Genome Snap-shot: a new resource at the Saccharomyces Genome Database (SGD) presenting an overview of theSaccharomyces cerevisiae genome. Nucleic Acids Research 34:D442–D445. (pages 21 and 142)

197

Ho, Y, Gruhler, A, Heilbut, A, Bader, GD, Moore, L, et al., 2002. Systematic identification of proteincomplexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415:180–183. (pages 36, 57, 61,66, and 80)

Hong, EL, Balakrishnan, R, Dong, Q, Christie, KR, Park, J, et al., 2008. Gene Ontology annotations atSGD: new data sources and annotation methods. Nucleic Acids Research 36:D577–D581. (pages 21and 34)

Huang, H, Jedynak, BM, and Bader, JS, 2007. Where have all the interactions gone? Estimating thecoverage of two-hybrid protein interaction maps. PLoS Computational Biology 3:e214. (pages 59 and 61)

Ito, T, Chiba, T, Ozawa, R, Yoshida, M, Hattori, M, and Sakaki, Y, 2001. A comprehensive two-hybridanalysis to explore the yeast protein interactome. Proceedings of the National Academy of Sciences98:4569–4574. (pages 34, 36, 37, 57, 58, 61, and 80)

Ito, T, Tashiro, K, Muta, S, Ozawa, R, Chiba, T, et al., 2000. Toward a protein-protein interaction map of thebudding yeast: A comprehensive system to examine two-hybrid interactions in all possible combinationsbetween the yeast proteins. Proceedings of the National Academy of Sciences 97:1143–1147. (page 80)

Jansen, R and Gerstein, M, 2004. Analyzing protein function on a genomic scale: the importance of gold-standard positives and negatives for network prediction. Current Opinion in Microbiology 7:535–545.(pages 47 and 70)

Jansen, R, Yu, H, Greenbaum, D, Kluger, Y, Krogan, NJ, et al., 2003. A Bayesian Networks Approach forPredicting Protein-Protein Interactions from Genomic Data. Science 302:449–453. (pages 47 and 70)

Jordan, I, Wolf, Y, and Koonin, EV, 2003. No simple dependence between protein evolution rate and thenumber of protein-protein interactions: only the most prolific interactors tend to evolve slowly. BMCEvolutionary Biology 3:1. (pages 33, 87, 118, and 137)

Jothi, R, Cherukuri, PF, Tasneem, A, and Przytycka, TM, 2006. Co-evolutionary analysis of domains ininteracting proteins reveals insights into domain-domain interactions mediating protein-protein interac-tions. Journal of Molecular Biology 362:861–875. (page 33)

Jothi, R, Kann, MG, and Przytycka, TM, 2005. Predicting protein-protein interaction by searching evolu-tionary tree automorphism space. Bioinformatics 21:i241–i250. (pages 33 and 118)

Juan, D, Pazos, F, and Valencia, A, 2008. Co-evolution and co-adaptation in protein networks. FEBS Letters582:1225–1230. (pages 31 and 121)

Juan, D, Pazos, F, and Valencia, A, 2008. High-confidence prediction of global interactomes based ongenome-wide coevolutionary networks. Proceedings of the National Academy of Sciences 105:934–939.(pages 31 and 118)

Kann, MG, Jothi, R, Cherukuri, PF, and Przytycka, TM, 2007. Predicting protein domain interactions fromcoevolution of conserved regions. Proteins 67:811–820. (page 32)

Kann, MG, Shoemaker, BA, Panchenko, AR, and Przytycka, TM, 2008. Correlated Evolution of InteractingProteins: Looking Behind the Mirrortree. Journal of Molecular Biology 385:91–98. (page 32)

Kashtan, N, Itzkovitz, S, Milo, R, and Alon, U, 2004. Efficient sampling algorithm for estimating subgraphconcentrations and detecting network motifs. Bioinformatics 20:1746–1758. (page 45)

Kelly, WP and Stumpf, MPH, 2008. Protein-protein interactions: from global to local analyses. CurrentOpinion in Biotechnology 19:396–403. (page 20)

198

Kemp, BE, Mitchelhill, KI, Stapleton, D, Michell, BJ, Chen, ZP, and Witters, LA, 1999. Dealing withenergy demand: the AMP-activated protein kinase. Trends in Biochemical Sciences 24:22–25. (page 79)

Kerrien, S, Alam-Faruque, Y, Aranda, B, and Bancarz, I, 2006. IntAct - open source resource for molecularinteraction data. Nucleic Acids Research 00:D1–D5. (page 64)

Kiel, C, Beltrao, P, and Serrano, L, 2008. Analyzing Protein Interaction Networks Using Structural Infor-mation. Annual Review Biochemistry 77:1–27. (page 35)

Kim, PM, Sboner, A, Xia, Y, and Gerstein, M, 2008. The role of disorder in interaction networks: astructural analysis. Molecular Systems Biology 4:179. (page 50)

Kim, WK and Marcotte, EM, 2008. Age-dependent evolution of the yeast protein interaction networksuggests a limited role of gene duplication and divergence. PLoS Computational Biology 4:e1000232.(page 51)

Konagurthu, AS and Lesk, AM, 2008. On the origin of distribution patterns of motifs in biological networks.BMC Systems Biology 2:73. (pages 26 and 45)

Krogan, NJ, Cagney, G, Yu, H, Zhong, G, Guo, X, et al., 2006. Global landscape of protein complexes inthe yeast Saccharomyces cerevisiae. Nature 440:637–643. (pages 38, 47, 66, and 83)

Kuchin, S, Treich, I, and Carlson, MW, 2000. A regulatory shortcut between the Snf1 protein kinase andRNA polymerase II holoenzyme. Proceedings of the National Academy of Sciences 97:7916–7920.(page 79)

LaCount, DJ, Vignali, M, Chettier, R, Phansalkar, A, Bell, R, et al., 2005. A protein interaction network ofthe malaria parasite Plasmodium falciparum. Nature 438:103–107. (page 21)

Lappe, M and Holm, L, 2004. Unraveling protein interaction networks with near-optimal efficiency. NatureBiotechnology 22:98–103. (page 34)

Lee, S, Kim, P, and Jeong, H, 2006. Statistical properties of sampled networks. Physical Review E73:016102. (pages 54 and 95)

Lehner, B and Fraser, AG, 2004. A first-draft human protein-interaction map. Genome Biology 5:R63.(page 27)

Lemos, B, Meiklejohn, C, and Hartl, D, 2004. Regulatory evolution across the protein interaction network.Nature Genetics 36:1059–1060. (page 87)

Li, L, Anderson, D, Tanaka, R, Doyle, J, and Willinger, W, 2005. Towards a Theory of Scale-Free Graphs:Definition, Properties, and Implications. Internet Mathematics 2:4. (pages 51, 168, and 169)

Li, S, Armstrong, C, Bertin, N, Ge, H, and Milstein, S, 2004. A Map of the Interactome Network of theMetazoan C. elegans. Science 303:540–543. (page 40)

Li, X, Chen, H, Huang, Z, Su, H, and Martinez, JD, 2007. Global mapping of gene/protein interactionsin PubMed abstracts: a framework and an experiment with P53 interactions. Journal of BiomedicalInformatics 40:453–464. (page 50)

Lin, N, Wu, B, Jansen, R, Gerstein, M, and Zhao, H, 2004. Information assessment on predicting protein-protein interactions. BMC Bioinformatics 5:154. (page 47)

Lo, WS, Duggan, L, Emre, NC, Belotserkovskya, R, Lane, WS, et al., 2001. Snf1–a histone kinase thatworks in concert with the histone acetyltransferase Gcn5 to regulate transcription. Science 293:1142–1146. (page 79)

199

Loganantharaj, R and Atwi, M, 2007. Towards validating the hypothesis of phylogenetic profiling. BMCBioinformatics 8:s25. (page 31)

Lu, L, Xia, Y, Paccanaro, A, Yu, H, and Gerstein, M, 2005. Assessing the limits of genomic data integrationfor predicting protein networks. Genome Research 15:945–953. (page 40)

Luscombe, NM, Greenbaum, D, and Gerstein, M, 2001. What is bioinformatics? A proposed definition andoverview of the field. Methods of Information in Medicine 40:346–358. (page 18)

MacDonald, N, 1979. Simple aspects of foodweb complexity. Journal of Theoretical Biology 80:577–588.(page 26)

Marcotte, EM, Pellegrini, M, Thompson, M, and Yeates, TO, 1999. A combined algorithm for genome-wideprediction of protein function. Nature 402:83–86. (page 47)

Maslov, S and Sneppen, K, 2002. Specificity and stability in topology of protein networks. Science296:910–913. (page 45)

May, RM, 2001. Stability and Complexity in Model Ecosystems. Princeton University Press. (page 87)

Meinke, G, Ezeokonkwo, C, Balbo, P, Stafford, W, Moore, C, and Bohm, A, 2008. Structure of yeastpoly(A) polymerase in complex with a peptide from Fip1, an intrinsically disordered protein. Biochem-istry 47:6859–6869. (page 35)

Mewes, HW, Frishman, D, Mayer, K, Munsterkotter, M, Noubibou, O, et al., 2006. MIPS: analysis andannotation of proteins from whole genomes in 2005. Nucleic Acids Research 34:D169–D172. (pages 65and 119)

Mika, S and Rost, B, 2006. Protein-protein interactions more conserved within species than across species.PLoS Computational Biology 2:e79. (page 40)

Millson, SH, Truman, AW, King, V, Prodromou, C, Pearl, LH, and Piper, PW, 2005. A two-hybrid screenof the yeast proteome for Hsp90 interactors uncovers a novel Hsp90 chaperone requirement in the ac-tivity of a stress-activated mitogen-activated protein kinase, Slt2p (Mpk1p). Eukaryotic Cell 4:849–860.(page 192)

Milo, R, Shen-Orr, S, Itzkovitz, S, Kashtan, N, Chklovskii, D, and Alon, U, 2002. Network motifs: simplebuilding blocks of complex networks. Science 298:824–827. (pages 45 and 87)

Mosch, HU and Fink, GR, 1997. Dissection of filamentous growth by transposon mutagenesis in Saccha-romyces cerevisiae. Genetics 145:671–684. (page 69)

Mount, D, 2004. Bioinformatics: Sequence and Genome Analysis. CSHL Press. (page 30)

Nariai, N, Tamada, Y, Imoto, S, and Miyano, S, 2005. Estimating gene regulatory networks and protein-protein interactions of Saccharomyces cerevisiae from multiple genome-wide data. Bioinformatics21:i206–i212. (page 37)

Newman, MEJ, 2003. Mixing patterns in networks. Physical Review E 67:026126. (page 45)

Newman, MEJ, 2005. Power laws, Pareto distributions and Zipf’s law. Contemporary Physics 46:323–351.(page 49)

Newman, MEJ and Park, J, 2003. Why social networks are different from other types of networks. arXiv0305612. (page 46)

200

Overbeek, R, Fonstein, M, D’Souza, M, Pusch, GD, and Maltsev, N, 1999. The use of gene clusters to inferfunctional coupling. Proceedings of the National Academy of Sciences 96:2896–2901. (page 47)

Pamilo, P and Nei, M, 1988. Relationships between gene trees and species trees. Molecular Biology andEvolution 5:568–583. (page 121)

Pan, X, Ye, P, Yuan, DS, Wang, X, Bader, JS, and Boeke, JD, 2006. A DNA integrity network in the yeastSaccharomyces cerevisiae. Cell 124:1069–1081. (page 66)

Park, J and Newman, MEJ, 2004. The statistical mechanics of networks. arXiv 0405566. (page 51)

Pastor-Satorras, R, Smith, E, and Sole, RV, 2003. Evolving protein interaction networks through geneduplication. Journal of Theoretical Biology 222:199–210. (page 28)

Pattison, P and Wasserman, S, 1999. Logit models and logistic regressions for social networks: II. Mul-tivariate relations. British Journal of Mathematical and Statistical Psychology 52:169–193. (pages 51,80, and 169)

Pazos, F, Ranea, J, Juan, D, and Sternberg, M, 2005. Assessing Protein Co-evolution in the Context of theTree of Life Assists in the Prediction of the Interactome. Journal of Molecular Biology 352:1002–1015.(pages 30, 32, 118, 134, 137, and 186)

Pazos, F and Valencia, A, 2001. Similarity of phylogenetic trees as indicator of protein-protein interaction.Protein Engineering 14:609–614. (pages 31 and 118)

Pazos, F and Valencia, A, 2008. Protein co-evolution, co-adaptation and interactions. The EMBO Journal27:2648–2655. (page 121)

Pellegrini, M, Marcotte, EM, Thompson, M, Eisenberg, D, and Yeates, TO, 1999. Assigning proteinfunctions by comparative genome analysis: protein phylogenetic profiles. Proceedings of the NationalAcademy of Sciences 96:4285–4288. (pages 30, 31, and 118)

Phizicky, E, Bastiaens, PIH, Zhu, H, Snyder, M, and Fields, S, 2003. Protein analysis on a proteomic scale.Nature 422:208–215. (page 68)

Picard, F, Daudin, JJ, Koskas, M, Schbath, S, and Robin, S, 2008. Assessing the exceptionality of networkmotifs. Journal of Computational Biology 15:1–20. (page 45)

Pratt, RC, Morgan-Richards, M, and Trewick, SA, 2008. Diversification of New Zealand weta (Orthoptera:Ensifera: Anostostomatidae) and their relationships in Australasia. Philosophical Transactions of theRoyal Society B 363:3427–3437. (page 26)

Przulj, N, 2007. Biological network comparison using graphlet degree distribution. Bioinformatics23:e177–e183. (page 45)

Ptacek, J, Devgan, G, Michaud, G, Zhu, H, Zhu, X, et al., 2005. Global analysis of protein phosphorylationin yeast. Nature 438:679–684. (pages 66, 79, and 175)

Ptak, RG, Fu, W, Sanders-Beer, BE, Dickerson, JE, Pinney, JW, et al., 2008. Cataloguing the HIV type 1human protein interaction network. AIDS Research and Human Retroviruses 24:1497–1502. (page 25)

Ramani, AK, Li, Z, Hart, GT, Carlson, MW, Boutz, DR, and Marcotte, EM, 2008. A map of human pro-tein interactions derived from co-expression of human mRNAs and their orthologs. Molecular SystemsBiology 4:180. (pages 26, 46, 47, and 89)

Ramani, AK and Marcotte, EM, 2003. Exploiting the Co-evolution of Interacting Proteins to DiscoverInteraction Specificity. Journal of Molecular Biology 327:273–284. (pages 118 and 137)

201

Raveh, A, Riven, I, and Reuveny, E, 2009. The Use of FRET Microscopy to Elucidate Steady State Chan-nel Conformational Rearrangements and G Protein Interaction with the GIRK Channels. Methods inMolecular Biology 491:199–212. (page 35)

Reguly, T, Breitkreutz, A, Boucher, L, Breitkreutz, BJ, Hon, G, et al., 2006. Comprehensive curation andanalysis of global interaction networks in Saccharomyces cerevisiae. Journal of Biology 5:11. (pages 50,65, 80, and 82)

Robins, G, Pattison, P, Kalish, Y, and Lusher, D, 2007. An introduction to exponential random graph (p*)models for social networks. Social Networks 29:173–191. (pages 51 and 169)

Salwinski, L and Eisenberg, D, 2003. Computational methods of analysis of protein–protein interactions.Current Opinion in Structural Biology 13:377–382. (pages 46, 89, and 140)

Sanchez, C, Lachaize, C, Janody, F, Bellon, B, Roder, L, et al., 1999. Grasping at molecular interactionsand genetic networks in Drosophila melanogaster using FlyNets, an Internet database. Nucleic AcidsResearch 27:89–94. (page 23)

Sato, T, Yamanishi, Y, Horimoto, K, Toh, H, and Kanehisa, M, 2003. Prediction of protein–protein inter-actions from phylogenetic trees using partial correlation coefficient. Genome Informatics 14:496–497.(pages 28 and 118)

Saul, ZM and Filkov, V, 2007. Exploring biological network structure using exponential random graphmodels. Bioinformatics 23:2604–2611. (pages 169 and 171)

Scholtens, D, Chiang, T, Huber, W, and Gentleman, R, 2008. Estimating node degree in bait-prey graphs.Bioinformatics 24:218–224. (pages 36 and 139)

Schwartz, AS, Yu, J, Gardenour, KR, Finley, RL, and Ideker, T, 2009. Cost-effective strategies for complet-ing the interactome. Nature Methods 6:55–61. (page 164)

Schwikowski, B, Uetz, P, and Fields, S, 2000. A network of protein protein interactions in yeast. NatureBiotechnology 18:1257–1261. (pages 18 and 52)

Sharp, P and Li, WH, 1987. The codon adaptation index - a measure of directional synonymous codonusage bias, and its potential applications. Nucleic Acids Research 15:1281–1295. (page 137)

Sharrocks, K, 2007. Host cell factors facilitating HIV-1 Integration. PhD Thesis. (page 25)

Shen, J, Zhang, J, Luo, X, Zhu, W, Yu, K, et al., 2007. Predicting protein-protein interactions based onlyon sequences information. Proceedings of the National Academy of Sciences 104:4337–4341. (pages 21and 46)

Shen-Orr, S, Milo, R, Mangan, S, and Alon, U, 2002. Network motifs in the transcriptional regulationnetwork of Escherichia coli. Nature Genetics 31:64–69. (page 45)

Shoemaker, BA and Panchenko, AR, 2007. Deciphering protein-protein interactions. Part I. Experimentaltechniques and databases. PLoS Computational Biology 3:e42. (page 36)

Shoemaker, BA and Panchenko, AR, 2007. Deciphering protein-protein interactions. Part II. Computa-tional methods to predict protein and domain interaction partners. PLoS Computational Biology 3:e43.(page 37)

Shokouhi, M, Zobel, J, and Scholer, F, 2006. Capturing collection size for distributed non-cooperativeretrieval. SIGIR Proceedings 316–323. (pages 139 and 142)

202

Skrabanek, L, Saini, HK, Bader, GD, and Enright, AJ, 2008. Computational prediction of protein-proteininteractions. Molecular Biotechnology 38:1–17. (pages 46, 47, 70, and 89)

Small, M, Walker, DM, and Tse, CK, 2007. Scale-free distribution of avian influenza outbreaks. PhysicalReview Letters 99:188702. (page 50)

Small, M, Xu, X, Zhou, J, Zhang, J, Sun, J, and Lu, JA, 2008. Scale-free networks which are highlyassortative but not small world. Physical Review E 77:066112. (page 50)

Smith, TF and Waterman, MS, 1981. Identification of common molecular subsequences. Journal of Molec-ular Biology 147:195–197. (page 27)

Snijders, T, 2002. Markov chain Monte Carlo estimation of exponential random graph models. Journal ofSocial Structure. (page 171)

Sprinzak, E, Sattath, S, and Margalit, H, 2003. How Reliable are Experimental Protein–Protein InteractionData? Journal of Molecular Biology 919–923. (page 61)

Srinivasan, BS, Shah, NH, Flannick, JA, Abeliuk, E, Novak, AF, and Batzoglou, S, 2007. Current progressin network research: toward reference networks for key model organisms. Briefings in Bioinformatics8:318–332. (page 46)

Stark, C, Breitkreutz, BJ, Reguly, T, Boucher, L, Breitkreutz, A, and Tyers, M, 2006. BioGRID: a generalrepository for interaction datasets. Nucleic Acids Research 34:D535–D539. (page 64)

Strong, DR, Simberloff, D, Abele, LG, and Thistle, AB, 1984. Ecological communities: Conceptual issuesand the evidence. Princeton University Press. (page 87)

Stuart, JM, Segal, E, Koller, D, and Kim, SK, 2003. A gene-coexpression network for global discovery ofconserved genetic modules. Science 302:249–255. (page 47)

Stumpf, MPH, Ingram, PJ, Nouvel, I, and Wiuf, C, 2005. Statistical Model Selection Methods Applied toBiological Networks. arXiv 0506013. (page 50)

Stumpf, MPH, Kelly, WP, Thorne, T, and Wiuf, C, 2007. Evolution at the system level: the natural historyof protein interaction networks. Trends in Ecology & Evolution 22:366–373. (pages 20, 43, 51, 87,and 168)

Stumpf, MPH, Thorne, T, de Silva, E, Stewart, R, An, H, et al., 2008. Estimating the size of the humaninteractome. Proceedings of the National Academy of Sciences 105:6959–6964. (pages 59, 61, 139, 141,and 142)

Stumpf, MPH and Wiuf, C, 2005. Sampling properties of random graphs: The degree distribution. PhysicalReview E 72:036118. (pages 50 and 95)

Stumpf, MPH, Wiuf, C, and May, RM, 2005. Subnets of scale-free networks are not scale-free: Samplingproperties of networks. Proceedings of the National Academy of Sciences 102:4221–4224. (pages 95and 171)

Tajima, F, 1983. Evolutionary relationship of DNA sequences in finite populations. Genetics 105:437–460.(page 121)

Takemoto, K and Oosawa, C, 2005. Evolving networks by merging cliques. Physical Review E 72:046116.(page 49)

Tanaka, R, Yi, TM, and Doyle, J, 2005. Some protein interaction data do not exhibit power law statistics.FEBS Letters 579:5140–5144. (page 50)

203

Tarassov, K, Messier, V, Landry, CR, Radinovic, S, Molina, MMS, et al., 2008. An in vivo map of the yeastprotein interactome. Science 320:1465–1470. (page 23)

Thomas, P, 2008. Generalising multiple capture-recapture to non-uniform sample sizes. SIGIR Proceedings839–840. (page 142)

Thompson, JD, Gibson, TJ, and Higgins, DG, 2002. Multiple sequence alignment using ClustalW andClustalX. Current Protocols in Bioinformatics Chapter 2:Unit 2.3. (page 120)

Thorne, T and Stumpf, MPH, 2007. Generating confidence intervals on biological networks. BMC Bioin-formatics 8:467. (pages 40, 46, 87, 89, 93, and 164)

Tucker, CL, Gera, JF, and Uetz, P, 2001. Towards an understanding of complex protein networks. Trendsin Cell Biology 11:102–106. (page 61)

Uetz, P, Giot, L, Cagney, G, Mansfield, TA, Judson, RS, et al., 2000. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 403:623–627. (pages 34, 36, 37, 57, 58, 61,and 80)

Umemura, M, Fujita, M, Yoko-O, T, Fukamizu, A, and Jigami, Y, 2007. Saccharomyces cerevisiae CWH43is involved in the remodeling of the lipid moiety of GPI anchors to ceramides. Molecular Biology of theCell 18:4304–16. (page 69)

Valencia, A and Pazos, F, 2002. Computational methods for the prediction of protein interactions. CurrentOpinion in Structural Biology 12:368–373. (pages 46 and 89)

Vazquez, A, Pastor-Satorras, R, and Vespignani, A, 2002. Large-scale topological and dynamical propertiesof the Internet. Physical Review E 65:066130. (page 45)

von Mering, C, Krause, R, Snel, B, Cornell, M, Oliver, SG, et al., 2002. Comparative assessment of large-scale data sets of protein-protein interactions. Nature 417:399–403. (pages 37, 56, 61, and 140)

Wagner, A, 2001. The yeast protein interaction network evolves rapidly and contains few redundant dupli-cate genes. Molecular Biology and Evolution 18:1283–1292. (pages 87 and 120)

Wagner, A, 2005. Robustness and Evolvability in Living Systems. Princeton University Press. (page 50)

Watts, DJ, 2004. Small Worlds: The Dynamics of Networks Between Order and Randomness. PrincetonUniversity Press. (pages 45 and 49)

Watts, DJ and Strogatz, S, 1998. Collective dynamics of ‘small-world’ networks. Nature 393:440–442.(pages 48 and 50)

Wojcik, J, Boneca, IG, and Legrain, P, 2002. Prediction, assessment and validation of protein interactionmaps in bacteria. Journal of Molecular Biology 323:763–770. (page 40)

Wolfe, K, 2006. Comparative genomics and genome evolution in yeasts. Philosophical Transactions of theRoyal Society B 361:403–412. (pages 26, 29, and 120)

Xenarios, I, Salwinski, L, Duan, X, Higney, P, Kim, SM, and Eisenberg, D, 2002. DIP, the Database ofInteracting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic AcidsResearch 30:303–305. (pages 32, 64, and 65)

Xu, J, Wu, S, and Li, X, 2007. Estimating collection size with logistic regression. SIGIR Proceedings789–790. (page 142)

Yang, Z, 2004. PAML: Phylogenetic Analysis by Maximum Likelihood. (page 30)

204

Yang, Z, 2006. Computational Molecular Evolution. Oxford University Press. (page 26)

Yang, Z, 2007. PAML 4: Phylogenetic Analysis by Maximum Likelihood. Molecular Biology and Evolu-tion 24:1586–1591. (page 120)

Yook, SH, Jeong, H, and Barabasi, AL, 2002. Modeling the Internet’s large-scale topology. Proceedings ofthe National Academy of Sciences 99:13382–13386. (pages 43 and 50)

Yu, J and Fotouhi, F, 2006. Computational approaches for predicting protein-protein interactions: a survey.Journal of Medical Systems 30:39–44. (pages 46 and 89)

Yuan, C, Yongkiettrakul, S, Byeon, IJ, Zhou, S, and Tsai, MD, 2001. Solution structures of two FHA1-phosphothreonine peptide complexes provide insight into the structural basis of the ligand specificity ofFHA1 from yeast Rad53. Journal of Molecular Biology 314:563–575. (page 35)

Zhang, J, 2003. Evolution by gene duplication: an update. Trends in Ecology & Evolution 18:292–298.(page 28)

205

On the analysis of protein interaction networks · 2010. 1. 28. · Abstract Protein interaction...

Documents

Transcript of On the analysis of protein interaction networks · 2010. 1. 28. · Abstract Protein interaction...