Deconvolution of PPI Networks: Approximation Algorithms ... · rithms developed using the theory of...

Deconvolution of PPI Networks:Approximation Algorithms and Optimization Techniques

Dong Hyun Kim

School of Computer ScienceMcGill University

Montréal

May 2012

A thesis submitted to McGill University in partial fulfilment of the requirements for thedegree of Doctor of Philosophy

c©Dong Hyun Kim, 2012

Contents

Abstract vi

Résumé ix

Declaration xi

Acknowledgments xii

List of Figures xiv

1 Introduction 1

1.1 Experimental Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1 Yeast-Two Hybrid (Y2H) Method. . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.2 Affinity Purification /Mass Spectrometry (AP-MS) . . . . . . . . . . . . . 6

1.1.3 Indirect Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2 Computational Analyses of PPI Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.2.1 Graph Models for PPI Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.2.2 Identification of Protein-Protein Interactions . . . . . . . . . . . . . . . . 14

1.2.3 Identification of Protein Complexes . . . . . . . . . . . . . . . . . . . . . . . 16

1.3 AP-MS based PPI Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.3.1 Protein Complexes in Yeast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.3.2 Modularity in Yeast Protein Complexes . . . . . . . . . . . . . . . . . . . . . 21

1.4 Contributions and Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2 Approximation Algorithms and Optimization Techniques 27

iii

iv

2.1 NP-hard Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.2 Graph Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.3 Approximation Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.3.1 Lower-bounding the Optimum . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.3.2 Polynomial Time Approximation Schemes . . . . . . . . . . . . . . . . . . 35

2.3.3 LP-based Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.4 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3 Protein Quantification with Shared Peptides 43

3.1 Problem Formulation for Protein Quantification . . . . . . . . . . . . . . . . . . . 46

3.2 Hardness of Protein Quantification Problems . . . . . . . . . . . . . . . . . . . . . . 50

3.3 Approximation Algorithms for Protein Quantification . . . . . . . . . . . . . . . . 53

3.3.1 An Algorithm for Multicover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.3.2 An Algorithm for Minimum Protein Types . . . . . . . . . . . . . . . . . . . 54

3.3.3 An Algorithm for Minimum Uniform Error . . . . . . . . . . . . . . . . . . 56

3.3.4 An Algorithm for Minimum Error Sum . . . . . . . . . . . . . . . . . . . . . 62

3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.4.1 Performance on Simulated Data . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.4.2 Protein Quantification from AP-MS Data . . . . . . . . . . . . . . . . . . . . 66

3.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4 Direct PPI Networks from AP-MS Data 75

4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.2 Mathematical Modelling and Problem Formulation . . . . . . . . . . . . . . . . . 77

4.2.1 A Probabilistic Model for AP-MS Data . . . . . . . . . . . . . . . . . . . . . . 78

4.2.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.3 Algorithm for A-DIGCOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.3.1 Identification of Weakly Connected Regions . . . . . . . . . . . . . . . . . 84

4.3.2 Identification of Densely Connected Regions . . . . . . . . . . . . . . . . . 90

v

4.3.3 Cut-based Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.3.4 Restricting the Solution Space for GA . . . . . . . . . . . . . . . . . . . . . . 95


4.4.1 Randomized Hill-climbing Algorithm . . . . . . . . . . . . . . . . . . . . . . 97

4.4.2 Choosing a Tolerance Level δ and Handling Numerical Errors . . . . 97

4.4.3 Generation of Scale-free Networks . . . . . . . . . . . . . . . . . . . . . . . . 98

4.4.4 Calculation of Connectivity Matrix from Peptide Counts . . . . . . . . 98

4.4.5 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.4.6 Accuracy of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.4.7 Inferring Direct Interactions from AP-MS Experimental Data . . . . . 105

4.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106


5 Hypergraph Modelling of PPI Data 111

5.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

5.2 Clique Cover for Graphs with Bounded Treewidth . . . . . . . . . . . . . . . . . . 116

5.2.1 Running Time of Treewidth-based Algorithm . . . . . . . . . . . . . . . . 119

5.2.2 Modifications for Clique Partition . . . . . . . . . . . . . . . . . . . . . . . . . 120

5.3 Planar Clique Cover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

5.3.1 Clique Cover for Planar Graphs with Bounded Branchwidth . . . . . . 121

5.3.2 Modifications for Clique Partition . . . . . . . . . . . . . . . . . . . . . . . . . 123

5.3.3 Baker’s Technique on Planar Graphs . . . . . . . . . . . . . . . . . . . . . . . 123


5.4.1 Simulated and Biological Networks . . . . . . . . . . . . . . . . . . . . . . . . 127

5.4.2 Performance of the Treewidth and Branchwidth-based Algorithms . 128

5.4.3 Performance of PTAS for Planar Graphs . . . . . . . . . . . . . . . . . . . . . 130

5.4.4 Clique Cover in Biological Networks . . . . . . . . . . . . . . . . . . . . . . . 130

5.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131


vi

6 Conclusion and Future Directions 133

Bibliography 137

Abstract

Understanding the organization of protein-protein interactions (PPIs) as a complex network isone of the main pursuits in proteomics today. With the help of high-throughput experimen-tal techniques, a large amount of PPI data has become available, providing us a rough pictureof how proteins interact in biological systems. One of the leading technologies for identifyingprotein interactions is affinity-purification followed by mass spectrometry (AP-MS). While theAP-MS method provides the ability to detect protein interactions at biologically reasonable ex-pression levels, this technique still suffers from poor accuracy as well as lack of sound approachto interpret the obtained interaction data.

In this thesis, we look for sources of systematic errors and limitations of the data from AP-MS experiments, and propose several approaches for improvements. In particular, we identifyvarious problems that arise within the experimental pipeline, and propose combinatorial algo-rithms developed using the theory of approximation algorithms and discrete mathematics. Thefirst part of the thesis deals with quantification of proteins from MS-based experiments. Exist-ing approaches for protein quantification often use each detected peptide as an indicator forthe originating protein. These approaches ignore peptides that belong to more than one proteinfamily. We attack this problem of protein quantification by taking these shared peptides into con-sideration, and propose a framework for estimating protein abundance via linear programming.

In AP-MS data, the identified protein interactions contain a large number of indirect inter-actions as artifacts from a series of simultaneous physical interactions. The second part of thisthesis studies the problem of distinguishing direct physical interactions from indirect interac-tions. To do this, we first propose a probabilistic graph model for the PPI data, and design acombinatorial algorithm suited for graphs with underlying structure that is evident in PPI net-works.

While the traditional model for PPI networks is a binary graph representing pairwise inter-actions, a large number of interactions involve more than two interaction partners. Such col-lections of proteins interacting concertedly are known as protein complexes, and various ap-proaches have been proposed to identify the complexes in the network. When these complexesare overlapping, however, the existing complex detection methods often fail to identify the con-stituent complexes. Taking one step further on this line of research, the last part of this thesisdiscusses the problem of modelling PPI networks as hypergraphs by studying the clique coverproblem on sparse networks.

For each problem discussed throughout the thesis, we obtain either an exact algorithm, analgorithm with provably good guarantee on the output quality, or a heuristic with efficient run-

vii

viii

ning time. Furthermore, each of the proposed algorithms is empirically tested against biologicaldata as well as simulated data, in order to validate both computational efficiency and biologicalsoundness.

Résumé

Comprendre l’organisation des interactions protéine-protéine (IPP) en tant que réseau com-plexe est un des plus grands problèmes de la protéomique moderne. Avec l’aide de techniquesexpérimentales à haut débit, une grand quantité de données de IPP est devenue disponible, nousprocurant ainsi une image approximative du fonctionnement des interactions entre protéinesdans des systèmes biologiques. Une des technologies de fine pointe pour identifier des interac-tions protéiques est la purification d’affinité suivie de la spectrométrie de masse (PA-SM). Mêmesi la méthode de PA-SM nous permet de détecter des interactions de protéines à des niveauxd’expression raisonnables biologiquement, cette technique souffre encore d’une précision défi-ciente et d’un manque d’une approche saine pour interpréter les résultats d’interactions obtenuspar celle-ci.

Dans cette thèse, nous cherchons des sources d’erreurs systématiques et des limites desdonnées provenant d’expériences PA-SM et nous proposons plusieurs approches amenant desaméliorations. En particulier, nous identifions divers problèmes présents dans la procédure ex-périmentale et proposons des algorithmes combinatoires développés en utilisant la théorie desalgorithmes d’approximation et des mathématiques discrètes. La première partie de la thèseétudie la quantification des protéines provenant d’une expérience basée sur la spectrométrie demasse. Les approches existantes pour la quantification de protéines utilisent souvent chaquepeptide détecté comme un indicateur de la protéine d’origine. Ces approches ignorent les pep-tides qui appartiennent à plus d’une protéines. Nous attaquons ce problème de quantificationde protéines en prenant ces peptides partagés en considération et proposons un cadre de travailpour estimer l’abondance des protéines via la programmation linéaire.

Les interactions de protéines identifiées dans les données de PA-SM contiennent un grandnombre d’interactions indirectes qui sont des artéfacts d’une série d’interactions physiques si-multanées. La seconde partie de cette thèse étudie le problème de distinction des interactionsphysiques directes de celles qui sont indirectes. Pour ce faire, nous proposons premièrement unmodèle d’un graphe probabiliste pour les données de IPPs et concevons un algorithme combi-natoire adapté aux graphes ayant des propriétés observées dans de vrais réseaux de IPPs.

Malgré le fait que le modèle traditionnel des réseaux de IPPs est un graphe binaire représen-tant les interactions par paires, un grand nombre d’interactions impliquent plus de deux parte-naires d’interaction. De tels groupes de protéines interagissant de concert se nomment des com-plexes de protéines. Plusieurs approches ont déjà été proposées afin d’identifier les complexesdans un réseau. Cependant, lorsque ces complexes se chevauchent, les méthodes existanteséchouent. La dernière partie de cette thèse discute du problème de modélisation des réseaux deIPPs comme des hypergraphes en étudiant le problème de couverture par cliques sur des réseaux

ix

x

clairsemés.

Pour chaque problème discuté au cours de cette thèse, nous obtenons un algorithme exact,un algorithme avec de bonnes garanties prouvables sur la qualité de la sortie, ou une heuristiqueavec un temps d’exécution efficient. De plus, chaque algorithme proposé est testé empirique-ment avec des données biologiques et simulées dans le but de valider l’efficience computation-nelle et la signification biologique.

Declaration

This thesis contains no material which has been accepted in whole, or in part, for any other de-gree or diploma. Except for results whose authors are mentioned, materials presented in Chap-ters 3, 4, and 5 of this thesis is an original contribution to knowledge.

In Section 3.4 of Chapter 3, the biological data for experimental studies was generously pro-vided by Benoit Coulombe’s lab at Institut de recherches cliniques de Montréal (IRCM).

In Section 4.3.1 of Chapter 4, the work presented in Theorems 4.1 and 4.3 are done jointlywith Ashish Sabharwal and my thesis advisors.

All other parts were done independently under the supervision of my thesis advisors, AdrianVetta and Mathieu Blanchette.

xi

Acknowledgements

During my first visit to Montréal in February 2005, my old advisor Sue Whitesides told me anumber of reasons why McGill is one of the best places to study computer science and discretemaths. Seven years later, I can acknowledge every one of those reasons, and possibly double thelist with my own.

First and foremost, I express my deepest gratitude to my thesis advisors, Mathieu Blanchetteand Adrian Vetta, for their tireless efforts to guide my research, and making my life as a graduatestudent ever so enjoyable. Mathieu has introduced me to the field of protein-protein interac-tions, invited me to his Barbados workshop on the subject, was always ready to suggest the nextsteps in research, and taught me all about what makes good science. Adrian has introduced meto the field of approximation algorithms, spent countless hours discussing various lines of at-tacks to problems, and has provided me with seemingly infinite amount of resources in discretemathematics. Without their guidance, this thesis, or my own scientific becoming, would havebeen impossible.

I thank my collaborators and colleagues: Ashish Sabharwal’s visit to Montreal resulted in thefirst part of work presented in Chapter 4; Mathieu L.-A., Javier, Mathieu R. and Javad taught mea thing or two about proteomics and machine learning; human PPI data used in Chapter 3 wasgenerously provided by Benoit Coulombe’s lab at Institut de recherches cliniques de Montréal(IRCM).

I acknowledge NSERC, CIHR, Walter C. Sumner Fellowship, and my advisors’ generous fund-ing for financial support.

Special thanks to the “family” for all the laughters over pints: js, mk, sh, jm, sc, ij, sy, sj, jy, thelist goes on. And finally, to my parents, for their love and patience over the years. Thank you.

xiii

List of Figures

1.1 The yeast PPI network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 A schematic of Y2H experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Pipeline of TAP experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4 Protein interactions from the perspective of Y2H and AP-MS . . . . . . . . . . 9

1.5 Degree distribution of yeast PPI networks . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1 Tree decomposition of a graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.2 Branch decomposition of a graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.3 A characteristics matrix and its perfect phylogenetic tree . . . . . . . . . . . . . 35

3.1 Bipartite graph model from peptide-protein relationships . . . . . . . . . . . . 47

3.2 Robustness of the algorithms for protein quantification . . . . . . . . . . . . . . 65

3.3 Peptide-protein graph from an AP-MS experiment (CDK9) . . . . . . . . . . . . 68

3.4 Peptide-protein graph from an AP-MS experiment (POLR2A) . . . . . . . . . . 69

3.5 Peptide-protein graph from an AP-MS experiment (RPAP3) . . . . . . . . . . . 70

3.6 Result of our algorithm on CDK9 dataset . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.7 Result of our algorithm on RPAP3 dataset . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.1 A schematic for indirect interactions in AP-MS data . . . . . . . . . . . . . . . . . 79

4.2 A direct interaction network and its connectivity matrix . . . . . . . . . . . . . . 81

4.3 The outcome of our 3 phase algorithm for direct PPI network . . . . . . . . . . 83

4.4 Comparison of results from our direct PPI data against Y2H interactionnetwork from Yu et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.1 Node types in nice tree decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.2 Sphere-cut decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5.3 Planar clique cover using Baker’s technique . . . . . . . . . . . . . . . . . . . . . . . 125

xv

xvi

5.4 Performance of the clique cover algorithm for graphs with bounded treewidth129

5.5 Performance comparison of treewidth-based algorithm vs. branchwidth-based algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

Chapter 1

Introduction

Upon completion of the Human Genome Project1, one of the compelling topics in the

biological sciences is the large-scale study of proteins encoded by the genes – a field re-

ferred to as proteomics. Indeed, it is the proteins that execute the functions programmed

in the genes, and analyzing the proteome will thus lead us to a better understanding of

how cellular processes work in living organisms.

While the genome is considered as a long sequence of DNA, proteome is often re-

garded as a much more complex system. Because cellular processes are caused by pro-

teins interacting concertedly within the cell, the proteins in the proteome form a large

network called protein-protein interaction (PPI) network, sometimes referred to as the

interactome. As a result, both structural and functional analyses of the proteome often

require detailed examinations of the PPI network. Efforts to construct the PPI network

have begun by developing various experimental techniques to identify the interactions.

Using these technologies, Saccharomyees cerevisiae (the yeast) was shown to contain

approximately 6000 proteins, and over 78,000 interactions have now been identified. To

give an example, Figure 1.1 depicts a small portion of the yeast PPI network, using only

1 An international scientific research program whose goal was to sequence the DNA, and iden-tify the genes of human genome; began in 1990, completed in 2003. (http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml)

1

http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml

http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml

2 Chapter 1. Introduction

Figure 1.1: The yeast PPI network constructed using Cytoscape software (http://www.cytoscape.org) with high confidence PPI data from Krogan et al. [93]

the highest confidence interaction data from Krogan et al. [93]. Humans, more com-

plex organisms as we are, are expected to have approximately 25,000 proteins sharing

as many as 650,000 interactions [130]. Such sheer volume of data makes it intractable

to analyze manually, and thus computational analyses have become essential to deeper

understanding of the proteome.

More recently, it has been revealed that many of these interactions may occur only

when three or more specified proteins are all present at the site of an interaction. Such

a collection of proteins interacting concertedly is called a protein complex, and this the-

ory around protein complexes opened a new set of computational problems, such as

the identification and functional annotation of protein complexes. To make things even

more interesting, many of these interactions depend on the phase of the cell develop-

ment cycle during which the interaction takes place, as well as the localization within

the cell [80, 97]. Consequently, such temporal and spatial dependence makes the struc-

ture of PPI networks much more dynamic than originally expected.

http://www.cytoscape.org

http://www.cytoscape.org

1.1. Experimental Techniques 3

Due to the natural structure of the proteome as a network, graph theory has played a

crucial role in computational analyses of PPI networks. As shall be demonstrated later,

however, it is often difficult to formulate an optimization problem that correctly cap-

tures all the properties of PPI networks, and even when a clean optimization problem

is formulated, the problem often turns out to be computationally intractable. In many

of these cases, heuristic algorithms targeting local optima have been popular, and very

few algorithms with provable guarantee of output quality have been proposed. As such,

clean mathematical models and sound algorithms remain in high demand in the field

of protein-protein interactions.

This thesis and its contributions can be divided into three parts; each part addresses

a particular challenge related to the analysis of PPI networks, and provides a novel ap-

proach to reduce errors and noise present in the data. The principal results of this thesis

will be outlined in Section 1.4. Prior to that, however, we first give an introduction to

the field of protein-protein interactions from the perspective of bioinformatics and dis-

crete mathematics. This introductory chapter consists of three parts: in Section 1.1, we

give a brief survey on various experimental techniques for identifying protein interac-

tions. Section 1.2 then discusses computational approaches to analyze the datasets pro-

duced by those experimental techniques; one of the leading technologies that is preva-

lent in the literature is affinity purification followed by mass spectrometry (AP-MS). In

Section 1.3, we review studies on PPI networks produced by AP-MS experiments, and

point out the limitations of existing computational approaches, which will then lead us

to the topics of this thesis, namely: (1) protein quantification; (2) predicting direct inter-

action network; and (3) hypergraph modelling of PPI networks.

1.1 Experimental Techniques

In this section, we introduce several medium to high throughput experimental meth-

ods that have been proposed to identify protein interactions, and discuss the nature


and the quality of the data obtained by each technique. Each technique makes use of

different mechanisms to identify the interactions, and bears its own advantages and dis-

advantages. For example, some techniques identify direct, physical interactions while

others are designed to find indirect, functional relationships between proteins. There-

fore, understanding the intrinsic differences between the datasets from various experi-

mental techniques will allow us to realistically model the given data, and devise highly

customized algorithms for the formulated problems.

1.1.1 Yeast-Two Hybrid (Y2H) Method.

Yeast two hybrid (Y2H) is an in vivo2 technique to detect PPIs by testing physical in-

teractions between two proteins [49, 75, 137]. It was discovered that transcription fac-

tors found in eukaryotic organisms have two distinct domains: (1) a binding domain

(BD), and (2) an activating domain (AD) [49]. The binding domain is a module respon-

sible for binding to a promoter DNA sequence while the activating domain activates the

transcription. The transcription is inactivated while the two domains are far from each

other, but it can be restored when a binding domain is in close proximity with an acti-

vating domain.

A Y2H experiment starts by fusing the binding domain into a protein X (known as the

bait), and the activating domain into a protein Y (known as the prey). If the bait X and

the prey Y interact physically, the activating domain is in close proximity to the binding

domain. The binding domain then binds to upstream activating sequence (UAS) of a

promoter, and thus activates the transcription of the reporter gene3. Figure 1.2 shows

a schematic representation of a Y2H system. To do this experiment in a high through-

put manner, either a matrix of prey clones or a library of random cDNA fragments are

2 An experiment is said to be in vivo (Latin for “within the living”) when the subject is a living organismas opposed to in vitro, a controlled environment.

3 In order to determine whether a gene is expressed, a reporter gene (which is easily identifiable whenexpressed) is attached to a regulatory sequence of the gene of interest. Commonly used reporter genesoften induce visually identifiable characteristics, e.g. LacZ, GFP, etc.


BD

UAS

XY AD

Transcription of Reporter Gene

Mediator

Figure 1.2: A Y2H experiment. Protein X (the bait) is fused with a BD that binds to the upstream activatingsequence (UAS) of a promoter. The bait interacts with Protein Y (the prey) that is fused with an AD,activating the transcription of a reporter gene via the mediator complex.

used [11, 50, 142]: In the matrix approach, a matrix of prey clones is mated with baits,

and the interacting prey proteins are identified by the expression of a reporter gene and

the position on the matrix. In the library approach, each bait is screened against a li-

brary of random cDNA fragments, and the interaction partners are identified by DNA

sequencing.

One advantage of a Y2H experiment is that it allows us to identify physical (rather

than functional) associations between interaction partners. Furthermore, being an in

vivo high throughput technique, Y2H produces large scale PPI data that potentially in-

cludes even the weaker, transient interactions. As such, the Y2H PPI data provide us

a comprehensive snapshot of the proteome. However, because each interaction must

occur between the proteins fused with the bait-prey pair, the interactions identified by

Y2H experiments are restricted to binary interactions. There are other limitations of

Y2H systems: proteins that initiate transcription on their own cannot be targeted since

they would result in expression of the reporter gene regardless of the presence of bait

proteins; the protein fusion may cause changes in the structure of the protein, and the

use of such bait proteins may result in incorrect interaction partners. Such a non-native

environment causes high false positive ratios for Y2H experiments. To rectify these is-

sues for Y2H experiments, many in silico4 methods have been proposed to post-process

Y2H experimental data (see Section 1.2.2). With the help of these computational analy-

4 Computational analyses are often called in silico methods as opposed to in vivo or in vitro.


sis tools, Y2H experiments have been the most widely used approach to obtain physical

interaction data.

Protein-fragment Complementation Assay (PCA). Another similar technique has been

proposed by Michnick [54, 103, 134]: Protein-fragment Complementation Assay (PCA)

is a technique in which fragments of a reporter protein can be separately expressed with

proteins (bait and prey) that are to interact with each other. In this method, the tar-

geted proteins (bait and prey) are fused with incomplete fragments of a third protein

(reporter), and are separately expressed in vivo. The interaction between the bait and

the prey brings the fragments of the reporter in close proximity, allowing the reporter to

reform, and then to be identified. While similar to Y2H experiments, PCA experiments

can test protein interactions at various subcellular compartments within a pathway of

interest in a quantitative manner, making it desirable for drug discovery purposes. Typ-

ically, PCA detects many interactions for membrane proteins whereas Y2H identifies

many interactions for nuclear proteins [78].

1.1.2 Affinity Purification /Mass Spectrometry (AP-MS)

Affinity Purification followed by Mass Spectrometry (AP-MS) [36, 58, 93, 119], some-

times referred to as tandem affinity purification (TAP), is a technique that allows us to

determine co-complex membership of protein interactions. This method is a two step

process consisting of the purification of protein complexes, and the identification of the

interaction partners via mass spectrometry.

Affinity purification. A TAP tag consists of a calmodulin binding peptide (CBP) fol-

lowed by a tobacco etch virus (TEV) protease cleavage site and the IgG binding domains

of Staphylococus protein A [118, 119]. The open reading frame of the target protein


IgG beads

Bait

Calmodulin bindingpeptide TEV protease

cleavage site

Protein A

TAP tag

+

Cell extract

BaitContaminants

Prey proteins

1st AffinityColumn

Cut at TEV proteasecleavage

Bait Calmodulin beads

2nd AffinityColumn

Bait natively formed protein complex

Figure 1.3: Pipeline of TAP experiments. At the first affinity column, the bait protein is cut at the TEVprotease cleavage site after washing. At the second affinity column, the bait-prey pairs are pulled downby Calmodulin beads, and the resulting eluate form native complexes which can then be identified usinggel electrophoresis and mass spectrometry.

(known as the bait) is fused with the TAP tag at the end, and is expressed in vivo within

the cells of the host organism (e.g. yeast or human). Once the AP-tagged bait protein is

expressed, it can interact with other proteins (known as preys), and form protein com-

plexes. These protein complexes go through two purification processes: Initially, the

protein A is binding tightly to an IgG matrix. After washing out the contaminants, the

link between the protein A and IgG matrix is released by the protease. The eluate of this

step is then incubated with calmodulin-coated beads, and after washing for the second

time, the targeted protein complexes are released. See Figure 1.3 for a schematic repre-

sentation of the AP process.


Mass spectrometry. The result of the purification processes is the target protein com-

plexes formed by the interaction partners of the bait protein. These protein complexes

go through gel electrophoresis, resulting in component peptide fragments that can be

identified using mass spectrometers.

Mass spectrometers then read in the peptide fragments, and produce ions with charges

based on their masses. Polypeptide sequences are then identified using their mass-to-

charge ratios. Various methods have been proposed to convert the peptide molecules

into ions in the gas phase using electrospray ionization (ESI) [144] and matrix assisted

laser desorption ionionzation (MALDI) [85, 116]. Furthermore, there are several algo-

rithms to analyze the mass spectra produced by MS and either identify the peptides

present, or even quantify the peptide abundance [59, 115, 135].

Due to its ability to perform at biologically reasonable expression levels in human

cells, as well as the ability to detect protein complexes with fewer false positives [36],

AP-MS approaches have become increasingly popular. One caveat of this approach is,

however, that a significant number of the co-purified prey proteins are in fact indirect

interaction partners of the bait protein. This is an artifact of chains of binary interac-

tions that occur simultaneously, and AP-MS cannot distinguish such indirect interac-

tions from direct physical interactions. For example, Figure 1.4 shows how the same

complex can be perceived differently by different experimental techniques. In Chap-

ter 4 of this thesis, we discuss the problem of identifying direct physical interactions

from AP-MS data.

Added to the difficulty of interpretation from indirect interactions, AP-MS carries a

high chance of capturing contaminants at the AP level: gel contamination, nonspecific

binding to TAP columns, etc. To reduce the false positives from these contaminants,

several computational approaches have been proposed [124, 128]. Some of these ap-

proaches make use of the network topology to assign a purification score to each hit [35].

On the other hand, Lavallée-Adam et al. [95] recently provided a Bayesian model specif-

ically for contaminants to directly identify false positive interactions.


X

Y Z

X

Y Z

X

Y Z

XY Z

X

Y Z

X

Y Z

X

Y

X

Z

X

Y Z

X

Y Z

Actual protein interaction

Binary interaction methods (Y2H, PCA)

Co-complex membership methods

(AP-MS)

Figure 1.4: Different types of protein interactions among three proteins, and how they are viewed in dif-ferent experimental techniques (assuming all three proteins have been tagged as a bait). Note that bi-nary testing methods (e.g. Y2H and PCA) cannot distinguish the last two types of interactions, whileco-complex membership testing approaches (e.g. AP-MS) identify the first two types of interactions asthe same topology.

1.1.3 Indirect Approaches

Aside from the methods described above, there are indirect approaches to associate pro-

teins with functional interactions.

Correlated mRNA expression (Synexpression). Genes can be partitioned into distinct

groups depending on the mRNA levels measured under a variety of different cellular

conditions. Then, the genes in the same partition are enriched to encode proteins that

may interact with each other [44]. Though this technique gives a broad coverage5, the

accuracy of the data is poor as it is difficult to infer direct physical (or even functional)

interactions solely from gene expression levels.

5 Coverage refers to (correctly identified interactions) / (total number of true interactions). Accuracy,on the other hand, refers to (correctly identified interactions)/(number of identified interactions)


Method Interaction TypeY2H physical, binaryPCA physical, binary

AP-MS physical, complexSynexpression functional

Genetic interactions functionalGenome analysis functional

Table 1.1: Different techniques to identify protein-protein interactions, and types of interactions capturedby each technique.

Genetic interactions. If two nonessential genes cause lethality when simultaneously

knocked out, they are often functionally associated, and thus their encoded proteins

may interact [12, 148]. This technique allows pairwise detection of functionally associ-

ated interactions.

Genome analysis. Some protein interactions can be detected via computational meth-

ods [112]. For example, (1) interacting proteins often show similar phylogenetic pro-

files,6 and (2) seemingly unrelated genes are sometimes fused together into one polypep-

tide chain. Though fast and inexpensive, these computational methods rely heavily on

the existing data such as phylogenetic trees, and orthology7 between proteins which are

prone to biases towards well-studied proteins.

Observe that different techniques have different goals in detecting interactions. While

Y2H, PCA, and AP-MS methods are designed to discover physical bindings between pro-

teins, the others seek to predict functional associations between entities such as tran-

scriptional regulators and their associated pathways. The different coverage and ac-

curacy of the data causes each technique to produce a unique distribution of interac-

tions [141]. Consequently, when comparing or merging PPI data from different experi-

mental techniques, one must carefully take these intrinsic differences into account. Ta-

ble 1.1 summarizes different aspects of the methods discussed in this section.

6 Across a number of species, each gene can be checked for presence, resulting in a binary vector calleda phylogenetic profile. If two proteins interact in the same biological pathway, the corresponding geneswould show similar phylogenetic profiles.

7 Homologous sequences are orthologous if they were separated by a speciation event, as opposed toparalogous if caused by a gene duplication event.

1.2. Computational Analyses of PPI Data 11

1.2 Computational Analyses of PPI Data

The traditional way to model PPI networks is to construct a combinatorial graph where

each node represents a protein, and two nodes are joined by an edge if and only if there is

an interaction between the corresponding proteins. While this classical model captures

the most basic information about the interactome, it does not accommodate various

sources of errors that are common in many experimental datasets.

On the other hand, protein complexes are hidden away within the constructed (bi-

nary) PPI network. Discovering these complexes would allow us to model the inter-

actome as a more complex system. Therefore, further post-experimental analyses are

imperative to obtain a proper view of PPI networks. In this section, we review compu-

tational methods for post-processing experimental PPI data, and we discuss how such

approaches can help characterize unknown properties of the interactome.

1.2.1 Graph Models for PPI Networks

One approach to the analysis of PPI networks is looking at topological properties of the

networks. While studying the topological structure of PPI networks may not immedi-

ately lead us to biological discoveries, understanding the characteristics of the networks

is typically a crucial step toward modelling real world networks.

Local Structure Models

Network motifs. Network motifs are patterns of interconnections (induced subgraphs)

that occur in a given biological network much more often than they would in random

networks. Shen-Orr et al. [127] have shown that the transcriptional regulatory network

of Escherichia coli contains three highly frequent motifs, each with a distinct function in

gene expression. This suggests that identifying different motifs in a PPI network would

help us describe different functional protein groups within the network and, further-


Figure 1.5: Degree distribution of yeast PPI network of 1210 proteins with 2357 interactions [93]. On alog− log plot, the distribution fits a power law distribution with y = a x−γ where a = 874.87 and γ= 1.809.

more, determine the function of uncharacterized proteins.

Graphlets. A similar notion of local structure called graphlets has been proposed by

Przulj et al. [117]. Graphlets are all possible simple graphs over 3− 5 vertices (up to

isomorphism). In order to analyze the global structure of PPI networks, Przulj et al.

looked at the frequency distribution of each graphlet appearing in the PPI networks, and

compared it against other global model networks such as Erdös-Renyi random graphs,

scale-free networks, and geometric random networks (see below). While network mo-

tifs for a given network capture the most significant local structures within the network,

this bottom-up approach of graphlet distribution provides a metric to compare different

networks. More specifically, the distribution of each graphlet measured by the relative

frequency of graphlets allows us to analyze networks of different sizes. This is particu-

larly useful when PPI networks are compared to various graph models (see next section)

whose network sizes may vary.


Global Structure Models

Scale-free networks. In many real world networks, including PPI networks, the degree

distribution of the nodes follows the power law P(k )≈ k−γ; that is, there are many more

nodes with a small number of neighbours than nodes with a very large number of neigh-

bours. Networks that exhibit such a degree distribution are called scale-free [10], and this

property has profound effects on real world networks. For example, scale-free networks

are resistant to random failures on the nodes due to the small number of high-degree

hubs – there are many more low-degree nodes which are equally likely to fail. This ro-

bustness can also be observed in PPI networks, and the connectivity of a node can be

correlated to the essentiality of the corresponding protein. For example, studies on the

yeast PPI network [79] have shown that:

1. 93% of the yeast proteins have degree at most 5, but only 21% of these caused

lethality when deleted;

2. only 0.7% of the proteins had degree at least 15, but 62% of these are essential.

These results suggest that non-essential proteins, whose disruption is non-lethal,

tend to have lower degree than essential proteins. Moreover, proteins in the same func-

tional group tend to show similar number of interacting partners. Such structure-function

relationships can be exploited for the prediction of protein interactions or complexes

that are yet unknown, as we discussed in Section 1.1.3.

Geometric random networks. A geometric random network [113] is a graph model

where the nodes correspond to randomly distributed points in a metric space, and nearby

nodes are joined by an edge. Przulj et al. [117] used the graphlet distributions to charac-

terize PPI networks as a geometric random network. In particular, upon comparing the

yeast and fruitfly PPI networks against various random graph models (including scale-

free graphs), the graphlet distributions of these PPI networks were closer to that of ge-

ometric random networks than any other model. This result suggests that the geomet-


ric random network may be a better model for PPI networks than the widely accepted

scale-free networks.

1.2.2 Identification of Protein-Protein Interactions

The PPI data from high throughput experiments such as Y2H, PCA or AP-MS contain a

significant amount of noise, and such high false positive/negative ratios deteriorate the

overall confidence level of identified interactions. These high throughput methods have

all been used to perform a large-scale study of yeast proteome [58, 70, 93, 134, 149],

but comparative assessments of those datasets continuously reveal that a significant

portion of identified interactions in one study is unique to that dataset and not observed

in others [78, 141]. Therefore, in silico approaches to improve the confidence level of PPI

data are an attractive topic of research.

Topology-based PPI detection. Inevitably, the PPI data from high throughput experi-

ments is used as the initial input. Since the global view of the currently known interac-

tome may be heavily biased due to incomplete experiments, several authors often resort

to examining the local topological structure. For example, Saito et al. [122, 123] devel-

oped a measure called interaction generalities (IG1, IG2) which considers the restricted

neighbourhood for each protein in the network. The main idea behind this measure

is that if a protein interacts with many interaction partners but these interaction part-

ners exhibit no further interactions among themselves then these interactions are more

likely to be false positives. This idea works particularly well for PPI data from Y2H exper-

iments, since some “sticky” proteins in the Y2H assays have a tendency to turn on the

positive signals by themselves, irrespective of their interaction partners.

More recently, two other measures have been proposed by Chen et al. [27, 26], and

Pei and Zhang [111]. The method of Chen et al., called Interaction Reliability by Alter-

native Path (IRAP) [27], considers alternative paths between two interacting partners.

If a pair of proteins exhibit a direct interaction, it is likely that there are alternative se-


quences of interactions that join the two proteins. Hence, IRAP looks for the shortest

(measured by distance) such alternative path for each pair of proteins, and the final

confidence level is determined by the strength of the discovered alternative path.

On the other hand, Pei and Zhang [111] have shown that, by considering all alterna-

tive paths of different lengths, a better result can be achieved. For each pair of vertices,

they consider all paths of length k > 1. Then, the proposed measure called PathRatio

is computed as a weighted sum of all the paths between two vertices. Since enumerat-

ing all paths between two vertices is computationally hard, a rough estimate is obtained

by only computing for small values of k . Using the PPI data from various sources, the

PathRatio method achieved better results in functional homogeneity, localization ho-

mogeneity, and gene expression distance when compared against a simulation using

IRAP. On the other hand, the iterative hill-climbing version of IRAP, called IRAP∗ [26],

showed similar performance to PathRatio.

While these topology-based methods attempt to assign confidence levels to inter-

actions, the resulting PPI networks still contain high ratios of false positives and false

negatives. This may be due to the fact that the models used in these methods do not

truly reflect the nature of the experimental PPI data – for example, the input PPI data

may contain a large number of indirect interactions from AP-MS experiments, but no

known algorithm addresses this issue explicitly (the algorithms discussed in this section

are evaluated only using the PPI data from Y2H experiments).

Phylogeny-based PPI detection. The poor accuracy and coverage of topology-based

methods may also be due to the lack of information contained in the PPI data itself.

To address this, several approaches have been proposed using phylogenetic informa-

tion. For example, a protein interaction in one species may be inferred from Rosetta

Stone proteins (two genes fused into a single one) in another organism (Gene Fusion

Method [131]). Or, in the case of bacterial genomes, the conservation of the order of

genes may be a good indication of interactions (the Gene Order Conservation Method [37]).


A potentially more powerful approach is using a set of reference organisms’ genomes.

A phylogenetic profile of a protein A is a vector whose i -th entry represents the presence

or the absence of protein A in organism i . Under the assumption that physically in-

teracting proteins coevolve, the interaction can be inferred from the similarity of two

phylogenetic profiles (the Phylogenetic Profile Method [112]). Moreover, the phyloge-

netic trees for two proteins could also be used in comparative analyses. In this method,

a multiple sequence alignment (MSA) is first obtained from a set of reference organ-

isms, and the phylogenetic trees for protein A and B are constructed using the position

of orthologs in the MSA. Then the similarity score can be computed by comparing the

distance matrices of the two phylogenetic trees. For a more comprehensive exposition

of these methods see Jothi and Przytycka [82].

Protein-protein docking. Another approach for detecting protein interactions is done

by structural analysis of protein molecules: namely, protein-protein docking. Given a

pair, or a small set of proteins, accurate and efficient protein docking algorithms pro-

vide a useful tool for understanding both the tendency for protein interactions and the

structure of protein complexes. Various docking algorithms have been proposed, and

often work in two phases: (1) A population of candidate conformations are generated;

and (2) each of the candidates is scored in order to find the most likely structural confor-

mations (see, for example, Azé et al. [6] and Choi [31]). To verify the correctness of these

methods, they are often compared against a database of NMR and X-ray structures [77].

1.2.3 Identification of Protein Complexes

While the PPI networks represent binary interactions between two proteins, protein

complexes with three or more interacting partners are difficult to identify immediately

from these networks. However, the protein complexes are often highly connected sub-

graphs in PPI networks, and thus graph clustering algorithms are often used to detect

these complexes.


Molecular Complex Detection (MCODE) algorithm. Bader and Hogue [7] proposed

the three-stage algorithm, MCODE, to identify protein complexes. First, the vertices are

given weights according to a local density measure, called the core-clustering coefficient.

Let N [v ] denote the subgraph induced by a vertex v and its neighbours. Then, a k -core

of v is a subgraph of N [v ], induced by vertices with degree at least k . The highest k -core

refers to a nonempty k -core with the highest k , and the core-clustering coefficient of a

vertex v is then calculated as the edge density of the highest k -core of v .

After each vertex is assigned a weight, the second stage of the algorithm recursively

starts to expand the clusters. The vertex with the highest weight is initially set to form

a cluster, and all of its neighbours are joined to the cluster if (i) the weight of the neigh-

bour is above a given threshold; and (ii) the neighbour has not been previously explored.

When no further expansion is possible, the next heaviest unexplored vertex is set to form

a singleton cluster, and the process is repeated. Note that the clusters found so far are

disjoint at this stage of the algorithm.

Finally, the third phase the the algorithm is a post-processing step, where clusters

that are not sufficiently dense are discarded from the set of clusters. Furthermore, any

unexplored vertices are joined to nearby clusters. At this stage, these unexplored vertices

may be joined to multiple clusters, thereby creating overlapping clusters.

Markov Cluster algorithm. The Markov Cluster (MCL) algorithm, proposed by Van

Dongen [139], simulates random walks on the graph by alternating two operations, ex-

pansion and inflation, to transform one set of probabilities into another. First, a weight

matrix M of G is normalized into M so that each column of M sums to 1. Then each

column of M represents a stochastic process, where each entry M (i , j ) corresponds to

the probability of going from vertex i to vertex j . The inflation ∆r of M for a power

coefficient r is defined as:

∆r (M (p ,q )) =M (p ,q )r

∑ni=1(M (i ,q ))r


Hence, higher values of r would yield tighter clusters by favouring more probable walks

on the graph.

On the other hand, the expansion operation creates longer walks on the graph by

computing the e -th power of the associated matrix M , where e is the expansion param-

eter. This operation will try to expand the information flow on the network. The MCL al-

gorithm runs the expansion and the inflation process as alternatively until M stabilizes.

When the algorithm reaches its equilibrium, the resulting matrix would correspond to a

collection of star-like components, each of which is declared a protein complex.

Min-cut based algorithms. Though not applied directly to the protein complex detec-

tion problem, several approaches have been proposed to use minimum cut algorithms

when finding clusters in biological networks (e.g. [69, 126]). These approaches first start

with a similarity graph as an input, and repeatedly partition the set of vertices into clus-

ters until a partition satisfies a particular stopping criterion. For example, in Hartuv et

al. [69], the notion of highly connected subgraph (HCS) is used as a stopping criterion.

A subgraph H is a HCS if the global minimum s − t cut of H is at least |V (H )|2

. The initial

PPI network is then recursively partitioned into two clusters by minimum s−t cuts until

each connected component is a HCS.

Simultaneous clustering algorithm. More recently, Narayanan et al. [106] proposed

a framework, called JointCluster, that finds a clustering of multiple networks that inte-

grates large-scale datasets including PPI data and gene expression data. Extending from

a previous study on clustering a single graph by Kannan et al. [84], JointCluster finds

sparse cuts to generate clusters that allow theoretical approximation guarantees on the

quality of the detected clustering relative to the optimal clustering. The biological sig-

nificance of the resulting clusters preserved across physical and coexpression networks

is verified by known protein complexes in the literature or enrichment of functionally

coherent pathways.

1.3. AP-MS based PPI Networks 19

In a recent study by Brohee and van Helden [20], four clustering algorithms – MCL,

Restricted Neighbourhood Search Clustering (RNSC) by King et al. [90], Super Param-

agnetic Clustering (SPC) by Blatt et al. [15], and MCODE - were evaluated against PPI

networks built from annotated MIPS database. In particular, to test for robustness of the

algorithms against false positives and false negatives, edges were randomly added and

removed to generate 41 altered graphs. After comparing against the hand-curated pro-

tein complexes from the MIPS database, the MCL algorithm clearly outperformed the

other methods under most conditions, while RNSC and MCL show similar behaviours

in some tests such as altered graphs with edge removal but no edge addition.

Note that these complex detection methods use similarity matrices that are often

populated using the number of times each pairwise PPI is observed. However, the sim-

ilarity between two proteins can be interpreted in various ways, e.g., (1) the confidence

level of a direct interaction, (2) the strength of an interaction, or (3) the timing of an

interaction to distinguish those that are transient. However, the existing complex detec-

tion methods do not reflect on how these similarity matrices are interpreted. In partic-

ular, if the PPI data contains many indirect interactions, clusters that highly overlap will

be be difficult to separate using the existing PPI data. Furthermore, because the clus-

tering algorithms tend to first create disjoint clusters, and then join the ones that are

close to each other, they often do not distinguish overlapping protein complexes well.

As such, methods to discover a collection of overlapping clusters are still in demand in

order to properly model PPI networks as hypergraphs8.

1.3 AP-MS based PPI Networks

In the previous section, we introduced the types of post-processing and analysis that

need to be done computationally, and reviewed different approaches proposed in the

literature. In general, these approaches do not take into account the experimental tech-

8 A hypergraph is a set system where, in the context of PPI networks, each set corresponds to a proteincomplex.


nique from which the dataset originates, and such generic models often fail to capture

the inherent nature of the produced data.

To give an example, the methods introduced in Sections 1.2.1 and 1.2.2 assume that

the dataset contains only direct, binary interactions. This is a reasonable assumption

when the dataset is from Y2H experiments, but when analyzing AP-MS data, indirect

interactions must be filtered out prior to applying these methods. On the other hand, the

complex detection methods discussed in Section 1.2.3 are designed to find individual

clusters within the network, and as is the case for most clustering algorithms, finding

overlapping complexes remains a difficult problem in general.

In this section, we take a brief look at how the analyses are carried out for PPI net-

works constructed from AP-MS experiments. In particular, we focus on two of the most

recent large scale studies on AP-MS based PPI networks from a bioinformatics point of

view.

1.3.1 Protein Complexes in Yeast

In Krogan et al. [93], the protein-protein interactions in the yeast were detected by the

AP-MS method. In order to improve the accuracy as well as the coverage, two MS pu-

rifications were done independently. Furthermore, both interacting partners for each

protein-protein interaction were tagged and purified in order to ensure greater data con-

sistency and reproducibility. After 2357 successful purifications of proteins, 4087 yeast

proteins were identified as preys with high confidence scores from the mass spectrom-

etry.

After the interactome with confidence scores had been established, protein com-

plexes were detected using the Markov Clustering algorithm (described in Section 1.2.3),

where the expansion and inflation operators were appropriately chosen to optimize the

overlap against the MIPS database. The Markov Clustering algorithm identified 547 dis-

joint protein complexes, half of which were not present in the MIPS database. Further-

1.3. AP-MS based PPI Networks 21

more, new proteins were identified for most complexes that had been known previously.

For the purpose of the graph theoretic analysis of the protein complexes, a soft-

ware plug-in (http://genepro.ccb.sickkids.ca) for the Cytoscape environment

was written to model the interactome in two different ways: (1) the traditional PPI net-

work model, and (2) the protein complex network model, where each protein complex is

represented as a node, and two nodes are joined by an edge if and only if the correspond-

ing complexes share common subunits. Using these two models, the authors were able

to verify (or re-verify) several hypotheses and findings from the previous literature; for

example, (1) proteins in the same complex should have similar function and co-localize

to the same subcellular compartment, and (2) highly connected proteins within the net-

work tend to be highly conserved, and conversely, highly conserved proteins tend to be

more highly connected and central (measured by betweenness9) to the network.

On the other hand, as discussed in Section 1.2.3, the Markov Clustering algorithm

does not necessarily separate two or more complexes that share subunits, and the au-

thors point out that the identified clusters may contain multiple complexes. Identifying

such overlapping complexes remains to be explored, and better suited complex detec-

tion algorithms will provide a finer view of the interactome.

1.3.2 Modularity in Yeast Protein Complexes

Gavin et al. [58] also studied the protein-protein interactions in the yeast via the AP-

MS method, and successfully purified 1993 distinct proteins, for which 2760 interacting

partners were identified via mass spectrometry. In order to measure the reproducibil-

ity, 138 purifications were done repeatedly, and 69% of the repeated purifications were

common, giving rough estimates of the false-positive and false-negative ratios.

As acknowledged by the authors, the current graph clustering approaches for iden-

9 Betweenness is a measure of graph centrality of a node. For a given node v , Betweeness(v ) is typicallydefined as the number of shortest s − t paths that pass through v (sometimes normalized by the totalnumber of s − t paths), where s 6= v 6= t .

http://genepro.ccb.sickkids.ca


tifying protein complexes are inappropriate for the PPI data from AP-MS experiments

since they do not explicitly capture the nature of the purification output. To avoid this

problem, the socio-affinity index measure was devised to quantify the propensity of pro-

teins forming an interaction. This index measures the log-odds of the number of times

two proteins are observed together, relative to what would be expected from their fre-

quency in the data set. In particular, the socio-affinity index A(i , j ) is defined as

A(i , j ) =Si ,j |i=b a i t +Si ,j |j=b a i t +M i ,j ,

where Si ,j |i=b a i t measures the tendency for protein j to be pulled down when protein i is

used as a bait (known as the spoke model), and M i ,j measures the tendency for two pro-

teins to co-purify when other proteins are tagged as a bait (known as the matrix model).

The socio-affinity model was the first attempt to quantify physical measurements

of the interactions purely from the AP-MS data. These computed values can populate

an n ×n matrix representing a PPI network, and a graph clustering algorithm (not de-

scribed) is then used in order to define protein complexes within this network. However,

since each protein can belong to multiple protein complexes, a disjoint clustering from

a single run of the clustering algorithm is unlikely to form the appropriate protein com-

plexes. To avoid this pitfall, several runs of the clustering algorithm are carried out using

different parameters.

Using this approach, 1784 different sets of complexes were generated, and were com-

pared against hand-curated known complexes. Consequently, the sets of complexes

that score highly ( > 70% ) in coverage and accuracy show similar clusterings. Thus

the top scoring set of complexes is merged with other high scoring sets of complexes

to form variants of complex formulations called complex isoforms. Interestingly, these

complex isoforms showed that the proteins within each complex can be partitioned into

two types: (1) core components, which are the subunits of a complex that are present in

most isoforms, and (2) modules, which are present in only some of the isoforms, but al-

ways together. Furthermore, each module (which can be regarded as a building block

1.4. Contributions and Thesis Outline 23

of a complex) appeared to be present in multiple complexes, where the core compo-

nents differ. The authors verified that this organization of complexes is due to biological

phenomena by looking at the co-expression level during the cell cycle, as well as co-

localization in the cell. Furthermore, the cores and modules that were detected within

the same complex tend to belong to the same functional category, or otherwise, the

core–module cross-talk between functional categories show many known connections

such as between protein synthesis, transcription, and the cell cycle.

This modularity of protein complexes gives a new insight towards the structure of

PPI networks in that each protein complex can be further partitioned into functional

building blocks – cores and modules, and this structural property of protein complexes

may be beneficial to designing new complex detection algorithms. On the other hand,

while the socio-affinity index attempts to quantify the probability measure for direct

protein interactions, the spoke model and the matrix model described above only con-

sider 1- or 2-hop neighbours of each protein from the raw PPI data. From the bioin-

formatics standpoint, devising a better suited measure that fully models the stochastic

nature of the AP-MS PPI data would certainly improve our understanding of the pro-

teome.

1.4 Contributions and Thesis Outline

Due to its ability to test for protein co-complex memberships, AP-MS has been the

method of choice for various recent studies in high-throughput proteomics. However,

the AP-MS method has its own drawbacks that require systematic curation as well as

reinterpretation of the experimental data. In this thesis, we (1) investigate three such

limitations from a mathematical viewpoint, (2) formulate combinatorial optimization

problems for each identified limitation, and (3) study algorithmic approaches with the

appropriate applications in mind.

Chapter 2. We give an introduction to the mathematical tools that are used extensively


throughout this thesis. This includes techniques from approximation algorithms

and combinatorial optimization as well as graph parameters that are useful for

designing efficient algorithms for computationally intractable problems.

Chapter 3. Protein Quantification. When testing the existence and the abundance of

proteins, mass spectrometry can only look at short fragments of proteins (known

as peptides), and then identify which protein the peptide belongs to. However,

there are a number of “ambiguous” peptides that belong to several different pro-

teins, which makes the identification of constituent proteins a difficult task. We

study the problem of quantifying the protein abundance despite the presence of

these ambiguous peptides: given a set S of peptide sequences and a set W of pro-

tein sequences, together with the peptide abundance σ, find the protein abun-

dance X for each protein. We formulate this problem into various mathematical

programs for which we show their hardness, and devise approximation algorithms

that can predict absolute abundance for all constituent proteins. This manuscript

is to be submitted [89].

• Ethan Kim, Adrian Vetta, Mathieu Blanchette, “Protein quantification with

shared peptides using multi cover algorithms”, in preparation.

Chapter 4. Direct PPI network. Once the PPI data from AP-MS experiments are quan-

tified, one needs to be sure that the identified interactions are accurate. However,

as discussed in this chapter, a significant fraction of the detected interactions are

artifacts of indirect interactions, i.e. detected via chains of simultaneous interac-

tions. We study the problem of separating the direct interactions from indirect

ones, in order to minimize the false positives in the AP-MS data. Here, we intro-

duce a probabilistic graph model that captures the direct and indirect interactions

within the AP-MS data. Then, under this model, the problem of distinguishing di-

rect interactions from indirect ones is formulated, and we provide an algorithm

that is highly customized for the AP-MS data. This work is published in [88].

1.4. Contributions and Thesis Outline 25

• Ethan Kim, Ashish Sabharwal, Adrian Vetta, Mathieu Blanchette, “Predict-

ing Direct Protein Interactions from Affinity Purification Mass Spectrometry

Data”, Algorithms for Molecular Biology, 5:34, (2010).

Chapter 5. Hypergraph modelling of PPI data. PPI networks are traditionally represented

as combinatorial graphs where each interaction is assumed to occur between two

proteins. Recent studies, however, revealed interactions involving several proteins

working simultaneously. Such discoveries of protein complexes as building blocks

of the proteome led to much interest in modelling PPI networks as hypergraphs.

We thus study the problem of constructing hypergraphs from binary PPI data us-

ing algorithms for the (edge) clique cover problem. This work resulted in various

efficient algorithms for both PPI networks and other restricted classes of graphs

such as graphs with bounded parameters, and an approximation algorithm in the

case of planar graphs, as published in [14].

• Mathieu Blanchette, Ethan Kim, Adrian Vetta, “Clique Cover on Sparse Net-

works”, The 9th SIAM Meeting on Algorithm Engineering & Experiments (ALENEX)

93-102, Kyoto, Japan (2012)

Chapter 6. We conclude and give a list of open problems.

Chapter 2

Approximation Algorithms and

Optimization Techniques

This chapter introduces topics from discrete mathematics that are preliminary to dis-

cussions throughout the thesis. In particular, we focus on algorithmic techniques that

are frequently used to tackle problems in computational biology. As such, during this

pedagogical exposition, we shall discuss problems with applications in computational

biology in order to motivate the reader. We shall first define NP-hard optimization

problems in Section 2.1. Then, Section 2.2 discusses graph parameters that often al-

low us to design efficient algorithms for computationally hard problems. Various tech-

niques to design approximation algorithms are discussed in Section 2.3, including α-

approximation algorithms, polynomial time approximation schemes, and LP-based meth-

ods. Finally we introduce genetic algorithms in Section 2.4.

2.1 NP-hard Optimization Problems

Combinatorial optimization problems ask to find an optimal object from a finite set

of objects [125]. Typical problems include shortest path, minimum spanning tree, and

27

28 Chapter 2. Approximation Algorithms and Optimization Techniques

travelling salesman problem, and these problems can often be stated as: given A, find

B of size k , where k is the size of the solution for which the problem tries to either

minimize or maximize. If k is indeed the minimum (maximum, resp.) possible value

admitting a solution, we say k is optimum.

From the perspective of computational complexity, we say that a problem Π is in

P if there is an algorithm to solve Π efficiently.1 On the other hand, if a problem Π is

such that its solution can be verified efficiently, we say Π is in NP. By definition, this

implies that problems in P belongs to NP. The hardest problems in NP deserves special

attention: if every problem in NP can be reduced to a problemΠwithin polynomial time

(without requiring that Π belongs to NP), we say that Π is NP-hard. Furthermore, if Π

also belongs to NP itself, we say that Π is NP-complete. Therefore, if an NP-complete

problem (the hardest problem in NP) can be solved efficiently within polynomial time,

then every problem in NP can be solved in polynomial time. Therefore, NP-complete

problems lie at the core of the class NP.

Consequently, it is of great interest in the theoretical computer science community

to decide whether NP \P= ;, thereby P=NP. Since it is widely believed that this is not

true, algorithm designers face difficulties in finding efficient solutions to these prob-

lems. Indeed, as is the case for problems that are discussed in this thesis, optimization

problems in computational biology often turn out to be NP-hard. As a result, one must

apply various techniques to either find plausible solutions to these problems, or restrict

ourselves to special cases of the problem that allow efficient algorithms. We shall intro-

duce some of these techniques in the following sections.

1 An algorithm is said to be efficient if its running time can be bounded by a polynomial in the inputlength. We avoid more rigorous definitions for these classes of problems; precise definitions involvingTuring machines and certificates can be found in various complexity theory texts, for example [4, 56].

2.2. Graph Parameters 29

2.2 Graph Parameters

Throughout this thesis, let G = (V, E ) be a finite graph with vertices V and edges E .

Unless stated otherwise, we assume G is simple and undirected. The number of vertices

and edges are |V |= n and |E |=m , respectively.

A graph parameter Γ is a function that assigns a non-negative number to every graph

G . For a class of graphs G , Γ(G ) denotes the maximum Γ(G ) over all graphs G ∈ G , and

we say a graph class G has bounded Γ if Γ(G )∈O(1).

Graph parameters are useful when designing efficient algorithms with parameter-

ized running time. In particular, a problem is fixed-parameter tractable (FPT) if it can

be solved in f (k ) · |I |O(1) time, where f is a computable function depending on some

parameter k , independent of the input size |I |.

A good example is treewidth, which measures how “tree-like” a given graph is. To

define treewidth, we need the notion of tree decompositions.

Definition 2.1. [120] A tree decomposition of a graph G = (V, E ) is a pair (X = {X i |i ∈

N }, T = (N , L))where each tree node i ∈N is associated with a set of vertices X i ⊆ V , such

that

(1)⋃

i∈N X i =V .

(2) For each (v, w )∈ E , there is an i ∈N with v, w ∈X i .

(3) For each v ∈V , the set of tree nodes {i ∈N |v ∈X i } induces a subtree of T .

For clarity, we shall refer to (tree) nodes of the tree decomposition T , and vertices of

the original graph G . The width of a tree decomposition (X , T ) is defined as maxi∈N |X i |−

1, and the treewidth of a graph G , denoted tw(G ), is the minimum width over all tree de-

compositions of G . Following this definition, it is easy to see that all trees have treewidth

of 1. See Figure 2.1 for an example of a tree decomposition of width 3. In general, it is NP-

complete to determine the treewidth of a graph [3]. However, for a fixed k , graphs with


1 2

3 4 5

6 7 8

9 10 11

1 2

4 5

4 5

7 84

6 7

9

3 4

6

7

9 10

7 8

11

Graph G

Decomposition Tree T

Figure 2.1: A graph G and its tree decomposition T of width 3.

treewidth k can be recognized, and a width k tree decomposition can be constructed in

linear time [17].

A tree decomposition of a graph is especially useful for designing divide and con-

quer algorithms: given a decomposition tree T , one can pick an arbitrary tree node r

as root of T . Then, for each tree node X , the problem can be solved for the subgraph

induced by the vertices associated with X , and all its descendant nodes. Building up the

solution from leaf nodes to r often provides a polynomial time algorithm (parameter-

ized by the treewidth) if solutions to the subproblems can be combined using dynamic

programming2.

Another related parameter for graph sparsity is branchwidth.

Definition 2.2. [121] A branch decomposition (T,φ) of a graph G is characterized by a

ternary tree3 T , and a bijectionφ from the leaves of T to the edges of G .

2 Dynamic programming is a method for solving complex problems by subdividing the problem intosimpler subproblems. This technique is useful when there are overlapping subproblems, so that one cansolve the problem in a bottom-up fashion.

3 A tree T is a ternary tree if every non-leaf node has degree 3.

2.2. Graph Parameters 31

1 2 3

4 5 6

7 8 9

4,5 1,4 1,2

2,5

4,7

7,8

2,3 3,65,88,9 5,6 6,9

Graph G

Branch decomposition T

Figure 2.2: A graph G and its branch decomposition T . The e -separation shown on the decompositiontree T corresponds to a middle-set of {5, 6, 8}.

Let us denote e to be a tree edge in T . Removing e from T partitions into T1 and T2,

and this partition induces a partition of edges in G , called an e -separation, associated

with the leaves of T1 and T2. The set of vertices in G that are shared by both G1 and G2 is

called the middle-set of e , and the width of this separation is the number of vertices in

the middle-set.

Given a branch decomposition (T,φ), the width of this branch decomposition is the

maximum width over all e -separations in T , and the branchwidth of G , denoted bw(G ),

is the minimum width over all branch decompositions. Figure 2.2 gives an example of

a graph and a branch decomposition. It is well known that the branchwidth is closely

related to the treewidth of a graph by the sandwich theorem [121]:

bw(G )≤ tw(G )+1≤3

2bw(G ).

These graph parameters often allow us to design efficient algorithms to solve oth-

erwise NP-hard problems, and such problems come in different flavours: The input


graphs sometimes have bounded graph parameters due to the nature of the problem.

In some other cases, different techniques can be used when the graph has unbounded

parameters, so we can restrict ourselves to the bounded case. Sometimes, the graph

decomposition gives rise to efficient algorithms to find near-optimal solutions.

All these techniques will be exploited extensively for modelling PPI networks as hy-

pergraphs in Chapter 5. Indeed, graph decomposition finds applications in a number

of different areas within computational biology. The perfect phylogeny problem, which

will be discussed in the next section, is polynomially equivalent to the problem of trian-

gulating coloured graphs, and it can be solved in polynomial time when the treewidth is

bounded [16, 92]. Within the field of proteomics, several studies have shown that pro-

tein interaction networks often exhibit low treewidth (see, e.g. Yamaguchi et al. [147]

and Cheng et al. [28]). This opens a gate for designing efficient algorithms for many

problems on PPI networks. For example, Dost et al. [43] study the problem of querying

the existence of a pathway Q within a base network G . While the problem of deciding if a

graph is a subgraph of another is an NP-hard problem in general [56], they focus on the

restricted class of graphs with bounded treewidth, and provide an efficient algorithm to

query such pathways in a database of molecular networks.4

2.3 Approximation Algorithms

Many optimization problems in bioinformatics are NP-hard, and often researchers re-

sort to heuristics for finding solutions that are close to optimal. A useful line of attack is

the theory of approximation algorithms, which offers compelling techniques for solving

NP-hard optimization problems with provable guarantees. For a minimization prob-

lem, an algorithm is an α-approximation algorithm if it can find in polynomial time

a solution of value at most α·OPT, where OPT is the value of an optimal solution. The

4 In Dost et al.’s work, similarity scores between vertices are computed using their sequence similar-ity, which is then used when matching a query network to the base network. The classical problem ofsubgraph isomorphism on graphs with bounded parameters has been studied by Eppstein [45, 46].

2.3. Approximation Algorithms 33

usage of approximation algorithms in bioinformatics includes genome assembly (short-

est superstring problem), multiple sequence alignment problem, and phylogenetic tree

construction (Steiner tree problem), just to name a few.

Despite the diverse application of approximation algorithms in bioinformatics, the

area of PPI networks has seen limited applications. In particular, to the best of our

knowledge, no combinatorial algorithm has been proposed so far to identify direct pro-

tein interactions (Chapter 4) or protein complexes (Chapter 5) purely from the topolog-

ical structure of the given PPI network. Such scarcity of applications may be due to the

fact that even defining a clean objective function is a nontrivial task. In fact, most ex-

isting approaches propose different objective functions, for which heuristic algorithms

are used. Moreover, it is often unclear whether these approaches directly capture the

experimental processes that produced the PPI data.

In this section, we present a glimpse of approximation algorithms by discussing a

few problems as examples. Since the focus of this section is to introduce techniques

for devising approximation algorithms, we refrain from discussing the latest results on

each problem with the best approximation guarantee. For more extensive surveys, Vazi-

rani [140], and Williams and Shmoys [145] give excellent tutorials on the field.

2.3.1 Lower-bounding the Optimum

When designing an approximation algorithm, one needs to compare the cost of the ob-

tained solution with the cost of an optimal solution in order to establish the approxima-

tion guarantee. This is typically done by finding a good lower bound (for minimization

problems) on the cost of an optimal solution. For example, consider the vertex cover

problem:

CARDINALITY VERTEX COVER. Given an undirected graph G = (V, E ), find a set C ⊆ V

such that every edge has at least one endpoint incident at C , and |C | is minimized.


A lower bound on the size of an optimal vertex cover can be found using a maximal

matching5 of G . With respect to a maximal matchingM , any vertex cover has to pick at

least one endpoint of the edges inM . Therefore, the size ofM is at most the size of an

optimal vertex cover. This gives rise to the following simple algorithm:

1. Find a maximal matchingM in G .

2. Output both endpoints of edges inM .

Observe that the chosen vertices are adjacent to all the edges of G , as otherwise any

uncovered edge would have been added to the matching. Furthermore, the size of the

cover is exactly 2|M |. Since |M | ≤ OPT, the size of the cover is at most 2· OPT, and thus

the algorithm is a 2-approximation algorithm.

Application: Phylogenetic Trees [18] Consider a set of m species with n possible char-

acteristics. We can represent this by a 0− 1 matrix M where M i ,j = 1 if species i has

characteristic j . Then, we say M has a perfect phylogenetic tree if there exists a tree T

such that: (1) there are m leaves, one for each species; (2) Each characteristic label cor-

responds to one edge in T (there may be edges with blank labels); (3) for each leaf s i , the

path from root to s i contains exactly those characteristics possessed by s i . See Figure 2.3

for an example of M and a possible perfect phylogenetic tree.

Let Sj be the species that have characteristic c j = 1. Then the following characteri-

zation holds.

Theorem 2.3. M has a perfect phylogenetic tree if and only if {S1, . . . , Sn} form a laminar

family.6

Thus, we say that two characteristics c i and c j have a conflict if Si and Sj intersect, while

Si \Sj 6= ; and Sj \Si 6= ;. If we remove characteristics so that there are no conflicts, then

5 Given a graph G = (V, E ), a subset of edgesM ⊆ E is a matching if no two edges inM share a vertex.We say a matching is maximal if no more edges of E \M can be added toM .

6 A set system is a laminar family if, for each pair of sets Si and Sj , either Si ⊆ Sj , or Sj ⊆ Si , or Si ∩Sj = ;.


c1 c2 c3 c4 c5s1 1 1 0 0 0s2 0 0 1 0 0s3 1 1 0 0 1s4 0 0 1 1 0s5 0 1 0 0 0

R

s1 s3

s5s2s4

c1

c2c3

c4

c5

∅ ∅

∅

Figure 2.3: Matrix M indicating the characteristics of a set of species (left) and a perfect phylogenetic treefor M (right). Tree edges with ; indicate unlabelled edges.

there is a perfect phylogenetic tree for the remaining characteristics. Thus, we want

to find the smallest set of characteristics to remove. We can turn this problem into an

instance of vertex cover: define a conflict graph whose vertices correspond to character-

istics c1, . . . , cn , and vertices c i and c j share an edge if and only if they conflict. Then,

a minimum cardinality vertex cover in the conflict graph consists of a minimum set of

conflicting characteristics to remove.

2.3.2 Polynomial Time Approximation Schemes

Some NP-hard optimization problems are approximable to an arbitrary degree defined

by the error parameter ε> 0. For an NP-hard optimization problem Πwith an objective

function f , we say that algorithm A is an approximation scheme for Π if on input (I ,ε),

A outputs a solution s such that:

• f (I , s )≤ (1+ε)·OPT if Π is a minimization problem.

• f (I , s )≥ (1−ε)·OPT if Π is a maximization problem.

The algorithm A is called a polynomial time approximation scheme (PTAS), if for each

fixed ε> 0, the running time of A is bounded by a polynomial in the size of the instance


I . Note that the running time of a PTAS depends arbitrarily on ε. If, in addition, the

running time is bounded by a polynomial in both |I | and 1/ε, we say the algorithm is

fully polynomial time approximation scheme (FPTAS). Here we give an example of an

FPTAS for the knapsack problem.

KNAPSACK PROBLEM. Given a set S = {a 1, . . . , a n}of objects with cost(a i )∈Z+, profit(a i )∈

Z+, and B ∈ Z+, find a subset of S whose total cost is bounded by B , and whose total

profit is maximized.

The design of a PTAS is often done via first mapping the problem instance to a coarser

instance, and the new instance is then solved exactly by dynamic programming ap-

proach. Before describing the FPTAS for the knapsack problem, we introduce the no-

tion of pseudo-polynomial time algorithms. An algorithm is efficient if its running time

is bounded by a polynomial in |I |, the size of the instance. However, some problems

contain numbers as part of the problem instance. For example, a knapsack problem

instance contains cost and profit for each object a i . Let |Iu | denote the size of a prob-

lem instance I , where all the numbers in the instance are written in unary. Then, an

algorithm is a pseudo-polynomial time algorithm if its running time is bounded by a

polynomial in |Iu |. Because the knapsack problem is NP-hard, we do not know whether

an efficient algorithm exists; however, there is a pseudo-polynomial time algorithm for

this problem.

Let P be the profit of the most profitable object in S. Then nP is an upper bound

on the profit of an optimal solution. For every 1 ≤ i ≤ n and 1 ≤ p ≤ nP , let S(i , p )

denote a subset of {a 1, . . . , a i } whose total profit is precisely p and whose total cost is

minimized. Further, define A(i , p ) to be the cost of the set S(i , p ). If no such set exists,

set A(i , p ) =∞. Then A(1, p ) is known for each value of p , where 1 ≤ p ≤ nP , and the

following recurrence holds.


A(i +1, p ) =

min{A(i , p ), cost(a i+1)+A(i , p −profit(a i+1))}, if profit(a i+1)< p

A(i , p ), otherwise

A dynamic programming algorithm using this recurrence will compute all values of

A(i , p ) in O(n 2P) time. Finally, the optimal solution is found in time O(nP) by looking for

max{p |A(n , p )≤ B}. Thus, this is a pseudo-polynomial time algorithm for the knapsack

problem.

Observe that, if the profit of each object is bounded by a polynomial in n , the above

algorithm would run in polynomial time. This observation leads us to an FPTAS for the

knapsack problem. We first scale the profit down to small numbers so that the profit

of each object is polynomially bounded by n . Then we can run the above algorithm to

compute the optimal solution with respect to the scaled profit function. By carefully

scaling with respect to the error parameter ε, we shall obtain a solution that is at least

(1−ε)OPT in time polynomial in |I | and 1/ε.

Here is the FPTAS for the knapsack problem:

1. Let K = εP/n .

2. For each object a i , define profit′(a i ) = bprofit(a i )

Kc.

3. Run the dynamic programming algorithm with the scaled profit function, profit′(a i )

to obtain the most profitable set S′.

4. Return S′.

We now show that profit(S′)≥ (1−ε) OPT. Let S∗ denote the optimal set with respect

to the original profit function. For each object a i , due to the rounding down at Step 2,


K ·profit′(a i ) is smaller than profit(a i ) by a difference at most K . Therefore,

profit(S∗)−K ·profit′(S∗)≤ n K .

Furthermore, S′ must be at least as good as S∗ under the scaled profits. By scaling the

profits back up by K we have:

profit(S′)≥ K ·profit′(S∗)≥ profit(S∗)−n K =OPT−εP ≥ (1−ε) ·OPT,

since OPT ≥ P .

The running time of the algorithm is O(n 2b PKc) = O(n 3 · 1

ε), which is polynomial in

both n and 1/ε.

Application: Target Protein Set. Due to limited resources, structural biologists often

face the problem of which protein to study. Because of diverse nature of protein struc-

tures, certain protein molecules are more difficult to crystallize than others. It would be

beneficial, therefore, if the chosen set of proteins are not too expensive to investigate.

Furthermore, certain proteins are more important than others due to biological rele-

vance, homology against existing results on similar proteins, etc. As such, researchers

want to find the set of target proteins that maximizes the importance staying under their

budget.

We can model this simple problem of task selection as follows: for each protein a i ,

define the cost of investigating it to be cost(a i ), and the importance of the particular

protein to be profit(a i ). If we let B to be the total resources that we have at our dis-

posal, an optimum solution to this knapsack problem gives us the most profitable set of

proteins we can study.


2.3.3 LP-based Algorithms

Linear programming gives a profound framework in which approximation algorithms

can be designed. Many NP-hard optimization problems can be stated as integer pro-

grams. While solving an integer program itself is NP-hard in general, solving a (frac-

tional) linear program is not, and thus the linear relaxation of the integer program pro-

vides a natural way of bounding the cost of the optimal solution.

A widely used approach for designing LP-based approximation algorithms is LP-

rounding. In this approach, a fractional solution to the LP-relaxation is first obtained,

and it is then converted to an integral solution by a rounding scheme. The approxima-

tion guarantee is established by comparing the cost of the fractional and integral solu-

tions. For more detailed discussion on the this technique, see [140]. Here we present

a simple example of the LP-rounding technique via a weighted version of the set cover

problem.

WEIGHTED SET COVER. Given a universe U of n elements, a collection of subsets of U ,

S = {S1, . . . ,Sk }, and a cost function c : S → Z+, find a minimum cost subcollection of

S that covers all elements of U .

We can formulate this problem as an integer program by assigning a binary variable

xS for each set S ∈ S , whose value is set to 1 if S is picked in the set cover, and 0 other-

wise. Furthermore, for each element e ∈U , at least one of the sets containing e must be

picked. So we have:

minimize∑

S∈S

c (S) ·xS

subject to∑

S:e∈S

xS ≥ 1 ∀e ∈U

xS ∈ {0, 1} ∀S ∈S


Let OPT denote the optimum solution of this integer program. The LP-relaxation of this

integer program is obtained by replacing the domain of variables xS with 0 ≤ xS ≤ 1 for

all S ∈ S . The derived linear program can then be solved optimally to obtain OPT f . By

construction, OPT f ≤OPT since the optimum integral solution is feasible for the LP.

Then, we can apply the rounding scheme as follows: pick all sets S for which xS ≥ 1/α

in the OPT f , where α is the frequency of the most frequent element (i.e. the maximum

number of sets containing an element). Let C denote the resulting subcollection of sets

chosen by the rounding scheme. To see C is a valid set cover, consider an arbitrary

element e . Since e is in at most α sets, one of these must have been assigned at least

1/α by OPT f . Thus, e is covered by C .

To check the approximation ratio, the rounding scheme increases xS by a factor of at

most α. Therefore, the cost of C is at most α·OPT f .

Application: Representative Protein Set. In Section 2.3.2, we introduced the problem

of choosing a set of target proteins to investigate. Here we modify this problem slightly.

Instead of maximizing the profit we gain by studying a set of proteins, we consider the

notion of representative protein set: from a large collection of protein molecules, we

would like to study the ones that somehow represent the entire collection of proteins.

Such a representative profile would provide us a succinct, yet global picture of the given

protein sample.

We can model this problem as follows: for each protein x , define the cost function

w (x ) ≥ 0 measured by the difficulty of crystallizing x . Then, for each pair (x , y ) of pro-

teins, define a distance function d (x , y )≥ 0 measured by the sequence alignment of the

two proteins. We can then construct a graph G = (V, E ) where each vertex corresponds

to a protein, and two vertices are joined by an edge if and only if the corresponding pro-

teins are within some prescribed distance∆. We say a subset V ′ ⊆ V of vertices is repre-

sentative, if every vertex in V is adjacent to at least one vertex in V ′. Then the problem

of finding a minimum cost representative set is precisely the dominating set problem.

2.4. Genetic Algorithms 41

WEIGHTED DOMINATING SET. Given a graph G = (V, E ) and a cost function c : V → Z+,

find a minimum cost subset D ⊆ V such that every vertex not in D is joined to at least

one vertex in D.

It is well known that dominating set is equivalent to set cover [83]. To reduce domi-

nating set to set cover, construct a set cover instance as follows: the universe U is V . For

each vertex v ∈ V (G ), create a set Sv containing the vertex v and all vertices adjacent to

v . For each set Sv , define its cost to be c (v ). Now, if D is a dominating set for G , then

C = {Sv : v ∈D} is a feasible set cover with the same cost. Conversely, if C = {Sv : v ∈D}

is a set cover, then D is a dominating set for G with the same cost.

Therefore, we can solve the representative protein set problem using the LP-rounding

scheme for set cover.

2.4 Genetic Algorithms

In order to solve optimization problems, a search heuristic called a genetic algorithm

(GA) often offers useful solutions. Inspired by natural evolution, genetic algorithms are

especially useful when a large number of candidate solutions can be efficiently gener-

ated, and pairs of candidates can be somehow “merged” to obtain a better offspring

of the two parent solutions. Each candidate solution is typically encoded as a string of

characters. A genetic algorithm can then be designed via the following four main phases:

I. Initialization: Many individual solutions are randomly generated to form an initial

population that spans the entire search space.

II. Fitness function: Each candidate solution is evaluated by the fitness function that

gives a score based on the optimization criteria.

III. Crossover: Given two parent solutions, the crossover operation generates an off-

spring based on its parents. Various methods have been proposed for the crossover


operation, with one point “cut-and-swap” being the simplest form, while it can be

fully customized in a problem-specific way.

IV. Mutation: In order to maintain genetic diversity from one generation to the next,

a prescribed portion of the population goes through a mutation operation. The

mutation operation is typically done by randomly switching characters in the can-

didate string, which can be fine-tuned depending on the application.

Using these operators, a large pool of candidate solutions are maintained from one

generation to another. Since a genetic algorithms is a search heuristic, one typically

cannot provide a theoretical bound on the output quality. Further, the fitness operator

only tells us whether one candidate solution is better than other existing candidates, and

thus it is never clear when to stop the algorithm. However, when a large initial popula-

tion can be easily generated, and the fitness function can be evaluated quickly, carefully

designed GAs provide an efficient approach to find good solutions. In this thesis, we

shall present a highly customized genetic algorithm to find direct physical PPI network

in Chapter 4. For more detailed discussions on this optimization technique, see [64, 94].

Chapter 3

Protein Quantification with Shared

Peptides

With the advance of mass spectrometry (MS) based techniques, biologists in proteomics

research are not only interested in determining the presence of proteins within a mix-

ture, but also the quantitative nature of the constituent proteins, i.e. their abundance.

MS-based shotgun proteomics approaches this problem by first shattering the (unknown)

protein molecules into shorter fragments known as peptides, and then reading the pep-

tides present in the mixture using a mass spectrometer.

Mass spectrometers typically work by first producing ions with charges based on

the masses of the peptides, and then separating the ions by their mass-to-charge ratios.

Various methods have been proposed to convert the peptide molecules into ions in the

gas phase using electrospray ionization (ESI) [144] and matrix assisted laser desorption

ionization (MALDI) [85, 116]. The produced ions can then be detected and processed

into mass spectra. There are several algorithms to analyze the mass spectra and either

identify the peptides present or quantify the peptide abundance [59, 115, 135].

Among these quantification methods, approaches known as label-free methods have

become popular due to the low cost and simplicity of the experiments. These methods

43

44 Chapter 3. Protein Quantification with Shared Peptides

work under the premise that more abundant peptides would produce a larger number

of spectra. By counting the number of spectra, we can obtain measures such as spec-

tral counting [30, 109], and peptide counting [87, 143]. Peptide count is the measure of

how well a given protein is represented in a sample by counting the number of unique

peptides. On the other hand, spectral counting simply counts the number of spectra

for each peptide. As such, while peptide counting measures the coverage of the parent

protein sequence, spectral counting provides a concrete measure of peptide abundance

that can later be translated into protein abundance.

This chapter deals with the next step in the protein quantification problem: translat-

ing the peptide abundance into protein abundance. In order to identify (and quantify)

constituent proteins in a mixture, the detected peptides often serve as indicators for

the parent proteins. For example, one of the most widely used approaches for MS pro-

tein identification is the Mascot score1[114]. Given a set of peptide mass spectra, the

Mascot score measures the probability that the match between the mass spectra and a

particular protein sequence is a random event (thus a lower probability yielding a higher

score), and is calculated for each protein in the protein sequence database. To compute

this measure, each protein sequence in the database is digested and fragmented in sil-

ico, and the resulting “theoretical spectrum” is compared against the experimental spec-

trum observed by mass spectrometers. Consequently, the computation of Mascot scores

involves considering all possible digested peptides for each protein in the database.

However, such an approach creates a problem when there is a peptide that belongs

to more than one parent protein; in other words, the mapping of peptides to proteins

is no longer one-to-one. These ambiguous peptides are called shared peptides, and

Nesvizhskii and Aebersold [108] recently highlighted the complications in protein iden-

tification with the presence of shared peptides. Unfortunately, shared peptides are preva-

lent in several mass spectrometry datasets. For example, proteins in a large protein fam-

ily have similar sequences that may create identical peptides when trypsinized, and al-

1 Mascot score is the in silico method to identify proteins from a sequence database, used by the Mascotsoftware (http://www.matrixscience.com).

http://www.matrixscience.com

45

ternative splicing may also create nearly identical protein isoforms. However, in most

studies of protein identification and quantification, shared peptides are often simply

ignored [100, 151].

In this chapter, we study the problem of identifying and quantifying the constituent

proteins from MS spectra, without having to exclude the existence of shared peptides.

Depending on the mixture of proteins under investigation, the shared peptides can con-

stitute as much as 50% of the peptides in the mixture [42, 81]. Consequently, incorpo-

rating the shared peptides into analyses would certainly lead to a more accurate quan-

tification of constituent proteins.

Recently, shared peptides have been a useful tool for protein quantification in sev-

eral studies. For example, Jin et al. [81] proposed a statistical model that finds groups of

proteins sharing significant number of peptides, namely peptide-sharing closure groups

(PSCG). In their work, proteins grouped together in the same PSCG were shown to be bi-

ologically related, and when applied to data from differential expression analysis, all the

peptides within each group showed consistent abundance relative to controls, justifying

the inclusion of shared peptides in protein quantification studies.

On the other hand, Dost et al. [42] recently proposed a clean combinatorial model

for quantifying relative protein abundance. Their model is built around peptide-protein

relationships and, using the relative peptide abundance data from multiple samples,

they set up a linear program to obtain relative protein abundance. As their input data

is relative (e.g. peptide s i is expressed 5 times more in sample x than in sample y , say),

their output is also in relative ratios across multiple samples. As such, solving the linear

program optimally (which gives a fractional value to the abundance of each protein)

sufficed to give a viable solution.

As described, existing studies of protein quantification using shared peptides focus

primarily on relative abundance measures across different samples. This may be be-

cause the quality of peptide abundance measures is relatively poor, and thus most stud-

ies only consider the differential profiling aspect of protein abundance. However, vari-


ous methods for absolute peptide quantification have been recently proposed [9, 61, 91],

and the quality of peptide abundance measures is improving rapidly. In this chapter,

therefore, we consider the problem of absolute protein quantification under the as-

sumption that absolute peptide abundance is available. In Section 3.1, we modify the

model of Dost et al. to suit our needs for absolute protein quantification, and formu-

late various optimization problems. Then, in Section 3.2, we establish the computa-

tional hardness of these problems, and design approximation algorithms in Section 3.3.

In Section 3.4, we evaluate the performance of our approaches using both simulated

and real MS data. Finally, Section 3.5 discusses how our approaches can be modified

to handle degenerate cases, and incorporate additional biological data such as peptide

detectability.

3.1 Problem Formulation for Protein Quantification

In this section, we formulate optimization problems for protein quantification. Let W =

{w1, . . . , wm }denote the set of all known protein sequences, and let S = {s1, . . . , sn}denote

the set of peptides we observe from a given mass spectrometry experiment2. We assume

that we are given the set of protein sequences and peptide sequences, together with

the abundance of each observed peptide sequence. Then we want to find the protein

abundance that can be constructed from the given peptide abundance. For simplicity,

we make the following assumptions.

1. For each protein w j , a peptide s i may only appear in w j at most once.

2. The absolute abundance for each observed peptide is given as a positive integer,

and thus protein abundance should be a nonnegative integer.

Lifting the first assumption can be done easily as shall be discussed in Section 3.5.

The absolute peptide abundance can be estimated by the spectral counting method, as

2 We use w for words and s for syllables as an analogy to proteins and peptides.

3.1. Problem Formulation for Protein Quantification 47

s1

s2

s3

s4

w1

w3

w2

�1 = 4

�2 = 5

�3 = 8

�4 = 7

x1 = 4

x2 = 7

x3 = 1

s4 =LTAU

s3 =QWTA

s2 =TUAIL

s1 =QYUA

w1 =QYUATUAIL

w2 =LTAUQWTA

w3 =QWTATUAIL

Figure 3.1: An example of the bipartite graph from peptides (s1, . . . , s4) and the parent proteins(w1, . . . , w3). Peptides s2 and s3 are shared peptides. The abundance of the peptides are indicated byσi while x j ’s denote the protein abundance.

it measures the number of corresponding mass spectra for each peptide.

Let us construct a bipartite graph G = (S, W ; E )with |S|= n , |W |=m , where (i , j )∈ E

if and only if peptide s i is a substring of protein w j . Let A be an n×m matrix where A i ,j =

1 if (i , j ) ∈ E , and 0 otherwise. Let σi be a positive integer that denotes the multiplicity

of a peptide s i , for 1 ≤ i ≤ n . Then, let x j denote the unknown variable for multiplicity

of protein w j . Figure 3.1 gives an example of the formulation. Then, we can encode our

problem into a system of linear equations as follows:

Problem 3.1.

m∑

j=1

A i ,j ·x j =σi i = 1 . . . n

x j ∈Z+ ∪{0} j = 1 . . . m

This model captures the dependencies of peptides on proteins, and our problem re-

duces to solving a system of linear equations. However, there remain a few practical

issues. The multiplicity of peptides σi from MS contains a significant amount of noise,


which would lead to inaccurate solutions or non-satisfiable system of equations. Fur-

ther, even after assuming the peptide multiplicities are accurate, the system may not

contain a unique, integral solution. We thus need a relaxed set of constraints and an

optimization criterion to select the most plausible solution. We propose to modify this

model in four different ways to progressively capture the data being analyzed in an in-

creasingly realistic manner.

MULTICOVER. We can relax the equality constraints: rather than requiring x j ’s sum up

to precisely σi for each peptide Wi , we turn them into covering inequalities, and mini-

mize the sum of x j ’s.

minimizem∑

j=1

x j

subject to:m∑

j=1

A i ,j x j ≥σi i = 1 . . . n

x j ∈Z+ ∪{0} j = 1 . . . m

Observe that this problem is closely related to the set cover problem. In fact, our for-

mulation above is a generalization of set cover known as multicover: instead of requiring

each element (peptide, in our case) to be covered at least once, each element needs to be

covered at least the prescribed number of times. If each set (protein, in our case) can be

picked at most once, the problem is called constrained multicover, while if a set can be

covered arbitrarily many times, the problem is called unconstrained multicover, which

is precisely the formulation above.

MINIMUMPROTEINTYPES. One further refinement from above is to look for a solution

with the smallest number of different proteins, that is, minimize the number of x j ’s with

3.1. Problem Formulation for Protein Quantification 49

positive values.

minimizem∑

j=1

χ(x j )

subject to:m∑

j=1

A i ,j x j ≥σi i = 1 . . . n

x j ∈Z+ ∪{0} j = 1 . . . m

where χ(x j ) = 1 if x j > 0, and χ(x j ) = 0 otherwise.

In this problem, we want to find a solution that uses a smallest possible number of

distinct protein types. Since the objective function is nonlinear, the resulting problem is

no longer an integer linear program. While nonlinear integer programs are much harder

class of problems in general, as we shall see later, we can still solve the above program,

via a reduction to set cover.

MINIMUMUNIFORMERROR. Instead of simply looking for covers onσi ’s, we can allow an

error term ε such that the x j ’s sum to a value withinσi ±ε.

minimize ε

subject to: σi −ε ≤m∑

j=1

A i ,j x j ≤ σi +ε i = 1 . . . n

x j ∈ Z+ ∪{0} j = 1 . . . m

This formulation looks for a solution such that, for each peptide s i , the proposed

multiplicities for proteins containing s i sum up to a value within ε ofσi . This introduc-

tion of additive error term helps us incorporate noise in the MS data while adhering to

the peptide abundance σi by minimizing the error term. One issue with this approach

is that some peptides are significantly harder to detect than others, and thus the error


for those peptide multiplicities are much larger than peptides that are easier to identify.

We can resolve this issue of non-uniform errors using the following formulation.

MINIMUMERRORSUM. While MINIMUMUNIFORMERROR looks for a solution where the uni-

form error term ε is minimized, here we introduce an individual error term εi for each

σi , and minimize the sum∑

i εi .

minimizen∑

i=1

εi

subject to: σi −εi ≤m∑

j=1

A i ,j x j ≤ σi +εi i = 1 . . . n

x j ∈ Z+ ∪{0} j = 1 . . . m

Using these error terms allow different error for each peptide. We note that error

terms εi can be weighted by “detectability” that would penalize differently any discrep-

ancies for different peptides. We discuss such extensions in Section 3.5.

Each of these four problems captures a different aspect of protein quantification,

and requires a different algorithmic technique. Although some problems clearly model

the mass spectrometry data more realistically than others, the theoretical results build

up progressively, and thus we shall discuss each problem in this chapter.

3.2 Hardness of Protein Quantification Problems

In this section, we study the computational hardness of the problems posed in Sec-

tion 3.1. To show the hardness of a given problem, we need to find a problem that is

known to be hard, and reduce its instances to our problem of interest. For that purpose,

3.2. Hardness of Protein Quantification Problems 51

we consider the following two NP-hard problems3.

SETCOVER. [56, 86]Given a set S of elements and a collection of sets W over the ground

set S, choose a minimum cardinality sub-collection of W that covers all elements in S.

EXACTCOVER. [56, 86] Given a collection W of sets over elements S, find a subset W ′ ⊆

W such that each element in S belongs to exactly one set in W ′.

Before showing the hardness of our four problems, we first consider the original for-

mulation; Problem 3.1 is simply a system of linear equations, but we can define an inte-

ger program by adding an objective function as in MULTICOVER.

Problem 3.2.

minimizem∑

j=1

x j

subject to:m∑

j=1

A i ,j ·x j =σi i = 1 . . . n

x j ≥ 0 j = 1 . . . m

x j ∈Z

The hardness of this problem is evident from EXACTCOVER. Due to unavoidable er-

rors in MS measurements, this problem is of less practical interest, and we refrain from

further discussing algorithmic approaches.

Theorem 3.3. Problem 3.2 is NP-hard.

Proof. Take an instance Π of EXACTCOVER, and transform into an instanceφ(Π) of Prob-

lem 3.2 by assigning σi = 1 for all 1 ≤ i ≤ n . If φ(Π) finds a solution of any size, then Π

has a solution for EXACTCOVER. Otherwise, Π has no solution for EXACTCOVER.

3 See Section 3.6 for inapproximability results on these problems.


Next, the hardness of the multicover problem is well known [140].

Theorem 3.4. [140]MULTICOVER is NP-hard.

Proof. Consider the decision problem of MULTICOVER(K). Take an instance Π of SET-

COVER(K), and cast into an instance φ(Π) of MULTICOVER(K), where σi = 1 for all i . If

Π contains a solution of size k , then the corresponding chosen sets form a solution for

φ(Π) of size k . Conversely, if φ(Π) has a multicover of size k , then there are at most k

x j ’s with positive values. These x j ’s correspond to a set cover of size k for Π.

The hardness of MINIMUMPROTEINTYPES comes from a reduction from SETCOVER.

Theorem 3.5. MINIMUMPROTEINTYPES is NP-hard.

Proof. Take an instance of set cover Π, and cast into an instance φ(Π) of MINIMUMPRO-

TEINTYPES by assigning the coverage requirement σi = 1 for each element. If Π admits a

set cover of size k , then the corresponding chosen proteins form a solution for φ(Π) of

size k . Conversely, if φ(Π) admits a solution of size k , the proteins chosen with positive

multiplicity correspond to a set cover of size k in Π.

The error minimization problems can be shown to be hard by reductions from EX-

ACTCOVER.

Theorem 3.6. MINIMUMUNIFORMERROR is NP-hard.

Proof. Take an instance Π of EXACTCOVER. Then, cast into an instance φ(Π) of MINIMU-

MUNIFORMERROR by assigningσi = 1 for 1≤ i ≤ n . Ifφ(Π) contains a solution with ε= 0,

then Π contains an exact cover. Otherwise, Π contains no solution.

A similar proof yields hardness for MINIMUMERRORSUM.

Theorem 3.7. MINIMUMERRORSUM is NP-hard.

3.3. Approximation Algorithms for Protein Quantification 53

3.3 Approximation Algorithms for Protein Quantification

In this section, we design approximation algorithms for the formulated problems. While

MULTICOVER and MINIMUMPROTEINTYPES are both covering problems, MINIMUMUNIFORMER-

ROR and MINIMUMERRORSUM share a common theme of minimizing the error terms. For

the covering problems, we obtain O(log n ) approximation guarantees, and we design

randomized algorithms for the error minimization problems.

3.3.1 An Algorithm for Multicover

The MULTICOVER problem formulated in Section 3.1 is an unconstrained version: the

solution vector x may contain any nonnegative integers. In other words, each protein

may be picked multiple times in order to meet the coverage requirement σ. For the

purpose of approximation algorithms, however, we can instead focus on the constrained

0−1 problem, where each protein can be either picked once, or not picked at all.

Lemma 3.8. If there is an α-approximation algorithm for constrained multicover, then

there is an α-approximation algorithm for MULTICOVER.

Proof. Consider the unconstrained multicover problem. We first take the LP relaxation

of the integer program, and solve the LP in polynomial time to obtain the fractional opti-

mum solution, denoted by OPT f . From the fractional optimum OPT f : x = (x1,x2, . . . ,xm ),

each x j can be split into an integral part I j and a fractional part Fj , where 0 < Fj < 1.

Then, OPT f = I + F . Now consider the “residual” IP:


minimizem∑

j=1

x ′j (3.1)

subject to:m∑

j=1

A i ,j ·x ′j ≥σ′i i = 1 . . . n

0≤ x ′j ≤ 1 j = 1 . . . m

x ′j ∈Z

where

σ′i =max{σi −m∑

j=1

A i ,j I j , 0}

to make sure σ′i remains nonnegative. Then F = (F1, F2, . . .) forms an optimal solution

for the LP relaxation of the above residual IP (since otherwise we can replace it with a

better solution for the residual IP and improve OPT f ).

Now, suppose we have an approximate solution A that is at most α ·F for some α> 1.

Then our overall solution is:

I +A ≤ I +αF ≤α(I + F ) =α(OPT f )≤α ·OPT

Therefore, from the standpoint of approximation algorithms, we can focus on the

constrained multicover problem. Due to their resemblance, standard techniques for

set cover can be applied with similar approximation guarantees, including the simple

greedy algorithm (which iteratively picks the set containing the largest number of un-

covered elements), or the LP rounding technique using the residual solution F .

3.3.2 An Algorithm for Minimum Protein Types

In Section 3.2, we showed that MINIMUMPROTEINTYPES is NP-hard via a reduction from

set cover. While MINIMUMPROTEINTYPES is formulated as a nonlinear integer program,


a harder class of problems in general, our problem is in fact equivalent to the set cover

problem.

Theorem 3.9. Any algorithm for SETCOVER can be used to find a solution for MINIMUMPRO-

TEINTYPES of the same size.

Proof. Let Π denote an instance of MINIMUMPROTEINTYPES. Now consider the corre-

sponding set cover instance φ(Π) constructed by replacing the coverage requirement

withσi = 1 for each element i . Supposeφ(Π) contains a set cover of size k . Take the cho-

sen sets in φ(Π), and assign the corresponding x j ’s with arbitrarily large values so that

each coverage requirement σi is satisfied. Since the value of the optimization function

is the number of x j ’s with positive values, we have a solution for MINIMUMPROTEINTYPES

of value k .

This implies that any approximation algorithm for SETCOVER can be used to compute

a solution for MINIMUMPROTEINTYPES with the same approximation guarantee, and the

inapproximability results for SETCOVER also hold for MINIMUMPROTEINTYPES. See Sec-

tion 3.6 for inapproximability results on SETCOVER.

Finding the protein abundance σ for Minimum Protein Types. Observe that, in the

proof of Theorem 3.9, our algorithm assigns arbitrarily large values to the x j ’s chosen by

the SETCOVER algorithm. In practice, however, such arbitrarily large values would result

in solutions undesirable for biological purposes. To resolve this, we first run the SET-

COVER algorithm to choose the smallest set of proteins (which would be assigned posi-

tive abundances), and then remove the unchosen proteins from the bipartite graph G .

Then, we apply the algorithm for MULTICOVER on the residual bipartite graph. This way,

one can be sure that we choose the smallest possible number of proteins, and further,

assign the smallest possible abundances to the chosen proteins.


3.3.3 An Algorithm for Minimum Uniform Error

We reduce the problem to a {0, 1} problem.

Lemma 3.10. If there is an α-approximation algorithm for constrained MINIMUMUNI-

FORMERROR, then there is an α-approximation algorithm for unconstrained MINIMUMU-

NIFORMERROR.

Proof. Let’s first spell out the original problem:

minimize ε (ILP1)

subject to: σi −ε ≤m∑

j=1

A i ,j x j ≤ σi +ε i = 1 . . . n

x j ∈ Z+ ∪{0} j = 1 . . . m

which we shall call ILP1. Now, consider the LP relaxation of ILP1, namely LP1. Then, by

definition, OPT(LP1) ≤ OPT(ILP1). Each value assigned to x j in OPT(LP1) is a fractional

value, which can be split into:

OPT(LP1) = I + F

where 0 < Fj < 1 for all j . We can then keep the solution given by I , and consider the

residual problem given by the following integer program, called ILP2:

minimize ε (ILP2)

subject to: σ′i −ε ≤m∑

j=1

A i ,j x j ≤ σ′i +ε i = 1 . . . n

0 ≤ x j ≤ 1 j = 1 . . . m

x j ∈ Z j = 1 . . . m

whereσ′i =max{σi −∑m

j=1 A i ,j I j , 0}.


Now, consider the optimum solution for the residual integer program, namely OPT(ILP2).

By definition, we have the linear relaxation LP2 such that OPT(LP2) ≤OPT(ILP2). For

OPT(LP2), we claim that F ≤OPT(LP2), and F is a feasible solution. To show F ≤OPT(LP2),

suppose otherwise. Then, OPT(LP2) can be pasted onto I to form:

I +OPT(LP2)< I + F ≤OPT(LP1)

contradicting the optimality of OPT (LP1).

Finally, suppose there is an approximate solution A for OPT(LP2) that is at most α ·F

for some α> 1. Then our overall solution is:

I +A ≤ I +α · F ≤α(I + F ) =α ·OPT(LP1)≤α ·OPT(ILP1)

Therefore, the overall approach would yield an α−approximation algorithm.

We thus focus on the constrained 0−1 problem. We can design a randomized round-

ing scheme where the fractional solution F is used as a probability, as shown in Algo-

rithm 3.3.1.

Algorithm 3.3.1: Randomized Rounding Scheme for MINIMUMUNIFORMERROR

Input: Fractional solution vector F from OPT(LP1).Output: Binary solution vector x .

for j = 1, . . . , m dop j ← Fj (each 0< Fj < 1 is treated as a probability.);Flip a p j -biased coin;if heads then

x j ← 1;endelse

x j ← 0;end

endReturn x .


To analyze the rounding scheme, we define a random variable Yi =∑

(i ,j )∈E x j where

x j ∈ {0, 1} is the result of tossing a coin for each j with probability p j (a Poisson trial). By

linearity of expectation, we have:

E [Y ] = E [Y1]+E [Y2]+ . . .=∑

(i ,j )∈E

E [p j ] =τi .

where τi =∑

(i ,j )∈E Fj , the coverage achieved by F . Since the fractional solution F has at

most the optimum error from the integer program, the expected solution E [Y ] is opti-

mal.

We now show that Algorithm 3.3.1 gives a small tail probability.

Lemma 3.11. Algorithm 3.3.1 returns a solution such that

Pr [Yi > (1+δ)τi +6 log n

δ2]<

1

n 2

and

Pr [Yi < (1−δ)τi −4 log n

δ2]<

1

n 2

Proof. To estimate the tail probability, we apply the Chernoff bound [29, 104] on Yi . For

the bound on the upper tail, we have:

Pr [Yi > (1+δ)τi ]< e−τiδ2/3

and for the bound on the lower tail:

Pr [Yi < (1−δ)τi ]< e−τiδ2/2

We show each case separately.


Case I. (Upper tail): Suppose τi ≥ 6 log nδ2 . Then it follows that:

τi ≥6 log n

δ2

⇒τiδ2

3≥ 2 log n

⇒ log e τiδ2/3 ≥ log(n 2)

⇒ e−τiδ2/3 ≤1

n 2

and it follows that

Pr [Yi > (1+δ)τi ]<1

n 2if τi ≥ 6 log n

δ2 (1)

Now consider the case τi <6 log nδ2 . We introduce an absolute error D such that (1+

δ)τi = τi +D. Then we can express as D = δτi , and rewrite the assumption with the

additive error D:

τi <6 log n

δ2⇒τi <

6 log n

(D/τi )2

⇒D <p

τi ·6 log n

Moreover, the Chernoff bound for the upper tail now becomes:

Pr [Yi >τi +D]≤ e−D2

3τi

< e−(pτi 6 log n )2

3τi (by assumption)

= e−2 log n =1

n 2

Rewriting D in δ gives

D <p

τi ·6 log n =6 log n

δ2,


and thus we obtain

Pr [Yi >τi +6 log n

δ2]<

1

n 2if τi <

6 log nδ2 (2)

Combining (1) and (2), we have

Pr [Yi > (1+δ)τi +6 log n

δ2]<

1

n 2.

Case II (Lower tail): Consider the lower tail as below.

Pr [Yi > (1−δ)τi ]< e−τiδ2/2

First, suppose τi ≥ 4 log nδ2 . Rearranging this gives e−τiδ2/2 ≤ 1

n 2 , and it follows that

Pr [Yi < (1−δ)τi ]<1

n 2if τi ≥ 4 log n

δ2 (3)

Now consider the case τi <4 log nδ2 . As before, let D be an additive error such that

(1−δ)τi =τi −D. Then we have D =δτi , and it follows that

τi <4 log n

δ2⇒τi <

4 log n

( Dτi)2

⇒D2

τi< 4 log n

⇒D <p

τi 4 log n


Now, the Chernoff bound can be rewritten:

Pr [Yi <τi −D]< e−τiδ2/2

= e−D2

2τi

< e−(pτi 4 log n )2

2τi

= e−2 log n =1

n 2

Rewriting D in δ gives

D <p

τi ·4 log n =4 log n

δ2,

and thus we obtain

Pr [Yi <τi −4 log n

δ2]<

1

n 2if τi <

4 log nδ2 (4)

Combining (3) and (4), we have

Pr [Yi < (1−δ)τi −4 log n

δ2]<

1

n 2

As a result, for each s i , flipping a coin independently with probability p j for each

neighbour w j results in a value within the error bound of

[ (1−δ)τi −4 log n

δ2, (1+δ)τi +

6 log n

δ2]

with high probability for both the upper and lower tail. We can then put all the s i ’s

together to build a bound on the randomized algorithm.

Theorem 3.12. The randomized rounding scheme finds a solution with optimal expected

value for MINIMUMUNIFORMERROR with high probability.

Proof. We use the union bound to put all s i ’s together. Consider the upper tail: for

each i = 1 . . . n , let A i denote the event that Yi is greater than the specified bound. By


Lemma 3.11, Pr [A i ]≤ 1n 2 for each A i . By the union bound, we have:

Pr [∪i A i ]≤∑

i

Pr [A i ] = n ·1

n 2=

1

n

Hence, the probability that no s i falls above the bound is 1− 1n

.

Similarly, for the lower tail, the probability that any s i falls below the specified bound

is 1n

. Hence, the probability that no s i falls below the bound is 1− 1n

.

3.3.4 An Algorithm for Minimum Error Sum

Here we discuss the MINIMUMERRORSUM problem where we allow independent error

term εi for each peptide abundance σi . As before, we focus on the constrained 0− 1

problem via LP relaxation.

Lemma 3.13. If there is an α-approximation algorithm for constrained MINIMUMERROR-

SUM, then there is anα-approximation algorithm for unconstrained MINIMUMERRORSUM.

Proof. Similar to the proof of Lemma 3.10.

We can thus focus on the constrained, 0− 1 problem. The randomized rounding

scheme for MINIMUMUNIFORMERROR shown in Algorithm 3.3.1 can be applied here, too.

Theorem 3.14. There is a randomized rounding scheme that finds a solution with opti-

mal expected value for MINIMUMERRORSUM with high probability.

Proof. Similar to the proofs of Lemma 3.11 and Theorem 3.12.

3.4. Experimental Results 63

3.4 Experimental Results

In this section, we discuss how our algorithms for the protein quantification problem

have been tested empirically against simulated data as well as biological data from AP-

MS experiments.

3.4.1 Performance on Simulated Data

String Trypsinization. To generate our simulated test data, we took a collection of

1000 random strings of length 150 over the alphabet of size 4. The abundance of each

string w j was then set to a random value between 1 and 100. We call the resulting mul-

tiset of strings x . Then, we cut the strings at random into approximately 30 fragments

to simulate the trypsinization of proteins. The string fragments are counted to form the

peptide abundanceσ for this input string set.

We acknowledge that the parameters chosen for the generation of protein / peptide

data are not based on a realistic model of protein sequences (e.g. alphabet of size 20,

realistic length for protein sequences, realistic trypsinization process, etc.). However,

note that the performance of our approaches depends directly on the edge density of

the bipartite graph constructed from the strings. The chosen parameters ensure that

each protein consists of up to 30 peptides, and that the edge density of the resulting

bipartite graph is similar to that seen in the real AP-MS data.

We first ran our algorithms on the synthetic data, and compared the results against

the protein abundance of the original set of strings x . Table 3.1 summarizes the results

of these comparisons. However, because each problem has its own distinct objective

function, comparing the output of our algorithms based on any one objective function

is not very meaningful. Furthermore, the abundance x of the original set of strings may

not be “optimal” for the input vectorσ depending on the objective. In fact, as shown in

Table 3.1, the original abundance vector x does not achieve the optimum objective value


Problem Objective Problem Objective

MULTICOVEROriginal x 6439

MINIMUMUNIFORMERROROriginal x 0

Our result 5371 Out result 74

MINIMUMPROTEINTYPESOriginal x 1000

MINIMUMERRORSUMOriginal x 0

Our result 887 Our result 782

Table 3.1: Output quality of the algorithms compared to the original abundance vector x (Objective indi-cates the value from the corresponding objective function). Note that in MULTICOVER and MINIMUMPRO-TEINTYPES, our solution achieves better objective values than the original abundance of strings, whilein the error minimizing problems (i.e. MINIMUMUNIFORMERROR, MINIMUMERRORSUM) the original stringabundances produces no error.

for MULTICOVER and MINIMUMPROTEINTYPES, while our approximate algorithms provide

slightly better solutions than x .

Consequently, we instead base our performance analyses on the distance between

the proposed solutions and the original abundance x . Let x ,x ′ denote two solution vec-

tors. Then, we define the distance between x and x ′ as the L 1-distance:

d (x ,x ′) =∑

i

|x i −x ′i |

where x i denotes the i -th component of x . We can interpret this L 1-distance between

our proposed solution x and the original abundance x as the error present in our pre-

diction.

Robustness Analysis. When Gentleman and Huber [60]discussed errors that are preva-

lent in proteomics, they classified the measurement errors as two distinct types: stochas-

tic errors and systematic errors. Stochastic errors have random variability, and thus can

be reduced by simply repeating the experiment. On the other hand, systematic errors

are recurrent in the measurements, which makes them difficult to identify and remove.

From the perspective of the protein quantification problem, stochastic errors are em-

bedded into the peptide abundanceσ. In our formulation of the problems, MINIMUMU-

NIFORMERROR and MINIMUMERRORSUM in particular, we attempt to remove the stochas-

tic errors by looking for a solution with the smallest error ε.


20 30 40 50 60 70 80 90 100

5500

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Coverage (%)

Erro

rs

MUE

MES

MC

MPT

Figure 3.2: Robustness of the algorithms for MPT (MINIMUMPROTEINTYPES), MC (MULTICOVER), MES (MIN-IMUMERRORSUM) and MUE (MINIMUMUNIFORMERROR). Errors indicate the L 1 distance from the originalprotein abundances x (averaged over 5 distinct partial datasets).

On the other hand, because some peptides are notoriously difficult to identify, and

the sensitivity of mass spectrometers is relatively poor, the mass spectrometry data is far

from comprehensive. As a result, certain peptides (and their abundances) are missing in

data, and form a systematic error that are recurrent throughout the dataset. We test our

algorithms against such systematic errors by analyzing the robustness of the algorithms.

In order to test the robustness of our algorithms, we randomly selected and removed4

a specified proportion of constraints before running the algorithms. For example, if we

remove 25% of the constraints (the peptide abundance σ) from the system, solving the

problem with the remaining 75% of the constraints can simulate the problem of quanti-

fying the protein abundance with 75% sensitivity of the MS data.

After removing a fixed proportion of the input data (peptide abundance σ), we ran

our algorithm for each problem and compared the results in terms of the errors pro-

4 We note that, by removing a particular peptide abundance σi , we simply remove the correspondingconstraint in the system rather than settingσi = 0 for that peptide.


duced as estimated by the metric d (x , x ). Figure 3.2 shows the result of this comparison

for different input coverage (i.e. percentage of the remaining data). Among the four dif-

ferent problem formulations, MINIMUMPROTEINTYPES showed a closest fit to the original

abundance x across the different input coverage, while MINIMUMERRORSUM and MUL-

TICOVER showed similar performance until the input coverage fell below 50%. Further-

more, when comparing the amount of errors produced between input coverage at 50%

and at 100%, the algorithm for MINIMUMPROTEINTYPES is shown to be the most robust

among the four problem formulations.

3.4.2 Protein Quantification from AP-MS Data

To test our algorithms on real biological data, several datasets have been acquired from

AP-MS experiments by our collaborators5. Spectral count is used to estimate the abun-

dance of peptides, as it measures peptide concentrations in protein mixtures based on

the number of tandem mass spectra obtained for each peptide [22, 30, 100]. Recall that

each AP-MS experiment consists of a bait protein, together with peptides from its in-

teracting partners, the prey proteins. We obtained the dataset from 3 separate mass

spectrometry runs for 3 bait proteins, RPAP3, CDK9, and POLR2A. Since each of these

datasets is produced by a distinct mass spectrometry experiment, we treat them as dif-

ferent samples.

As shown in Table 3.2, the three datasets from AP-MS experiments contain approxi-

mately 2000 peptides that belong to 1000 proteins. However, when restricted to only the

connected components (in the bipartite graph) containing shared peptides, the dataset

contains approximately 500 peptides over 100 proteins.

While the nontrivial portion of the problem is smaller in terms of the number of

peptides and proteins, the corresponding bipartite graphs are much denser due to the

peptides shared by many proteins. In fact, peptides in our datasets are shared by as

5 Our test data was generously provided by Benoit Coulombe’s lab at Institut de recherches cliniquesde Montréal (IRCM), and consists of human PPI data from AP-MS experiments.


Number CDK9 POLR2A RPAP3

Overallproteins 1593 1386 901peptides 2592 2861 1689

components 1320 1211 703

Non-trivialproteins 125 70 125peptides 672 525 589

components 21 19 26

Table 3.2: Network analysis of prey proteins from AP-MS experiments. Overall indicates the total numberof proteins, peptides, and connected components for each dataset, while non-trivial indicates only theconnected components containing shared peptides.

many as 19 proteins. For example, Figures 3.3 through 3.5 show the bipartite graphs for

the dataset with CDK9, POLR2A, RPAP3 as the bait, respectively.

We executed our algorithms for MINIMUMPROTEINTYPES and MINIMUMERRORSUM for

the nontrivial components (containing shared peptides) of each dataset. Furthermore,

for smaller connected components with fewer than 20 proteins, we also computed exact

solutions to the corresponding integer programs6 , and found the results to be promis-

ing. In the case of MINIMUMPROTEINTYPES, both the integral optimum solution and our

approximate solution often picked the same set of proteins (FP ∼ 10%, FN ∼ 15%), with

only marginal differences in the abundance. In the case of MINIMUMERRORSUM, the sum

of errors (∑

i εi ) in our approximate solution differs from that of integral optimum solu-

tion by at most the maximum peptide abundance value (σi ) in that component. While

a pleasant result, this is no surprise since we obtain the significant part of our protein

abundance from the optimum of LP relaxation.

Within our dataset, there were various instances where a group of peptides are shared

by a large number of proteins. In such instances, our algorithm for MINIMUMPROTEIN-

TYPES picked out only a few proteins as constituent proteins. For example, Figure 3.6

shows a connected component from CDK9 dataset, where all the peptides are shared

by variants of the histone family. Among the 14 proteins that share the peptides in this

component, only 3 proteins were selected with positive abundance values. This partic-

6 Gurobi Optimizer (http://www.gurobi.com/) was used to solve the integer programs as well astheir LP-relaxations.

http://www.gurobi.com/


WDR43,KITLG,VEMGLKLEK, RQPSLTR, LQAKESPQR, CCDC114, MGGTSREELR, PDE6C,ZDHHC6, IDMSAARR,PPP2R5C,YLEIEER, ENQKLK,QSWVRAR,ACADVL,LAHLLQADR, CCPG1, NRBF2, ENKQLK, FAM83H, SLESCLLDLR, EEA1, ZHX3,QLLIQQK, YPEEQLK, ZNF335, LKAVVEAK,ANKRD24,AQNQESELR,CALML6, ARDAAEAR,PARD6G,QAAWRSTGPAVTL, MTTP,FLSHEDLR,

DDX3X,MLQEDELRDAVLLVFANK, TAAFLLPILSQIYSDGPGEALR, MEYHTMIR, ARF5, APOE, EGAERGLSAIR, CDC5L,QNAISSNK,CLIP4, METTL4,PCDP1, TIGD7,MLVCLGAR, LKSIIIGK,KSGACPNK,DENND5A,NPPGFAFVEFEGPRDAEDAVR, ZNF841,SRSF7, EPWTVVSQVK,VRAEEAGAR, DST, DGCR2,VSPEPTHEIR, SQTCVDIK,ILESVAEGR, CKAP5, QCKSEFPIR,DDX41,FAM101A, HACL1,LQVALQR,TRIM22,TAEKEGAR,GLYR1, LLLEKNLK,MED13L, TAERQEK, MCM10,CCDC116,LECLKEDVQR, ANSTSTGKPGLAREQGCAR, QFSCLSLR, RNF32,MCAM, EIYELEK,MALSU1, DISC1,LIRLSQIK, DLCEKNK, YIEELER,RIN2, CALB1, KANK3,QIEEQTIK, MTAWTMGAR,RDX,KLSEVGR,ACTR2,EIQLEAQIK,CEP152, LLEPNPDQR,STK32A,QEYQEKGVR,IQVQALR,LDDMLSELR, IQEKQIQQEK,DACH2,PRSS12, ESLSSVCGLR, ARR3,CFP, EIMSLKK,TRAIP,SEEQVSGAK, CEELQGQK, KGEEESQK, TTC25,

ZNF248, DDX3Y, QSGGASTASK, NRG1, CALPPRLK, EPLSSLR,KIAA1875,TFIESLK, EIF3D,LSVGSNRDR,YTEDLDIK,SPINK5, LVLEQVVTSIASVADTAEEK, PXDNL, LFCPQDK,VLGDPGTR,KPASQKK, IPO5, LAS1L,CHD8, GPC3, S1PR3,BDP1,TVGGGGGEAEGKR, METDLKEIR, MRPYDANKR,TGGLFGSR,DDX4,AP5B1,FOSB, AELESEIAELQK, EPLKVVLR,SLVDGYK,MAP3K9,AISVHSTPEGCSSACK, GVMSDPNFLR, IGF2BP1,CLDGRR, DNAH10,GASWLGRR, JAG2,RNF146,VADWTGATYQDKR, ALLADAQLMLDHLK, MCPH1,RPS15AP17,QVLLRPCSEVIIR, MYO18A,NEMNVMLK, LSAVCGALGGR, QVCTDVIK,SDHD, GTF2E2,COL21A1,MAAPILK, CALFYSYIR,CIQSLR, CLISQR, KIAA1109,IL1RAPL2,KIF24, RPLP0,CLALSGR,

GRID2,IAVNIALER,ACBD4,TSNARE1, SSCPQER, GLCSQLR, YTHDC2,QGAELDAR,ASB16,DCLKLMK,CTSPRR, C19orf40, MB21D1,NSF,PLP2, SLIT2, LLDCVPIGPR,HTAAPTDPADGPV, VRNSNNLK,ALLAKEGK,MPP5, EQDQSVLVLKK,JAK1, KCNAB2,RAPGEF5,QRNLLTR, DKILSEEGQR,LLDIEEAVHR,FASTKD1,SAPYKGSEIR,RPS6KC1, RDYLEK, PSAP, MYALFLLASLLGA, CCNL2,QCDKAFSCSR,LRRC41, SDSVLLGMEGSVK, NCOA6, MEIRCNIEK,LOC646214,ALHLSDLFSPLPILELTR, ALLARGTK, LLIDPNCSGHSPR, TMEM194B,DPH2,RAILVMNPATAK, CAMSAP2,GEAAAAALSVR, ZNF709,SKSLADIK,EALAHLR,DGCR8, VQFHLMELNR, WEGVSGK,HLSSPIR, LOC339535,SARDH, XIRP1,

THAP4,CFTR,SGINCPIQK, DASKATGGVR,MQRSPLEK,

VPDNCCR, CDH26,SWTAADTAAQITLR, EIF3E, HLVFPLLEFLSVK, SNEYQLIDCAQYFLDKIDVIK, GNAS,HLA-C, HMGCR, DHX33, EAISDSLLRK, MYO15A, SALTVHCK,TRAF3IP1, LHRLINPNFFGYQDAPWK, TMCO6, MNAAVVR,NEQGGVR, TEDAGLELLACPVLR, ZNF223,

NVLAAITR, GVTTQVAAR, CSMD1, ITCVQLNNR, DMQLLSISR, LMNB1,NPM1,FPGS,SLC7A6, HNRNPA2B1,PPP6R1, DIO1,LLLGATLPR,VAGALVQNTEKGPNAEQLR, DLG5, MOS, VLLILFPDR, GFGFVTFDDHDPVDKIVLQK, ELKELK,LLQGEEER,

CSF3R, LEPPMLR, CHSY3, NPC2,OSBP,LWIDQSGEIDIVNHKTGDK, DRAP1, KPGSGGRK,LEEFLR,LIDLHSPSEIVK,RPS20,

SSYISICNR,LOC100130466,QTQQAASK, GFRA2,EIEKLK, C3orf26, VRK3,EPHA3,ELGISIK,MRPS10, EISVAIK, QPKECFLIQPK,ELEKLK, WARS2,SOST, WAC,MRVEVGIGSR,ELLGSLK,PRRT1, EEAVAEGARR, DGKI, TWKELATCR,MYLK,EKLAFEK,SPATA24,EGGSRSPR, GTPBP1,TCP10L2, IRF2BP1,MLEGQLEAR, FAM26F, ASDVQDLLK,LDAKDGR, LEGYHR,QLKACHR, EIF2C4,EFGIVVHNEMTELTGR, GFRA1, LDCVKASDQCLK, DNAH3,DNAH17,RGTSVDVDLWLITAFHEK, ACVR2A,YQAMVHCGSIR,

GLLPQILENLLSAR, PIK3R6, VLVLGDDR, ITIH4, LCVDPR, ZFYVE27, QEDLQRVR, KIAA0020,QDVLNKSVMK,PPP6R3, KIF23, QLEMQNK,DCVAKSVGGK, AARS2,DDHD1, NIFPSNLVSAAFR, HGEKTMYR, SPEG,LEPLILKR, AGDSVLAPGGDALRPARVAR, SLC1A5, MEGF10,DHH,NFDKLSFLYLITGNLEK, LQFFEER, IEARDQAR,QSDLDTLAK, FAT1, CDC34,SLC9A5, YLSQLLMR, OTC, ORAI2,VDAERDGVK, ILAVADKIK, POLD1,SV2A,ISQERLCR, QECVSLADSK,GTF2IRD1, DTX3L,GGLSDGEGPPGGRGEAQR, CRAAMPVLK, VGHEVLR,SMARCAL1,VVADHIR, GIN1,LLIPRWR,LINC00479, CARNS1,VVCGVGRGDRPLR, VWA5B2, RBCK1,AIAFAHDSTR, GIDQCIPLFVEAALER, IPO7, EYQEDLALR,UTS2, LTPEELER,

LOC100507141,

IDGGITGALRQEFK,

ESYT1, LYMKLVMR,

ZNF850,CHD3,

SSSQPCAWLRSR,

QLAVNLTR,

SFTFHSALIR,

MGPGGVGVR,

EDVEEAMKEK,

VWEGHEGLQSEMALSSTCSALLIR,

EKAPSIIFIDELDAIGTK,

TCNLILIVLDVLKPLGHKK,

IQLWDTAGQER,

LAPVPFFSLLQYE,

RAPH1,

GCET2,

CNOT1,

ATP6V0A4,

MDN1,

ASAP2,

RAB15,

RPL4,

IQIWDTAGQER,

EIISEVQR,

SLLQAWGLILR,

FYVGDGYK,

ETSLLLR,

APGFVYAWLELISHR,

ILIFEK,

NIPGITLLNVSK,

SLC6A15,

WFIKKN2,

STAG1,

RPS14,

RASAL1,

TFDP1,

PHLDB2,

CPNE3,

UBXN11,

MRPL10,

ADFPLSVVR,

EYDVAVEAIR,

MAFMTR,

YSSSSLSHMGAYSR,

GTTGVPKNSK,

RVCDALNVLR,

SPLGEVAIRDIVQFVPFR,

EMPGAPSPLR,

TPGPGAQSALR,

TVPFLPLLGGCIDDTILSR,

FDX1,

DRG1,

RRNAD1,

GSDMD,

CCDC83,

PSMC3,

RAB33B,

APRT,

MSH6,

ADAM15,

C21orf2,

BHLHB9,

COPA,

MSI1,

NUP188,

GTF3C1,

OR2AT4,

NGRN,

ZNF316,

BOP1,

ROPN1L,

FCHSD2,

XPNPEP3,

DPM1,

DEFT1P2,

GTGSGALPSGQKLEELK,

ISSLEGR, NARG2,

IMPA1,

APOL1,KNITNDIR,

AILFVPR,

AELEQKIDEAR,

SMSQASKELGPCAAATSPTTFLR,

HQWAHAEEKPHR,

ENLPLIVWLLVK,

DLGVLDVIFHPTQPWVFSSGADGTVR,

SSLEQMIRK,

NCL,TALGVAELTVTDLFR, SPATA5,ATDLGGTSQAGTSQR, GVVDSEDIPLNLSR, TRAP1, TLVLSNLSYSATEETLQEVFEK, ZC3HAV1, AVAPSIIFFDELDALAVER, ASITPGTILIILTGR, IARS2,SIVEEIEDLVAR, RPL6,VCP,LLIVSNPVDILTYVAWK, RPN2,QAAPCVLFFDELDSIAK, QLAVLTELPFVVPFEER, UBE3C,KPNA6, LDHA,LDHAL6B,IIDFLSALEGFK, SEC61A1,DYVLNCSILNPLLTLLTK, FSGNLLVSLLGTWSDTSSGGPAR, LIIVSNPVDILTYVAWK, EALDVLGAVLK,SCQTALVEILDVIVR, STESLQANVQR,MSH2, LNLVEAFVEDAELR, MRPS27, RPL13, LPCAT1,

TIAECLADELINAAK, SVVTVIDVFYK,RPS5, LLLGIDILQPAIIK, FLG2,NITWTLSNLCR, FANCD2,EMD,HK1,ALADILSESLHSLATSLPR, SANLVAATLGAILNR, SNRPD1,CLINT1, SLLLLAYLIR, YFILPDSLPLDTLLVDVEPK, YLESEEYQER,LFLWGQALYIIAK,PHKB, CDC123,MRPS18B,LFDQAFGLPR,NAP1L4, GIPEFWFTIFR,AFITNIPFDVK,VKEEIIEAFVQELR, PSMC1,HNRNPM, VAEEHAPSIVFIDEIDAIGTK, HSPB1,DDX1,FLVLDEADGLLSQGYSDFINR, RPL19, VASP,LLADQAEAR,TYGEPESAGPSR, TPETLLPFAEAEAFLKK, KPNA7,ILGVWDTAVSLGK, TLSDIFLLFK, ASPRV1,MAP7D1,ILNIFGVIK,TFRC, ELYLFDVLR, GEMIN5,NVLTAILLLLR, IPO11,ASNS, LVFCLLELLSR, SERPINB5,GLSGLTQVLLNVLTLNR, GNL3,QITIIDSPSFIVSPLNSSSALALR, TTI1,LLQDANYNVEK, ILVVNAAYFVGK,GGT7,IRS4, TTPDVIFVFGFR,LOC100287551, CPSF6,VLLDFLDQHPFSFTPLIQR, LNTEVASVVVQLLSIR, RPS24,QTQTFITYSDNQPGVLIQVYEGER, GIPEFWLTVFK,NAP1L1, SFVEFILEPLYK,EFTUD2,RPL3,AVSDASAGDYGSAIETLVTAISLIK, NNASTDYDLSDK, QARS,EAATQAQQTLGSTIDKATGILLYGLASR, SUB1,FYYNVESCGSLRPETIVLSALSGLK, POLR2C, GISLNPEQWSQLKEQISDIDDAVR, CLPX,

YYALCGFGGVLSCGLTHTAVVPLDLVK, MERSEELNK,YRDC, CNTRL, SLC25A3, LRRC59,QLEEAIQLKK,ASPESQEPLIQLVQAFVR, LANFGGLAVGLGFGALAEVAKK, LVTLPVSFAQLK, FQVWDLGGQTSIRPYWR, ADCK3, ARL1, KIAA1967, MECOM,TPR, ASGTKLTEPR,GQNLLLTNLQTIQGILER, QEGLDGGLPEEVLFGNLDLLPPPGK, SETX, CPSF2, MRPS7,VLELAQLLDQIWR, ASB7, LLLEFK, IILEFK,VLISNLLDLLTEVGVSGQGR, UNC45A, DQAVENILVSPVVVASSLGLVSLGGK, NSPGSQVASNPR, AFLTGLAR,POLM,KPVEELTEEEKYVR, IYEEQNR, PCNA,MLL2, XRN2, SERPINH1,

WDR26, RBBP4,LALLNVATQGVHLWDLQDR, ELDLEKGLEMR, SNRPB2, LLEPVLLLGK,SLYALFSQFGHVVDIVALK, RPS16, TIFTGHTAVVEDVSWHLLHESLFGSVADDQK, SAMHD1,LTDNIFLEILYSTDPK, TKVGIIAGR,tcag7.977,LGLQILK, TNRC6A, YLASGAIDGIINIFDIATGK, EALAQTVLAEVPTQLVSYFR, WDR61, ILF2, CPNE1,IAIYELLFK,RPS10, BCR,INNVIDNLIVAPGTFEVQIEEVR, MDMNSIKEPQSR,KIF3A, MCOLN2,INLSLSTLGNVISALVDGK, NCAPD2, LLNILGLIFK, CGRRF1,EVVDYIIFGTVIQEVK, LYEALQK,LILIESR,RPS13,

VTPQSLFILFGVYGDVQR, FSASGELGNGNIK, PTBP1,GIDPFSLDSLAK, RPS4X,CCT6B, TIVAINKDPEAPIFQVADYGIVADLFK, P4HA2,ETFA, CEP170,LRECLPLIIFLR, RPL18, TNRPPLSLSR,KPLTTSGFHHSEEGTSSSGSKR, HNRNPAB, ADRM1,IARS,MCM6,LFLDFLEEFQSSDGEIK, FVDILTNWYVR, GFGFILFK,SQVLDYLSYAVFQLGDLHR, NFHVFYQLLSGASEELLNK, GTPBP4,MYO1B,LSPVLDGVSELR,TBR1,SQSAAVTPSSTTSSTR, SLFVQELLLSTLVR, SRSF3,KCMF1,ASLTVLEK, NPPGFAFVEFEDPRDAADAVR, DCAVLSAIIDLIK,DHX36, PARP1,FFEVILIDPFHK,RPL15, SLQELFLAHILSPWGAEVK, FYSVNVDYSK,NDUFA4,

KIELHVK, QPGCYTLALR, PASK, QLHVELK,CHN2, PKN3, PLA2G5,IPVSVWMK, TPESTKPGPVCQPPVSQSR, XYLT1,KLLSLGLK,RAN, ITNWNR,SNYNFEKPFLWLAR, KLVYCLK, ANKRD40, PEG3, GALNT6, RASEF, LSCMSRLIELCR,C6orf99, FAM190A,TBC1D5,IVLAGDAAVGK, INSC,QIKDILK,DGGLAMLPILVSK, RHMPLR, IVSQNLSTR, PIAS1,LPFYDLLDELIKPTSLASDNSQR, RPN1, LDAQVKELVLK, NEK10, LILPNKQK, MTOR,AIQIDTWLQVIPQLIAR, MRPS14, HLADHGQLSGIQR, DLSYTLK,ELL2,QINWTVLYR,RPL24, HADHB,

RPL14P5,LLLYCLGK, ALVDGPCTQRR,CCDC167, ASNYEKELK,EFCAB10,LSLNAELQLDR,ESQLAGR, QDSSGLR, LHX2, ISSAIDR,PLEKHA4,ITGA2B,HLQTYGEHYPLDHFDK, LECDSCGKLFSNILILK, SSSSPRSINK,KLK11, TSEN54,BCKDHA, LLCGATLIAPR, TAGAP,METTL3,MLLIAVLEAIGGQASR, ZFHX3,GAPVD1, QLDSLRER,LSAQAQVAEDILDKYR, IPO13,SLWVINSALR,PIK3CA,IINENYNPK, DROSHA, SSRSPSR, LELDRLR,RILPL1,YEDLLK, C2orf16, ELAAHLR, DOPEY2, ILEPVLLLLLQPK, DNAH7, RLQILK, FAM214A,LGALS3BP,QLNLITR,IFVGGLSVNTTVEDVKQYFEQFGK, CAYQGLALSPR, MON2,TLQALEFHTVPFQLLAR, LOC100506012,NLRC5, AVYYLDAVVSFIECGNALEK, ARFGEF2,AWDVLLDHIQSAALSK, AFF4, AGTR1, KDILQLLK, IBTK, MLIDIISSKMISHGIK, UPF2,KIEEIK,RPL38, ALFIVPR,GSSLSGTDDGAQEVVKDILEDVVTSAIK, JMJD1C,TNRDLEIK,SDCCAG8,KLGAEEAAR,TBX18,FLPCENK,IL10,NUP133, QHEIVLK, SORCS1, VGAGGGSQAR,ILLDQVEEAVADFDECIR, SSALDTRR, TOMM70A,PIKFYVE,

LMEEKNK, NEB, AGDALNER, RASAL3,AEMNILEINKK,NBPF3,GORAB,SSAACAPDKSLR,SGSSSPDSEITELK, DRD5,ITIDTNK,CTPS, BBS9, SPON2,LARP1B, TGTHMSRAK, WTLGSGPSEAIR,LKLMDEPALR, LUZP1, UNC5A, VTWRGEGR,PLEC,KYLNEK, ARAEEESR, CRLS1,GWRFAAAVR,EBPL, DVMLIAAVFYVRYR, QAIEGVGNLR,ESLATLSELDLGAER, SUPT5H,EIVEEAIR,IPLHSPPSK, ENC1,PHC3, ANO2,

ENAH, EDN1,ASNSD1, DENLTANEVLK, LKTVEEAAAPR,QERQER,NQELPLR,ZNF233, SSFHDPKLK, CAMK4, RAD51AP2, LOC401317,KFLACR,LEEEDIK, INTS3, FCRL5,FLEGQDAK,LOC100287896,NEBL,QPSPSSSR, YTHDF2, LENNENKPVTNSR, KDLENEIK,RBM20, VACSVPGAR, HMGXB3,MSSVYLK, UTP11L, HEASLAIR,AAAAAASLR, LCE4A,TULP2,TVGNQTGIK, GVTNQTGLK, SHCHRPK,SNRNP70,

MAEA, LOC100507023, LOC145783,LEAPSMGHSPR,SSPELTRK,AGGQAGTRR, SRRM2, C8orf58,TGMERELIESR, SPTBN1, DGAGEGLAR, KIAA0947, KGGEESLR,QLQEDAAR, PLEKHH2, AVGNNTQR,SIN3B,AEEKLESYR,EKEDQIK, RGAAGGNLSSR,EELSNVLAAMK, LPPR1,CENPE, CCDC75, GAIEVLIR,RBM25,PPP2R3A,DLEHCLQGNIK, LEGLLAFK,C11orf35, LOC100289079,EAVASQR,SSLWNVVMR, AUH,SSWVGRMLR, TXLNA,

EHBP1, ODF3L2,GTPBP10,EQLLLDELVALVNKR, LOC100134174,GAQVGTRGSR,GMGHKFLK, RUSC1, GCKR,MPPTCSPGLR,ARHGEF11, TAVGSSDSK,MGTLSCDSTPR, VRAEGELEPR,RB1CC1, DEAIQTALK, DPY30, ERPPNPIEFLASYLLK, LOC100131673,LRRC4,YLVEVDQASFQCSAPFIMDAPR, FSAFLDK,QYKPMSLNR, MAGCCCKGNIGR, MRPL15, RALGDS,USH1C,MKGGSATK, VTSQDKAPAVIR, RBBP5, KEAGNIK,UBXN2A, GSCFLINTADR,FNITHR, DYTN, ALG13,

SGCE, QNEEFRPFIRR,QANNFIISR, RER1, UBXN4,ECVGATGLNAAR,LINC00313,QFVTGGKK,VPS41,FAAYIRAAVR,PKP4,GPGPGTGKGMGMGR, C6orf25,YEVLEK, ERCGDPCR,LVVVGDGGVGK, LSKEDVWK,DEPDC4,B3GALNT2,QQVQAKR,MYO7B, AASAAAAGSLK,MLEERNR, MRAS,ENSTTTISR,PLS1, ELFN1,LOC100287837, SECISBP2,LLVPGPSGER,ECE2,GYEGQLR,EMILIN2,APCPPRATSLR, QQPQDNFKNNVK, SMC2,

SGGQKQR, HEATR1, CLQLLTHTFNR,KIF1B,SPQILVPTLFNLLSR, KAFAGELEK,ABCB11, SMCR8, TPM1, MEQKMK,SLSPELR, ANKRD32,HES2,QLGEKGPCQR, GPS2,AGDAAELR, ELN,RNFT2,KIDINS220,MMGT1, CLSSYTSVKENFDK, TREM2,MTPLGSGPPR, KLELEK, FAM120B,VLQMLDTVR,MEIQEIQLKEAK, ASIC4,LOC646934, IPNRGSAR,ANKRD30A,LDCTFIK, LRP1B, IENAWGIR,MEQMKK, GAEGEARR,

NAGEKIK,SQYAPSPDFKR,TMEM233,STRBP, ILRDLCNR, ITPR3,ASLTWASPR,LOC100287367,CVRDLVR, LPLPGPPER,LELLEEVLALGLR, EINKLLK,TWIST1,RSAD1, ARL6IP1, FANCM, ELSLVEQR,LIDFGCAR,KSAACGQPR, ZSCAN4, SSGKNLER,YSK4, HOMER1, LKQEIDNAR,ZNF737, SSILTAHK, WNT2B, VTGTIGMAEKR,EML6,LDSMVSEGK, ARHGAP27,EMB, MAGI3,KNESLGQ,CRAFQLETGQLVECVR, DATTEKNK,RYR1,IEVMSSK, DOCK3,GRIN2D,SPRYD3, VCGTLLEYLGK, PCDHB17, MMETPLPK,SVGEGAREWIR,

RBMY2FP,AHQNAYALPSNYQPPLK, HOXB3, SPSGCLRSAR,SSCTRPIMPTCLR, DNAH8,TVGSALPSSPKR, LSLDTMKR,ARID5B,QMNIDVATWVR,PKN1,PNKP,FREMTDSSHIPVSDMVMYGYR, ARTAMPAISVLDLLQDEER, CDH11,IKLEGGIK, VPERIR,BPTF, QMAVREK,GPR45,LKLEVQK, HARS, MLLT4,QPSSSGAKR, HYLDSDPAEEYELVQVISEDK, TP53I13, RGL1, FAM163A,RPTSAQGR, STAFGMNGLGGIAAKLFAAPFSMR, TRIP10,FVKMSIR,LEKELR, LSGDSRACR,ARHGEF9, C16orf48,ANKRD6, PRAMEF23,LEEKIR,IEKIER,LEKEIR, C11orf2,LEGLLSR, PARP14,CCDC112,

QAPGGGWDCR,QPREER, QIMQGAHSDFIVR, HESAKADLLR,CTC1, SCXB,KCNH7, CGLQGARR,GDF7, PRKG1,TLAQAAGAAAVPAAAVPR, RSSMESPK, ACSS2,AASLLLEILGLLCK,CCDC18, FMAELKELR,PAIDLGQELR, FRY, HIVEP3, RHBDL2,AICGLSR,F5,NUB1, EWRGAGMAQK, PLEK, AIQMASR,ELADEALQK, SLDRQGIQR, LYST,KSCEMGLQLR, LYSMD4,MMLVLEK, EXOSC4, MAVGTRGTLLK,PSMC5, EIF4G1,RINT1, IDILDSALLRPGR,ILLLSEEK,CCDC168,

QSRQNSR,SNCAIP,C2CD3, SSFLSSPER, PRSS33,VAGAATPKK,HIST1H1D,SLNSMCTELSIPLARK, DAPK1,MIDN, QGPQSPER, QLSSELGDLEK,TBC1D19, YNTNNGAPK, LGALS3,NKX3-1, MPPHPPLCLLSQMLK, SPOCK3, GSEVTAPVASDSSYR, GNAZ,GQTCYRPLR,SUPT4H1,GNDVAFHFNPR, DLRHLR,PHF20L1, CTSZ, NAB1,AIFHLTR, ALDH1L1,ALHFLTR,LOC100506052, DALENGR,SREBF1,IDRHLR, EPHB6, GAQRAHIR,FAM98A, LEEALEAAQGEARGAQLR, RSVLSPK, TBKBP1,TTKMGFQISR, EKELLIHER,FAM105B,

KISS1,CRTC3, OPLAH,LOC100507508,EAAPGNHGR, SPPASSAALCVWLAPR, SSSGLQSSR,VKSSNIQVGDLIIVEK, ATP9A,IGFN1,GFI1B,LEAELATAR, RSHSGTRPFACDICGK, CCDC23,AMOT, NSRP1, QMARVNAK,KDCGQYSVTLR, SVDSSIKK, LEGELEELK,C10orf137, CNPY3,DIAPH2, MYH14,ILNELAELK,QKSAQQELK, RRP7A, TRIP13,LAPTM4B, KMVAPWTR, EAPFVPVGIAGFAAIVAYGLYK, HIGD1A,STWPQKR,GDTAALGGKK, SSLSTSSPESARK, YSSAGTVEFLVDSK, CEP164, RLEDLR, PACRGL,COL4A4,HNIVFGDYTWTEFDEPFLTR, KGESGIGAK, PCCA,

CHTF18,RLAQENVWLR, CDKL5, MLRTLK, LOC652586, YSSLLRGASR,IL1RAPL1, KLC3, NAA38, MIVGTLK,RDTLLIR, CD6,AEQERK, KVMQVWTEK, EIEDLLRR,IENKESR, ABP1,FAM186A, ELRLQPSSTTTMAK, CHD9,ZNF668, MEVEAAEAR, QSOX1,MAGEB16, ITSN1,KQELLVEIQR, RDTGAALLAESR,VAFLVNFMLHK,NAA16, ILKCYEQK,YLGSALELR,NANOS1,KHQNGTK,FQSVESSTR,QVDGERR, FHEHENGKSR, LLNGASQADGAR, ZNF516,USP42, RARS2, ATXN7L1, QQHVEDVQR,QTECLNYIR, KIAA0226L, RBM22, ASTMPRLDPPEDK, CYTH2,SEMA4D,

N4BP3, MAQQEQRR, C19orf45,KIYMEIDPVTR,TNMD,NCMTDLLAK,REEP5,NAPMFQR,CACNA1E,GEASVVDLK,PDE1C, KAT7, YMKSQTILR, MCM3,TVDLQDAEEAVELVQYAYFKK, ESD,C15orf39, GQTLDGTFLR,QLAQAER,TSVHIPR, LMAN1L,TNS1, HGCVVIK,METTL10,WTRSCAPGMAR, LAMC1,GSCPLRK,EAPPHGAPPR, KLHL29,FBRSL1,CLEC4F, SSGADGSGGAAVAAR, SYPGSQLDILIDQGK, IQUB,PLCH1, NCINLQNEAQKR,NTKGGGLEGR,EGECLTLCK,SH3TC2,SSAARPVSR, CXorf59, CCDC63,EMDAGVSGLRK, KEIEDLR, PTPRT, KGYHEIR, MYC, RSFFALR, NPAS2, SVTREGTGTTR, DICER1,EAARASR, FLGPCGWK,PHF7,NISQDLEKQAAR, TM7SF4,ADSSGQQAPDYR, CARS, C19orf60, GHLTDPQSR, FNSDAASPRMAPR, OVOL1, RQEVSSPR,HLA-B,LAGGAGGGSRAGR, TANK, IFT74, ENFNSQQK, AIILETR,CX3CL1,CTNIGGLELSR,CLDN15, CYP2D6,FGDIVPLGMTHMTSR,

SH3GL1,MAKPAQGAK, MDM2,EPGLESK, MPHOSPH8, MVSEKVGGAEGTK, HQNSALHFAK, ANXA6,HMX1, LHDMLGPHMLRR, ATN1,DNSNNTLYLQMSSLR, CLAWCEPR, CHD4,EELAEAEIIK,ASUN,STIAFAHR, NKYQELINDIAR,IQGAP1, ENLLLEEELR,SYNM, ACSL5,HELLS, GGQSGLNLSK,ACPSTPESR, VEEKDSLSR,C2orf73, UROC1, TLE3,SSKTASTNNIVQAR, MCTP2,MBNL1, AVSVTPIR, RCWSGNQK,GNG12, HPS4,AYPTMPR, NVVYER, HIF3A,LPTGGDAPQEHGK, MRPAAGAARRPR,POF1B,DQSVGDPK,SCARB2,FSALTILGSIAEVFHVPEGR, THADA,QQHISNDCANLFPLVDLSIR, DLL1,EISPATPNMQK,CCDC66, CALHM2, GIPR, LAMA3,LISRLATQR, QPRAGPIK, NEXN,GWHHCRLR, EYEELIK,AAVAPVTWSVISLLR, PRRC2C, FAM129B,KETLDR,HMETQAKDLR, TQTAPAGR,TDMDQIITSK, SLC22A11,SCNN1A, DNAH1,ZNF181, MPRIP,LSKGMIPDWESR,TRAF3,EAEQRAR, QEAVMGK,

PAG1,LEIEKR,KIF3B,LELEKR,KIAA1826, EERAQWAEQQR,NSMVVPSERAR, IPPESAVDTMLTAR, LOC441736,OPA1, IVLSNNTPR,NEDD4,GNSSESIEAIREYEEEFFQNSK, LOC81691,MYBPC3, SSSFRTPR,NGNQDCR, AKEEEEAEK,SLC4A8, NCNLRALK, AHNAK2, DKIIPPDPITK,KFEAELK,ISSAVTPVK, DLGSPQHPPPR,SYNE1, QALFQQEMAR,UBE2D4, P4HTM, PTPN7,MED14,GEDVNRTLEGGR,MTPN,VGSDDLER,PSD2,IKVEHLK,KIFC3,MCPPLTTK,WDR11, ZNF444, REHVLR, ZNFX1, KPMQLPRSMVK,LTCASLQK,TVTSTMLGVFR, GALNT5, ERGIC3,KMQNALGR, LFVSGVR,ACAP3, CDK19,ALB,WLPLEANPEFLK, FEELKAQK,EAAFFKELAVR, NSD1,C10orf76, GCH1,UCHL3,EGKGSPLK, DKILSMAEGIK, LTB4R, YPSEGHR,CDKN2AIP, WDGPQNK, AK8, KVMHLQDVEVK,MYO18B,RTWESPGVR, DAEEQVKLK, TRIM4,QFIDEYVETVDMLK, PIK3AP1, PTPRG,SLC9A2, LLIRENQPK, LHFVCSPR,TCIIWDLNK,WDFY3, ALDH3B2,POLN,

SPAHPTPLSR, AVSTSSMGTLPK, SYT12,EIEQLQEK, HMGCL, VTVAESSSDGR,CNNM4,MIS12,VFLCPGLLK,

LTDEGLR,DQLAIAASR,FAT4, LTGELDR, ZHX2, RNF215,FBXL7,

VAIGGQSNESDR, LDGRSLIK, IAVDELFTSLR,SCFD2, NCAIISDVKVR,KDM3B,PPIL2,

NAVSVHHALASK, PDE1B,PIEZO2, GADAGNGIR, PSD3, SECAIVYNDR,LLDALLQR,

SFI1, LPHN3,SINEMLLSR,C12orf51,ESAQGLR,

QVLGERR, EAASGTTPQKSR,STEGVDIVMDNLK, LDSSALR,FAM174B,WDFY4, TCOF1,C9orf79, FILPNR,SQKTLGCADK, ANKRA2,FAM170A, NNAT,TCVSSLCVNK,KMEELEK, TMEM214, DLASYLNYK,MTFMT, MPSPSPK,YNLPLDIQNR, IHHLSEK,TOM1L1, POLR3G,CLRN1,HSVAGELR,VFRACTELGIR,VGGSGLGPPR,LOC100293539, PAPPA2, MARAAALLPSR,PC, PVRL2,LGEECDDGDLVSGDGCSK, ARHGEF25, KLCEGDCLSPLHR, TGM1, KEVELAPGASDR, SYT13,GLTENSR,EQLDAELLR, SLAMF7, ARID5A,TWISTNB,CHPT1, MAVGASIAAR,ACVR1C,TDRILGVMGGMLR, ORC1,MTRALCSALR,DPY19L2P4,AGELQEVAAWR, KLHYQTYR,

ZNF687,RDLELR, BCAT1, GVDNKIR, C22orf25, CPGYGTR, FISHKK,VTFGNRVTSSLGDIPVSR, HERC5, KILPNGPK,HMG20B,IGSF6,GGA2, NREP,VTQIACGR,MGTASRSNIAR,AGVACLGK, C9orf91,RIEDLR,UBN2, ABCG4,SLAESALLEPQVRR, RDLEIR,MKNFHLK, REVELR,GIGYF1,KIAA1755, DNAAF1,SKELMEAVLR,LTVMYSQINGASR, RNF213,TRIM25, WASGVDGGLYEHKTFMYPVAK, ERVFC1-1, SQDWKFALR, SILNSMNSLR, SEMA5A,SYEHMAKIR,SUGP2, SCALSNLACIR,COL25A1, F11,KLHL12,SCVQSAVGMLR, NFKBIB,FAM198A, MAAVSGLVR,PDHB,CRNPAELR, PSMA2,VCVGQNREER, YNEDLELEDAIHTAILTLK,

SGKLGEQCEAVVR, HYAL1,LEELGDQR,ILQDSLGGR, POLR1A, LTITFPAMVHR,INTS7,ITPR1, FNDC7, FLNC,APANIQVSFDSGALK, FBXO31, IVLEVRER, CCDC135, QVETQLDYLK,GALSLEDQAQMAVEFK, TOB2,EAGSMGSR,GPR179, CVHIGEMVDPVVELAAKR, LLSESEK,HAUS7,ANK3,GGGHPALGSR, LPLVLLGTR,DSNQRPKNNR, MYO5A, EEERPQIR, DECIPQDR,ZFYVE28, KIF11,RLSLAAQK, TTC38,SSQIPKR, SLGGIEGCLSKPK, FLT3LG,MLL,LRLQEAANR,C16orf7, ZNF236, CPQCFR,GQGSSPVAIQK,VCL, SH2D4A, EGQTPKAGLR, SOX8,KDLQSHMIK,AHNAK,BIRC6, NANDANSAAR, ZBTB48,VIPNGNGTFR,

KQLNHVK,LIISELR, KIAA0355,LRWLLDNVR, FAM193A, ASTFLTDLFSTVFR, GK,PSME3,MVCYLLK, LLLMPLK,FAM111A, ESEKIIENFK, KLEEENENLR, SLC4A4,PMFBP1,ZNF761,MMNDSILR, OIP5,KEFLSTAQGNR,DDX58,EQNKLK, EFLQSSLR,IL6,GNLAGLTLR, ELAALFQLWLETK, CLSTN1, RLSVEIDTLR,NFRKB,MEQHPVGLPR, SWT1, CERKL,MAML3,IEEEVEK,CRABP1, CASVKCNIR,C4orf27, ALGVNAMLR,DTNB, EXT2,MVGGGGKR,ALRSFGNALGNCILDK, QLFIEMR, XRN1,FN3KRP, LGEMRLK, KIAA1257, LEELLIK,QRSQIK, IAGFDSIDKK,SUPT6H, MED12L, RAD52,

REEP1,CPHCQVLQQGMVK, GAIACAEK, FBLN1, SSAATASWR,CFH, GQGALSER,IYKENER, EXO1,VWF, TCQSLHINEMCQER, C13orf35,ILVKCLSLK,DOCK6, TAP2, IYQEGWR,GINS3,RHLSGSIASPDVK,YIPF4,THSD4,SIGAEEHEVCR, USP43,TSVPLHR,QSDTMTEKER,UPF3B, RHOT2,GABPA, DAP3,NDWHGGAIVSALSQTGSLFKPR, NGVDPPPR, RLDQEK,NKPTMNYEK,EKDTLR, TRIP11,STDLTTHK, KAGGTKPK,ZNF253, ZNF185, SWITKQDESEGR,HIST1H1C, CAMKK1,SEKPDCMER,ZMYM4,PPP1R7, GALNTL2,LELGSNRIR, ISPHGLEAR,IVKETVR,GEKGEAGVR, RSPRY1,C1QTNF8,VTVQTDDSNK,

QIDSQKETLLSR, RLMETNLSK,NRIP3,QAPESRK,LOC644667,LTIMPKEIQLAR, TIEDRNIK,MAP9,KLLVAGNR, SLC26A6, MGLADASGPR, CEP290,LOC644950,LRRC38,VPSLQHFRK,CCEEIQPR, KIVEIHK,FRMD3,LGLLQLK,TTC34, TFAP4,AHCTF1,RHOV, QTIVELAK,KAT2B, NKLEDELK, TULP1,QSSPSPSR,SHB,CVLVGDGAVGKSSLIVSYTCNGYPAR, ITEADEK, EYA3,VRK1, TVLQLSLR,MRPL46,SLK, QTLDALNYLHDNK, SSADLKER, SLLLIQSR, MIA3, TVLVVKDR, LDLRAP1, GEELSAAAIK, FTH1, MKLQNQR, GOLGA6L2, KMWQEEK, SIK3, DLKAENLLLDANLNIK, GON4L,PFKL, EVQKAMDDK,

LQFYQNR,LLEGEEQRLCEGVGSVNVCVSSSR, ATAENEFVALK,

ATAENEFVALKK,

LTAEVENAK,

EAECVEADSGR,

SGGVCGPSPPCITTVSVNESLLTPLNLEIDPNAQCVK, ATAENEFVVLK,

KSDLEANVEALVEESSFLR,

DVDCAYLR,

LTAEVENAKCQNSK,

ATAENEFVVLKK,

FLEQQNKLLETK,

DLNMDCIVAEIK,

ECCQSNLEPLFEGYIETLRR,

LAGLEEALQKAK,

QDMACLLK,

CKLAGLEEALQK,

FASFINK,

AKQDMACLLK,

LAGLEEALQK,

AQYDDIASR,

KRT82,

CCISAAPYR,

ECCQSNLEPLFEGYIETLR,

KRT83,

KRT86,

GLTGFGSR,

VLQAHISDTSVIVK,

KRT85,

LLETKWQFYQNQR,

WQFYQNQR,

LGLDIEIATYRR,

LYEEEIRVLQAHISDTSVIVK,

LTAEIENAK,

LGLDIEIATYR,

LAELEGALQK,EYQEVMNSK,

HGETLRR,

LYEEEIR,

KDVDCAYLR,

CCITAAPYR,

SRAEAESWYR,

RLLEGEEQR,

FAAFIDK, LLEGEEQR,

AEAESWYR,

LASELNHVQEVLEGYKK, LASELNHVQEVLEGYK,

FAAFIDKVR,TKEEINELNR,

LLETKLQFYQNR,

RLYEEEIR,CKLAELEGALQK,

YEEEVALR,KYEEEVALR,

DVDCAYLRK,

LCEGVGSVNVCVSSSR,

GGGGGGYGSGGSSYGSGGGSYGSGGGGGGGR,

THNLEPYFESFINNLR, SLDLDSIIAEVK,

LLRDYQELMNTK,

SLVNLGGSK,

AQYEDIAQK,

AEAESLYQSK,

GSYGSGGSSYGSGGGSYGSGGGGGGHGSYGSGSSSGGYR,

MSGECAPNVSVSVSTSHTTISGGGSR,

SKAEAESLYQSK,

GSGGGSSGGSIGGR,

LALDIEIATYR,

NKLNDLEDALQQAK,

SLNNQFASFIDKVR,

TNAENEFVTIK,

LALDLEIATYR,

THNLEPYFESFINNLRR,

VDLLNQEIEFLK,

GGSGGGGSISGGGYGSGGGSGGR, QISNLQQSISDAEQR,

RSTSSFSCLSR,NSKIEISELNR,

TAAENDFVTLKK,SISISVAR,

NLDLDSIIAEVK,

GFSSGSAVVSGGSR,

VLYDAEISQIHQSVTDTNVILSMDNSR,

GSSSGGGYSSGSSSYGSGGR,

LLEGEECR,WELLQQMNVGTRPINLEPIFQGYIDSLKR,

KRT2,

YEELQITAGR,GGGFGGGSSFGGGSGFSGGGFGGGGFGGGR,

SKEEAEALYHSK,

KRT4,

YQELQITAGR,

GGSISGGGYGSGGGK, KRT77,

WTLLQEQGTK,

KRT5,

YEELQQTAGR,QNLEPLFEQYINNLRR,

QLDSIVGER,AQYEEIAQR,

LALDVEIATYR,

KRT6A,

KRT6C,

VSLAGACGVGGYGSR, FASFIDK,TAAENEFVALKK,

KRT75,

KRT6B,

TAAENEFVALK,

IEISELNR,

KRT1,

WELLQQVDTSTR,

FSSCGGGGGSFGAGGGFGSR,

FGGFGGPGGVGGLGGPGGFGPGGYPGGIHEVSVNQSLLQPLNVK,

FLEQQNQVLQTK,

TNAENEFVTIKK,TLLEGEESR,

YLDGLTAER,

TAAENEFVVLK,

LALDIDIATYR,

KRT8,TAAENEFVVLKK,

KRT84,

SEIIELNR,

KRT7,GRLDSELK,

FLEQQNK,

KRT3,

AEIEHAKAQR,

LVQLLVK,

LLNQPNQWPLVK,

TMQNTSDLDTAR,

LLNDEDPVVVTK,

EGLLAIFK,

LNTIPLFVQLLYSSVENIQR,

QVVSSSEQLQSYQAEIIELR,

SDLEAQVESLKEELLCLK,

LVVQIDNAK,

SDLERQNQEYQVLLDVR,

QLVESDINGLRR,

LVVQIDNAKLAADDFR,

QVVSSSEQLQSYQAEIIELRR,

QLVESDINGLR,

QNHEQEVNTLR,

QLVESDINSLR,

ARLEGEINTYR,

RILDELTLCK,

ILDELTLCK,

ILDDLTLCK,

QVVSSSEQLQSCQAEIIELRR,

KRT33A,

LVVEIDNAK,

LASDDFR,

LNVEVDAAPTVDLNR,

CQLGDRLNVEVDAAPPVDLNR,

ARLECEINTYR,

KRT31,

YETEVSLR,

LKYDNELALR,

KRT33B,

LVQNCLWTLR,NLSDVATKQEGLESVLK,

NLALCPANHAPLQEAAVIPR,

JUP,

LAEPSQLLK,

EQALRMSVEADINGLR,

TIEELQQK,

CQYETLVENNRR,

LNVEVDAAPPVDLNR,

LASYLEK,

TKYETEVSLR,

LAADDFR,

KNHEEEVNSLR,

ETMQFLNDR,

KRT35,

QLVESDINSLRR,

DALESTLAETEAR,

LECEINTYR,

SDLESQVESLREELICLK,

NHEEEVNSLR,

KRT34,

QNQEYQVLLDVR,

QVVSSSEQLQSCQAEIIELR,

LVLQIDNAKLAADDFR,

KRT32,

QLERENAELESR,

ARLEGEINMYR,

LASYLTR,

LVLQIDNAK,

KRT37,

YESELSLR,

KRT38,

KRT40,

KRT36,

QLVEADINGLRR,

LIVQIDNAK,

LANYLEK,

QNQEYQVLLDVK,

QLVEADINGLR,

ENAELESR,

SGGGGGGGLGSGGSIR,

GGGGSFGYSYGGGSGGGFSASSLGGGFGGGSR,

LENEIQTYR,

QEYEQLIAK,DAEAWFNEK,

ADLEMQIESLTEELAYLK, QSLEASLAETEGR,TIDDLKNQILNLTTDNANILLQIDNAR,

KRT9,

HGVQELEIELQSQLSK,

FSSSGGGGGGGR,

GGSGGSYGGGGSGGGYGGGSGSR,

MTLDDFR,

QGVDADINGLR,

YCGQLQMIQEQISNLEAQITDVR,

LASYLDKVQALEEANNDLENK,

IRLENEIQTYR,KGPAAIQK,

KRT13,

QSVEADINGLRR,

GSLGGGFSSGGFSGGSFSR,

VLDELTLTK,SLLEGEGSSGGGGR,

VILEIDNARLAADDFR,

SKELTTEIDNNIEQISSYK,

VLDELTLTQADLEMQIESLTEELAYLKK,

KRT80,

NVQALEIELQSQLALK,

KLVEGEEGR,

YCVQLSQIQAQISALEEQLQQIR,

LKYENEVALR,

ALEESNYELEGK,

ELTTEIDNNIEQISSYK,

GQLEANLLQVLEKVEEFR,

FSSSSGYGGGSSR,

DIENQYETQITQIEHEVSSSGQEVQSSAK, STMQELNSR,

IKFEMEQNLR,

QLWWRR,

ALEEANADLEVK,IKEWYEK,

KRT14, VLDELTLAR,KRT16,

YENEVALR,

LASYLDK,ADLEMQIESLTEELAYLKK,

VTMQNLNDR,

KRT10,

AETECQNTEYQQLLDIK,

LASYLDKVR,ISSVLAGGSCR,

QEIECQNQEYSLLLSIK,

NYSPYYNTIDDLKDQIVDLTVGNNK,

HGVQELEIELQSQLSKK,

GGSGGSHGGGSGFGGESGGSYGGGEEASGSGGGYGGGSGK,

EIETYHNLLEGGQEDFESSGAGK,

QVLDNLTMEK,

TLVTQNSGVEALIHAILR,

HVAAGTQQPYTDGVR,

HPEAEMAQNSVR,

KRT15,

NVSTGDVNVEMNAAPGVDLTQLLNNMR,

NHEEMNALR,

SQYEQLAEQNRK,KRT17,

KRT19,

HIYYITGETK,DLVILLYETALLSSGFSLEDPQTHANR,

ELHINLIPNKQDR,VILHLKEDQTEYLEER, ALLFIPR,KHLEINPDHPIVETLR,

DLVVLLFETALLSSGFSLEDPQTHSNR,

IDIIPNPQER,YHTSQSGDEMTSLSEYVSR,

TIGGGDDSFNTFFSETGAGKHVPR,

EIIDLVLDR,

LADQCTGLQGFLVFHSFGGGTGSGFTSLLMER,

EIIDLVLDRIR,

LDHKFDLMYAK,

GHYTIGK,

LSVDYGKK,EDAANNYAR,

GHYTIGKEIIDLVLDR,

GMLSTHLTSMFEYLAPPR, VVTLWYRPPELLLGER,

LADFGLAR,

CDK13,

KGSQITQQSTNQSR,

EGFPITAIR,

CDK2,

QYDSVECPFCDEVSK,

DPYALDLIDKLLVLDPAQR, NPATTNQTEFER,

AYVRDPYALDLIDK,

NKSTGQMVALK,

DQVANSAFVER,

APFDLFENK,HSQFIGYPITLFVEKER,

ELISNASDALDKIR, NPDDITNEEYGEFYK, HNDDEQYAWESSAGGSFTVR,

LGIHEDSQNR,ELKIDIIPNPQER,

YESLTDPSKLDSGK,

RAPFDLFENK,ELISNSSDALDKIR,ADLINNLGTIAK,

TKPIWTR,

ASPH, CCDC147,OXTR,PUM1,MAP7D2,

LTSFIGAIAIGDLVK, GATQQILDEAER,LDINLLDNVVNCLYHGEGAQQR, LVLDSIIWAFK,ILGGVISAISEAAAQYNPEPPPPR,

VPS13C,FNTA,CAMK2D, SCUBE1,MADD,CCT2, DSG4, TDRD3,

THYSNIEANESEEVR, WTLLSLVLSQHVK, RPVIRPNVGFWR,VPLADMPHAPIGLYFDTVADKIHSVSR, ALIAGGGAPEIELALR, DALSDLALHFLNK, VLGNTSGLLQLLFNR,

LAKENAPAIIFIDEIDAIATK, CCT4,ATP1A1,

EAFSLFDKDGDGTITTK, DTDSEEEIREAFR,

ENAPAIIFIDEIDAIATK,

LTN1, DUSP14, XPO1,CAPNS1,

AGEEAKR, KIAESEIK,GLFLPEDENLR, RNEFLAR,YISAAPGAEAK,RAB39A,LNIPVSQVNPR,

CALM2, FSNISAAK,ATP1A4,

KGTVAPHDQSPR, LOC652826,RAB5C,PANK3, C7orf73, ACTR3,RPL7, PCBP1,EIF4E, KDFGFLVDILYSALR, HLMPYWGI, FLVTVIKDLLGLCEQK, LFAQLAGDDMEVSATELMNILNK, PKD1,LOC100506888, SCN5A,DCDC2C, PRAMEF3, LOC440258, RPL36AL, MAP4K4, LALVTGGEIASTFDHPELVK, KLEGETYR,EKDTIR,MGLTPGMLGPSR,GKGEGGAGSR,GTADGGGSVLGR, CMKIPTGQGYAAK, ETKTFGGGGGGAR,

LOC100294386,LOC100510063,

TEKDNAALAR,VPSEVQQR, ESLAAIEKR,

TCEB3C,LGFEEFK, AVGHPFVIQLGR, VQDDEVGDGTTSVTVLAAELLR, QVLLSAAEAAEVILR, LFVTGLFSLNQDIPAFK,

LQVLELR,

PRAMEF25, L1RE2,

LLVAELQR,

NPIP,

EALNMERNNR, GNLHFIR,

CEP85L,

KLEEIK,ALFLIPR,

PANK1,RPL36A,

LECVEPNCR,

TNIK,KLGGTIDDCELVEGLVLTQK, VIDPATATSVDLR,

VFDKDGNGYISAAELR,

TDGFGIDTCR,QLIDYER,LIIVEGCQR, MISEGDIGGIAQITSSLFLGR, TAHNLENVLIHFWER, ISEPEADVESVLGVSNLLQVLQKPK,

ENAPAIIFIDEIDAIATKR, RPL7P9, EIF4E1B,

TTHFVEGGDAGNREDQINR,

RAB5B,

LVLLGESAVGK, WALWFFK,

PCBP2, RAB6B,ACTR3C,

DITYFIQQLLR, INISEGNCPER, LQLWDTAGQER,

QLIVGVNK,

IGGIGTVPVGR,

QTVAVGVIK,

THINIVVIGHVDSGK,

VETGVLKPGMVVTFAPVNVTTEVK, EVSTYIKK,

EEF1A1P5,

STTTGHLIYK,

EHALLAYTLGVK,

TLSDYNLQK,

EEF1A1,

YEEIVK,KDGNASGTTLLEALDCILPPTRPTDKPLR,

EVSTYIK,

LPLQDVYK,

NMITGTSQADCAVLIVAAGVGEFEAGISK, DGNASGTTLLEALDCILPPTRPTDKPLR,

YYVTIIDAPGHR,

YEEIVKEVSTYIK,

UBB,

TITLEVEPSDTIENVK,

MQIFVK,

UBA52,

TLTVELEPSDTVENLK,

MQLFVK,

HIST1H2BL, ESYSVYVYK,

UBC,

HIST1H2BN,

HIST1H2BM,

RPS27A,

HIST1H2BC,TITIEVEPSDTIENVK,

IQDKEGIPPDQQR, TLSDYNIQK,

HIST1H2BE,HIST1H3B,HIST1H2BK, HIST3H3,

H2BFS,

H3F3A,

EIAQDFKTDLR,

H3F3B,HIST1H2BH,

HIST1H3A,

HIST1H2BG,

QVHPDTGISSK,HIST2H2BF, KLPFQR,AMGIMNSFVNDIFER,

H3F3AP5,

HIST1H4E,

DAVTYTEHAK,

VFLENVIRDAVTYTEHAK,

VFLENVIR,FGIEPNAELIYEVTLK,

HIST1H4A,

HIST1H4C,

TVTAMDVVYALKR, FDSSHDRNEPFVFSLGK,

ISGLIYEETR,

YGFGEAGKPK,HIST2H4A,

DNIQGITKPAIR,

DAVTYTEHAKR,

FAEQDAKEEANK,

EIQTAVR,STELLIR,

HIST1H2BD,EIAQDFK,

HIST1H3D,LLLPGELAK, HIST1H3I,

POMZP3,

INVS,

VCSPVTVR,

KIAA1598,

PPP3CA,

PPP3CB,

POM121, PPP3CC,

AKAP9,

INERMPPR,KELELK,

TQEEVAALK,

KTLEEMELR,

ERC2,KTQEEVMALK,

QHIEVLK,

POM121C,

ERC1,HNRNPF,

YVELFLNSTAGASGGAYEHR, LOC100290337,

ATENDIYNFFSPLNPVR,

ALYYLQIHPQELR,

TEPATGFIDGDLIESFLDISRPK,

EATADDLIKVVEELTR,

STGEAFVQFASQEIAEK,

GQAVTLLPFFTSLTGGSLEELRR, SYSDPPLK,

VHIEIGPDGR,ITGEAFVQFASQELAEK, YKEVYAAAAEVLGLILR,

YVEVFK,

DDB1,

KTEPATGFIDGDLIESFLDISRPK,

HTGPNSPDTANDGFVR,

HNRNPH1,

GLPWSCSADEVQR, SDPNRETDDTLVLSFVGQTR,

NPAPPIDAVEQILPTLVR, KPNA2,

NNQGTVNWSVDDIVK, AIGNIVTGTDEQTQVVIDAGALAVFPSLLTNPK,

LLGASELPIVTPALR,

QDQIQQVVNHGLVPFLVSVLSK, GINSSNVENQLQATQAAR,

NLTWTLSNLCR,

DIIALNPLYR,

YSNENLDLAR,

LAQPYVGVFLK,

LONP1,

LSSDVLTLLIK,

QLEVEPEEPEAENKHKPR,

TIRDIIALNPLYR,

LTEEETVCLDLDKVEAYR,

LLQLQEQMR,

TLELQGLINDLQR,

ETQSQLETER,

DSP,

NQCTQVVQER,

QLQNIIQATSR,

SNIWVAGDAACFYDIK,

LNDGSQITYEK,

SITIIGGGFLGSELACALGR,

IDNSVLVLIVGLSTVGAGAYAYK,

GLGLDESGLAK,SATEQSGTGIR,

QAASGLVGQENAR,

FANCI,

ETKPIPNLIFAIEQYEK,

DGEQHEDLNEVAK,

SLELLPIILTALATKK, TGGLEIDSDFGGFR,

ELVKHLK,

FVSSLLTALFR,

ETGHVSGPDGQNPEK,

TADKLQEFLQTLR,

VTEAFDYLSFLPLQTVQR,

MNLQEIPPLVYQLLVLSSK, EALLLVTVLTSLSK,

SLELLPIILTALATK,

VGQQGDSNNNLSPFSIALLLSVTR,

AVLLAGPPGTGK,ELWFSDDPNVTK,

ALESSIAPIVIFASNR, SLSAIDR,

ISGLGLTPEQK, YSVQLLTPANLLAK,

NLLIFENLIDLK,

GIAPGDER,

GTEDITSPHGIPLDLLDR,

DSIEKEHVEEISELFYDAK,

HKQSLEEAAK,

LKQLIDK,

SVEEVASEIQPFLR,

IGLVRPGTALELLEAQAATGFIVDPVSNLR,

SVQNDSQAIAEVLNQLKDMLANFR,

IVSGEAESVEVTPENLQDFVGKPVFTVER,

IAYTFAR,

ALCGLDESK,

TENPLILIDEVDKIGR,

VLFICTANVTDTIPEPLRDR,

ALGTEVIQLFPEK, LDPSIFESLQK, PRKDC,AIFM1, LOC731751,RUVBL1,

DFSAFINLVEFCR,

TALALAIAQELGSK, QLKLDPSIFESLQK,

VEAGDVIYIEANSGAVKR,

LQSVQALTEIQEFISFISK,

NKNPAPPIDAVEQILPTLVR,

ASLSLIEK,

LLHHDDPEVLADTCWAISYLTDGPNER,

EKQPPIDNIIR,

TDCSPIQFESAWALTNIASGTSEQTK,

NVSSFPDDATSPLQENR, TTDEGAKNNEESPTATVAEQGEDITSK,

POTED,QSTTHLADGPFAVLVDYIR,

ALAELFGLLVK,

SDDELLDDFFHDQSTATSQAGTLSSIPTALTR, EEQCILYLGPR, WITPVLLLIDFYEK,

SLLSILQR,

MQREEQCILYLGPR,

KEEDLLR,

LOC100287399,

KQLAAFLEGFYEIIPK, HADHSSLTLGSGSSTTR,

HUWE1,

DVAFTVGEGEDHDIPIGIDK,

FKBP5,LOC100288966,

CTAGE9,

EKLEQAAIVK,CTAGE4,

ALGLDSANEK, EVLGSLPNVFSALCLNAR, QLAAFLEGFYEIIPK,

POTEB, POTEC,RQQQAATSESSQSEASVR,

CTAGE6P, LTEDKADVQSIIGLQR,

DCQLNAHKDHQYQFLEDAVR,

STAPSAAASASASAAASSPAGGGAEALELLEHCGVCR,

VLVNDAQK,FQWDLNAWTK,

TRIM28,

FASWALESDNNTALLLSK,

KLLASLVK,

VLVNDAQKVTEGQQER, VFPGSTTEDYNLIVIER,

LDLDLTADSQPPVFK,

CTAGE15P,

QVISYEK,KLIYFQLHR,

VSLERLDLDLTADSQPPVFK,

LSPPYSSPQEFAQDVGR,

CTAGE8,

RAB6A,ACTR3B,RAB5A, PCBP3, AVAGDASESALLK, LOC100289196, RPL7P32, EIF4EP2,SCN3A, PKD1P1, LOC440264,PRAMEF22, PANK2,LOC100509473, TCEB3CL, RPL36AP7, MINK1, RLFAQLAGDDMEVSATELMNILNK, TGEIQFSR,AQDIEAGDGTTSVVIIAGSLLDSCTK, SLHDALCVLAQTVK, CALM3, YYGLQILENVIK,KHGATLVHCAAGVSR, PSMC4, NFLTSLVAGLSTER, IPTGQEYAAK, MHLGLVIPK,DGSMNVSVK, RNEMLAR,SSGSGRTK,ILESSIPMEYAK, IIAEADGER,QQVLLEK,EKGEGGAGSR, CQLISR, GHVLSLALQMYGCR,

ISGSILNELIGLVR,ACLIFFDEIDAIGGAR, TYQQSCVSSCR,LAVDTDTFSFGVVVLETLAGQR, ENIWSASEELLLR, IADFGLAR, VNDVVPWVLDVILNK, DISEASVFDAYVLPK, APLDLDKYVEIAR, IADGYEQAAR,SIATLAITTLLK,NYQQNYQNSESGEKNEGSESAPEGQAQQR, ISLGLPVGAVINCADNTGAK, ALVDGPCTQVR, STIIGESISR, LTTDFNVIVEALSK, TSPY1,

IQCJ,LUC7L2,

TSPY4,AIAYLFPSGLFEKR, VIHLSNLPHSGYSDSAVLK, KQESTVMVLR,LSNQVSTIVSLLSTLCR, LVGSQEELASWGHEYVR, TLFSFLGEIEELR,AALSSQQQQQLALLLQQFQTLK, SVNSLDGLASVLYPGCDTLDKVFTYAK, LFGSAANVVSAK, TPGQVLSLISSLGFNTPIAEK, IFVGGLSPDTPEEK, KNFIAVSAANR,LFIGGLPNYLNDDQVKELLTSFGPLK, LSTHLQQEGSELLWYLAEK, DLEEFFSTVGK, AIGPHDVLATLLNNLK, TIEYLEEVAITFAK, FADDQLIIDFDNFVR, VYEFLDKLDVVR,EKENELK, IAAQDLLLAVATDFQNESAAALAAAATR, SSDPKSR,LPFETFR,AAMCKPLMQNR,ALLDQKK, MGKEQELVQAVK, NAAQELATLLLSLPAPASVQQQSK, ELEFLR, LDSLTYK,MPAPGASLALR,NSETDTIIFIR, SCVFLSIAVK, QENKLK,GCQLMATATLK,QLTEHLR, AGTERWMK, DGADIHSDLFISIAQALLGGTAR, YEEILDAR, VIQATGGGAYK,ESANVEDLEPVR,CTNILR,LTDHCLPLFR,SSNNNGSVR, QTELQDK, TEDPDLPAFYFDPLINPISHR, QNIENVSVEMLLR, LIGEYGLR,LTADDLR, HSEVQQANLK,SNVKPNSGELDPLYVVEVLLR, NVVHQLSVTLEDLYNGVTK, LLLPWLEAR, LASYVEKVR, QLGYIHRNFAGNR, RLDLAGPLLAFLFR, AEPYCSVLPGFTFIQHLPLSER, FFCFQVLEHQVK, VTVHRNK,FLDALISLLS,LLGLLFPLLAR,LPDFLQR, ILLLEAGPK, VSFELFADKVPK,RDEMLGLVPMR, SSAAAAAALDLSGRR, CQDVSAGSLQELALLTGIISK, VLPRGLAAR, SPQEEALQR, TFEEAAAQLLESSVQNLFK, EMLLLTRMAEDVK,

WYFTR,

SPVGLSSDGISSSSSSSR,

LEPTQGHR,DHYIAAQVEQQHK, TEQLYSQK,

VHAAADKHNSVEDSVTK, WSNWEIPVSTDGK,

IDRWFLHR, NERPDGVLLTFGGQTALNCGVELTK,

EATAGNPGGQTVR,

VLGTSPEAIDSAENR, TLGVDLVALATR,

GHNQPCLLVGSGR,

VLKPEWFR,

TLTETIYKNYYR,

DEVVSPLPSALQGPSGSLSAPPAASVISAPPSSSSR, YYGYRNPSCEDGR,

HLRPGGILVLEPQPWSSYGKR,

NGYQPHRPPGGGGGK,

NSCNVGGGGGGFK,

VVSVLTVLHQDWLNGK,

SCDKTHTCPPCPAPELLGGPSVFLFPPKPK,

IGHM,

FNWYVDGVEVHNAK,

GPSVFPLAPSSK,

IGHG1,

IGHV4-31,

TPEVTCVVVDVSHEDPEVK, THTCPPCPAPELLGGPSVFLFPPKPK,

AEDTAVYYCAR,

TPEVTCVVVDVSHEDPEVKFNWYVDGVEVHNAK,

STSGGTAALGCLVK,

ELSDLESAR,

MREIVHIQAGQCGNQIGAK,

MSSTFIGNSTAIQELFKR,

TUBB3,ISVYYNEASSHK,VFNTGGAPR,

CDFALFLGASSENAGTLGTVAGSAAGLK,

EIVHIQAGQCGNQIGAK,

YLTVATVFR,

TUBB6,

IMNTFSVMPSPK,

MASTFIGNSTAIQELFK,

SGPFGQLFRPDNFIFGQTGAGNNWAK, SGAFGHLFRPDNFIFGQSGAGNNWAK,

INVYYNESSSQK,

AILVDLEPGTMDSVR,

HNSVEDSVTK,MPIEGSENPERPFLEK,

TYSLSSSFSSSSSTR,

WLSSQPSFKLEPTQGHR,

AASSKPEEIKMR,

EQLENSPSR,

ISDHSSVKQEYTHK, WLSSQPSFK,

RIWNWR,

KLEHVIK,SILEQLAEKNFELVINLSMR,

KLETLDLDVR,

CGVEADKELSCR,

HHGPISTTPGIIPQK,

LFVEALGQIGPAPPLK,

YVAPPSLR,SSSLNFSFPSLPTMGQMPGHSSDTSGLSFSQPSCK,

WYVDGVEVHNAK, SCDTPPPCPR,

TPLGDTTHTCPR,

IGHG3,

VVSVLTVLHQDWLNGKEYK,

TPEVTCVVVDVSHEDPEVQFK,

TPEVTCVVVDVSHEDPDVQFKWYVDGVEVHNAK,

NQVSLTCLVK,

IGHG4,

FWEVISDEHGIDPAGGYVGDSALQLER,

GFYPSDIAVEWESNGQPEDNYK,

GFYPSDIAVEWESNGQPENNYK,

TTPPVLDSDGSFFLYSK,

GHYTEGAELVDAVLDVVR,

EAESCDCLQGFQLTHSLGGGTGSGMGTLLLSK,

EIVHIQAGQCGNQIGTK, FPGQLNADLR,

IREEYPDR,KEWESCDCLQGFQLTHSLGGGTGSGMGTLLISK,

FWEVISDEHGIDPTGTYHGDSDLQLDR,

LHFFMPGFAPLTSR,

MASTFIGNSTAIQELFKR,

FPGQLNADLRK,

TUBB,

TUBB8,

GHYTEGAELVDSVLDVVRK, GHYTEGAELVDSVLDVVR,

NSSYFVEWIPNNVK,

FWEVISDEHGIDPSGNYVGDSDLQLER,

LAVNMVPFPR,

IMNTFSVVPSPK,

ISEQFTAMFR,

NMMAACDPR,

TUBB2A,

YLTVAAIFR,

INVYYNEAAGNKYVPR, EQLENTPSRR,

VAHACLHPLEPLLDTK,

NIISSTALFLAAKVEEQAR,

CCNT1,

CCNT2,

EQLENTPSR,NIISSTALFLAAK,LNVSQLTINTAIVYMHR,

QYISSHNSVFNHPLPPPPPVTYQVGYGHLSTLVK,

CTQLVR,QETSLSGSQYNINFQQGPSISLHSGLHHRPDK,

FNKNIISSTALFLAAK, FYMIQSFTQFPGNSVAPAALFLAAK,

GPSEETGGAVFDHPAK,

HSSQTSNLAHK,

AKHAEELAAQK,

RWYFTR,

DITDPLSLNTCTDEGHVVLASPLK,

IWNWR,

FGVDPDKELSYR,

QLENMEANVK,

SCFPASLTASR,

HSHSQLPVGTGNKRPGDPK,

SGNTDKPRPPPLPSEPPPPLPPLPK,

GGGPQAQSHGEAR,

FQYGNYCK,

HHHHHPLPAAGFKK,

TPHVLVLGSGVYR,

LSLDDLLQR,

VPQFSFSR,

FWEVISDEHGIDPTGSYHGDSDLQLER,

MSATFIGNSTAIQELFK,

RPVINAGDGVGEHPTQALLDIFTIR,

QGQSQAASSSSVTSPIK,

ILALDCGLK,

HSADGIPPTVLR,

TLHEWLQQHGIPGLQGVDTR,

CAD,

GHYTEGAELVDAVLDVVRK, MAVTFIGNSTAIQELFK,

EAESCDCLQGFQLTHSLGGGTGSGMGTLLISK, STSESTAALGCLVK,

TTPPVLDSDGSFFLYSR, TAVCDIPPR,

CCVECPPCPAPPVAGPSVFLFPPKPK,

IGHG2,

VVSVLTVVHQDWLNGKEYK,

TPEVTCVVVDVSHEDPEVQFNWYVDGVEVHNAK, KCCVECPPCPAPPVAGPSVFLFPPKPK,

VVSVLTVVHQDWLNGK,

GPSVFPLAPCSR,

IGH@,

FWEVISDEHGIDPTGTYHGDSDLQLER,

MAATFIGNSTAIQELFK,

TUBB4A,

TUBB4B,

EIVHLQAGQCGNQIGAK, MREIVHLQAGQCGNQIGAK,

INVYYNEATGGK,

SGPFGQIFRPDNFVFGQSGAGNNWAK,

MAVTFIGNSTAIQELFKR,

YLTVAAVFR,

LTTPTYGDLNHLVSATMSGVTTCLR,

VLGTPVETIELTEDRR,

EVDEQMLNVQNK, LAADFSVPLIIDIK,

VYFLPITPHYVTQVIR,

MSATFIGNSTAIQELFKR,

SQYAYAAQNLLSHHDSHSSVILK, HAEELAAQK,

EQLENSPSRR,

IPVAGGDKAASSKPEEIK,

VSLKEYR,

TSENLALTGVDHSLPQDGSNAFISQK, AASSKPEEIK,

TYSLSSSFSSSSSTRK, MALLATVLGRF,

EIEYEVVR,

VTAVDWHFEEAVDGECPPQR,

KVLILGSGGLSIGQAGEFDYSGSQAIK,

LLESLGYSLYASLGTADFYTEHGVK,

HPQPGAVELAAK,LALGIPLPELR,

LVQNGTEPSSLPFLDPNARPLVPEVSIK,

LLDTIGISQPQWR,

VAHTCLHPQESLPDTR,

VIECNVR,RGPSEETGGAVFDHPAK,

WFESSGIHVAALVVGECCPTPSHWSATR,

HAEELAAQKR,HSHSQLPVGTGNK,

LPPQTLEGDPGAEGEEGTTTVRK,

ESPGAAATSSSGPQAQQHR,

QQAANLLQDMGQR,

GRDVLDLGCNVGHLTLSIACK,

HHHHHPLPAAGFK,

AAPPDVGEER,

MRIPVAGGDK,

NKPIPALR,

GFAFVEFETKEQAAK, FLREQIEK,KEYLALQK,EDNIQAKEENMDTSNTSISK,

LARP7, MEPCE,

RGGGGTELGPPAPPRPR,

TLNAETPK,

GQHHQQQQAAGGSESHPVPPTAPLTPLLHGEGASQQPR,

GGGGTELGPPAPPRPR,

NPSCEDGRLR,

GFQRPVYLFHK,DAPQPYELNTAINCRDEVVSPLPSALQGPSGSLSAPPAASVISAPPSSSSR,

FKTPEDAQAVINAYTEINK,

LEILSGDHEQR,

EENMDTSNTSISK,

QVLADIAK,

DTLAAISEVLYVDLLEGDTECHAR,

QVDFWFGDANLHK,

SRDGYVDISLLVSFNK,

KLTTDGK,

HCWKLEILSGDHEQR,

DPVEILIPK,

KFQYGNYCK,

GRDPVEILIPK,

HYLSEELR,

AAPPDVGEERR,

CAPSAGSPAAAVGR,

IQLKPEQFSSYLTSPDVGFSSYELVATPHNTSK,

ITPSYVAFTPEGER,

DNHLLGTFDLTGIPPAPR,

NQLTSNPENTVFDAK,

NELESYAYSLK,

TWNDPSVQQDIK,

TKPYIQVDIGGGQTK,

IDTRNELESYAYSLK,

LTPEEIER,

HSPA5,

RSRPTSEGSDIESTEPQK,

NVNHSWIER,EYLALQK,

HKMGEEVIPLR,

IISTEPLPGR,SRPTSEGSDIESTEPQKQCSK,

VNATGPQFVSGVIVKIISTEPLPGR, CGNVVYISIPHYK,

IISTEPLPGRK,

HYLSEELRLPPQTLEGDPGAEGEEGTTTVR,

HLRPGGILVLEPQPWSSYGK, WVHLNWGDEGLK, MVGLDIDSR,

LPPQTLEGDPGAEGEEGTTTVR, HRGQHHQQQQAAGGSESHPVPPTAPLTPLLHGEGASQQPR,

WVHLNWGDEGLKR, DVLDLGCNVGHLTLSIACK,

VTHAVVTVPAYFNDAQR,

IEIESFYEGEDFSETLTR,

IINEPTAAAIAYGLDKR,

ITITNDQNR,

SQIFSTASDNQPTVTIK,

IEWLESHQDADIEDFKAK, VKQVLADIAK,

LDKSQIHDIVLVGGSTR,

SQIHDIVLVGGSTR,

SINPDEAVAYGAAVQAAILSGDK,

MKEIAEAYLGK,

FEELNADLFR,TVYVELLPK,

TQEKVNATGPQFVSGVIVK,

IIQKDIIK,

DAPQPYELNTAINCR,

NSCNVGGGGGGFKHPAFK,

KTLTETIYK,

SSAVVELDLEGTR,

STAGDTHLGGEDFDNR, TTPSYVAFTDTER,

LLQDFFNGKELNK,

GTLDPVEK,

DAGTIAGLNVLR,ITITNDK, GPAVGIDLGTTYSCVGVFQHGK, QTQTFTTYSDNQPGVLIQVYEGER,

NSTIPTK,

HWPFQVINDGDKPK,

MKEIAEAYLGYPVTNAVITVPAYFNDSQR, QTQIFTTYSDNQPGVLIQVYEGER,

LVNHFVEEFKR,

LLQDFFDGRDLNK,

CQEVISWLDANTLAEKDEFEHK,

LLQDFFNGR,

AAAIGIDLGTTYSCVGVFQHGK,

HSPA1B,

LVNHFVEEFK,

NQVALNPQNTVFDAK,

AQIHDLVLVGGSTR, LDKAQIHDLVLVGGSTR,

LSKEEIER,

ELEQVCNPIISGLYQGAGGPGPGGFGAQGPK, HSPA1A,

YKAEDEVQR,

NQVALNPQNTVFDAKR,

DAGVIAGLNVLR,

QVDFWFGDANLHKDR,

TPEDAQAVINAYTEINKK,

VEASSLPEVR,

GFAFVEFETK,

VNATGPQFVSGVIVK,

KPGIFPK,

SSSEDAESLAPR,

AIEFLNNPPEEAPR,

SRPTSEGSDIESTEPQK,

DRVEASSLPEVR,

AIEFLNNPPEEAPRKPGIFPK,

IINEPTAAAIAYGLDKK,

VQVEYKGETK,

ARFEELNADLFR,

TVTNAVVTVPAYFNDSQR,

MVNHFIAEFK,

VEIIANDQGNR,

VCNPIITK,

HSPA8,

VQVEYK,

ILDKCNEIINWLDK,

LLQDFFNGK,

DNNLLGR,

ATAGDTHLGGEDFDNR,

STLEPVEK,

VEILANDQGNR,

IINEPTAAAIAYGLDR, HSPA6,

ARFEELCSDLFR,

AQLGVQAFADALLIIPK,

TKVHAELADVLTEAVVDSILAIK, AKLEEAILGSIGAR,

TTSGDDACNLTSFRPATLTVTNFFK,

NSLLASYIHYVFR, TVEFLNEK,ELQHDVQK,IEHYASR,

TPM3P4,

MELQEIQLKEAK, EVPDADSLQR,

MILIQDGSQNTNVDKPLR, DKVQAGDVITIDK, IPDEFDNDPILVQQLR, AINQQTGAFVEISR, TQGFLALFSGDTGEIKSEVR, DVYVPNTTYR, QSRLEQEEQQR,LTPEEEEILNK,

TLGILGLGR,

VLSIGDGIAR,

IIDVVYNASNNELVR, EVAAFAQFGSDLDAATQQLLSR, LAATNALLNSLEFTK,

VLANPGNSQVAR,QADVNLVNAK,

LLIVSTTPYSEKDTK, IGGGIDVPVPR,RAPDQAAEIGSR,YAIQLITAASLVCR, IGGDAATTVNNSTPDFGFGGQKR, TVLEHYALEDDPLAAFK, CDY2B,GSTA1,

GSTA3,NFIA,CDY1,

NFIX, KRTAP2-4,CCDC72,RPS27,EXOG, KRTAP2-1,

AQFEGIVTDLIR,AQFEGIVTDLIRR,

QAVTNPNNTFYATK,

LQQELDDLLVDLDHQR,

KLEGDSTDLSDQIAELQAQIAELK, RYDDPEVQK,

VISGVLQLGNIVFKK, RPS27P17, EKPYFPIPEEYTFIQNVPLEDR,

GYFEYIEENKYSR,

NVVHQLSVTLEDLYNGATR,

NVVHQLSVTLEDLYNGATRK,

TIVITSHPGQIVK,NQSQGYNQWQQGQFWGQKPWSQHYHQGYY, AHQANQLYPFAISLIESVR,

ITFTGEADQAPGVEPGDIVLLLQEK, GIFEALRPLETLPVEGLIR, SLSALGNVISALAEGSTYVPYR,

LYDILGVPPGASENELKK, ILQDSLGGNCR,

FQGEDTVVIASKPYAFDR, YGEQGLR,

AFHNEAQVNPERK,

GALQNIIPASTGAAK,

VFEVSLADLQNDEVAFRK,

YLSLANNK,VLCELADLQDKEVGDGTTSVVIIAAELLK,

LVINGNPITIFQERDPSK,

ACQSIYPLHDVFVR,

ASQVECTGAR,

LVVPELSSR, DVIVRAADCK, KLFCPQK,KDGAEIDK, LTPESTR, IFVDIEK,CRVVSR, RDGADIHSDLFISIAQALLGGTAR, VLVDSLVEDDRTLQSLLTPQPPLLK, ILPVNLKAVK, LLEEQLQHEISNK,AFAALPSSRPVYDIQSPDFAEELR, SESLESPRGER, EATSVPR,KQNELK, VQDFLQR,KEVDENK, QDSSLGGRAR,KCLPIR, AQTELDPPR,LREATEAK, QITEEQIK,NTRVYPCVWCK, NREMVDVRPR, EILRER, ARQEGGYR,LKNELLK, TDMIQALGGVEGILEHTLFK, SSNNNGSVRTA, LDLRSCR, LLLSYGASR,RYEEILDAR, KLQDLIK, HLPSTEPDPHVVR, LVSRCGAVR,SFAVGTLAETIQGLGAASAQFVSR, AAQVFALLFVTEYLTK, VYFGGFFIR, SKFLDALISLLS,APFQASSPQDLR, LINFIRLK,IIPGFMCQGGDFTR, LASYVEK,SFKPDFGAESIYGGFLLGVR, KFDVNTSAVQVLIEHIGNLDR, QVVNIPSFIVR,LVSQIDNTK,KSNVKPNSGELDPLYVVEVLLR, NVVHQLSVTLEDLYNGVTKK, FSDHVALLSVFQAWDDAR, VGRPSNIGQAQPIIDQLAEEAR, YQLLQLVEPFGVISNHLILNK, QLREEQR,LFHFLGIFLAK,LDLAGPLLAFLFR, AELATEEFLPVTPILEGFVILR, AIAYLFPSGLFEK, SSSRSPR, LALFPLLPK, GIAYVEFVDVSSVPLAIGLTGQR, NFIAVSAANR,VVELSYWEWLPLLK, VLSEAAISASLEKFEIPVK, KLDSLTTSFGFPVGAATLVDEVGVDVAK, ICFELLELLK,IFVEFTSVFDCQK, ILGTPDYLAPELLLGR, LFAQLAGEDAEISAFELQTILR, TSPY2,NT5DC4,IQCJ-SCHIP1, TSPY3,GFGFVLFK, TYQQSCVSSCRR,VLNSIISSLDLLPYGLR, HDADGQATLLNLLLR, VADLVHILTHLQLLR, SDVWSFGILLTELVTK, FVNLGIEPPKGVLLFGPPGTGK, IDLRPVLGEGVPILASFLR, DFVSEQLTSLLVNGVQLPALGENKK, FGAQNESLLPSILVLLQR, LVAIVDVIDQNR,GYYSLETFTYLLALK, NIVEAAAVR, GSAITGPVAK, WVGGPEIELIAIATGGR, TEALSVIELLLK, SVGDGETVEFDVVEGEKGAEAANVTGPGGVPVQGSK, ICHQIEYYFGDFNLPR,

ACSL3, MYLK2,HADHA, U2AF2,RBM39, SF3B1,GIGYF2, FASTKD5,ZBED1, MASTL,KRT23, MRPS9,DNAJA4, DHX9, PUF60,RPS9, HECTD1, MATR3,POLR2B, PSMD2, SREK1, IQGAP2,HNRNPD, IRAK1,PSMD3,CAPN2, FYN,KRTAP11-1, PSMC2, CAND1, KIAA0368, RPS26, PPP6C, RPL14, COPG2, YBX1, RARS, RPL23, VPERLR,CCT5, LEELKR,SSB,CHSY1, FBXL19, GOLGA4,CLCC1,IPO9, TNN,MYBPC1, DNAJA3, OR6K4P,TDRD6, CCDC22, TTN, CDC42BPB, COQ6,PRPF8,OFD1, DHX29, XPOT,BARD1, PANK4,FCHO1, ULK1, IPO4,ARHGEF2,RAD50, PPP1R15A,FSIP2, NUP205,PYGL, SCRIB, GTF3C4, ZMAT1, CASKIN1,UBR4,ZMYM3,KTN1,NET1, SKI, CLTC,PPIA,PRMT3, SKIV2L2,CDC45, KRT39, COPB2,HNRNPUL1,

KIVDQIRPDR,

ZFP57,

IEELEEELEAERTAR, MKPEFNVR,EAVSILK,

MYH6,USP17L4, DDX17,

MEKENQK,

TBC1D4,

KNAVVMDQEPAGNR, QCALAALR,

PFKFB4, MAPK6,ACAA2,KIR3DL3,

SSWPNSEK, HALREIK,AVASWSR,DRLHLR,

EDNRB,NOA1, BIVM-ERCC5,RAB11A,

IREFCQR,SDLRHLR,LDGSRLIK,TILEMLLGFLK, HGSGSGHSSSYGQHGSGSGWSSSSGR, QLATVNEQPLQNGFEELIQWTK, LVLDAFALPLTNLFK, ESLNASIVDAINQAADCWGIR, EWAAQYR, VELSEQQLQLWPSDVDKLSPTDNLPR, TNTPADVFIVFTDNETFAGGVHPAIALR, FAGGDYTTTIEAFISASGR, TDYNASVSVPDSSGPER, NKEEAAEYAK, TQLRNEFIGLQLLDVLAR, LPHLPGLEDLGIQATPLELK, LVEGILHAPDAGWGNLVYVVNYPK, IVSRPEELREDDVGTGAGLLEIK, INALTAASEAACLIVSVDETIK,

IINDLLQSLR,GGGGPGGGGPGGGSAGGPSQPPGGGGPGIRK, TIALNGVEDVR,AVLIAGQPGTGK,FVQCPDGELQK,SYTEDWAIVIR,SAIFSITYPSQDVFLVIK, AQGPQQQPGSEGPSYAK,

EEIKIK,MELQELQLKEAK,KLASQGDSISSQLGPIHPPPR, KPFSVSSTPTMSR,KNLEDGINNLK,

VQISPDSGGLPER,AIECLEK,

LLVKCLSLK, GLGLDDALEPR,VLIPIHEANR, GTSYQSPHGIPIDLLDR, AQGPQQQPGSEGPSYAKK, VLLDLSAFLK,

VATAQDDITGDGTTSNVLIIGELLK, STVAQLVK,

QNLSKEELIAELQDCEGLIVR, AAVENLPTFLVELSR, AVDSLVPIGR,

EHIKNPDWR,

GIDPFSLDALSK,

AGTGVDNVDLEAATR,

HVVFIAQR, LFNLSKEDDVR,FLESVEGNQNYPLLLLTLLEK, GPYESGSGHSSGLGHR, SGNYTVLQVVEALGSSLENPEPR, IEPLSPELVAAASAVADSLPFDK, APVPGTPDSLSSGSSR, VPNSNPPEYEFFWGLR, DKPELQFPFLQDEDTVATLLECK,

TLIQNCGASTIR,TARDBP, NEFIGLQLLDVLAR, TROVE2, INF2,CCT7, AFAFVTFADDQIAQSLCGEDLIIK, CCT3, TRDELEVIHLIEEHR, INALTAASEAACLIVSVDETIKNPR, STOML2, AHITLGCAADVEAVQTGLDLLEILR, RPS6,HRNR, GCIVDANLSVLNLVIVK, IIIPEIQK,CSE1L, ILEPGLNILIPVLDR, QSLGHGQHGSGSGQSPSPSR, CNP, EPRS,HNRNPK,NFDFEDVFVKIPQAIAQLSK, GSYGDLGGPIITTQVTIPK, LQAILEDIQVTLFTR, NDUFA9,SELLSQLQQHEEESR, TLTAVHDAILEDLVFPSEIVGKR, MRPS31, MMS19, EIDKNDHLYILLSTLEPTDAGILGTTK, RPS7,ACDSVTSNVLPLLLEQFHK, MAGED2,

RAB11B,IPTHLFTFIQFK, ILSISADIETIGEILKK, TSDLIVLGLPWK, YNVYPTYDFACPIVDSIEGVTHALR, SSVSGIVATVFGATGFLGR, SSQLLWEALESLVNR, TAVETAVLLLR,VHTVEDYQAIVDAEWNILYDKLEK, MYH7, PFKFB3,MYO5B,KIR2DL3, TBC1D1,MAPK4,DDX5,ERCC5, USP17L8,EDNRA,SNURF, PRKCE,

FAYKDEYEK,EATQELIEDLR,

CD209, AP1M1,RAB11FIP4,

AAVGELPEKSK, VFLSGMPELR,

TMEM120B,PFKM,SENP7,

EQWWLK,EELKLK,KNEIQK,

PAPD4,

RLLESLR, QEEQRR,

PNMA5,ROCK2, CDRT1,

AGARLDVR, ENLESRLK,IRSLLER,

CDKN2B,

APMTTVR,

KLHL9, TAOK1,

ITWEEK,EALASHLR,

ANGPT1,

GPSYSLR,

LOC338579,

LTIMLEK,

THEX1,

ELEKIK,

BAT2L,LOC100505868,MYH4,

AFLVRSR,KNMEQTVK, KSLFMLK,

LOC729562, LOC100509661,

QREVLR,

CGN,

EKSIFLVAHR,ELKENLDR,

NPHP3,

YEQAEHFR,

SHROOM3, LOC653566,

GSTA2,CDY1B, CDY2A, GSTA5, NFIB,NFIC, LOC729973,

ADGYVLEGKELEFYLR, LYDNPWR,

VIHDNFGIVEGLMTTVHAITATQK, NCIVLIDSTPYR,

VPTANVSVVDLTCR,

CDSDILPLR,PPP2R5A, KRTAP2-3, RPS27L,RPS27P9,KRTAP2-2, VQVALEELQDLKGVWSELSK,

VFEVSLADLQNDEVAFR, YPVNSVNILK,

TTDGYLLR,FATEAAITILR,

ILDDDTIITTLENLKR,

QAVTNPNNTFYATKR,

LLGQFTLIGIPPAPR,

QAQQERDELADEIANSSGK, TFHIFYYLLSGAGEHLK, VLENAEGAR,

SSGPTSLFAVTVAPPGAR, VSHLLGINVTDFTR,

AVVVCPK,DLPEHAVLK,

STNGDTFLGGEDFDQALLR,

VNFPENGFLSPDKLSLLEK, ISFLENNLEQLTK,

ITFTGEADQAPGVEPGDIVLLLQEKEHEVFQR, NTIQWLENELNR,ITFHGEGDQEPGLEPGDIIIVLDQK,

SLSALGNVISALAEGSTYVPYRDSK, IGLVEALCGFQFTFK,

VIEPGCVR,QISQAYEVLSDAK,

RUVBL2,EIVTNFLAGFEA, CCT6A,GALQYLVPILTQTLTK, ATP5A1, KPNB1, TPM3,VHAELADVLTEAVVDSILAIK, DOCK7, KHSRP,TIMM50, DMD, LVEPQISELNHR, SYNE2, AATEELLANK,TMF1, XIRP2,ELMQLEK, GEGLEYENIK,LTEGCSFR,TRISNLPTVK,ASANEMLIAGR, QADKVWR, KLEELK, CTRPICEPCR, RPS8, LLACIASRPGQCGR, DLPLLLFR,GAPDH, LISWYDNEFGYSNR, PHGDH,SLLVIPNTLAVNAAQDSTDLVAK, RPS3A,DYNC1H1, TCP1, LTLFGNSLK,LRRC15,LITEDVQGK,ENFIPTIVNFSAEEISDAIR, NFILDQTNVSAAAQR, MYH9,TTPSVVAFTADGER, QNKELK, HNRNPU,HSPA9, KIF5B,DNAJA2, TNLSVHEDKNR,NVLCSACSGQGGK, DNAJA1, LIIEFK,

TUFM,RCN2, MRPS34,AVQYLSSQDEK, FASN, ARHGEF1, PTCD3,EEF2,SFN,

U2AF1,ATAD3C,

WFNGQPIHAELSPVTDFR, AVIDLNNR,

RPS2,

GTGIVSAPVPK, TYSYLTPDLWK,

GLLLFVDEADAFLR,

SPNQNVQQAAAGALR,

YVEPIEDVPCGNIVGLVGVDQFLVK, AIGIEPSLATYHHIIR, DLEKPFLLPVEAVYSVPGR, GTVVTGTLER,ALLELQLEPEELYQTFQR, VSNSPSQAIEVVELASAFSLPICEGLTQR, HEEEAFTAFTPAPEDSLASVPYPPLLR, LPLFGLGR,SDYDREALLGVQEDVDEYVK, LSEEEILENPDLFLTSEATDYGR, LAEQAER, LLLKSHSR,SAYQEAMDISKK,ALGLGVEQLPVVFEDVVLHQATILPK, HLKAEVDAEKPGATDR,

LGLALNFSVFHYEIANSPEEAISLAK, DLVEAVAHILGIR,QRCPPGVVPACHNSK, PKP1, EMPPTNPIR, QESGYLIEEIGDVLLAR, EALLGVQEDVDEYVK, STAISLFYELSENDLNFIK, SAYESQPIR, VAVLQALASTVNR, LLQLLGRLPLFGLGR, AWGILTFK,QLHDDYFYHDEL, LLDAVDTYIPVPAR, TFCQLILDPIFK,HYAHTDCPGHADYVK, RQESGYLIEEIGDVLLAR,

RAB11FIP3,ANGPT4,PRRC2B, NPHP3-ACAD11,TMEM120A, SHROOM2,CLEC4M, AP1M2, SPCS2,

LOC100509022,ATAD3A,

TLLEGSGLESIISIIHSSLAEPR, RPS2P40, LOC652798,

ISALNIVGDLLR, IASLEVENQSLR,

SNRPB, GFPT2,EEF1DP1,

VLGLVLLR,

NDEL1,

VIQQLEGAFALVFK,

05-Sep, STRCP1,

RMQEMLQR, ASQPSAHISPR,MREENMER, YELLEEK,

BEX2,

IQDALSTVLQYAEDVLSGK,

EIF3FP3, TRPC7,

TAOK2,KLHL13,CDKN2A,USHBP1,

ARF3,

VELSDVQNPAISITENVLHFK, SALSGHLETVILGLLK,

ANXA2P2, LOC100291405,

MLAEDELRDAVLLVFANK, NQSFCPTVNLDKLWTLVSEQTR,

LOC643751, EIF4A1,RPL27AP6,

NVFDEAILAALEPPEPK, IPDWFLNR,

RPS18,

VLITTDLLAR,

ERI1,ANKRD30B,FBXW10, ZFHX2,MYH2, MOAP1,HERPUD1, OVOL3, SOS1, MTMR6, PFKP,NOS2,

TIGD6,RPP30,SVSTTGLQTAGR, YTISSALNLMQICKGK, HNF1B,

BEX1, STRC,SEPT5-GP1BB, EIF3F, TRPC6,

UGGT2, RNF167,LEGKMMNGLR, VQIDYK,PLXNA1,SSLKGLAR,GDSGGPLVTR,LTALFCCNASGTEK, TMPRSS11A, MAN2B2,

NR3C1, KCLQAGMNLEAR, KCNV1, SHPRH,KENPGGTGLEK, KELSETLDFK,BTNL8, TSGGVNPEPCPICAR, TVGATALPR,

HNRNPA1,

SESPKEPEQLR, NQGGYGGSSSSSSYGSGR, GFVENSYLAGLTPTEFFFHAMGGR, SIAANMTFAEIVTPFNIDRLQELVR,

CCT8,POLR2A,

LFVTNDAATILR, AIADTGANVVVTGGK,

SSGPYGGGGQYFAKPR, LSGEAFDWLLGEIESK, TTSNDIVEIFTVLGIEAVR, LFIGGLSFETTDESLR, NLRDIDEVSSLLR,

VITLSLAGR, RCCHSACGR,CVDPWLTQTR, MLNKQQSEDDVR, CPNE2,EAEKQR, DENND2D,WFDC5,CELF4,NUMA1, TVLRGSGNLLR, RPL7P34, AGNCYILAEPK,DYNC2H1,MPLQDLVTAEKPIPLFVEK, ARHGAP5, NMGYVQQR, DKC1,NPPAGYFQQK, BUB3, DDX12P,ILVESQNR, CCNL1,FSGVCPDSR, SERPINA10,NDSLRTNVFVR,LIELVDR,MBLAC2, PHLPP1, LRP2,IMLPGVLR, FEEIDLSGNK,QLNLTIR,VPS13D,

C9orf142,SYAPELGR, CPGESLINPGFKSK, PDGFRB, ISLIILIMLWQK, LPIN3, RSFHEVR,CMNCGKVFSR, RIT1, CDR2L, RGMSILR,ZNF83,AESHMQWAWGRLPK, APAF1, LOC442132,SGLLMKLINLCTR, GTF3C5,LLKFESCGK, EIEAAIERK, REQQER, LRDQEK,RELB, TCHH,IESLGLR, ZFP37,STCGLLETQIKK, WDSUB1,CCDC39,

EEF1D,ARF1, SNRPN,ANXA2, NDE1,CDC42, RPS18P5, RPL27A, EIF4A2, PTPLAD1, GFPT1,

KLEEATAAR,DUOX2, C14orf133,KEIEDIR, CSPP1,TMEM57, VCEKDPALK,CNTFR,KVEEELR, RYLTQGITQGK, ELLRER,SDLGVHLR, ILK,LSTGLMITSVAVELCKNVK, ST8SIA6, MLLTRK,PLSCR1,SLALETVQNDLR, QGKNVLAK,BCAR1,RVETAMEACSLPSSR, PLCB1, CCHCR1,ARLPGEPLVLLPGR, CDK11B, NEQESAVHPR,STK32C,VSMAPPR, DBNL,VSGQEGGLR,MSH5, ANKRD20A8P, TLALLDEEPK, SWTAADTAAQISKR, RLEEEVR, RFC1, C11orf51,MCC, HLA-G,RLQQTER, AFGGPVPTSALRTVNLGTPNQDA, HCRP1, SLRQSTIAK, CTCF,PHF3,QLCPESIK,

ITENAANKK,AKNAD1, EVQPAPELEIK, MRPL38,CCNO,NLQVSRVLR, QSPGSLKAVLK, ZSCAN29, LLTQSAPK, C17orf66, LTIKSEMR, HADH, IRDSEAQLQVLLR, ZNF148, LNTCAMKGGLLR,ARRDC5, NTFTPGEK,EDVKLHR, BCOR, EGGHPATK, FAM47B, KLEDAWAR, C9orf96,UCN, QSDQDITK,CHRDL2,CCT6P4,EYFGEK, ALLLLLAER, ESIAALHR, ICT1, CHTMAFVTR, SPVSAPKLPK,KLF10,LRDQTR,KIAA1217, JAKMIP3, LPCHPNDGTR,LOC100131193,

TGLYSKR,

LGAPAAGGEEEWGQQQR,

AENLQLLTENELHR,

QELIKEYLELEK,

VRELELELDR,RLGGDDAR,

HEXIM1,

HWRPYLELSWAEK,

IRAEMFAK,

LTWEEKK,

VATWVDETLSSVASKLEHTAQTYSELQGER,

LTWEEK,LTYQDAVNLQNYVEEK,

SGTISTSAAAAAAALEASNASSYLTSASSLAR,

AFPQLGGRPGPEGEGSLESQPPPLQTQACPESSCLR,

AASTAPSSTSTPAASSAGLIYIDPSNLR,

UBR5,

NDEELNKLLGK,

HIST2H2AC,

HIST1H2AJ,HIST1H2AH,HIST1H2AD,

HIST2H2AA3,HIST1H2AG,

NDEELNK,

RHWRPYLELSWAEK, TQSPGGCSAEAVLAR,

HIST1H2AI,

GQPVAPYNTTQFLMDDHDQEEPDLKTGLYSK,

LQQLQACTGQQSCR,

HLQLAIR,H2AFJ,

VTIAQGGVLPNIQAVLLPK, LSQAEEETRR,

ACTG1,ACTB,

VAPEEHPVLLTEAPLNPK,

GYSFTTTAER,

AGFAGDDAPR,

DSYVGDEAQSK,

IWHHTFYNELR,

VLQDWNALK,CTADILLLDTLLGTLVK,

WSESEPYR,LSAVEAIANAISVVSSNGPGNR,

SFYTAIAQAFLSNEK,

LOC728379,

USP17,USP17L3,

LOC100288520,

LOC100287144,USP17L2,

LOC100287327,

RAB35,

RAB3B,

RAB30,

RAB39B,RAB3C,

RAB8B,

LOC100287404,LOC100287441,

LOC100287478,LOC100287205,

LOC728369,

LOC100287364,LOC100287513,

USP17L1P,SYELPDGQVITIGNER,

LOC100287178,

USP17L5,

FLQEQNK,

IIAPPER,LOC100287238,

DSYVGDEAQSKR,

VPDCFQR,

ELELELDR,

ELELELDRLR,WSSGVGGSGGGSSGR,

LAGNTLGSR,

HWKPYYK,

HIST1H2AA, HIST3H2A,

HIST1H2AB,QELVRDYLELEK,

HIST1H2AE,

HLQLAIRNDEELNK, AGLQFPVGR,

LSQAEEETR,

TSGAPGSPQTPPER,

HIST1H2AC,

VAPDEHPILLTEAPLNPK,

ACTBL2,NDEELNKLLGR,H2AFX,

NDEELNKLLGGVTIAQGGVLPNIQAVLLPK,

DYLELEKR,

HVAYVFQALIYWIK,

NVAIFTAGQESPIILR,

TSDSPWFLSGSETLGR, NNPLYHAGAVAFSISAGIPK,

VDGAYVAVK,ILGLCLLQNELCPITLNR,

GEHLLLFLVQTVAR,

LRQENQMWNR,

HEXIM2,DFSETYER,

YHTESLQNMSK,

EGEKGQNGDDSSAGGDFPPPAEVEPTPEAELLAQPCHDSEASK, QVEELAAEVQR,

EYLELEK,

FHTESLQGR,

KDFSETYER,

GQNGDDSSAGGDFPPPAEVEPTPEAELLAQPCHDSEASK,

MEDENNRLR,GQPVAPYNTTQFLMDDHDQEEPDLK, DYLELEK,LRAENLQLLTENELHR,

RLQQLQACTGQQSCR, YFPTQALNFAFK,

GNLANVIR,

YFAGNLASGGAAGATSLCFVYPLDFAR,

SLC25A4,

SLC25A6,

DFLAGGIAAAISK,

TAVAPIER,

SLC25A5,DFLAGAVAAAVSK,

AAYFGIYDTAK,SFNRGEC,

VYACEVTHQGLSSPVTK,

VDNALQSGNSQESVTEQDSKDSTYSLSSTLTISK,

VDNALQSGNSQESVTEQDSK,

DSTYSLSSTLTLSK,

LLIYWASTR,

IGKC,

VDNALQSGNSQESVTEQDSKDSTYSLSSTLTNSK,

LYACEVTHQGLSSPVTK,

AASQSTQVPTITEGVAAALLLLK,

EQGVLSFWR,

LLIYGASSR,IGK@,

PSSLSASVGDRVTITCR,

SSGSAALLALTWTCLLVR,

RAB4A,MIA-RAB4B,

RAB3A,

RAB8A,

RAB1B,

RAB37,

RAB1C,

LQIWDTAGQER,

RAB3D,

RAB10,

SWEQKLEEMR, GCN1L1,

QIGSVIRNPEILAIAPVLLDALTDPSR,

CDC37,

HFGMLR,

KDSSSVVEWTQAPK, KPLVIIAEDVDGEALSTLVLNR,

INQQEEPGFEVITR,

VGLQVVAVK,

IDVEDILCKAEAISLQMVK,

TLLVNCQNK,

LVQDVANNTNEEAGDGTTTATVLAR, WSFLFSLTDLK,

LLIESLEK,

TBC1D15,

LQAEAQQLR,

CALMEQVAHQTIVMQFILELAK,

TEEDSEEVREQK,

LGPGGLDPVEVYESLPEELQK,

LQAEAQQLRK,

MEQFQK, TFEQLHSTIGHQALEDILPFLLK,

TGDEKDVSV,

NPEILAIAPVLLDALTDPSR,

TFVEKYEK,

LVLPSLLAALEEESWR,

VGKGEPGAAPLSAPAFSLVFPFLK, LSVADSQAEAK,

VLPLEALVTDAGEVTEAGK,

AALLETLSLLLAK,

RAB12,

RAB43,

GANPVEIRR,

HSPD1,

ALMLQGVDLLADAVAVTMGPK,

VTDALNATR,

RAB1A,

CEFQDAYVLLSEKK,

VGGTSDVEVNEKK, EFSFLDILR,TALLDAAGVASLLTTAEVVVTEIPKEEK,

VTNYIFDSLR,AAVEEGIVLGGGCALLR, GYISPYFINTSK, NDSPTQIPVSSDVCR,

TVAAPSVFIFPPSDEQLK, DFLAGGVAAAISK,

SGTASVVCLLNNFYPR,

VDNALQSGNSQESVTEQDSKDSTYSLSSTLTLSK,

HKVYACEVTHQGLSSPVTK, DSTYSLSSTLTISK,

GAWSNVLR,GLGDCLVK,

LFNDSSPVVLEESWDALNAITK,

FSSVQLLGDLLFHISGVTGK,

ILPEIIPILEEGLR, VLAFLSSVAGDALTR,

VLPQLISTITASVQNPALR,

ENVNSLLPVFEEFLK, ASEAKEGEEAGPGDPLLEAVPK, RLGPGGLDPVEVYESLPEELQK,

LKELEVAEGGK,

EGEEAGPGDPLLEAVPK,

RWDDSQK,

SMPWNVDTLSK,

ELEVAEGGKAELER,

SLSQSFENLLDEPAYGLIQK,

HYGFNEILK,SKWSFLFSLTDLK,

FLLGYFPWDSTKEER,

VLEKDAEVIVDWRPLDDALDSSSILYAR,

GANPVEIR,

KISSIQSIVPALEIANAHR,

IQEIIEQLDVTTSEYEKEK,

GVMLAVDAVIAELKK,

LKVGLQVVAVK,LSDGVAVLK,

ISSIQSIVPALEIANAHR, RAB4B,

HSQFIGYPITLFVEK, SIYYITGESK,

HSP90AA1,

RLSELLR,

HLEINPDHPIVETLR, FICIGALYSELLAVSSK,

HSP90AB1,

HSQFLGYPITLYLEKER,

HSP90AB2P,

ALLFVPR, LGIHEDSTNR, NTPVQSPVSLGEDLQWWPDKDGTK,

AKFENLCK,

DRGSGLLGSQPQPVIPASVIPEELISQAQVVLQGK, HGLEVIYMIEPIDEYCVQQLK,

RVFIMDSCDELIPEYLNFIR,

HSQFLGYPITLYLEK,

RVFIMDNCEELIPEYLNFIR, ELISNSSDALDK,ELISNASDALDK,

TLTIVDTGIGMTK,

HIYYITGETKDQVANSAFVER, TLTLVDTGIGMTK,

LISQIVSSITASLR, VVVITKHNDDEQYAWESSAGGSFTVR,

LVSSPCCIVTSTYGWTANMER,

TKPIWTRNPDDITNEEYGEFYK,

FENLCK, RAPFDLFENR, IVLLSANSIR,EKYIDQEELNK,HSQFIGYPITLYLEKER,

LGIHEDSTNRR, FYEQFSK,YIDQEELNK,

NPDDITQEEYGEFYK,

IRYESLTDPSK,EQVANSAFVER,GVVDSEDLPLNISR,

TKPIWTRNPDDITQEEYGEFYK,

SLTNDWEDHLAVK,

KHSQFIGYPITLYLEK, HLEINPDHSIIETLR,

HSQFIGYPITLYLEK, EDQTEYLEER,KHLEINPDHSIIETLR,

APFDLFENR,

HFSVEGQLEFR,

VMQMLLNGLYYIHR,

TKASPYNR,

FTLSEIKR,DPYALDLIDK,

FTLSEIK,

HENVVNLIEICR,

GSQITQQSTNQSRNPATTNQTEFER, IGQGTFGEVFK,

IGQGTFGEVFKAR,

LPAPPIGAAASSGGGGGGGSGGGGGGASAAPAPPGLSGTTSPR,

QYDSVECPFCDEVSKYEK,

EGFPITALR,

LLVLDPAQR,

VLMENEKEGFPITALR,

AANVLITR,CDK9,

GSQITQQSTNQSR,

ILQLLKHENVVNLIEICR,

IDSDDALNHDFFWSDPMPSDLK,

AYHEQLTVAEITNACFEPANQMVK, RSIQFVDWCPTGFK,

AVCMLSNTTAIAEAWAR,

TIGGGDDSFNTFFSETGAGK, TUBA1A,

RNLDIERPTYTNLNR,

FDGALNVDLTEFQTNLVPYPR, AYHEQLSVAEITNACFEPANQMVK,

SRSPYSSR,

AVFVDLEPTVIDEVR,

QLFHPEQLITGKEDAANNYAR, SIQFVDWCPTGFK,

TUBA1C,TUBA1B,

QLFHPEQLITGK,GLADQCTGLQGFLVFHSFGGGTGSGFTSLLMER,

LSVDYGK,

NLDIERPTYTNLNR,

TUBA4B,

IHFPLATYAPVISAEK,

QIFHPEQLITGK,

YMACCLLYR,VGINYQPPTVVPGGDLAK, LIGQIVSSITASLR,

Figure 3.3: Peptide-protein graph constructed from an AP-MS experiment with CDK9 as a bait. Proteinsare drawn as red triangular vertices while peptides are drawn as blue round vertices. Note that there arevarious peptides shared by several different proteins.


RPAP1, C15orf44, AALAFGFLDLLK, C21orf2, NVLTAILLLLR,NTESLGELWLGLLR, LLAQLDPSLVAFLR, ZCCHC11, ARMC5,

XPR1, ACAP3, TRIP13, VVNAVLTQIDQIKR, YQLLQLVEPFGVISNHLILNK, PRR11,SIWPLIR, MAEEMR, MATR3,

NISCH,TLNLAGNLLESLSGLHK, LQPDQLR, POLA1, IIPHTSSHK, IGBP1,AAQQQEEQEEKEEEDDEQTLHR, BCR,MRPL37, CRYGA,QEALERLK,

FLLPGLEEELEEAVGR, PPP6C, APLDLDKYVEIAR, CLINT1, SLLLLAYLIR, HK1,

CDAIVDLIHDIQIVSTTR, LANLAATICSWEDDVNHSFAK, ANLN,TEALSVIELLLK,KIAA0368,QVLLRPCSEVIIR,RPS15AP17,LSSSFSSR, NUP210,

OTOG,SLAGSSGPGASSGTSGDHGELVVR, EEF1D,LSQLEEER,FAM184A,MPMAGLLK, LSSVFLR,

DHX9,FSDHVALLSVFQAWDDAR, IPO4, BAZ1A,DTIFPSR, LLGLLFPLLAR,

SILLGLLLGAGR,TPGPGAQSALR,ALYSILDEVIFK,IQCB1,LETVGSIFSR,MRPS23, PPOX,RPS14,SANLVAATLGAILNR,

ELDLEKGLEMR, GAB2, KAKPTPLDLR, MYO18A,ALLADAQLMLDHLK, PIN4,

IIIQALR,LLLAIVGQ,ANXA7,YKSEALGVGDVK,TES, CCT7P2,SSGNIFHKEQQR,

CTLVTTLQMFK, FVEQLGSYDPLPNSHGEK, TMPRSS4, IPMETFR, GAK, LSIAEVVHQLQEIAAAR, MRPS16,

VALVYGQMNEPPGAR, ATP5B,GEAAAAALSVR,

MYO1A,DCLK1,

BAZ1B, SPACA7,

QVELIRNSR, TAHSFEQVLTDITDAIK,

RTPIGTDR, TMEM194B,

ATP13A1,AVEGPSTVLTTAVMTDLPVISNILSR, MON2,SAEQISDSVR,SCO2,QAEKISR,TRPM7,

CSQVQQDHLNQTDASATK, THUMPD3, STLAYGMLR, RPL38, CALHM2,AAVAPVTWSVISLLR, KIEEIK,ELCDGLELENEDVYMASTFGDFIQLLVRK,

QQHVEDVQR,

LSLVGIAK,

ACSL5,

ZNF91,

HOOK1,

KIAA0226L, LLQGARR,SRGAHLASADELR, SUSD5,HLSCTVGDLQTK,RUFY1,VPGSAVPEAGR, PNPLA5,

NTELAVFHDETEIQNQTDLLSLSGK, IARS,VLELAQLLDQIWR,

LEPLILKR,DDHD1,AIEQADLLQEEAETPR, AP1S2,DNSNNTLYLQMSSLR, CSF3R, LEPPMLR, FAM59A, SEAVREECR, PIAS4,

NOLC1,

FBP1,

UGGT2,

PLCH1,

TTC27,

KPRP,

POLR1D,

EPS8,

PRKAG1,

UNC45A,

NCKAP5L,

THSD1,

PARP9,

SMC5,

GHDC,

TLE3,QAGLRSK,

ACACA,

PACSIN3,

VVPSDLYPLVLGFLR,

RSEPIYNSR,

SPNGKLR,

ILFLDVLFPLAVDK,

QLPSPQSLK,

VLISNLLDLLTEVGVSGQGR,

GIVSLSDILQALVLTGGEK,

RVFDLER,

CPLASTNKR,

LKLLDAK,

IILVNVR,

PRPH,

USH2A,

SYNE1,

KIAA1217,

KCNQ5,

MICAL3,

SLC45A4,

VEEKDSLSR,

DUX4,

QRFPR,

HPSE,

HEXIM2,

MTMR2,

C13orf35,

XPNPEP3,

VPS54,

UPF2,

CASC4,

UXT,

YNILTLMR, HIST1H1D,

MATPSKK,

VSSLSPSQCR,

ADSAVSQEQLR,

GSVLEPEGTVEIK,

EYELVLTDR,

VQEEMARK,

SPTBN1,

SRSF7,

AHSA1,

SLAMF7,

ESYT1,

NPPGFAFVEFEGPRDAEDAVR,

NIT1,ELVDRK,

VKDANLTLESK,

VRNSNNLK,

KLTTDPDLILEVLR,

SKTVESAEGK,

AEMNILQINEK,

C19orf40,

LARP4,

VCAN, MFINIK,

TMF1,

NBPF3,

GNA12,

P2RX4,

SLC22A11,

UBA2,

CHMP2A,

QRGEPLR,

ZNF208,ZNF469,

BRIX1,

GPR153,

DDAH1,

LOC441736,

DDX41,

PCCA,

GPR171,

YRADLK,

MAGI3,

PRR25,

NBPF8,

C3orf77,

KALRN,

KIF16B,

FAT2,

OSM,

KHWLSK,

SSRPSCANVLLR,

MAPPTPPK,

LIEEMEEK,

GNNGSAFR,

LSCQRTIPMTGK,

QEQEKR,

LADLEQR,

LSIGQRNGGLR,

GEGLEYENIK,

RTLSLPR,

MGDTVLLK,

IMKEEVK,

AGAKRPVK,

OLFM2,

MED13, DPT,

MSLINPSSSLK,

TNEKQEK,

RDGEVDATASSIPELER,

GATTTFSAVER,

PRSS16,

RLHSADGQR,SLC1A6,RASVASPGEK,VFMIPVENENHCDFVKLR, LFQVENQSAQEKVK,

MAPK7,KVMHLQDVEVK,TRIM4,

DST,ASPSPFCPELR,CCDC87,

DOCK8,

IVSSAMEPDR,

SYQASPDLR,

QSVSRNR,QNVDRK,MIADFLR,

LFGEKFR, DCC,SSYHLIR,

KIF4A,KPSPEPR,

MOK,

ZNF518B,MPLSRR,

KNLLELIEYTHMLR,

KLEQKPK,

LTGMGGAR,GPSM3,

VSAQEQR,TRIT1,

SYPLPAIVDFIVER,

ALSCSSYIR,

KIF21A,

C14orf180, AQALRGLR,

QKLIDELENSQK,

QPVPWETVYPGADR,

ILGEFLQEK,

VTFGNRVTSSLGDIPVSR, GGA2,

KIAA2026,

LEEALEAAQGEARGAQLR, DNAH2, FAM161B, XIRP2,WBSCR16,

AHI1, NPAT,

RLSGPGLGR,

DLIISIENARK, STAFGMNGLGGIAAKLFAAPFSMR, C20orf132,

TJP2,

TMEM201,

APBA3,

LOC653510,

CBWD1,

LOC100510529,

LOC100510407,RGPD8,

GOLGA8A,RGPD3,

GOLGA8B,

KMSQEVCTLK,

ALGQNPTNAEVLK, LASYVEK,

MYL6, KRT23,

ENLPLIVWLLVK,GFGFVTFDDHDPVDKIVLQK,

DPM1, MYO1C,HNRNPA2B1,

TFAP2A,

NGGRSLR,TFAP2C,

DLVEAVAHILGIR,RDNYPQSVPR,

TFAP2B,MINK1,

IKELVVTQLGYDTR,

FASN,PFKP,

LNIIIVAEGAIDTQNKPITSEK, GFNLNPLNQDELK,

VCP,

C19orf2,

KVDNDYNALR,

CBWD7,

H2AFX,

PCBP2,

LVHTNEVTVLLGDNWFAK,

NSTGSGHSAQELPTIR,

HIST1H2AA,

AFVDVVNGEYVPR,

DLLADKELWAR,

TSDIFEADIANDVK,

PCBP1,

GEGIKMTPR,

LNTEWSELENLVLK,

VASP,

PCBP3,

LSSEVEALR,

DSP,

PPP1CC,

EIFLSQPILLELEAPLK,

KPTGLFYLLDEESNFPHATSQTLLAK,

MYO9B,

PPP1CB,

ATGREEGLEVGPPEVLDTLSQLLK,

KPNB1,

AKAP9,

TNIK,ALFLIPR,

MAP4K4,

GGGGNFGPGPGSNFR, GANFLTQILLRPGASDLTGSFR, LASYVEKVR,VLDFEHFLPMLQTVAK,

QAVAKGQSTPR,FAN1,

PTPRC,

SP140,

AKAP2,

ILSISADIETIGEILKK,

ILSISADIETIGEILK,

ARF1,

MLAEDELRDAVLLVFANK, ARF3,

POLR3A,

AQENLNPLVVLNLFK,

LQQQPGCTAEETLEALILK,

EVVEKR,

GSASDWYDALCILLR,

AMOT,

RSPH3,

GFGFVLFK,

HNRNPD,

VAGAATPKK,

PPP2CA,

P4HA2,

GAGYTFGQDISETFNHANGLTLVSR,

HNRNPA1,

GNTELFGGQVDGDNETLSVVSASLASASLLDTNRR,

AVLQLLVEGALHR,

IASVVGILGHLASR,

AFLTGVDPILGHQLSAR,

NNLHVLPDELGDLPLVK,

LVSIPEEIGK,

TLLEGSGLESIISIIHSSLAEPR,

DLVDKGGDIFLDQLAR,

YLFDLPLK,

LSNQVSTIVSLLSTLCR, IAAQLLQQEQK,

LRCH2,TDSQKDQEVYDFVDPNTEDVAVPEQGNAHIGSFVSFFK,

SEC13,

QAAPCVLFFDELDSIAK,

DVAWAPSIGLPTSTIASCSQDGR, LAYINPDLALEEK,NAPAIIFIDELDAIAPK,

PDRG1, P4HA1,

TPKDESANQEEPEAR,

SLLVIPNTLAVNAAQDSTDLVAK, TCP1,

DDKHGSYEDAVHSGALND,

VLCELADLQDKEVGDGTTSVVIIAAELLK,

YPVNSVNILK,

RYR2,

NVFDEAILAALEPPEPK,

LAQQISDEASR,

LOC643751,

IIEDCSNSEETVK,

USP9X,

NKLEGEIR,

INVS,

CDKL1,

GTF3C5,

LAPAHVDEQVFLYENAR,

POLR1B,

ZBED1,

CAMKK2,

CSNK2A1P,

GAEETSWSGEER,

LKPESSQQPGQDALAVK,

LPADTCLLEFARLVR,

KRT18,

ASLENSLREVEAR,

AQIFANTVDNAR,

LFGSAANVVSAK,

LOC440264,

EALNMERNNR,

HFVTISSPLATQIPQAVGAAYAAK,

RDSLEEGELR,

ELP3,

LALPSRVQK,

LLLPWLEAR,

CLTC,

TIFTQEIFTEQVVTAHAVR,

SSQLLWEALESLVNR,

DHX38,

ITEIQYTCR,

SRPK2,

CDK11B,

SLC25A6,

QDFETVLLLEPGNK,

TSTD2,

ZNF525,

GMGGAFVLVLYDELK,

TVGATALPR,

ETKTFGGGGGGAR,

SDLAVPSELALLK,

CICTAVIPAQSTCSRDGR,

RNAVSSSTNNSR,

FNIINMTFPTK,

VEEARR,

LGAYELVTGRR,

QVQTQR,

MAAVSLR,

TLADLIR,

QPREER,

DTAEEKELLR,

PRRC2B,

PTPLAD1,

ARID5A,

ZNF674,

TCHH,

GNAS,

TGALVLSR,

ZNF791,

LOC100507203,

OR4K3,

AKEQLEIR,

EPMTVSSDQMAK,

ILQCFLDR,

PLEC,TLMELR,

FAM75A5,

FAM75A4,

INTS7,

KLEEIK,

LQQQQAQQPLQQQQQR,

LOC100289196,DBF4,

14-Sep,

CFH,

EQEERLEQR,

SETX,

SLC25A5,

TLQALEFHTVPFQLLAR, LGALS3BP,

VELSDVQNPAISITENVLHFK,

IILEFK,

FCHSD2,

CYB5R1,

SLC7A6OS,

MCTP2,

MFN2,

NUB1,

SHQ1,

SMARCA5,

SH3TC2,

LOC645954,

TRIM68,

AELEQKIDEAR,

KSPPEPR,

RWVPTLR,

LSRDSQQR,

ETQEATR,

STIAFAHR,

QFVEHASEK,

SGVLQRLQDELSDVIDIK,

EWRGAGMAQK,

EGECLTLCK,

IDISRR,

ITGB3,

GRIA1,

MWRAVSLATR,

ATP6V0A2,

LTA4H,

RPS18P5,

CACNA1S,

LOC100128703,

EIVDRK,

HAGPGAVMALRRPGER,

SSSFSLPSR,

LSAYVIR,

SYNM,

GIQDLHVQIR,

NVDCVLLAR,

ALDH1L1,

DDX31,

SHPFTNPR,

FAM163A,

ATP6AP1L,KLCAYPR,

QMIYSAARAIAGMYK,

SSLIILK,

KIAA1257,

TIMM44, TIESETVRTSEVLR,

FAM203A, PSMA5, RHBDL2,

SSTLTEHK,

DFPTISR,

ALSLGTIPSLTR,

ISSGNIQER,

ESKIQR,

ENLLLEEELR,

TELSACGPLGR,

MAP1A,

TNFRSF6B,

DKIPCEK, MACF1,

AKT3,

AIADTGANVVVTGGK, EVAAFAQFGSDLDAATQQLLSR,

QLREEQR,

MSEYIR, GIGYF2,

OR2M2,LYQDEK,

IEKSPAPQK,

LSAYVLR,

EAAWAISNLTISGR,

FAEAFEAIPR,

CPSF2,

GLYR1,

ATG4A,

APLP2,

SPN,

CA3,

TBKBP1,

CDO1,

ATP2B3,

BAZ2A,

NSSQDDLFPTSDTPR,

SYDYEAWAK,

KEQAGER,

HENMT1,

LCANEEIMR,

LMEPNLIK,

EKPLLIFEILQR,

ELEGELEAEQKR,

NCAPH2,

MDN1,

LYRM1,

AARPPPAASATPTAQPLPQPPAPR,

FEDEELQQILDDIQTK,

ALIPSLEGRFEDEELQQILDDIQTK,

EAGGQAQAMELPEAQPRQAR,

KEVDMMK,

QVSNIGR,

YSSAGTVEFLVDSK,

LYMKLVMR,

QEEYIEEKK,

KAKPAPSK,

DUOX2,

IPGPVCK,

EERAQWAEQQR,

LIDWGLAEFYHPGQEYNVR,

DYNC2H1,

KIAA0586,

HOOK3,

MAP3K10,

IQCH,

SSGPYGGGGQYFAKPR,

SNPH,

QLYRNK,

EGTSGRLR,

SRFDIK,

KEIEDIR,

CSNK2A1,

VLMASVQGSKR,

TQSPGGCSAEAVLAR,

LSLLLDNER,

ALFIVPR,

VTVQTDDSNK,

YETFISDVLQR,

LILAGALR,

AILFVPR,

LCEQTEEK,

CPNE1,

SYNCRIP,

APOB,

NIPBL,

FGF19,

EIF2C4,

ASB4,

SUPT7L,

GAPDH,

M6PR, ESWQTEEK,

QLYKNR,

SSFDLLPR,

QEEISRLVK,

EQHLFLPFSYK,

DLFEDELVPLFEK,

EALAQTVLAEVPTQLVSYFR,

VIHDNFGIVEGLMTTVHAITATQK,

IQMAALSEISK,

RSTVNEKPK,

RSLIHVMSVEK,SLALWPR,

ADQLCQQELR,

SMEMVKTK,

QVGEAVATLK,

DWVSPWLPR,

DGTIVYTGLETR,

MERGTGETR,

LLWLSMLK,

STSANISR,

KVYNIPGISPDMMK,

MYOF,

CAMSAP2,

PMPCA,

FBXO31,

F5,

CHEK2,

FNDC3B,

ZNF155,

KLHL10,

UBTF,

CAND1,

AASS,

DCDC2,

TTI1,

FUS,

EIF3E,

TPR,

SBF1,

KVIEDR,

KEELESK,

TLSIAPGQCR,

GLDTVVALLADVVLQPR,

KIAA1244,

KDEFSTK,

SLDRQGIQR,

ILGETSLMR,

TTC40,

NDRG4,

ABCA13,

ZNF30,

UTP18,

SOCS6,

MRPL4,

LKVLEAK,

DFIDMRVK,

EMID1,

UNC45B,

MVLAISSCR,

LDKDDLEK,DOCK1,

CDC42BPB, HLA-DPA1, MRPEDRMFHIR,

DLG5, ELKELK,KCNB1, EQMNEELK,

LPQEQR,IFI16,

ESNSLCPAGIR,USP24,

AEMNILEINKK,

MAIIPDWLR,EFCAB6,

ZNF425, RVPACTAPGR,

PPARGC1B,

FGD3, FHGQFLLPELK,

ILLYETCPR,

LARIEEPK,

AGVEARAGPR,

WPSTLR,

GRHL1,

STXBP2,

VEQQKK,

CDC42,

CD1B,MLSHKMK,

TSVSHMPIR,

TCEVAAWCPVEDDTHVPQPAFFK,

GBP7,

EYS,

KVLEDR,WNT7B,

VPHATGALNELLQLWMGCR,

ILSPKVK,

APQLLIAPFKEEDEWDSPHIVR,

AMOTL1,

LLSVLAQVFTELAQK,

TQTAPAGR,

KIIADIK,

SPQQSAALPR,

SQREMEEK,

ISPSGIDSATTVAAATAAAIATAAPLIK,

YYREMK,

WLDGLFFPR,

GILSMIER,

PPP2CB,

PDE1C, RSSLNSISSSDAK,LOC494150, EKAIEVER,GBP2,SINTAVR,

LPLPNKMK,

VDVYGYVVK,

HLVFPLLEFLSVK,

AAIDWFDGKEFSGNPIK,

GLLALLFPLR,

IDLRPVLGEGVPILASFLR,

GQNLLLTNLQTIQGILER,

ALADILSESLHSLATSLPR,

STVGSSDNSSPQPLKR,

QDLVISLLPYVLHPLVAK,

SHEESLAHTGTLRR,

ISRQKPR,

VQYDLK,

AGLQELR,

AEVRGGGR,

ATRTAALQPTPSSGTSGK,

LSKDNLK,

KLAEEQQK,

KIGQQPQQPGAPPQQDYTK,

ITELTDENVKFIIENTDLAVANSIR, VQISPDSGGLPER,

DQGGFGDRNEYGSR,

MHFSLKE,

TLTAEEAEEEWER,QAQVATGGGPGAPPGSQPDYSAAWAEYYR,

SNKQNLFLGSLTSR,

IYQEEEMPESGAGSEFNRK,

QSRLEQEEQQR,

EFRPEDQPWLLR,

TIMM50,

KTGLSSEQTVNVLAQILK, APDQAAEIGSR,

VVVVDCKK,

TVLEHYALEDDPLAAFK,

GTF2F1,

IYQEEEMPESGAGSEFNR,

TTPNSGDVQVTEDAVRR,

LRLDTGPQSLSGK, VLNHFSIMQQR,GNSRPGTPSAEGGSTSSTLR,

PYANQPTVR,LGLIPLISDDIVDKLQYSR,

ITELTDENVK,

KLSDLQTQLSHEIQSDVLTIN,

IINDLLQSLR,

FAM75A3,

YFKDYR,

KISQALASK,

FAVAESDCNLAVALNR,

RGPD2,

LIF,

YFPTQALNFAFK,GIAADGANALLPANR, DFLAGGVAAAISK,

EQGVLSFWR,

IPDEFDNDPILVQQLRR,

LNEAKQDFETVLLLEPGNK, QNLFLGSLTSR,

YNIMAFNAADKVNFATWNQAR,

DISCLNRDPAR,

RAPDQAAEIGSR, LLLQVQHASK,

SLC25A4,

VLELEPNNFEATNELR,

RPAP3,

IEEVSDTSSLQPQASLK,

TIALNGVEDVR,

AQGPQQQPGSEGPSYAKK,

ISQALASK,ILHDFYIEK,

AIELQLQVK,

QLEDGDQPESK,KHSRP,

IQFKQDDGTGPEK,

MILIQDGSQNTNVDKPLR,

LASQGDSISSQLGPIHPPPR, DAFADAVQR,AINQQTGAFVEISR,

DIEQFTEFFENPAFRPDGLK, CLKSCKPR, LRVPLLEILK,AIGPHDVLATLLNNLK, MAGQRVELPCK, IAALQAFADQLIAAGHYAK, GKGCYNTR,IPMSKMMTVLGAK, SPPPPPPR,QLEIYR,CMMLTNVQMLNKEPEDMITGER, EQQETIDKLR,WCGHKEVPPR,DSNTQIEQLR,LGALPSMPR,FQAMEETHLRHMK, LEVDLDK, QFQVLDAPLLK,

LOC440258,N4BP2L2,

CEP85L,

LLSSDRNPPIDDLIK,

LOC645084, L1RE2,

SPHQNVCEQAVWALGNIIGDGPQCR, C7orf73,

EALGDAQQSVR,

DNAJC7,

NALCN,

RLAELER, KLSSPTTPR,

KDYNEAYNYYTK,

MYO18B,

EAETFKEQGNAYYAK,

AVQFFVQALR,FFELIQGTEIDIFSYKPTLLTSK,

ISAIHILDVLVLNGTDVR, GIYKDDIAQVDYVEPSQNTISLK,

TPHYGSQTPLHDGSR, LAVEAVLR,

ILLAELEQLKGQGK,

CENPE,PDGFD,

LSVGSNRDR,

KIAA1109,

LKNGIQK,SSMRAASLK,

DSTFDLPADSIAPFHICYYGR,

QLLLCQFLMALSIVR,

VAELALSLSSTSDDEPPSSVSHGAK,

TARDBP,ELDTVTLEDIKEHVK, LQLPENFDIHPVGSLAEK, VAELYLPLLSIAR,ISLPLPNFSSLNLR,

VELHSTCQTISVDR,

VAC14, SRCIN1, CHD5,

KIEELK,

VVDALGNAIDGKGPIGSK,

LTSFIGAIAIGDLVK,

QVLLSAAEAAEVILR,

ELMLGEDTGLPKR,

CCDC158,

LQELQHLK,

HRLDLGEDYPSGK,

EVDDLGPEVGDIK,

SNLGSVVLQLK,

QESGYLIEEIGDVLLAR,

ARHGEF1,

LLLKSHSR,

AIGIEPSLATYHHIIR,

TLISGLPGR,

EKSIFLVAHR,

APITD1-CORT,

LOC653566,

IQEQVQQTLAR,

LST-3TM12,

SNRNP200,

LSLVGLAK,

FREALGDAQQSVR,

QDVDNASLAR,

WTELGALDILQMLGR,

TSSVPEYVYNLHLVENDFVGGR,

NAQLELKK,

LOC652147,

LAYELYTEALGIDPNNIK,

SIK3,

DLKAENLLLDANLNIK,

FCHO1,

GTPGEKGPR,VSSDKLALCHLELTR, INVSVSKNLNLK,

ASVDTKEAEGAPQVEAGK,

LEQYTSAIEGTK,

NDDDEEEAARER,

CALD1,

YEIEETETVTK,

AHYTHSDYQYSQR,

MQNDTAENETTEKEEK,

GNVFSSPTAAGTPNKETAGLK,

LQEALER,

LPSFAIPYAIDVLTTGSYPR,

CORT,

CCDC162P,

SPCS2,

DSCAM,

AVGDPSPAVKWMK,

AVDSLVPIGR,

ATP5A1,

GIRPAINVGLSVSR,

LKEIVTNFLAGFEA,

SMYD3,

TSDLIVLGLPWK,

LVEGILHAPDAGWGNLVYVVNYPK,

EALLSSAVDHGSDEVK,

VQDDEVGDGTTSVTVLAAELLR,

FLVLDEADGLLSQGYSDFINR,

CCT2,

REMVYASR,

SLHDALCVLAQTVK,

ERCC4,

EKEAFEK,

FAVALDSEQNNIHVK,

SLGGDVASDGDFLIFEGNR,

DIVTESNKFDLVSFIPLLR,

VEENFVILFSDLTMHELK,

SLCO1B7,NLYIISVK,

RPL23P8,SLEQELASPILDIEDLVK,

PPAN-P2RY11,QLSLDVRR,

RAQHNEVER, EAMVKPFEK,

OR2M1P,PDCL2,

LOC653707,

USF2,

LICTTVPK,

IASLLGLLSK,HKLDVTSVEDYK,

RTEL1,

ETGANLAICQWGFDDEANHLLLQNNLPAVR,

PPAN,RPL23,

INALTAASEAACLIVSVDETIK,

MIIEDETEFCGEELLHSVLQCK,

ILVLTELLER,

FTSJD2,

TALLHALASSDGVQIHNTENIR,

DOCK6,

USF1,

SHSWVNSAYAPGGSK,

TLLACVLWVLK,

FRY,

LNDDAKR,

LWGIPDQAR,

ASB6, PSMD3,

HDADGQATLLNLLLR,

VYEFLDKLDVVR,

ALVSFSDYQIFQELIK,

ELFFLPLGFALK,

EFETAETLLNSEVHMLLEHR,

AQELIQATNQILSHR,

SCAF1,

KVEEELR,

IALREIR,

LPCAT1,

GRIA4,

QAELLEKR,SRPK1,

CDK11A,

QAQGLQQELGGLHSALLR,

LALFPLLPK,

HLQTYGEHYPLDHFDK, DLQSWINGIR,

TWFGTFIR,

INTS2,

QLLLELMGILPTVR,

NEFIGLQLLDVLAR, INF2,

SATDKQQELLVSLATVIFVASQK,

PSMC4,

ENAPAIIFIDEIDAIATK,

LQQELEFLEVQEEYIKDEQK,

LOC652826,

TYSYLTPDLWK,FAEFLLPLLIEK,

AKT2,

SSLLSATR,

HALIIYDDLSK,

FENAFLSHVVSQHQALLGTIR,

KEVIIAK,

GOLM1,

LOC100293491,

AALSSQQQQQLALLLQQFQTLK,

DIDEVSSLLR,

VINEEYK,

DVGRPNFEEGGPTSVGR,

VTADVINAAEK,

NAGNCLSPAVIVGLLK,

IDLLQAFSQLICTCNSLK, TQLRNEFIGLQLLDVLAR,

RBBP7,

ILQDGGLQVVEK,LAKENAPAIIFIDEIDAIATK,

ASHEEVEGLVEK,

TVALWDLR,

TVLEQVLPI,

DNAJB1,

FKEIAEAYDVLSDPR,

DLPLLLFR,

TLGILGLGR,

PHGDH,

QADVNLVNAK,

ICFELLELLK,

CCT8,

LKENQAR,

MYH6,

ILESSIPMEYAK,

KAVELDLVPGR,

FCSNLCLPK,

APDLFPTDFK,

SPISVAAAAIYMASQASAEK,

GTGAASFDEFGNSK,

EIGDIAGVADVTIR,

GTF2B,

VLNSIISSLDLLPYGLR,

SSTSNANDIIPECADKYYDALVK,

EIEDIIEEVTVGYIR,

SLIINTNPVEVYK,

FQATSSGPILREEFEAR,

IFYPETTDVYDRK,

IQGAP2,

NPNAVLTLVDDNLAPEYQK,

VLWLDEIQQAVDEANVDEDR,

VNAELQAR,GVLLDIDDLQTNQFK,

VLWLDEIQQAVDEANVDEDRAK,

LNDGSQITYEK,IYDVEQTR,

KYDYYYNTDSK,

CLIATGGTPR,SNIWVAGDAACFYDIK,

SITIIGGGFLGSELACALGR,

AIFM1,

DKVVVGIVLWNIFNR,

AALSASEGEEVPQDKAPSHVPFLLIGGGTAAFAAAR,

ALGTEVIQLFPEK,

ISGLGLTPEQK,

AIVFRPFKGEVVDAVVTQVNK,

HSIPSEMEFDPNSNPPCYK,

YFGPNLLNTVK,

VDKNDIFAIGSLMDDYLGLVS,

TMDEDIVIQQDDEIR, GFVLYPVK,

GEVVDAVVTQVNK,

POLR2G,

YGFVIAVTTIDNIGAGVIQPGR,

LFTEVEGTCTGK,

VGLFTEIGPMSCFISR, NLLIFENLIDLK,

SLGPPQGEEDSVPR,

DVLIQGLIDENPGLQLIIR,

LQSVQALTEIQEFISFISK,

GQAVTLLPFFTSLTGGSLEELRR,

AQNVTLEAILQNATSDNPVVQLSAVQAAR,

KPNA4,DAQVVQVVLDGLSNILK,

PRKDC,LOC731751,

KPNA3,

GALQYLVPILTQTLTK, FLEFQYLTGGLVDPEVHGR,

TDYNASVSVPDSSGPER,

HNRNPK,

HNRPDL,

PPP1CA,

AFHNEAQVNPERK,

PALM2-AKAP2,

AAVENLPTFLVELSR,

QLDAGEEATTTK,

SQVLDYLSYAVFQLGDLHR,

INTS5,

NNIFYILLR,

MEPLATVESLEQYLLK,

HECTD1,

ILQQIEEPLALASGALPDWCEQLTSK,

GQVKPSTSSQPILSAPGPTK,

RAD21L1,

NQGGYGGSSSSSSYGSGR,

YLVEVEELAEEVLADKR, TYEEGLKHEANNPQLK, GQGSVSASVTEGQQNEQ,

STIP1,

MSLLQLVEILQSK,

VKEEIIEAFVQELR,

URI1,

CBWD2,

INISEGNCPER,NDEELNKLLGGVTIAQGGVLPNIQAVLLPK, HIST2H2AB,

SQQVIVQGVHELYDLEETPVSWKDDTER,

SISCEEATCSDTSESILEEEPQENQKK,

LOC100291405,IPDWFLNR,

RPS18,

CDKL4,

LASYLDR,

BAT2L,

SESPKEPEQLR,

TDRD3, SF3B1,

ETEGDVTSVK,

TQTSDPAMLPTMIGLLAEAGVR,

AIFTGHSAVVEDVAWHLLHESLFGSVADDQK,

ELEKIK,

SLESSQTDLK,

POLR2D,

SNPEATNQPVTEQEILNIFQGVIGGDNIR,

TALGVAELTVTDLFR, ILALVPPWTR,

BCKDHA,

KFDVNTSAVQVLIEHIGNLDR,

ENAPAIIFIDEIDAIATKR,

RPS2,

TLNYTAR,

DSQVVQVVLDGLK, ILLGQNR,

ILDELDKDDSTHESLSQESESEEDGIHVDSQK,

DFLAGAVAAAVSK,

FAM75A1,

FAM75A6,

RGPD4,

NCDYQQEADNSCIYVNK,

QLDDKDEEINQQSQLVEK,

EISFAYEVLSNPEKR,

IGLVEALCGFQFTFK,

RQESGYLIEEIGDVLLAR, VIM,

SUPT5H,

SSVGETVYGGSDELSDDITQQQLLPGVKDPNLWTVK,

PKP2,

DIKPSNLLVGEDGHIK, TILEMLLGFLK,EVATAIR,

SPTAN1,MMS19,

SFTTTIHK,

MAD2L1,EFCAB7,DDX1,

LPLQDVYK,

WNPTAGVAFEYDPDNALR,

KLASQGDSISSQLGPIHPPPR,

HSVGVVIGR,

ELWFSDDPNVTK,

SEYSELDEDESQAPYDPNGKPER,

IPDEFDNDPILVQQLR,

MFYHISLEHEILLHPR,

TMDEDIVIQQDDEIRLK,

KIAA1598,

LLLQGEADQSLLTFIDK,

KELELK,

QVPLSAAEAAEVILR,

LVPGGGATEIELAK,

FEDEELQQILDDIQTKR,

AIQDEIR,

LATNAAVTVLR,

SQALEFPDQPDIAEDLKDLITR,

LLGRCSGTR,

AVDNILR,

SATEQSGTGIR,

VEAVLSR,LPNGEGTR, ISQKAER,

EMILIN1,COL16A1,

IDNSVLVLIVGLSTVGAGAYAYK,

LLPLHHMPTQLLSIEESLALQK,

SQDSEEHDSTFPLIDSSSQNQIR,

SVSSSVQVCPEVGKR,

SFCSNFCYQASK,

KAELEAAVR,

YVLGEETTK,

RPAP2,

SEITLVGISKK,

ESQNSLDESLPFR,

TNKVYDITER,

VHAELADVLTEAVVDSILAIK,

ALQFLEEVK,

ALHIVEQLLEENITEEFLMECGR, GIDPFSLDALSK,

GLQDVLR,

VATAQDDITGDGTTSNVLIIGELLK,

AQLGVQAFADALLIIPK, LTFLYLANDVIQNSKR,

TKVHAELADVLTEAVVDSILAIK, CCT6A,

HKSETDTSLIR,

SVTLLIK,

SVYGGEFIQQLK,

MLADFLR,

ALQDLENAASGDATVR,

RPRD1A,

DFAPVIVEAFK,SVYENDVLEQLK,

EFESVLVDAFSHVAR,

KLTFLYLANDVIQNSK,

RPRD1B,

IASLPVEVQEVSLLDK,

LTFLYLANDVIQNSK, QALYGDKKPR,

TYEQIKVDENENCSSLGSPSEPPQTLDLVR,

IQSLPDLSR,

KLSELSNSQQSVQTLSLWLIHHR,

HSQFIGYPITLYLEK,

FYEAFSK,

ALLFVPR,

HSP90AB1,

HNDDEQYAWESSAGGSFTVR,

KHLEINPDHPIVETLR, EQVANSAFVER,NPDDITNEEYGEFYK,

HSP90AA1,

ALLFIPR,

HFSVEGQLEFR,

ADLINNLGTIAK,

SLTNDWEDHLAVK,

YESLTDPSK,

HIYYITGETK,

HLEINPDHPIVETLR,

HSQFIGYPITLYLEKER,

HSQFLGYPITLYLEK, YHTSQSGDEMTSLSEYVSR,

QNTGVWLVK,

YLLQEQLK,

WAIIGWLLTTCTSNVAASNAK,

GLADQCTGLQGFLVFHSFGGGTGSGFTSLLMER,

DLVDITKQPVVYLK, TLTAEEAEEEWERR, AIECYTR, LVELYR,GMGGAFVLVLYDEIK, GAWSNVLR,

RGPD5,FAM75A2,

TAVAPIER,QDVCQSYSEK,

YDEAIDCYTK,

VLKIEEVSDTSSLQPQASLK, VLLDLSAFLK,AQGPQQQPGSEGPSYAK, NLDPDVFNQIVK,GHWDDVFLDSTQR,

KEQESCDCLQGFQLTHSLGGGTGSGMGTLLISK,

LSVNMVPFPR,

TIGGGDDSFNTFFSETGAGK,

GHYTIGKEIIDLVLDR, NLDIERPTYTNLNR,

EHYHATALGAK,

YMACCLLYR,

AVCMLSNTTAVAEAWAR,

TLGAPASSER,

EIEEAPDIRK,

NRDNDPNDYVEQDDILIVK, GGGGPGGGGPGGGSAGGPSQPPGGGGPGIRK, SVSLTGAPESVQK, TTPNSGDVQVTEDAVR,

TGLSSEQTVNVLAQILK,

FIIENTDLAVANSIR, TGLSSEQTVNVLAQILKR,

EAESCDCLQGFQLTHSLGGGTGSGMGTLLISK,

TUBB8,

GHYTEGAELVDSVLDVVR,

EEESCDCLQGFQLTHSLGGGTGSGMGTLLISK,

FPGQLNADLR,

TUBB6,

SCDTPPPCPGCPAPELLGGPSVFLFPPKPK,

SCDTPPPCPICPAPELLGGPSVFLFPPKPK,

TTPPVLDSDGSFFLYSK,


RISEQFTAMFR,

TUBB3,

YLTVAAVFR,

STSGGTAALGCLVK,

YDDPEVQKDIK,

NTTIPTKK,

MKETAENYLGHTAK,

QAVTNPNNTFYATK,

QAASSLQQASLK,ERVEAVNMAEGIIHDTETK,

QAVTNPNNTFYATKR,

VNVTHKEEIILTPIEVAIEDMQK,

EQQIVIQSSGGLSKDDIENMVK,

TTSGDDACNLTSFRPATLTVTNFFK,

AQFEGIVTDLIRR,

SLSNSNPDISGTPTSPDDEVR, STNGDTFLGGEDFDQALLR,

DAGQISGLNVLR,NAVITVPAYFNDSQR,

TTPSVVAFTADGER,

LYSPSQIGAFVLMK,

GPSVFPLAPSSK,

HSPA9,

SAVRPASLNLNR,

SQVFSTAADGQTQVEIK, KDSETGENIR,

AQFEGIVTDLIR, TPNEEIDRQNDDQR,

GSWACSIFDLK,

IDISPAPENPHYCLTPELLQVK,

VQQTVQDLFGR,

YDDPEVQK,

DKVQAGDVITIDK,

LLIVSTTPYSEK,

SAIFSITYPSQDVFLVIK,

LQEAFSK,

KTQELAFATHQDPADPK,

VQAGDVITIDKATGK, TQELAFATHQDPADPK,

GYQTSPDLR,

FMDDIAALVSTIASDIVSR, LKEALQPLINR,

DOCK7,

AKLEEAILGSIGAR,

YSDPQIK,

KQISGQYSGSPQLLK, NLLYIYPQSLNFANR,

DSTEVEISTGER,

NSLPDALLPNLLDR,

GLLRPHVPPAAITTLAR,

SNSWVNTGGPK,SYTEDWAIVIR,

NSLLASYIHYVFR,

NAEKYAEEDR,

RYDDPEVQK,

AHGELHEQFK,

TILTYAEEDLELR,LLGQFTLIGIPPAPR,

GIDPFSLDSLAK,

CCT6B,

NAIDDGCVVPGAGAVEVAMAEALIK,

VLAQNSGFDLQETLVK,

YVLGEETTKSQDSEEHDSTFPLIDSSSQNQIR,

EEQSGHSGEEVQLCSK,

KSFCSNFCYQASK,

SSQTSQNQGLGRPTLEGDEETSEVEYTVNKGPASSNR,

KIEFER,VYDITER,

SSQTSQNQGLGRPTLEGDEETSEVEYTVNK,

IFDSFAK,

AAIAECEEVRR,

AELEAAVR,

AAIAECEEVR,

IREFYR,

LDSQEKDATCELPLQK,

AAIAECEEVGR,

LKASENSESEYSR,

TNVLPFR,

QNYEEMQAK,

FVQCPDGELQK,

AVLIAGQPGTGK,SGSPISSEER,

GTSYQSPHGIPIDLLDR, NLTGLSSGTEK,

EYQDAFLFNELKGETMDTS, SELFNPVSLDCK, GFEPQAPEDLAQR, AIAEVDVGTDKAQNSDPILDTSSLVPGCSSVDNIK,

YAIQLITAASLVCR,

GLGLDDALEPR,

TQGFLALFSGDTGEIKSEVR,

SYNPEGESSGR,

ITIADQGEQQSEENASTK,

GTEVQVDDIKR,KSELFNPVSLDCK,

HRVSSQAEDTSSSFDNLFIDR,

YAIQLITAASLVCRK, FVQCPDGELQKR,TTEMETIYDLGTK,

DYDAMGSQTK,

IKEETEIIEGEVVEIQIDRPATGTGSK,

MIESLTK,

DKQHLDDITAAR,SGSPISSEERR,

VSSQAEDTSSSFDNLFIDR,

POLR2M,

DRVPPSSEASEHHPR,

RUVBL2,

IRCEEEDVEMSEDAYTVLTR,

ARDYDAMGSQTK,

KEVVHTVSLHEIDVINSR, IRGTSYQSPHGIPIDLLDR, TQGFLALFSGDTGEIK, QASQGMVGQLAAR,

TEALTQAFR,

VYSLFLDESR,

LLIVSTTPYSEKDTK, ETLIEWK,

FLDTLLEELHLK,

VNTQSSSNSTLPER,

ASENSESEYSR,

TSDIDNPSHFEK,

LCGYPLCQK,

FITPAHYSDVVDER,

VLPGLLVPLQITLGDIYTQLK,

CBWD5, VEFTEDLQK, TPADIYR,

CBWD3,

SIYYITGESK,

HSQFLGYPITLYLEKER,

HSP90AB2P,NPDDITQEEYGEFYK, IDIIPNPQER,YIDQEELNK,

ELISNASDALDK,

FENLCK,

RGPD1,FAM75A7,

EEERHPDFQLLK,

FFEAQIPK,

LOC100509836,

LOC100509749,

LPNVTGSHMHLPFAGDIYSED,

LAAEIDDR,

ALQDLENAASGDAAVHQR,

SPATA5L1,ASTPAILFLDEIDSILGAR, MSH6, IIDFLSALEGFK, PSMA2,YNEDLELEDAIHTAILTLK, NDUFA10, NUBPL,XPO1,LDINLLDNVVNCLYHGEGAQQR, LLQYSDALEHLLTTGQGVVLER, DLIDDATNLVQLYHVLHPDGQSAQGAK, LAQTLGLEVLGDIPLHLNIR, YBX1, EWSR1, MRPS22, C20orf4,LDHAL6B,LIIVSNPVDILTYVAWK, NYQQNYQNSESGEKNEGSESAPEGQAQQR, GDATVSYEDPPTAK, WISLTNFISEATVEK, AKAP8L, TVEDLDGLIQQIYR, ISYNA1, SVLVDFLIGSGLK, MAP1B, TPDTSTYCYETAEK, HEATR6, AVAPSIIFFDELDALAVER, SLFN11, NUBP2,STESLQANVQR,RPL13,ELIQSSDLQAFFETK, Incorrect!MKEIAEAYLGYPVTNAVITVPAYFNDSQR7:6 KEAESCDCLQGFQLTHSLGGGTGSGMGTLLISK4:3 FLDPDMYSLLEDSTSDLR, FQGSGAATETAESR, SETD1A,TVAQSQQLETNSQR, RFC3,TLEEGHDFIQEFPGSPAFAALTSIAQK, GPN3,SPATA5,TVPFLPLLGGCIDDTILSR, MRPL10,ACALQVLSAILEGSK,

SSGVPVDGFYTEEVR, NTPCR,

GFGFILFK,HNRNPAB,LANFGGLAVGLGFGALAEVAKK,

MKSLESQLEK,LOC643371,CISEQFTAMFRCK,

IHSDCAANQQVTYR, DSG1,

ADCK3,PMF1,

AVPPTYADLGKSAR, ASGR1,

LYTENR,

ASITPGTILIILTGR,

LONRF3, LRPMGFK, VDAC1,

MED1, VPLILNLIR,

TRMT11, SPFWILSIPSEDIAR, RPL6,

CEP170,VAVQGDVVR,NVIAVLEEFMK,C17orf75,VHSYNASETSQLLSVSEGELILHK, KPLTTSGFHHSEEGTSSSGSKR, EPRS,FASTKD5,

RBM39, DLEEFFSTVGK, KIAA0355,SQSAAVTPSSTTSSTR, ADRM1, ASTFLTDLFSTVFR, RPRD2, VEITPESILSALSK,

MMAA,

QTARGVK,

IPQERK,

POLR1A, ELL2,EDSTCISLPK,

PDE4D,SYSSGGEDGYVR,GISLNPEQWSQLKEQISDIDDAVR, NCL,LDSLTYK,

AFAP1, LQTKEEELLK, CNNM1, VEVEVGK, HAX1, ITKPDGIVEER, MUM1, LSQMQGAVR,

LPDFLQR, PRSS12,

ENO4,

MAST4,

MFHAS1,LDLDKR,AFAP1L1,TPGRPTSSQSYEQNIK, TAQLNISK,MRLSWFR,LOC100509073,SISDARAR, EIF4E2,TVPLDALR,BAI1,

SLCO1B3,ADIFALALTVVCAAGAEPLPR, WEE1,GPR112, IIPTVDR,DTLVISVK, RNF25, TKIDAIR,ABHD1, ZNF318,USP9Y, PIK3CA, VRAEGELEPR, NQSFCNKI,DMQLLSISR,IIEDCSNSEDTIK, LRBA,

CRTC2,MAGLTVR,CSTF2,IBTK,VLLGVGDPK,NOP56,LLLMGHK, QKCGATPK, LPSALNR,

LQIPGLTLDCR, C12orf34, ASSTSTPEPTR, LOC151760, KLDPLLK, PTPRU,

EKFGSFLCHK, MUC6,FGAACAPTCQMLATGVACVPTK, VAVDGNVIR, RGS3, ACVAAACTVAAR, PFKL, SEWGSLLEELVAEGK, ZEB1,

QAQPTSR,VIFGEENR,IYDQEK,

ACGTSIFQNR,

ASUN,

LTSDYLK,

SEMA4D,QQQEELEAEHGTGDKPAAPR,

MAGEB17,

C22orf28, AKR1C3,

DLEC1,

C9orf50, MVGGGGKR,

EILSAAEHFSMIR,

CASP8AP2,

DRLQYR,

MDVLQTDVSQNFGLELDTKR,

ANK2, GAS2L2,

PLXNA1,

XRN2,

EAQLQSR,

KACQACR,

RIYVGEK,ZNF429,

ARHGAP32,

MYLIP, GTPGTRGPR,

FXR2,

RQMPQGITSCMLHLDGK, EKDTLR,

CEVTVLDALPR,

EKDTIR,

ASB8,

IGLL3P,

QSISNSESGPR,

BYSL,

VIQQLEGAFALVFK,

DHGLLHLK,

GTQTKAEGPTIK,

DGAT2L6,

GDFITIGSRK,AGK,

MYCN,

QDVQMKPVK,

SEC23A,

VSGSNGLAAAR,

SQVISNAK,

SLIRP,

DOCK10,

NTS,

QSWVRAR,TLK2,

SSLPWLPK,PABPC1L,

ITGA7, TNFRSF25,

NFAT5,

C3orf15,

GMKDTQR,

NCAIISDVKVR,

RIVTGNPR,

LAPVPFFSLLQYE,

CDC42BPA,TATALADRAIR,

EEECSPVLR,

PRPF8,

EMB,

DGKI,

ATAAFILANEHNVALFK,

LPEAAAAK,

FSAFLDK,ALG13,

CD320,

IQKVEPAGR,

IPO5,

LACLAGELR,

LACLQR,

GSLQHCCCLLPK,

AVEDEATK,

SLIT2,

SINEMLLSR,

ENAH,

SKOR1,

CLSSYTSVKENFDK, KDM3B,

TPRN,

TMEM79, ABCE1,

C12orf51,

EIEKLK,

FEM1C, GADVNRK,

LCP2,

RAB23,

CNNM4,

WARS2, ACBD4, CALSLPR, CSF2RB, C8orf74,MQLCLEK,RMPPPPR, EALRLER,SRRM1, LEPRE1,

ISQERGK,FAAH2,TEGRLQQK,SMARCAL1,EXOSC9,GLYADTR,C17orf85, MICU1,ETPLSNCER, VMAGPGIKR,

GPI, NLVTEDVMRMLVDLAK, SPAHPTPLSR, LSQEINK, RARS, SCARF1, DRPARDGAAVSR,STIIGESISR,

FISHKK,ZNF687,TYKGGSVR,NEEAEALAKR,RBFA,

GASWLGRR,RNF146,TDMIQALGGVEGILEHTLFK,

ZNF85,DVLETPVDLAGFPVLLSDTAGLR, LQEIRLEVLK,

APRT,LRRC55,

LECLVHR,

FSTLTTHK,

C19orf45,

C8orf85, VEATLAR,

EAPPHGAPPR,

GTPBP3,

DOCK4,

MTUS1,

IALDCNLPR,

PHF1,LMDSVALK,C1orf123,

SCN4A,

QIAAPKAK,

NAALADL1,

DRFISGR,KNESLGQ,

TAGSVILR,

MDC1, KLAMQEFMEINER, OXCT1, MLGAMQVSK, CARS,TLN1, ADSSGQQAPDYR,

11-Mar, FAM120B, KSLHSCSVK,

LAPAMIR,

ACTR2,

LOC100133315,

BBS9,

EPB41L3,

SUV420H2,

C8orf82,

FADS6,

PIP4K2C,

MYBL1,

PVRL2,

FLDFITNIFA,

EPARSAHR,

MARAAALLPSR,

LELLVGCIAELR,

ITIDTNK,

KDGGGLYEK,

ERNFLR,

LSRSPLK,

QLYLER,

KPNPNTSKVVK,

PIGO,

MRS2,

KCNMA1,

SYNE2,

INCENP,

ARHGEF2,

FXYD3,

NRG1,

SKA3, MTAP,

CLCN5,

C3P1,

LSM14A,

MOAP1,

PREX2,

HEATR5A,

VSRPENEQLR,

MOBP,

FAM53B,

CARKD,

MASRFSR,

KTLLADR,

FIAANDK,

KNLEDGINNLK,

DASLPVPGQR,

CALPPRLK,

ATYKEMIR,

SLASTLDCETARLQR,

KEEQQR,

SESLESPRGER,

LAMTOR3, SPAG17, C6,

APARCFSAR, SLSLLLK,

VEILYR,

AEGHRVLIFTQMTR, CAGE1,TFDRSTK,GAD1,LSSLGCHLK,ARG2,LLQPGDK,

ERN1,RGMSILR, NVEDLSGGELQR,

ERBB2IP, MKL2,

MTEQETLALLEVK, DAALFPGCER,

LNPADIK,

KNLNLSK, HSD11B1L, DRASAAEAVR, SDHD, LSAVCGALGGR, DNAAF2, GIT1,WDHD1, TSLDGARR,

INALTAASEAACLIVSVDETIKNPR,

SQDAEVGDGTTSVTLLAAEFLK,

TEX264,

LGRAGGKPGGR,

CCT7,

PNMA1,

PRRC1,

QRSQIK,

QVKPYVEEGLHPQIIIR, LLSDDYEQVR,WVGGPEIELIAIATGGR,

QMETMGKR,

VYSLQHLDPQGAQELLEFTIR,

LSTLIEFLLHR, SPALQVLR, DCQLNAHKDHQYQFLEDAVR, EQAELTGLR, ATSPADALQYLLQFAR, AQETEAAPSQAPADEPEPESAAAQSQENQDTRPK, NVESGEEELASK, EIDKNDHLYILLSTLEPTDAGILGTTK, LSPPYSSPQEFAQDVGR,

CCT5,MAGED2,PGAM5,TRIM28,PTCD3, INTS4,

SPATA16,MAAAGPMTGAAASR, FAM99A,QVTQAGR,TLGSAVSPQQR,HOPX,SLLDNLRNK, PROCA1,

YEEAIKQTR, AJUBA, LSGLGGPRK, SEZ6L, CRYYSNLR, CD248, ACRELGGDLATPR,

C7orf59, SSSHSCFPGGR, IQSEC3, AACTIQTAFR, CCNB3, ZNHIT6,TNFKEDSLVK,

TASVLWPWINR, LELSNLEIPHQGVQVEGDGFSHAIR, EILSEVER, GIPEFWLTVFK,

VGAAAARAR,PENK,EPGPFVPR,BRPF3,TVDLQDAEEAVELVQYAYFK, TPIDYAR, MCM3,

VPNSNPPEYEFFWGLR, GPIAFWAR,VSTDLLR,HSQYHVDGSLEKDR, MNEAFGDTK,VLVNDAQKVTEGQQER, SAYESQPIR,VSNSPSQAIEVVELASAFSLPICEGLTQR, IAILTCPFEPPKPK,

NPPGFAFVEFEDPRDAADAVR,

VKEERPLFPQILSSIELLQHSLPK,

QSKDAPK,

CILPFDK,

LQLEGQVTELR,

IETYGGR,

ELEKLK,

QSDLDTLAK,

STDLTTHK,

ISSLEGR,

DYNC1LI2,

RAF1,

KIF1C,

ZNF619,

ZNF253,

AKAP17A,

OR2AT4,

PLEKHG6,

SRSF3,

CALCOCO1,

ELKILNLSK,

ISSAIDR,

IQDPQLR,

YLSQLLMR,

QLTLPEEDK,

LRRK2,

MTR,

LHX2,

PRAMEF10,

SLC9A5,

ALLARGTK,

ALEADFLTNMHTSKISK,

EQLLLDELVALVNKR,

TPESTKPGPVCQPPVSQSR,

LTPEELER,

KVDLGYISDVDK,

GRVPGVDR,

SNTQAPRLAPSHR,

KLCEEDLEK,

ARHGEF10L,

LRHPVSTK,

LLTELESPAWWPFSSK,

LONRF2,

UTS2,

SLFVQELLLSTLVR,

SMELYR,

MEIQEIQLKEAK,

PARP11,

EHBP1,

ISIMYSTLR,TAF1A,VSFQPKLSPDLTR,

RTWLESMAK,LOC388906,DGDEEALKTMIK, AGGCPSSLASSR,

DADPTWTLR,

DKDVTLSPVK,BOD1L,

DLLAKSLYSR,

IVTCAHR,

SSYLESPR,

MYO16,ISSLNKR, UTF1,

IMMT,

WAC,

OTC,

TTC17,

CDR2L,

GNPTG,

C10orf103,

RDPSPVSDR,

WGPTSNNPR,

VSRGPFSPR,

MAREPR,

SEMA6C,

EHALLAYTLGVK,

EVSTYIK,

THINIVVIGHVDSGK,

DNDPNDYVEQDDILIVK,

IGGIGTVPVGR,

YEEIVKEVSTYIK,

DCTCEEFCPECSVEFTLDVR,

YYVTIIDAPGHR,

EEF1A1,

VADWTGATYQDKR,

LDGLVETPTGYIESLPR,

SUCLG1,

ADAM15,

CDK16,

TLATDILMGVLK,

MED14,

KIPIIIR,

KAIDCLR,

QLKACHR,LRRC53, IKGQNLK,

QVSAINR,

CXCL2, ACLNPASPMVK,

ZNF285,

YLPM1,

ENVNSLLPVFEEFLK,

VLVVHDGFEGLAK,

TTN,

LVPLLLEDGGEAPAALEAALEEK,

FAISLPKAR,

PRX,MRPS31,

POLR2F,

RYLPDGSYEDWGVDELIITD, TQGNVFATDAILATLMSCTR,

DNAJA3,

QLLQANPILEAFGNAK,

RPS27L,

IIGLDQVAGMSETALPGAFKTR,

DGADIHSDLFISIAQALLGGTAR,

ALMLQGVDLLADAVAVTMGPK,

IQDKEGIPPDQQR,

TALLDAAGVASLLTTAEVVVTEIPKEEK,

ESTLHLVLR,

GINSSNVENQLQATQAAR,

VTEPSAPCQALVSIGDLQATFHGIR,

MQLFVK,UBC,

MQIFVK,

TLSDYNIQK,

TITLEVEPSDTIENVK,

QIGSVIRNPEILAIAPVLLDALTDPSR,

QDQIQQVVNHGLVPFLVSVLSK,

SHGIDLDHNR,

SMC4,

AAVEEGIVLGGGCALLR,

CEFQDAYVLLSEKK,

NPEILAIAPVLLDALTDPSR,

PFKFB3,

PESGAAAARPRR,

PPP2R5A,

VITSQGTVVK,

LVLGQEELR,

GTPBP5,

SLELLPIILTALATK,

POC1A,

JUP,

LNTIPLFVQLLYSSVENIQR,

EXOG,

MELQEIQLKEAK,

CAKFSPDGR,

CSPG4,

ACLGEPLAR,

QLSQSLLPAIVELAEDAKWR,

YFAQEALTVLSLA,

TFDP1,

KIF27,

SNAP47,

FKBP2,

C16orf7,

TPM1,

RTQSDLTIEK,

EVRDLK,

RCPSILGGLAPEK,

LQELEAR,

TVLWNPEDLIPLPIPK,

IYQTDR,

LHHQNQQQIQQQQQQLQR,

IAIYELLFK,

DGMLYLGSR,

HSPA4,

ZC3H4,

SRCAP,

TRIM22,

QDMLAEKVLAGK,

SSNNNGSVRTA,

ACADVL,

ASB2,

CHSY1,

C14orf49,

MED15,

RPS10,

SPAG5,

GNAL,

ZNRF1,

LRRIQ1,

NSSIASDIHGDDLIVTPFAQVLASLR,

MMHVQQLSLASLCVKK,

KEDMLLK,

PPTDSPLER,

FLLTGLLSGLPAPQFAIR,

MUC16,

GRIK1,

SMTN,

SALL4,

UNC5A,

ERC1,

PIGN,

ZBTB11,

FVSSLLTALFR,

TRIM6-TRIM34,

PSMA3,

VADLVHILTHLQLLR,

ESPNL,

AVENSSTAIGIR,

NSGLTEEEAK, SSPO,

RAE1,

AGSLAALEKR,

NFSSASALQIHERTHTGEKPFVCNICGR,

VTWRGEGR,

KIAELESH,

LQEEIAR,

VSAQQFDDAFLK,

YLTNER,

QLENQNPQGR,

IQTEPTSSLTLGLRK,

WDR11,

UACA,

DBF4B,

URB2,

LSLEGIVVQR,

WLNCPR,

YDSQVAEENR,

WKPPSLNSVDFR,

RNGTT,

TEVSFTLNEDLANIHDIGGKPASVSAPR,

DDX26B,NLQASGLTTLGQALR,

LKLGAIFLEGVTVK,

IIRPSETAGR,

ALIVVQQGMTPSAK,

TDLTVLVAHNDDPTDQMFVFFPEEPK,

LRENQLPR,

AECRPAASENYMR,

LQIEESSKPVR,

QHVLDMLFSAFEK,

HQYYNLK,

TIRDIIALNPLYR,

AQFGDKPSEGRPR,

EEVTELLAR,YFGIKR,

RTDLTVLVAHNDDPTDQMFVFFPEEPK,

POLR2E,

DIIALNPLYR,

FLDDLGNAK,TLQQNAESRFN,

BAG2,

IIDEVVNKFLDDLGNAK, LLESLDQLELRVEALR,

TLQQNAESR,

GLANPEPAPEPK,

KPVAIFK,

RECQL5,

NDRDQVSFLIR,

ANLFYDVQFK,

EEEENRLR,

FAM50A,

LOC100294412,

QLAVLTELPFVVPFEER, DKPELQFPFLQDEDTVATLLECK,

SEC16A,

VPHPLEHK,

POLR2J,

FENNSWVFMR,

NKPFFDICTSR,

GELDLTGAK,

UBE3C,MSH2,MYLK2,

KNFIAVSAANR,

LNLVEAFVEDAELR,

CNP,

VELSEQQLQLWPSDVDKLSPTDNLPR,

MEX3A,LASVPEYR,

ATGSANMTK,DQEEIEDQSPPCPR,

NBPF6, SPIN2A,

SPIN2B,NBPF4,

FOXK1,

HNRNPH1,

STGEAFVQFASQEIAEK,

LAQNILSYLDLKSDEWEY,

BAG5,

QLFEIDSVDTEGKGDIQQAR,

HTGPNSPDTANDGFVR,

IQELGDLYTPAPGR,

AVIEVQTLITYIDLK,

FEELLLQASK,

IRS4,

VLGNTSGLLQLLFNR,

ELQQAQTTRPESTQIQPQPGFCIK,

PIH1D1,

GQGCTAYDVAVNSDFYRR,

EALDVLGAVLK,ALEEAANSGGLNLSAR, FDDGAGGDNEVQR,

LNTEVASVVVQLLSIR,

LLQEEEQER,

DVWQVIVKPR,

ACLIFFDEIDAIGGAR,

ELYDRYGEQGLR,CFLSWLVK,

LRCH1,

GNTSGSHIVNHLVR,

DNAJA2,

PSMC2,

THYSNIEANESEEVR,

CAPNS1,

FSATEVTNK,

HDLPPYR,

VIEPGCVR,

ELFLTAGGAGGLHLWK,

LVATSLEGK,

EIINAIDGIGGLGIGEGAPEIVTGSR,

YLWNNIK,

ITFTGEADQAPGVEPGDIVLLLQEK,

VHLTPYTVDSPICDFLELQR,

DAQGQPGLER,

YVDVLNPSGTQR,

DDTHKVDVINFAQNK,

SSLSSHSHQSQIYR, SWSSYFSLPNPFR,

LNTEVASVVVQLLSIRR,

LLIVSNPVDILTYVAWK,

WFQPVANAADAEAVR,

IEEFVPPDENCPLKEASSR,

GSVSASEQGTLNPTAQDPFQLSAPGVSLK,

GTGVIQLYEIQHGDLK,

YLATGDFGGNLHIWNLEAPEMPVYSVK,

VVCAGYDNGDIKLFDLR,

WDR92,

STVWQVR,YEYPIQR,

EWNLFYQK,

YTGEEDGAGGHSPAPPQTEECLR,

UPF1,

NVFLLGFIPAK,

ISEPEADVESVLGVSNLLQVLQKPK,

SQLLKDPQVLFAGYK,

LLMEELVNFIER,

LTN1,LDHA,

LLGHEVEDVIIK,

KTIMQLCHDR, TENPLILIDEVDKIGR,

LLQNVAR,

LONP1, INTS3,

IVSGEAESVEVTPENLQDFVGKPVFTVER, DIIHNPQALSPQFTGILQLLQSR,

ILCFYGPPGVGK,

TQLVWLVR,

GHKEIINAIDGIGGLGIGEGAPEIVTGSR,

DCWTVAFGNAYNQEER,

AAYNPGQAVPWNAVK, LAEAEETAR,

KIAA1967,

QEGLDGGLPEEVLFGNLDLLPPPGK, ALVSHNGSLINVGSLLQR,

SPAPPLLHVAALGQK,

ILQDSLGGNCR,

ANLEAFTVDKDITLTNDKPATAIGVIGNFTDAER,

VLLLSSPGLEELYR,

QLEESVDALSEELVQLR,

KIF5B,

SLSALGNVISALAEGSTYVPYR,

LYLVDLAGSEK,

SAEIDSDDTGGSAAQK,

HVAVTNMNEHSSR,

TAAPGHDLSDTVQADLSK, IESEGLLSLTTQLVK,

MRPS27,

MQNSDFLR,

GQGCTAYDVAVNSDFYR, NRPFMGSISQQNIR,

ELVITIAR,

VVSRGNK,

VDNALQSGNSQESVTEQDSKDSTYSLSSTLTNSK,

VDNALQSGNSQESVTEQDSKDSTYSLSSTLTISK,

RTVAAPSVFIFPPSDEQLK,

VDNALQSGNSQESVTEQDSK,

IGKC,

TVAAPSVFIFPPSDEQLK,

DSTYSLSSTLTISK,

PCDHB8,

SGTASVVCLLNNFYPR,

VDNALQSGNSQESVTEQDSKDSTYSLSSTLTDSK,

NFIC,FAM90A9,SYSDPPLK,

HSGPNSADSANDGFVR, HNRNPF,

IEGDETSTEAATR,

ADQFEYVMYGK,

LSAYVSYGGLLMR,

LQGDANNLHGFEVDSR, LHCESESFK,

VYLLMK,

MELQELQLKEAK,TELQGLIGQLDEVSLEKNPCIR,

AVWNVLGNLSEIQGEVLSFDGNR,

YVELFLNSTAGASGGAYEHR,

ITGEAFVQFASQELAEK,

YVEVFK,

YLDLEEEADTTK,

ATENDIYNFFSPLNPVR,

CCT4,

IDDVVNTR,

LFAQLAGDDMEVSATELMNILNK, THYSNIEANESEEVRQFR,

TDGFGIDTCR,

ITFTGEADQAPGVEPGDIVLLLQEKEHEVFQR,

AFADAMEVIPSTLAENAGLNPISTVTELR,

LGFEEFK,

GKGAAAAAAASGAAGGGGGGAGAGAPGGGR, YSNENLDLAR,

VLFICTANVTDTIPEPLRDR,

NYLDWLTSIPWGK, LAQPYVGVFLK,

FSDLFSLAEEYEDSSTKPPK, QLEVEPEEPEAENKHKPR,

NKNPAPPIDAVEQILPTLVR,


LLGASELPIVTPALR,

FQGEDTVVIASKPYAFDR,

EAVFFQSHSAR,

ITHEVDELTQIIADVSQDPTLPR,

FAM90A19,

KPNA2,

PRPS1,

PRPS2,

KIAA0415,

CAMK2B,

DTNA,

USP13,

VNMSNQVPR,

LAD1,

HTR3B,

PRPF40A,

IEKMELALVPVSAHGNFYEGDCYVILSTR,

MKEASVNPSGAR,

SRESPGGDLR,

TQGGVSPR,

MEX3B,

CALM3,

QEKQER,

YLLDKDEK,

CALM2,

ARL9,

MOSPD2,

NCOA1,

EAFSLFDKDGDGTITTK,

CHADL,

DENND2D,

SIRPD,

ZMYM5,

C19orf55,CD36, VFNGKDNISK,

GQVCVMIHSGSRGLGHQVATDALVAMEK,

QSVLWSR,

ALHLSDLFSPLPILELTR,

WASGVDGGLYEHKTFMYPVAK,

QNLQEEKER,

LKEAGQGR,

WPFVPEK,

IDMPRK,

RVLSGATLPDK,

QAPESRK,

EERPLFPQILATIELLQR,

TPEEHELETGIK,

GHGEGRLGCR,

LPEITDLK,

APDSTLR,

ADPECCSILHGLVAAVETLCK,

C19orf24,

PDCD11,

PEG3,

ANKRD27,

EIF3L,

PLAGL2,

SNED1,

RPS24,

DNAH12,

SCRIB,

AKAGWAGDPR,

RGPGLGAR,

TTPDVIFVFGFR,

MPPLIADSPKAR,

LFSQRKPCK,

LSVAGNCR,

QEKEAGR,

AYDITNVLDGIDLVEK,

MQEEEARK,

HIASGNQK,

ALLLLLAER,

GRM7,

NPM1,

CEP44,

ERVFC1-1,

MTFR1,

AMVVDLVKR,

LLCGATLIAPR,

ZNF223,

SLWVINSALR,

ASB15,

KLCTHK,

TARS2,

RTL1,

AVIL,

HAPLN1,

SLC43A3,

IL4I1,

GCET2,

COL9A2,

ATP8B1,

KLK11,

PPIL4,

OSMR,

LLGAGRAR,

EQYDEKR,

KQAELDK,

AREMLLK,

NDWHGGAIVSALSQTGSLFKPR,

TAMEFQRR,

LOC729040,

RPL19,

DAK,

BBX,

ZNF250,

NAA16,

RVDYAAR,

LVEPQISELNHR,

LTQLAMLLAEISSVAHQK,

QLDSLRER,

DMD,

MRPL55,

TRIM37,

KANSL2,

FBXL13,

MRPS28,

ANGEL2,

CECR2,

EEF2,

FIGN,

KEISVNGK,

CVRLSDASVMK,

GTGKTLLGR,

CSPGEKSQLVR,

CSSQSLPMTR,

IADGSVMR,

LSSYSTTIR,

QEETPVLTR,

ALPALPSAR,

NPPAGYFQQK,

PKSQVSEEEGK,

LLGLTGLLSK,

GAAVLGVDHRVR,

LYENEK,

GVEITGFPEAQALGLEVFHAGTALK,

NQAIQALR,

LLVRPTSSETPSAAELVSAIEELVK,

ILKCYEQK,

RDFVEAPPPK,

SPGWSEEVVR,

RLASSVLR,

IKMATADEIVK,

KCVPVPR,

TCTTTRVSK,

SSESPGDQVMAAAR,

KLVEAIK,

LOC100131673,

LNNSSPCDIR,

RAD51AP2,

AIAYLFPSGLFEK,

C4orf27,

SALTVHCK,

EPB41,

SCGTTRELQK,

STK40,

PARN,

ACRV1,

PCLO,

MREG,

DAP3,

MITF,

SMARCA1,

TRIO, TLVLSNLSYSATEETLQEVFEK, SUB1,

RCOR3,

LRRC38,

KRT78,

ESLSSVCGLR,

C4orf29,

FLGATTDTTVLEANAVLLGIQESK, CLCC1,

YEEIAR, YEELAR, ARL1,

QVNSALK, DLSYTLK, POLR2K, LVVFDAR,

GTGLDEAMEWLVETLK, MGLL,

LELLEEVLALGLR,

MAAQRLGK,

RABGGTB,

LYSEQR,

LLKFESCGK, LRRTM2,

VWF,

RP1,

TEDAGLELLACPVLR,

PPP1R14A,SKFLDALISLLS,

ATXN7L1,NVGCLQEALQLATSFAQLR,

STX18,EFTRSR,

ZFP37,

TYGEPESAGPSR,EMD,

FFEVILIDPFHK,

DHRPPCAQEAPR,

TDVKNTLSEIK, LQREQQR, RP1L1, VQIDYK, MLPALGMACPPK,

LYATDGR,

TCQSLHINEMCQER,

MSEYLR,

ATR,

DFEPILEQQIHQDDFGESK,

MEPCE,ISVCPAMR,

NTN1,

EPHA8,

LEGVVTR,

EAYLLLQLFK,

TLLPLLLESSTESVAEISSNSLER, QAACTIVEALATIPSR, DASGFWSSLVNPATGLAEVEGENAQR,

EGMEIPNLR,VPS41,

QQLVKLEMNTLNVMLGTLNLGNLLLTGDK,

PCDHB13,

ELPQLVASVIESESEILHHEK,

MKIHSPSPHK,

COL6A1,

C16orf70,

FREM3,

FHL2,

C7orf34,

IQGAP3,

NAV2,QMLNDKK,BRCA2,NUP214,

YVFDIK,

RC3H1,

SSVLPSPSGR,

RERE,

AFMID,

CLPX,

GOLGA4,

GTF2H5,

GHRLOS2,

WDR82,

NCAM2,

AISHEHSPSDLEAHFVPLVK,

QLSQSLLPAIVELAEDAK,

AINTAMYHIMMFQLVKDHIR,

VCCELELR,

VGELADQNAFSLTQK,

KTEALLEAK,

LGAEEALR,

IEATNNPR,

EALNPETIEIK,

DCEGLSREK,

SMARCA2,

RELB,

CIITA,

MDLAGLLK,

CVHIGEMVDPVVELAAKR,

CADPS,

FNTA,

EVFGRGGQPGPK,

RAMELLK,

VAILAELDKEK,

ZMAT1, TMCO6,

TNFAIP8L1,

RPL36,

QSGPRTK,

CDC45,C9orf80,

SGCE, QANNFIISR,

ESRRA,

THOC2, KEENGTMGVSK,

AGPAT4,

TOB2,

PABPC4, EAAQKDSK,

TEQGTVLIGSEHGLIEVDPVSR,

EEASSLLK,

EEGMVPFIFVGTR,

EVEVTACLVWK,

LQGGPGLHLK,

SLGDLEK,

RPCAPTR,

VHELQGNAPSDPDAVSAEEALK,

MQEEEEVVDK,

NICEGGEEMDNKFGVEQPEGDEDLTK, MHGGGPPSGDSACPLR, NSINQVVQLR,LSGEAFDWLLGEIESK, CIESNMLTDMTLQGIEQISK, SMVVSGAK,

RSGLELYAEWK,

TLPHFIKDDYGPESR,

AEIQELAMVPR,

IKDILAK,

LTFASTLSHLRR,

RLDLAGPLLAFLFR,

FEQIYLSKPTHWER,

LRNLTYSAPLYVDITK, IVATLPYIKQEVPIIIVFR,

TAETGYIQRR,

RVQFGVLSPDELK,

SLSEYNNFK,

IVEDAPPIDLQAEAQHASGEVEEPPR, IPQIGDKFASR,

NLTYSAPLYVDITK,

ATAGDTHLGGEDFDNRLVNHFVEEFK,

QIALAVLSTELAVRK, LVQNGTEPSSLPFLDPNARPLVPEVSIK, WFESSGIHVAALVVGECCPTPSHWSATR, AIVHAVGQELQVTGPFNLQLIAKDDQLK,

CDFALFLGASSENAGTLGTVAGSAAGLK,

TLHEWLQQHGIPGLQGVDTR, VFNTGGAPR,

RPVINAGDGVGEHPTQALLDIFTIR,

ASDPGLPAEEPK,QTQTFTTYSDNQPGVLIQVYEGER,

LLDTIGISQPQWR,

SVGEVMGIGR,

RIIAHAQLLEQHR,AKFEELNMDLFR,

CAD,MVNHFIAEFKR,

VCNPIITK,NQLTSNPENTVFDAKR,

FDDAVVQSDMK,ITITNDK,

VEIIANDQGNR,

IEIESFYEGEDFSETLTR,

LSKEEIER,

AFYPEEISSMVLTK, NALESYAFNMK,

YKAEDEVQR,QATKDAGVIAGLNVLR, QAMGVYITNFHVR, KGFDQEEVFEKPTR,

VSANKGEIGDATPFNDAVNVQK, ARGPIQILNR,DCQIAHGAAQFLR, ICRPLLIVEK, LVNHFVEEFK,ITSQIFIGPTYYQR,

GEIGDATPFNDAVNVQK,

LLQFHVATMVDNELPGLPR,

QEDMPFTCEGITPDIIINPHAIPSR, TFAPEEISAMVLTK,

CQEVISWLDANTLAEKDEFEHK,

MKEIAEAYLGYPVTNAVITVPAYFNDSQR,

AQIHDLVLVGGSTR, LVNHFVEEFKR,

ELEQVCNPIISGLYQGAGGPGPGGFGAQGPK,

AAAIGIDLGTTYSCVGVFQHGK, KELEQVCNPIISGLYQGAGGPGPGGFGAQGPK, SAVEDEGLKGK,

SAVEDEGLK,

VQVSYKGETK,

NQVALNPQNTVFDAKR, HSPA1B,

QTQIFTTYSDNQPGVLIQVYEGER,

LLQDFFNGR,

HSPA1A,

EIAEAYLGYPVTNAVITVPAYFNDSQR,

FKEIAEAYLGYPVTNAVITVPAYFNDSQR,

HWPFQVINDGDKPK, LLQDFFDGRDLNK,

LLQDFFDGR,

IINEPTAAAIAYGLDR, STLEPVEK,

NQVALNPQNTVFDAK,

FEELCSDLFR,

VEILANDQGNR,

ARFEELCSDLFR,

VTHAVVTVPAYFNDAQR,

NSTIPTK,

HSPA1L,

SINPDEAVAYGAAVQAAILMGDK,

DNNLLGR,

DAGVIAGLNVLR,

RFDDAVVQSDMK, MVNHFIAEFK,

ATAGDTHLGGEDFDNR,

TKPYIQVDIGGGQTK,

TTPSYVAFTDTER,

EIAEAYLGK,ITITNDKGR,

IINEPTAAAIAYGLDK,

CNEIINWLDK,

IEWLESHQDADIEDFKAK,

KVTHAVVTVPAYFNDAQR,

AVEEKIEWLESHQDADIEDFK,

IINEPTAAAIAYGLDKR,

TWNDPSVQQDIK,

VYEGERPLTK,

SFYPEEVSSMVLTK,

HSPA5,

IDTRNELESYAYSLK,

MKEIAEAYLGK,

LLQDFFNGK,DAGTIAGLNVLR,

QTQTFTTYSDNQPGVFTQVYEGER, STAGDTHLGGEDFDNR,

HSPA6,

QATKDAGTIAGLNVLR,

LSKEDIER,

NQVAMNPTNTVFDAKR,

MVQEAEK,

TVTNAVVTVPAYFNDSQR,

LSKEEVER,

ELEEIVQPIISK,LSSEDKETMEK,DNHLLGTFDLTGIPPAPR,

GTLDPVEK,

ITITNDQNR,

DAGTIAGLNVMR,

ITPSYVAFTPEGER,

NKITITNDQNR,

LYGSAGPPPTGEEDTAEKDEL,

LLESLGYSLYASLGTADFYTEHGVK,

FHLPPR,

GQPLPPDLLQQAK,

MAEIGEHVAPSEAANSLEQAQAAAER,

HPQPGAVELAAK,

LALGIPLPELR,

QIALAVLSTELAVR,

VNEISVEVDSDPR,

GHNQPCLLVGSGR,

SQIFSTASDNQPTVTIK, NQLTSNPENTVFDAK, IEWLESHQDADIEDFK,

LTPEEIER,SDIDEIVLVGGSTR,

QISNLQQSISDAEQR,


LQAEIEGLK,

LQAEIEGLKGQR,

RSTSSFSCLSR,AQYEDIANR,

ISSSSFSR,

NSKIEISELNR,

QNLEPLFEQYINNLRR,

WTLLQEQGTK,

QNLEPLFEQYINNLR, AQYEEIANR,

QLDSIVGER,

KRT6C,KRT6A,

KRT5,

KRT75,

GFSSGSAVVSGGSR,

FASFIDK,

KRT79,

YEDEINKR,

SISISVAGGGGGFGAAGGFGGR,

NKLNDLEEALQQAK,

LEGLTDEINFLR,

SISISVAR,VDPEIQNVK,

KRT6B,

KLLEGEECR,GSGGGSSGGSIGGR,

TSQNSELNNMQDLVEDYKK,

SLDLDSIIAEVKAQYEDIAQK,

SLNNQFASFIDK,

LNDLEDALQQAKEDLAR,

AEAESLYQSKYEELQITAGR,

LQGEIAHVK,

AQYEDIAQK,

NLDLDSIIAEVK,

YEELQQTAGR,GRLDSELR,

VSLAGACGVGGYGSR,

DVDAAYMSK,

KRT7,

YEELQVTAGK,

TTSGYAGGLSSAYGGLTSPGLSYSLGSSFGSGAGSSSFSR,

ASLEAAIADAEQRGELAIK,

YEELQSLAGK,

KRT8,

LDSELKNMQDMVEDYR,

YEELQITAGR,DYQELMNTK,

DVDNAYMIK,TNAENEFVTIK,

AQYEEIAQR,VLYDAEISQIHQSVTDTNVILSMDNSR,

NERPDGVLLTFGGQTALNCGVELTK,

FLSSAAAVSKEHPVVISK,

NSVTGGTAAFEPSVDYCVVK,

EIEYEVVR,

RLAADFSVPLIIDIK,

LCPPGIPTPGSGLPPPR,

EELSALVAPAFAHTSQVLVDK,

GEVAYIDGQVLVPPGYGQDVR,

RLSSFVTK,

ELSDLESAR,

LSLDDLLQR,

LREQGSLLGK,

TPHVLVLGSGVYR,

VTAVDWHFEEAVDGECPPQR,

VYFLPITPHYVTQVIR,

IIAHAQLLEQHR,

LSSFVTK,

SLNNQFASFIDKVR,

DVDGAYMTK,

THNLEPYFESFINNLRR,

SKAEAESLYQSK,

FLEQQNQVLQTKWELLQQVDTSTR,


NKLNDLEDALQQAK,

SLVNLGGSK,

IEISELNR,EIKIEISELNR,

FLEQQNQVLQTK,TAAENDFVTLKK,

GSYGSGGSSYGSGGGSYGSGGGGGGHGSYGSGSSSGGYR,

FSSCGGGGGSFGAGGGFGSR, VDLLNQEIEFLK,

YEELQVTVGR,

GGSGGGGSISGGGYGSGGGSGGR,


THNLEPYFESFINNLR, SLDLDSIIAEVK,

LLRDYQELMNTK,AEAESLYQSK,

SDQSRLDSELK,

SKEEAEALYHSK,

TLLEGEESR,

HGGGGGGFGGGGFGSR,

FGGFGGPGGVGGLGGPGGFGPGGYPGGIHEVSVNQSLLQPLNVK,

GFSSGSAVVSGGSRR,

TNAENEFVTIKK,

KRT2,

WELLQQVDTSTR,

GGSISGGGYGSGGGK,

KRT1,YLDGLTAER,GGGFGGGSSFGGGSGFSGGGFGGGGFGGGR,

TAAENDFVTLK,

MALLATVLGRF,ATGYPLAYVAAK,

VLGTSPEAIDSAENR,

VLSEPNPRPVFGICLGHQLLALAIGAK,

KVLILGSGGLSIGQAGEFDYSGSQAIK,

VLILGSGGLSIGQAGEFDYSGSQAIK,

AIVHAVGQELQVTGPFNLQLIAK,

TLGVDLVALATR,

VIECNVR,ILALDCGLK,

TSSSFAAAMAR,LAADFSVPLIIDIK,

EHPVVISK,

QDVEALWENMAVIDCFASDHAPHTLEEK,

ALKEENIQTLLINPNIATVQTSQGLADK,

KNILLTIGSYK,

EATAGNPGGQTVR,

AAFALGGLGSGFASNR, LFVEALGQIGPAPPLK,

VQVEYKGETK,

NSLESYAFNMK,

ARFEELNADLFR,

VQVEYK,

FEELNADLFR,VNHFIAEFK,

ILPWSTFR,

SINPDEAVAYGAAVQAAILSGDK,

SQIHDIVLVGGSTR,

TVIKEGEEQLQTQHQK,

HISPGDTK,

ASDPGLPAEEPKEK,

SFEEAFQK,

TLSSSTQASIEIDSLYEGIDFYTSITR,

NQTAEKEEFEHQQK, HSPA8,

QTQTFITYSDNQPGVLIQVYEGER,

DNNLLGK,

HWPFMVVNDAGRPK, GPAVGIDLGTTYSCVGVFQHGK, NTTIPTKQTQTFTTYSDNQPGVLIQVYEGER, NQVAMNPTNTVFDAK,

LOC100287551,

ELDDRDHYGNK,

ATVEDEKLQGK, LDKSQIHDIVLVGGSTR,


FWEVISDEHGIDPTGTYHGDSDLQLER, MREIVHLQAGQCGNQIGAK,


FWEVISDEHGIDPTGTYHGDSDLQLDR, INVYYNEATGGK,

KEPESCDCLQGFQLTHSLGGGTGSGMGTLLISK,

KEAESCDCLQGFQLTHSLGGGTGSGMGTLLISK,

MRENVHLQAGQCGNQIGAK, MAVTFIGNSTAIQELFKR, MSATFIGNSTAIQELFKR,

IMNTFSVVPSPK,TUBB4B,

MAVTFIGNSTAIQELFK,

MAATFIGNSTAIQELFKR, EVDEQMLNVQNK,

INVYYNEATGGKYVPR, KEWESCDCLQGFQLTHSLGGGTGSGMGTLLISK,

TUBB,TUBB4A,

MREIVHIQAGQCGNQIGAK,

MSATFIGNSTAIQELFK,

MSMKEVDEQMLNVQNK,

AVLVDLEPGTMDSVR, NSSYFVEWIPNNVK,

VREEYPDR,

MAATFIGNSTAIQELFK,

TAVCDIPPR,TAVCDIPPRGLK,

GHYTEGAELVDAVLDVVR,

IREEYPDR,GHYTEGAELVDSVLDVVRK,

AILVDLEPGTMDSVR,

KLAVNMVPFPR,

MASTFIGNSTAIQELFKR,

FWEVISDEHGIDPSGNYVGDSDLQLER, FPGQLNADLRK,

NMMAACDPR,

EIVHLQAGQCGNQIGAK,

TTPPMLDSDGSFFLYSK,

VVSVLTVVHQDWLNGK,

ALTVPELTQQMFDAK,

GHYTEGAELVDAVLDVVRK,

MSSTFIGNSTAIQELFKR,

ISEQFTAMFR,

TLKLTTPTYGDLNHLVSATMSGVTTCLR,

LHFFMPGFAPLTAR,

EIVHIQAGQCGNQIGAK,

KEAESCDCLQGFQLTHSLGGGTGSGMGTLLLSK,

FWEVISDEHGIDPAGGYVGDSALQLER, INVYYNESSSQK,

EIVHIQAGQCGNQIGTK,

MASTFIGNSTAIQELFK,

LAVNMVPFPR,

LTTPTYGDLNHLVSATMSGVTTSLR,

YLTVATVFR,

TPLGDTTHTCPR,

IGHG3,

WQEGNVFSCSVMHEALHNHYTQK,

DTLMISR,


IGHG2,

IGH@,

SCDTPPPCPR,

GPSVFPLAPCSR,


STSESTAALGCLVK,

GYLVTQDELDQTLEEFK, QSLVDMAPK,

MQEENITR,

RNLDIERPTYTNLNR,

GFASVSEK,

ADSWVEKEEPAPSN,

GYLVTQDELDQTLEEFKAQFGDKPSEGRPR,

IHFPLATYAPVISAEK,

DVNAAIATIK,

YMACCLLYRGDVVPK,

FDGALNVDLTEFQTNLVPYPR,

TIQFVDWCPTGFK,

DYEEVGVDSVEGEGEEEGEEY,

LDHKFDLMYAK,AVCMLSNTTAIAEAWAR,

FDLMYAK,

LIGQIVSSITASLR,

SIQFVDWCPTGFK,

QLFHPEQLITGKEDAANNYAR,

EDMAALEKDYEEVGADSADGEDEGEEY,

TUBA1A,

TUBA1B,

TUBA1C,

EDAANNYAR,

VGINYQPPTVVPGGDLAK,

AYHEQLSVAEITNACFEPANQMVK,

AFVHWYVGEGMEEGEFSEAR,

AVFVDLEPTVIDEVRTGTYR,

QIFHPEQLITGK,

TUBA4B,

QLFHPEQLITGK,

LISQIVSSITASLR,

RP11-631M21.2,LHFFMPGFAPLTSR,

EAESCDCLQGFQLTHSLGGGTGSGMGTLLLSK,

LSVDYGK,

GAAAAAAASGAAGGGGGGAGAGAPGGGR,

LLCHLTPSIYTEFPDETLR,

EIFDIAFPDEQAEALAVER, LSSDVLTLLIK,

IQAGDPVAR,

DGPSAGCTIVTALLSLAMGRPVR,

IQVSSEKEAAPDAGAEPITADSDPAYSSK,

AVFVDLEPTVIDEVR,

KLADQCTGLQGFLVFHSFGGGTGSGFTSLLMER,

TIGGGDDSFNTFFSETGAGKHVPR,

GHYTIGK,

EIIDLVLDR,

EIIDLVLDRIR,

AYHEQLTVAEITNACFEPANQMVK,

LADQCTGLQGFLVFHSFGGGTGSGFTSLLMER,

VELGDSDLK,

ITTPYMTK,

LLQCAEPYK,

YAVLYQPLFDKR,

DSWLAELAGER,RGP1,

VTPENAGQWKPDELQVLEK,

DNAH17,

MAELYR,ASLYVIR,QCALAALR,YATDSETVEPIISQLVTVLLK, LOC651921, PFKFB4, LOC642846,ZCCHC18,

LTEGCSFR,KLEELK,

SCTLGLQPHRGPSVCPR, QLVGTNK,NNALEKEK,

RPS27,RPS27P17,

HHEASSRADSSR,

NFIB,

AQDIEAGDGTTSVVIIAGSLLDSCTK,

GLPWSCSADEVQR,

SVYSWDIVVQR,RDGADIHSDLFISIAQALLGGTAR, TLVTQNSGVEALIHAILR,

EIF3D,FSQLAEAYEVLSDEVKR,

DDX11,ZCCHC12,POC1B,

TRIM6,

CCDC162,

NVDLLSDMVQEHDEPILK,

NAP1L1,FANCI,

CDK17,

CCT3,

ELGIWEPLAVK,

IRAK1,

EALLLVTVLTSLSK,

TLIQNCGASTIR,TIADIIR,SLELLPIILTALATKK,

CYP2D6,PRDX1,

IKBKAP,

TLSALLLPAAGLDDVSLPVAPR,

QENIASGISAK,NHS,SSPLPAKGR,

PPP2R1A,

FITVGWGR,

VFDKDGNGYISAAELR,

EIISAAEHFSMIR, GLFIIDDKGILR,EPHA10, PRDX4, CYP2D7P1,

ELISDLER, LSLPADIR, AIARQMGK,

IEIQNIFEEAQSLVR,

SVEQQVIGFSGLSDDKNYK,

LVIASTLYEDGTLDDGEYNPTDDRPSR,

VYRIEGDETSTEAATR,

POLR2H, TPM3,

TPM3P4,

TCTDIKPEWLVK,

DHX15,

FAHIDGDHLTLLNVYHAFK,

LRTVQLNVCSSEEVEK,

IRVESLLVTAISK,

LQLPVWEYKDR,

HQSFVLVGETGSGK,

VIQYLAYVASSHK,

MYH9,

DFSALESQLQDTQELLQEENRQK,

HLESFPK,

IAEFTTNLTEEEEKSK,

VISGVLQLGNIVFKK,

LQQELDDLLVDLDHQR,

NFRKB,

VSHLLGINVTDFTR, RPS27P9,

VTSLNRQR,

QADKVWR,

CCDC72,

LOC729973,

IAQLEEQLDNETKER,

NFIX,FAM90A17P, NFIA,

TDCSPIQFESAWALTNIASGTSEQTK, ITHEVDELTQIIADVSQDPTLPRTEDHPCQK,

LYYVCTAPHCGHR,

SAAEMSGRGSVLASLSPLR,

POLR2I,

VIDPATATSVDLR,

ALIAGGGAPEIELALR, DALSDLALHFLNK,

FAM90A8,

TIMP1, PPP6R1,

HLACLPR,

FLG, PFKM, CCDC19, NUP205,

KMETEAELR,QLVRTPQR,

GGSGSGPTIEEVD,

KFGDPVVQSDMK,

LDKAQIHDLVLVGGSTR,

HAIYDKLDDDGLIAPGVR,

HIDQLKER,

GFDQEEVFEKPTR,

EVAYCSTYTHCEIHPSMILGVCASIIPFPDHNQSPR,

LFEASDPYQVHVCNLCGIMAIANTR,

MDTLAHVLYYPQKPLVTTR,

YAYTGECR,

ISNLLSDYGYHLR,

IVATLPYIK,

NTYQSAMGK,

RLNSPIGR,

EGEEQLQTQHQK,

GKDFNLELAIK,

SMEYLR,

TVTLPENEDELESTNR,

YSLATGNWGDQK, DCSTFLR,

LLLAALGR,

LTFASTLSHLR,


KLEDGPK,

TSETGIVDQVMVTLNQEGYKFCK,

POLR2B,

GTCGIQYR,

QMDIIVSEVSMIR,

KSAIGQR,

AGVSQVLNR,

MTIGHLIECLQGK,

YSLATGNWGDQKK, EMLPHVGVSDFCETK, GPIQILNR,ELPAGINSIVAIASYTGYNQEDSVIMNR, LDLAGPLLAFLFR,

LLFQELMSMSIAPR,

IYTDAGR,

QLHNTLWGMVCPAETPEGHAVGLVK, RDCSTFLR,

KDGNASGTTLLEALDCILPPTRPTDKPLR,

YYVTIIDAPGHRDFIK,

LGLIPLISDDIVDK,

FYYNVESCGSLRPETIVLSALSGLKK,

FYYNVESCGSLRPETIVLSALSGLK,

HTVYPKPEEWPKSEYSELDEDESQAPYDPNGKPER,

POLR2C,

HTVYPKPEEWPK,

NIWLAESVLDILTEQR,

VLAHLAPLFDNPK,

VGGTSDVEVNEK,

IQEIIEQLDVTTSEYEKEK,

NAGVEGSLIVEK,GANPVEIR,

RYEEIVK,

STTTGHLIYK,

FAELKEK,

TPEVTCVVVDVSHEDPEVQFK, VVSVLTVLHQDWLNGK,

TPEVTCVVVDVSHEDPEVK,

IGHG4, IGHG1,

FNWYVDGVEVHNAK,

GTLVTVSSASTK,

THTCPPCPAPELLGGPSVFLFPPKPK,


ASNGDAWVEAHGK,

ETGVDLTKDNMALQR,

VINEPTAAALAYGLDK,

RTIAPCQK,

VLENAEGAR,

SVEMHHEALSEALPGDNVGFNVK, VTDALNATR,

ISSIQSIVPALEIANAHR,

GVMLAVDAVIAELKK,

KPLVIIAEDVDGEALSTLVLNR,

HSPD1,

LVQDVANNTNEEAGDGTTTATVLAR,

HDELLAEHIK,

ERLFEASDPYQVHVCNLCGIMAIANTR,

QQLDSFDEFIQMSVQR,

GPIQILNRQPMEGR,

IFVNGCWVGIHK,

TVTLPENEDELESTNRR,

LPSDLHPIKVVEGVK,

GSKINISQVIAVVGQQNVEGK,

RIPFGFK,

NQDDLTHK,

NQDDLTHKLADIVK,

HGVNRQDTGPLMK,

ELINISK,

RDVFLER,

MIVTPQSNRPVMGIVQDTLTAVR,

TPSLTVFLLGQSAR,

DDYGPESR,

MDDDVFLR,

INISQVIAVVGQQNVEGK,

VLIAQEK,

SCLENSSRPTSTIWVSMLAR,

FRELPAGINSIVAIASYTGYNQEDSVIMNR,

QEVPIIIVFR,

TLQEDLVK,

TQISLVR,

TFIGKIPIMLR,

SLGTSAGSLVHISYLEMGHDITR,

MATNTVYVFAK,

AYFLGYMVHR,

THSTHPDDEDSGPYK,

IPQIGDK,

LNLSVTTPYNADFDGDEMNLHLPQSLETR, VSGDDVIIGK,

KITSQIFIGPTYYQR,

TSETGIVDQVMVTLNQEGYK, HMCDGDIVIFNR,

YSPTSPTYSPTTPK,

SIAANMTFAEIVTPFNIDR,

MQEEEEVVDKMDDDVFLR,

MIWNAQK,

GNEVLYNGFTGR,

GNEVLYNGFTGRK,

KILLSPER,

DFNLELAIK,

QDTGPLMK,

QDVIEVIEK,

INAGFGDDLNCIFNDDNAEK,

LTHVYDLCK,

LVIVNGDDPLSR,

TYQDIQNTIK,

QAQENATLLFNIHLR,

TLQEDLVKDVLSNAHIQNELER,

YSPTSPTYSPTSPVYTPTSPK,

TAETGYIQR,

FGVEQPEGDEDLTKEK,

MSVTEGGIKYPETTEGGRPK,

IIITEDGEFK,

RVDFSAR,

LDILYR, METTL3,

KLLVAGNR,

ATGDLKR,

MNSSLSFCSWKK,

GVQYLNEIKDSVVAGFQWATK,

LPFETFR,

AARS,

MSSVYLK,

GART,

DHX40,

RSAD1,

SQSTM1,

EIF3I,

RPL15,

KHQNGTK,

UBE2D2,

CNOT1,

BMX,

TJP3,

ZHX2,

CACNA1C,

ANKRD26P1,

DHVPELVK,

LQEAELR,

IPLHSPPSK,

TVEGALEER,

PHC3,

SSSLHNHHR,

VFTKDMHGR,

ZKSCAN4,

LTQYIDGQGRPR,

CAMK2G,

LERYHTAIR,

UBR4,

FYVEDLK, PTGDS,

GVLVEIEDLPASHFR,

VLADAGGGR,

UPF3B,LOC642620,HLA-F,

CYP7B1,

LSLYNNCICDVGAESLAR,

LRRN3,

ADARB2-AS1,

SIIKEPESAAEAVK,

KEIMAQLEER, FMNLIK,

AALFALQNLFEEK,ANKRD40,KCMF1,

RVCDALNVLR,

MKEDLIK,EIISEVQR,ASAP2,

NNAASEGVLASFFNSLLSK,

UROC1,

AEGAPGRTR,

CWEVQDSGQTIPK,

AGWARAR,

AIQFLHQDSPSLIHGDIK,

LAVDTDTFSFGVVVLETLAGQR,

DMBT1,YTHAANTVVYSSNK,

AGTSAASSWGGGK,

SMPPLAPQLCR,

LARP1,

QAVEQQIQSHR,SLSALGNVISALAEGSTYVPYRDSK,

ILLYACR,

TITIEVEPSDTIENVK,

UBB,

QNFFSQTPILQALQHVQASCDEAHK,

TLTVELEPSDTVENLK,

TLSDYNLQK,

DLALVSR,

ESTIHLVLR,

LAQLTLEQILEHLDNLR,

GYISPYFINTSK,

SFSLLVR,

UBA52,IIIEDLLEATR,

RPS27A,

NMITGTSQADCAVLIVAAGVGEFEAGISK,

TPVIDADKPVSSQLR,

DYNC1H1,

VEDVLGK,

QKSIQR,

GEPGAAPLSAPAFSLVFPFLK,

VETGVLKPGMVVTFAPVNVTTEVK, QLIVGVNK,

FYFVGDEDLLEIIGNSK,

SSLQSQCLNEVLK,

AHQANQLYPFAISLIESVR,

GTEGRTGPVAVR,

LALESICLLLGESTTDWK,

WVTCASAVQK,

AASQSTQVPTITEGVAAALLLLK,

LFNDSSPVVLEESWDALNAITK,

SSAGERHGSR,

ILPEIIPILEEGLR,

GCN1L1,

VLPLEALVTDAGEVTEAGK,

VLPQLISTITASVQNPALR, LVLPSLLAALEEESWR,

FSSVQLLGDLLFHISGVTGK,

ENFIPTIVNFSAEEISDAIR,

KLVPLLLEDGGEAPAALEAALEEK,

GIFEALRPLETLPVEGLIR,

SRQELEQHSVDTASTSDAVTFITYVQSLK,

VTDFGDKVEDPTFLNQLQSGVNR,

SELLSQLQQHEEESR,

TFKEICAVSR,

SPISVAAAAIYMASQASAEKR,

KMWEELPEVVR,

TQKEIGDIAGVADVTIR,

LSDLQTQLSHEIQSDVLTIN,

VIDVGSEWR,

AVELDLVPGR,

QVQMAATHIAR,

BDH1,

LEHTTLR,

LLVDSNNPK,

VRGNLMGK,VVLPCNLLR,

AKDILCR,INISQVIAVVGQQNVEGKR,

FHPKPSDLHLQTGYKVER, LKELINISK,

FRFDYTNER,

GNSQYPGAK,

RISDEECFVLGMEPR,

DVFLER,

HLALLCDTMTCR,

DNGDRIDLR,

YTPTSPSYSPSSPEYTPTSPK, LFYSNIQTVINNWLLIEGHTIGIGDSIADSK,

LYAHTIAGFLDPEK,

GPPAPSSSLPIR,

KIVPELK,

KLTMEQIAEK,ALQEWILETDGVSLMR, RNEQNGAAAHVIAEDVK,

RTLQEDLVK,

KIIITEDGEFK,

EGLIDTAVK,

VVEGVKELSK,

ILLSPER,

VQFGVLSPDELKR,

THSTHPDDEDSGPYKHISPGDTK,

ASLENSLEETKGR,

LPSDLHPIK,

YSPTSPK,

TYQDIQNTIKK,

TTSNDIVEIFTVLGIEAVRK,

VQFGVLSPDELK,

RGNSQYPGAK,

DVDPVRTTSNDIVEIFTVLGIEAVR,

KSLGTSAGSLVHISYLEMGHDITR,

CSFEETVDVLMEAAAHGESDPMK,

DVLSNAHIQNELEREFER,

GHLMAITR,

AKQDVIEVIEK,

ISDEECFVLGMEPR,

IFHINPR,

YPETTEGGRPK,

NICEGGEEMDNK,

MAEEFR,

AHNNELEPTPGNTLR,

NEQNGAAAHVIAEDVK,

SIAANMTFAEIVTPFNIDRLQELVR,

YARPEWMIVTVLPVPPLSVRPAVVMQGSAR,

VELDRK,

VLSEKDVDPVR,

ILNDARDK,

QIFSLIIPGHINCIR,

FNQAIAHPGEMVGALAAQSLGEPATQMTLNTFHYAGVSAK,

GVSENIMLGQLAPAGTGCFDLLLDAEK, ISPWLLR,

VVVENGELIMGILCK,

RMAEEFR,VVVENGELIMGILCKK,

INTS9,

GLGLDESGLAK,

PILLFLIDTSASMNQR,

MASTL,

CDTYATEFDLEAEEYVPLPK,

VPFCPMVGSEVYSTEIK,

GTEDITSPHGIPLDLLDR,

LEKEIETYHNLLEGGQEDFESSGAGK,

YCGQLQMIQEQISNLEAQITDVR,

LASYLDKVQALEEANNDLENK,

POLR2A,

SDLEMQYETLQEELMALKK,

AVPPPQPQMFGEELPDAQDGEQPGPSR,

DIENQYETQITQIEHEVSSSGQEVQSSAK,

IKFEMEQNLR,

ATFLGLTNEK,

QLSIDTRPFRPASEGNPSDDPEPLPAHR, ANAHFILK,

GEHLLLFLVQTVAR,

YNHLVYSQIPAAVK,

SFYTAIAQAFLSNEKLPNLECIQNANK,

VTFKDEPGEGSGVAR,

VYMHLPQTDNK,

DECIDPFSK,

ELCAQPGQVVAPGAVLVR,

AVPPPQPQMFGEELPDAQDGEQPGPSRR,

DSPRPGKPDER,

YLNKEIEEAPDIR,

EACGVIVELIK,

TALALAIAQELGSK,

GTEVSEPSPPVRDPEGVTQAPGVEPSNGLEKPAR,

AQTEGINISEEALNHLGEIGTK,

LVLSPDAPDR,

FDYTNER,

DVLSNAHIQNELER,

LTMEQIAEK,

VHEIFKR,

LASYLDK,YENEVALR,

NHEEEMKDLR,

EVATNSELVQSGK,

KRT16,

TKYEHELALR,NVSTGDVNVEMNAAPGVDLTQLLNNMR,

LASYLDKVR,

SQYEQLAEQNRK,

KRT10,

KRT14,

TVITPDPNLSIDQVGVPR,

FHPKPSDLHLQTGYK,

VLDELTLAR,

QEIECQNQEYSLLLSIK, APSTYGGGLSVSSSR,

FSSSSGYGGGSSR,

VQALEEANNDLENK,

GSCGIGGGIGGGSSR,

SGGGGGGGLGSGGSIR, TLLDIDNTR,

KDIENQYETQITQIEHEVSSSGQEVQSSAK,

TTSNDIVEIFTVLGIEAVR,

TLNDMRQEYEQLIAK,

GFVENSYLAGLTPTEFFFHAMGGR,

KRT13,AETECQNTEYQQLLDIK,

SQYEQLAEQNR,


MTLDDFR,

QVLDNLTMEK,

ALEEANADLEVK,


DAEAWFNEK,

VAPDEHPILLTEAPLNPK,

PPP2R2A,

QSVEADINGLR,

VLDELTLTK,

SLLEGEGSSGGGGR,

LKYENEVALR,

KRT24,

MED17,

QEYEQLIAK,

FSSSGGGGGGGR,

SGLELYAEWK,

YTPQSPTYTPSSPSYSPSSPSYSPTSPK,

GGSGGSHGGGSGFGGESGGSYGGGEEASGSGGGYGGGSGK,

HGVQELEIELQSQLSK, HGVQELEIELQSQLSKK,

INAGFGDDLNCIFNDDNAEKLVLR,


ELYHVISFDGSYVNYR,

LLASINSTVR,

ACTBL2,

STMQELNSR,


LKYDNELALR,

VIGSGCNLDSAR,

VLVEGPLNNLR,

RLEALER,

ELLADLER,

LVLSPDAPDRATHLIAAR,

ALEAELNDLM,

ALGQEADKGLSGCGIVYCR,

EIGVQNVK,

RPEEQEEEPQPR,

KYVYFQGTGDMNAPPGSR,

GPPAPSSSLPIRQEPSSFR,

VPAEGAPTAAVAEVR,

LEGCSHPVVMK,

DIWPPAQAPTSSQELAGAPEPQGSCAQGGR,

EIEEAPDIR,

TREHYHATALGAK,

VHTDYYAK,SKVLADVAIIFSGLHPTNFPIEK,

EHPFVLQSVGGQTLTVFTESSSDK,

ENSPAAFPDR,

TISHVIIGLK,

VVTTNYKPVANHQYNIEYER,

LYAHTIAGFLDPEKK,

ATHLIAAR,

CTDP1,

FGVEQPEGDEDLTK, QTFENQVNR,

FAPNLITVK,

IGSVLAVFEAAASAQSAGASQSR, RRPEEQEEEPQPR,

YGEDGLAGESVEFQNLATLKPSNK, NVTLGVPR,

LOC100288966,

FNSQPVGDCDFNVR, YDSQVAEENRFHPSMLSNYLK,

DNSVFHVSNLEFPFRK,

TPGQVLSLISSLGFNTPIAEK,

VPNACLFTINKEDHTLGNIIK,

QQTNTLLNLVWR,

GVTQVTTQPK,

MDWSIEAAVATFAQARPPGIYK,

POLR2J2,

NTWELKPEYR,

RLQIEESSKPVR,

GTF2F2,

AVLLAGPPGTGK,

IRAQTEGINISEEALNHLGEIGTK,

LDPSIFESLQK,

EHPFVLQSVGGQTLTVFTESSSDKLSLEGIVVQR,

LSQQLDK,

YLSQQWAK,

LSQQLDKVVTTNYKPVANHQYNIEYER,

LEHTTLRK,VRILPWSTFR,

VHEIFK,YSPTSPTYSPTSPK,

LYELHVFTFGSR,VLADVAIIFSGLHPTNFPIEK,

VLQAQECGHLHVVNPDWLWSCLER,

KLVIVNGDDPLSR,

YLESEEYQER,MRPS18B,TFADFLVYEFSTSAGGQQLNK, MED23,ARAF, MRPS9,ILLDQVEEAVADFDECIR, LSAQAQVAEDILDKYR, HAS1, LAVEALVR, ASNS, WDR26, CEP55,SEELLSQVQFLYTSLLK, TOMM70A,LALLNVATQGVHLWDLQDR, ELYLFDVLR,

IPVSVWMK,GOLGA3,LQAQVECSHSSQQR, IDATVVR,SCQTALVEILDVIVR, PASK,

ESAGADTRPTVRPR, RPS6, LIEVDDER, DUOXA2, NCAPD2, LLNILGLIFK, QARS,EAATQAQQTLGSTIDKATGILLYGLASR, LASYTVR,TMPRSS6,NLSQVARR,SAT1,

LKDGLVR, UBAP1L, GVRLELAGAR, WHSC1L1, GASEISDSCKPLK, NAP1L4, GIPEFWFTIFR, FMR1,

PRKAA2,QDPYLLK,MNGNTALAIAKR,CPSF6,AVSDASAGDYGSAIETLVTAISLIK, ITPTDSR,SLCO1B1, LFLPEAVALR,

TPEGIYR,

MAVKEGK,

LGASEKNER,

MYH10,

RAB3IL1,

SACS,

SULF1,

KQWIGK,PHKG2,SPPPSTLQGR,

RSPGLLER,ANKRD34A,

DASAVSLSESK,MTOR,

GCLSLNLR,

EAISDSLLRK,DAIGLQPDAR,

LPAGVGTR, TEKT2,

ZNF645,EGTFNHTLR,

DHX33,

KNLSIDER,

E2F6,ULK1,

OFD1,

YTHDC1,

LLASKSEGIR,

TTLL4,

FAM214B,

MPLPSPTESR,

IKDVVVK,

HIPK1,

LLRAEDFR, FSIP1, LDKILAK, VLLYHR, IARS2, ASB7, LLLEFK, CDCP1,

ZNF391,EVDQLR, GRSASGLPR,ADRA2A, QAILESPEKK,

KATTHTK,

MPYDFDIR,

DNAJC1,

TDG,

IDMLQAEVTALK,

SLEAEILQLQEELASSER, TMEM14B,

RNF103,

MYO1B,

APOL1,

NFHVFYQLLSGASEELLNK,

VRAEEAGAR,

NDUFS3,LKMAPIK,

GAPVD1,

HFM1,

RTN4RL1,

MEX3C,

KIR2DS5,

AKATSSATTAAAPTLR,

GQAQNYSTLLLEEASAR,

LAG3,

RAPGEF3,

SAVAWEER,

LPQGTVR,

TSEN54,

TAF7,

CAPN3,

TCOF1,

MMGT1,

PLCB1,

GLYAT,

GRIA3,

DNA2,

SSSSPRSINK,

LLSTDAEAVRTR,

MEICADELKK,

MTPLGSGPPR,

ELLDVGNIGR,

AGCSPSDIGIIAPYRQQLK,

VKQTQR,

DTNRNSLSR,

SSSMWYIMQSIQSK,

EAASGTTPQKSR,

MYH1,

DACT1,

VPS13A,

SRRD,

GLG1,

SMAP2,

KDM2A,

TAF3,

SOGA1,

FAM19A5,

SPTBN2,

SEMA4G,

C19orf47, TBL2,

FOXP1,

TIGD6,VPLLLEEQGVVDYFLR, IESGEGTIPVR,

MAADSREEK,

CTPS, QFSLEEK,

SSVQSNILILQ, HAUS6,EIKLEELIDSLGSNPFLTR, SKIV2L2,KSNVKPNSGELDPLYVVEVLLR, LOC644667,

PNMA3,

HGGTAELK, COX4I2, CDC5L, LECLKEDVQR, SLC7A10,NKVAHFSR,JPH3, SLFN14,

RMGCPTPK,MALARPRPR,HRASLS2,DTLDECGLR,

UCN,

KPSYAEICQR,

DALSLEEILR,

LRRC41,XKR6,

MPSSHRILHK,

WTVSNLR,CD200,KGYHEIR,KAEALASGK,GPRIN1,KEIEDLR, NDUFS1, FAF2,TTLL1,

MASFFSTLTFPSLK, PLA2G4E, EGLSPCHLLTVR, ODF2, MKGDTVNVR, ALPK1, VRTN,

MGRN1,NNSTLPK,

SPTA1,ZMYM3,

TRAF3IP3,

PJA1, KTNSEVPMHR,LAEKEETGMAMR,

VAGGAKEK,

FEELNRTLR,

CATLDTR,

VLEDVGK,PACSIN1,

LSSAHVYLR, BMPR1A, ADARB1,

LQPEVQR,TBC1D24,LGAPEARR,KIF13B,PLEKHA7, IIKSSSK,TRSTTLR, FUBP1, IQPQLDR,

MSTO1,KSGNLSEK, LSRQPQR,

FBXO40, IKNLANK,

STOX2,

C3orf35,

CCDC25,

ILIFEK,

TLPCLVR,

ILITFR,

ASPASRGPR,VLCLSQSEGRVVSLTGICGAEEVEGGQDPK,

ARMC8,

NDE1,

SSSLVSVR,

DTLETQLSR,

EYVMSMGVCPVSSSALK,

TOM1L1,

NUDT21,

MRI1,

ARHGAP21,

CYBB,

PIAS1,

TLEAIRYSR,

LTTEILGR,

ISKNYNHK,

DLISRR,

ISALNIVGDLLR,

DGTGNQMLQASK,

ESKAEQAESK,

QLQSDGK,

QEQKEVSVMR,KIF20B,

NDEL1,

TGM2,

GSWTGPGCWPRGFQSK,

EIF3F,

LOC100508384,

CCDC63,

GFPT2,

PTPRT,

FNIP2,

SEC14L1,

LPSCEVLGAGMKMDQQAVCELLK,

LDVDAPR,

SWT1, EELSAELLR,

FLJ43944,

LNVGMENKVVQLQR,

GLLLFVDEADAFLR,

GELSQGVQK,

NTRVYPCVWCK,

KDM5B,

YEEALR,RLHACAR,SPIRE1,

LARP4B,

MLPRAAWSLVLR,

QSQEQWQEKEQR,

DMXL2,

C1orf65, LOC100507699,

DLPEHAVLK,

HNRNPU,

SSGPTSLFAVTVAPPGAR,

HGSGSGHSSSYGQHGSGSGWSSSSGR,

GSGSGQSPSSGQHGTGFGR, HRNR,

LLDAVDTYIPVPAR,

NFILDQTNVSAAAQR, TUFM,

YEEIDNAPEER,

AEAGDNLGALVR,

HYAHTDCPGHADYVK,

DLEKPFLLPVEAVYSVPGR,

LOC728989,QQIAEGK,

EIF3FP3,IQDALSTVLQYAEDVLSGK,

PDE4DIP,

LLREELQLLQEQGSYVGEVVR,

LEGGSGGDSEVQR,

PSMC5,

ANXA2P2,

HIP1R,

SALSGHLETVILGLLK, PIAS2,

FLG2,LTGHLHAQGTPPYVINLDPAVHEVPFPANIDIR,

LTADPDSEIATTSLR,

FSNSSSSNEFSK,

ANXA2,

HSSWSEGEEHGYSSGHSR,

HQEEESETEEDEEDTPGHK,

FAM198A,EAQQSLR,

ANKHD1-EIF4EBP3, TTSEIFLSSTAEGADLR,

DLKPENLLLASK,VYAILTHGIFSGPAISR,

NVVHQLSVTLEDLYNGVTKK, AFITNIPFDVK,VLAQQGEYSEAIPILR, SFVEFILEPLYK,YVGETQPTGQIK,SLAGAAQILLK, ILGTPDYLAPELLLGR, HREDIEDYISLFPLDDVQPSK, NFIAVSAANR, VLKPLLSGSIPVEQFVQTLEK,

GAQDVAGGSNPGAHNPSANLAR,

FAM50B,

FYQQVTK,

QISDGEREELNLTANR,

LVLIGSNHSLPFLK,

QEMETQVR,

LLPSAPQTLPDGPLASPAR,

DAEIPEGAARGPHR,

GFPT1, ATAD3A,

ATAD3C,

TMLELLNQLDGFEATK,

IQTQPGYANTLR,MAVDQDWPSVYPVAAPFKPSAVPLPVR,

QSLGHGQHGSGSGQSPSPSR,

LPSGKVGFSK,LSPVGWVSSSQGK,

EKPYFPIPEEYTFIQNVPLEDR,

GNFTLPEVAECFDEITYVELQKEEAQK,

LLESLDQLELR,

VVQHSNVVINLIGRDWETK,

NFDFEDVFVKIPQAIAQLSK,

TLTVEVSVETIRNPQQQESLK,

NDUFA9,

SSVSGIVATVFGATGFLGR,

LPHLPGLEDLGIQATPLELK,

VFEISPFEPWITR,

MORN2,

MYO5B,

ACAA2,

QVLMEHK, VTSIADR,

ANKHD1,

ARHGEF33,

CDH11,

CUL4B,

LOC100134240,

MQEMGNGKANR,

CUL4A,

GAVSAEQVIAGFNR,

MVGGVLVER,

FAM201B,

SMSLVLDEFYSSLR,

IIETLTQQLQAK,

SLANAESQQQR,

GPYESGSGHSSGLGHR,

GPYESGSGHSSGLGHQESR,

QCVVGPNHAAFLLEDGR, AYSIVIR,

GSGLLGSQPQPVIPASVIPEELISQAQVVLQGK,

DAVWWLHTVVPSISK,

LYVPLYSSK,

WSSGVGGSGGGSSGR,

KPNA6,

UBR5,

RSGTISTSAAAAAAALEASNASSYLTSASSLAR,

DLLFILTAK,

SEHQGALWDCLLSFIR,

LSPSAASDAVLSALLSIFSR,

PPP2R2D,

SLC25A3,

TPETLLPFAEAEAFLKK,

MAP7D1,

TPETLLPFAEAEAFLK,

YYALCGFGGVLSCGLTHTAVVPLDLVK,

LEHTAQTYSELQGER,

TVPLYESPR,

VTMQNLNDR,

NQILNLTTDNANILLQIDNAR, LENEIQTYR,

TIDDLKNQILNLTTDNANILLQIDNAR,

LTYQDAVNLQNYVEEK,

CTADILLLDTLLGTLVK,

GDFLNYALSLMR,

NAQNPSLHHPR,

NVAIFTAGQESPIILR, QYLMNLEQR,VLLLPLER,

SFYTAIAQAFLSNEK,

NTPVQSPVSLGEDLQWWPDKDGTK,

WFWSIVEK,

HVAYVFQALIYWIK, FICIGALYSELLAVSSK,

LOC730429,FAQLALER,

RPPPHILDQVK,

RRPEILSFFSTNLQR, TASTLLKPAPSGLPSER,

LSAVEAIANAISVVSSNGPGNR, AASTAPSSTSTPAASSAGLIYIDPSNLR,

VFMEDVGAEPGSILTELGGFEVK,

SLNQSLR,

INTS1,

LAEALAFRQDLEVVSSTVR,

LDLLYR,

SFQNQIAAIQR,

KPNA1,

KPNA5,

MRPS7,

TSSVFEDPVISK,

QTITESSSLLLSQLTSLDPQGPPR,

IVQVALNGLENILR,

AWGILTFK,

KPVEELTEEEKYVR,

GPN1,

TROVE2,

ALLQEMPLTALLR,

LPLFGLGR,

DSQLYAVDYETLTRPFSGR,

TNTPADVFIVFTDNETFAGGVHPAIALR,

MPAATLR,

QSTTHLADGPFAVLVDYIR,

HUWE1,

ALAELFGLLVK,

TIDGILLLIER,

LLQLLGRLPLFGLGR,

PFDN2,

VRPDYTAQNLDHGK,

MRPS34,

GSSGGGCFGGSSGGYGGLGGFGGGSFR,



IRLENEIQTYR,

ALEESNYELEGK,

QSLEASLAETEGR,

NVQALEIELQSQLALK,

ADLEMQIESLTEELAYLK,

IPSEQEQLR,DIVQFVPFR,

QVVEYYSHR,CPNE7,

SAMHD1, EFTUD2, FKBP8,

SALTEMCVLYDVLSIVR, LTDNIFLEILYSTDPK, AFIPAIDSFGFETDLR, SCSLVLEHQPDNIK, LLLAVFVTPLTDLR,

DNAJA4,

NVVHQLSVTLEDLYNGVTK,

MED24,HNRNPM,

VVLAITDLSLPLGR, INEILSNALK,

VFGFDSFK,FKQISQAYEVLSDAK,

NVVHQLSVTLEDLYNGATRK,

DDB1,

GHGECPTTENTETFIR,

YLIYDIIK,

LIIEFK,

POTEB,

YSELGHPFGYLK,

POTED,

KEEDLLR,

INTS6,

POTEC,

DFQQLLQGISEDVPHR,

CDC42EP1,

ASWESLDEEWRAPQAGSR,

LYGPTNFSPIINHVAR,

FARSA,

SLQALGEVIEAELR,

CPNE3,

SPLGEVAIRDIVQFVPFR,

VVEELTR,

SFFSEIISSISDVK,

PPP2R2B,

NILETLLQMK,

MRPS35,

IIAPPER,

ILGLCLLQNELCPITLNR,

NNPLYHAGAVAFSISAGIPK,

LSLELFGR,

LLCDSVVLQPYLR,

DRGSGLLGSQPQPVIPASVIPEELISQAQVVLQGK,

KRT9,

HYNGEAYEDDEHHPR,

NVVHQLSVTLEDLYNGATR,

DNAJA1,

GLCAECGQDLTQLQSK,

ELNGSEAATPR,

RUVBL1,

VEAGDVIYIEANSGAVK,

YSVQLLTPANLLAK,

GYSFTTTAER,

DSYVGDEAQSKR,

ALESSIAPIVIFASNR,

QAASGLVGQENAR,

MKIEEVK,

VEAGDVIYIEANSGAVKR, ERVEAGDVIYIEANSGAVK, DSIEKEHVEEISELFYDAK,

SYELPDGQVITIGNER,

ACTG1,

LOC100287399,

TIVITSHPGQIVK,

VNFPENGFLSPDKLSLLEK,

QISQAYEVLSDAK,

VAPEEHPVLLTEAPLNPK,

IWHHTFYNELR,

DSYVGDEAQSK,

ACTB,

LCYVALDFEQEMATAASSSSLEK,

LEELHVIDVK, KTEPATGFIDGDLIESFLDISRPK, ITFHGEGDQEPGLEPGDIIIVLDQK,

SDPNRETDDTLVLSFVGQTR, ELYDKGGEQAIK,

EATADDLIKVVEELTR, TYEVSLR, ALYYLQIHPQELR,

LLEGEDAHLSSSQFSSGSQSSR,


QSVEADINGLRR,

ADLEMQIESLTEELAYLKK,

QFSSSYLSR,

QTMQNLNDR,

Figure 3.4: Peptide-protein graph constructed from an AP-MS experiment with POLR2A as a bait. Proteinsare drawn as red triangular vertices while peptides are drawn as blue round vertices.


LLETECPQYIR,

FSPFGDIQDIWVVR, RBM45,MRAAELALR,ADARB2, RPL6,

LLSSFDFFLTDAR, S100A8,

LSDQFHDILIR, INA, QILELEER, RPS10, IAIYELLFK, TRIP12,GGPVKIDPLALVQAIER, CCT6A,ASITPGTILIILTGR, PDCD6, DDX18, ALDYLAIGIHELAAISER, SNRPD1, AVAAAAAAAVTPAAIAAATTTLAQEEPVAAPEPK, TLAFLIPAVELIVK, SRRM1,LSSPADITDK, NREPVQLETLSIR,HAL, TCOF1,GISLNPEQWSQLKEQISDIDDAVR, SUB1,LLLLESVSGLLQPR, PELP1,GAVDALAAALAHISGASSFEPR, VLTELLEQER,DDX50,QIGNVAALPGIVHR, C22orf28,LLIYAASSLQSGIPSR, IGKV1OR1-1,TAFYLAEFFVNEAR, GGH,PSMA2,LLQSQLQVK,RBM17,WVGGPEIELIAIATGGR, TLAESALQLLYTAK, SQAPGQPGASQWGSR, DAZAP1,TLN1, PLG,GNDTFVTLDEILR, CCT5,MRE11A,GDATVSYEDPPTAK, AQLGVQAFADALLIIPK, EWSR1,

SSPATADKR, SSNNNGSVRTA, ZSCAN2,PIAS4,GLDSLVAIMR,NOL10, QVELIRNSR,CHSY1, LLEQQELR, VWA3A, SIM1,DTIFPSR,PRR11,GQNLLLTNLQTIQGILER, TPR,VAELYLPLLSIAR,DOCK6,MAADIPR, GVDYIHSK,EIF2AK2, EKENSEFYELAK, ISYNA1, SLLGYDPQILQMLK, RAD54B, CDC5L,KIVTEEFVR,VTVQTDDSNK,IIIPEIQK,ILLLEAGPK,COQ6,LILPNKQK,NEK10, C13orf35, MAGEH1, QTLLAVLQFLR,TAF5,HLPSTEPDPHVVR, PTBP1,KLPIDVTEGEVISLGLPFGK, HNRNPUL1,ILLGGYQSR, GNAL, QDMLAEKVLAGK, FLJ39061, KGPYCPR, MRPS5,VSGSINMLSLTQGLFR, LTELLSER,EIISEVQR,ASAP2,

HDAC4, LQETGLR, MS4A13, DGKZ, MLKSVSR, L1RE2,LGREVSR, LUZP1,PCCA,ZNF629, VLGAQSVPR, TAB1,VTEDTSSVLR, C17orf56, ISSELEMLR, AK8, SLSSTCARR, EXO1, TVCFQNLR, HSPA12A, ADAMTS4, GLS,DTAGEVAGDTGGDTVGYTETSANVKTMG, LPEGHLK,

AP4B1,HDDC2,SLALWPR,CASC4,LFDEVLDRDVQK,ZRANB1,ALWPLPQLPQPSEWELGATPCLSWR, MAN1C1, AACALLNSGGGVIR, LTIAILR, SWTIPSSLR,PAH,HOXD9,LLLKSHSR,PARP11, SNTQAPRLAPSHR,

C19orf60,LOC100287869,STGLRTSDPR, DRGX, VTNLSEDTR,EIF3G, DAPIRTLVQR,EALEAQQSLGR, KNITNDIR,GTF3C1,VLKYFTR,LCN8,KAAGQALK,BAIAP3,ELKELK,DLG5,KCNQ2, EHVDRHGCIVK,EALNMERNNR,

LOC100507483, VCAAALARAWRPENQAR, ENTPD1, VLDVVER, TKTL2,VWPGAER,C2CD2,DGDEEALKTMIK,ASB2,IVATLPYIKQEVPIIIVFR, LVEPQISELNHR,DMD,

CYFIP1,EASEPIHVR,MLNR,GEFPGVVLSK,CRTASGVLGLCK,

RAD50,KENEGSAK,KIAA1109,MTOR, ILLNIEHR,

LNPVVEK, NCKAP1L, MPSLTASR, PLCL1,

TRMT12,

MTAIKHALQR,EXOC1,RC3H1,CQLNLDILMEDK,PCDHGA12,LISDSKR,GVINP1,RBCK1, KEIMAQLEER,NTSLNPQELQR,

ILGLEYSLR, CRELYYCVK, LINC00479, LLIPRWR,

QITQLLPEDLR, MLGGSAGRLK, MMEL1,

ALDH1B1,

ATP5A1,KLASCSR,PCDHGA1,OR51A7,DACT1,MVSASEDATIK, GJA10,PAFAH1B1, KQMALLMQMTAR, QLEVDPSNGKK,EAVVAKPK,

VATISPRR,

EDRGTPINK, EVFAIQGNK,SP140, FAM120B,CLSSYTSVKENFDK, GNA12, MADVGGQRSQR,C6orf136, ARTSVLPGLR, NUDC, GKDMVVDIQR,

EKLSALQEQK,N4BP2L2,LKVFLCCTR,DAGLA,GPYLLGVKAVLAESYEK, ITWRSR,IQTQTPR, NQSFLR,SAV1,CCDC122, KVLFNLK, FLG,YILGIASSR,ZNF469, ARID5B, LILPYER, GSN, LMATQVSKGIR, FKTN, IESKDPR, TC2N, SCN9A,HHEASSRADSSR, QAAPPIELDAVWEDIRER, IREB2, TVIDKSR,SEMA6D, LLEMTSRK,SPAG1, ALELQHK,LTWRSR, NAF1,AZGP1,MMP16, ILLTFSTGR, FAM9B,

LLSLNTDK,

CLILCYELLK,NCAPG,

LGLEAGDTDDPPR,

SLLSSPTK,

LEDVFVGLCLERLNIR,

HAGKDPVR, AKAP1,

USP7,

SEMA4D,

R3HCC1,

ARID5A,

PRPF8,LKLSNIADR,VLDQLSAGK,SMC1A,SKGGVVGIK,ALDOA,QQKSAVR,QQKEEIGLSR, ADCYAP1,DLTLEENQVK,RIOK2,AKNA, SSLSADLR, TEP1, BTN3A2,FAM99A,

WHSC1L1,ENSGAIEASVK,KIAA1429,

TTC25,SPARDVIFR,TNS1,

DAILNELK,MED25,

TJP2, QLEITGK, WWC2,AELEQKIDEAR,QGVNTMR,HCCEELMKIEIMSPR, STX3,FCHSD2,RTKN2, NSLISSLEEEVSILNR, NUMA1, IFIQTLEANACR,GEMIN4,RPL38,

MED27, SANQMGVSAKR,

KIEEIK, ITPR1, KTEEGNNKPQK, TPM1, MEIQEIQLKEAK, LNMGKGDPK, RGS12, VLVVDLGGSSSR, BBX,HMGB1,RTSWIDPR,

AEVWRTLR,CHMP4A,MRPWLR,ACSM5,LIN37,VPVSKTPK,SETX, NQLDAVLQCLLEK, ERVALAK, CYP2E1,IILQPCGSK, RVCAGEGLAR,C2orf77, CDSN, CPE,VLQISAGVR,LIIFQQR,IFNE,AQGQQEELLQR,CDC42BPG,QENIQLAADAR,CCDC88C, LLIPGNYK,MORN1,

DSE,KLAAVMR,CWF19L1, CCNDBP1,

ISEGSSPIR,MAPK8IP2,DSVEQVIR, VLTAMVGK,KLADLYGSK, CA3, LRRC9, TVDLIPVDQFR, SAP30BP, MARLESR, LEKTVPR, CCDC168,EPMTVSSDQMAK, SPEIISGR,

GQRSTPLTHDGQPK, ATXN7L1, KHQNGTK, B3GALT5, CALY, MAPLQTVR, CCHCR1, LREQLSDTER,

MIKGGGLK,UPF2,RVPQAKPVAVEEK, POLR2B, DLAMFR,CCDC12,AAVTEAVR,SARS2,RNF123,SLFN11,HVIPETNSR,FLJ45964, ZNF619, TAQLNISK,

C15orf5,

LSSLWAR,

RPPEQTR,IPAIQEK, TFAP4,AAILQQTAEYIFSLEQEK, RFX7, USP15, KVVEQGMFVK, PRIC285,VPHSGKTEGSTAGAQIPSK, MOS, NUDT14, DAK,MERIEGASVGR,LLLGATLPR, ERAL1, LSSAHVYLR,CCDC25,LVQSALR, VIVDANNLTVEIENELNIIHK, NDUFS8,VPDVEVSLPSMEVDVQAPR, AHNAK2,QAEIILARR,KIN,PSMD1, NNNTDLMILK, OTC, QSDLDTLAK, IL31RA, RADVLLLK,C8orf74,YIHLENLLAR,SMU1,MADD,MLRFLAPR, IKMATADEIVK, VAC14, MADILLR, KIAA1107, ARLENMSPR, S100A6,

NKDNVAR, TYK2, HGIPLEEVAKK, FANCA, LTAASLSR, MGAT5, SPECC1, MIRALEEK,MVEAFCATWK,LOC100508805, WITQKR, RBM34, LNNSELMGRK, LTNRSPR, TIMM8A, LENG1,TRAPPC9, LNSMLLNDK,RGP1, DSWLAELAGER, ELALSDLR, FKBP15, IMNQVFQSLR,CEP135, FABP7,KLTDAIK,RFESD, QLSSELGDLEK, ATDLAGKR, FAM211B,PCDHB8, VDFLTGEIRLK,NKX3-1,NLDGSAQDPEKR, XKR5, KIAA1841,EKSLLNIMMLVPHK, MAN2B2, C5orf30, SACS,LSVMGCDVLK,MFI2,

GVVDPEFR,DST,RIMS1,DCLK2,VKDAVEQQGEVK, LYAR, MOK,GASEISDSCKPLK, LGAYELVTGRR,KFPNIDVR, CCDC166, SLEAIEEKDLR, C20orf111,VDASGSVASLSVGEGTGVR, UTRN, DTLTQLNAK,FBXO33, VSPAEQPR, NDUFA10,LLQYSDALEHLLTTGQGVVLER, WDR55, LKFWDMAQLR, MYL6,VLDFEHFLPMLQTVAK,

SGLNQIPNRR, MED14, LLIDSVHAR, C10orf118, HGGTAELK, REEP3, VSWMISR, SIK3,LLLHDKDR, GLDLSTCPIALK, MAAAGPMTGAAASR, VWVQSLEK,LLITYSNR, TSGSLCPR,CTTNBP2,

AGBL4,TIAM1, KCNQ5,TLDSHASR, STSANISR,

IKSDHPGISITDLSK,

PCBP2,EFTUD2,SRSF1,

REV1,

TTVQQEPLESGAK,

RFEDLGSR,

SQAEPLSGNKEPLADTSSNQQK,

SSSGLLEWESK,ASLNGADIYSGCCTLK,

CD5L,

LVGGDNLCSGR,

NPNGPYPYTLK,

HTDVQFYTEVGEITTDLGK,

LTEQKGEQQIQK,

IAAEIAQAEEQAR,

GVWGSVCDDNWGEKEDQVVCK,

AEMNILEINKK,NVSEELDRTPPEVSK,

NLLIFENLIDLK,

CNN2,

EFEDPRDAPPPTR,

SNRNP70, IHMVYSK, SPTAN1, ALB,IAALQAFADQLIAAGHYAK,

HSDFSRLAR,

KLSGLNAFDIAEELVK,

DINTDFLLVVLR,

LAQFEPSQR,

ILF2,

DHX9,

KFESEILEAISQNSVVIIR,

FSDHVALLSVFQAWDDAR,

ISFEEFCAVVGGLDIHKK, VKPAPDETSFSEALLKR,

YGEYFPGTGDLR,

HDAC1,

LLLPHWAK,

AAECNIVVTQPR,

GBAP1,

LMRGLLHCMIR,

FLDQNLQK,

NQCTQVVQER,

H2AFY, SNRPD3, ARG1,

TVNTAVAITLACFGLAR, GVTIASGGVLPNIHPELLAK, FLILPDMLK,

AELIVQPELK,

ALLQAILQTEDMLK,

VLLQEEGTR,

DSP,

LTEEETVCLDLDKVEAYR,

ETQSQLETER,TLELQGLINDLQR,

SBSN,

FGQGVHHAAGQAGNEAGR,

YIEIFR,

PHKA1P1,

ATGEADVEFVTHEDAVAAMSK,

EIAENALGK,

HNGPNDASDGTVR,

PHKA1, FAM98B, SP3,

YGLVTYATYPK, GLRAMSVGSGLR,

RLDVTVQSFGWSDR,

STGEAFVQFASK,

DGMDNQGGYGSVGR,

VHIDIGADGR,

ATENDIANFFSPLNPIR,

LONP1,

GAVEALAAALAHISGATSVDQR,

HNRNPH3,

LSSDVLTLLIK,

PNPLA6,

BRD2,

DDX42,

NFYNEHEEITNLTPQQLIDLR,

VADPVVTFCETVVETSSLK,

LFKHSK,

RSIEER, C9orf80,

DGTIVYTGLETR,

KMGIAEAASK,

GSGGGNETIGAKQGR, SNX13, LGRTSLPR, LLLEYKAR, AGGIGGATIPESAPRAGPTR, LRRC8A,

GPSSVEDIKAK,

TVIAQHHVAPR,NCCRP1,

LLDLDALRK, SDHB,

LOC100289085,

CLIP1,

FTEYETQVK,

TARDBP,

FGGNPGGFGNQGGFGNSR,

TSPAN12,

TCHH,LLLDQEQKK,

FGPALSVK,

ELSLVEQR,ATP6AP2,

KVEISIPASSLPR,

GLAGGQQPR,

GGFMKWMSR,

IPO13,

LGDGSFGVVRR, BTN1A1,


PGS1,

CGN,PROS1,VIAIDSDAESPAK,RBL1,LENEKQSK,UNC80, RANSLLEETK,

RAG1,

DNAJC8,LGLRLDR,ANKRD63,

ADVIQATGDAICIFR,

LLSSRSR,

POLI,

NPM1,

LVVPASQCGSLIGK,

GPSSVEDIK,

IITLAGPTNAIFK,

TCFVDCLIEQTHPEIR,

YIELFR,

CSNK2A1P,

FINYVK,

TRLDGVR,SINQQSGAHVELQR, YFGFDTNAEGCILQWK, AS3MT, IBTK,LADHRTQ,FUBP3, SMCR8,

CYP2F1,LSPQAGSR,ANO2,MUC7,LPPSPNNPPKFPNPHQPPK,

MOXD1,

MLL,

KEGLSFNK, ZNF643,

AQP5,SETKSGDK,

EXOC6B, UBR1, ILTCMQGMEEIR, POLR2C, NAT1,SVYDGEEHGRFMEK, EPKELDPMLPR,

AVEKMPR,SULT1C2,ADAM28,YLHSNMMAATLCGSPMYMAPEVIMSQHYDAK, ULK2,GLPGPPGSK,COL16A1, AGTPBP1,LPANLLYK,

SYT4, LOC93622, AAEAARMGR, NUP205, LEEGSELEKK, ZFR2, CLESLAALR, ZBED1,DFPVGERR, MLMNREIIK,

SFGPAVVMNR, FLNA, TPCEEILVK, NEKEPMVLK, LLILEAGHR, ARHGAP20, DENND5B, IKTDVGR,PSKH2, TLLIDGR,KDM3A,

AGVETTTPSK,

NDQRPSGVPDR,

LTVLSQPK,

YLWNNIKK,

ATLVCLISDFYPGAVTVAWK,

CDK9,

CDK14,

CDK4,

CDK16,CDK17,

GNFGGSFAGSFGGAGGHAPGVAR,

CDK5,

CDK15,

LADFGLAR,

AGEVFIHKDK,

LFVGNLPADITEDEFKR,

ISDSEGFKANLSLLR,

FATHAAALSVR,

FAQHGTFEYEYSQR, WNLDELPKFEK,

QVSDLISVLR,

DDX5,

GIVEFASKPAAR,AVVIVDDR,

HLQLAIRGDEELDSLIK,

H2AFV,

IADFGLAR,

FGFR1,

HIST1H2AD,GGGGNFGPGPGSNFR,

GFGFVTFDDHDPVDK, H2AFZ,

MDERALPK,

EAFSLFDKDGDGTITTK,

VFDKDGNGYISAAELR,

FESPEVAER,

DKFNECGHVLYADIK,

VGEVTYVELLMDAEGK,

VGSEIER,

LCK,CDK20,

INEILSNALK,

FGFR4,

EXOG,GRIA4,

GRIA1,GVTFLFPIQAK,FGQGAHHAAGQAGNEAGR,

MKDTDSEEEIR,

VLEEANQAINPK,

LLQLVEDR,

STCIYGGAPK,

GFGDGYNGYGGGPGGGNFGGSPGYGGGR, QEMQEVQSSR,

LOC100287513,

LOC100287205,

LOC100287178,

USP17L1P,

DYFEEYGKIDTIEIITDR,

EESGKPGAHVTVK,

HCK,

YHTINGHNAEVRK,

EEF1A1,

VETGVLKPGMVVTFAPVNVTTEVK,

IGGIGTVPVGR,

YYVTIIDAPGHR,

CDK12,

CDK2,

FKQESTVATER,

YEEIVK,

LPLQDVYK,

SYSCQVTHEGSTVEK,

AAPSVTLFPPSSEELQANK,

LOC649330,

VDSLLENLEKIEK,YAASSYLSLTPEQWK,

TVAPTECS,

VTISCSGSR,

INQVFHGSCITEGNELTK,

GOLGA6A,

EIAQDFKTDLR,

SVAGGFVYTYK,NLNPEDIDQLITISGMVIR,

FLAVGLVDNTVR,

STELLIR,

YRPGTVALR,

YQQLFEDIRGQSDIAITK,

KTTIENIQLPHTLLSR,

SGVSLAALKK,

KALAAAGYDVEK,

KATGAATPK,

ASGPPVSELITK,

QSNNKYAASSYLSLTPEQWK,

ALAAAGYDVEKNNSR,

KASGPPVSELITK,

VAGAATPKK,

HIST1H1D,

HIST1H1E,

RTCCFPALPK,

SGTSASLAISGLR,

VVPSDLYPLVLGFLR,

CNN2P9,

VLVDVER,

NOLC1,

SPAVKPAAAPKQPVGGGQK,

THINIVVIGHVDSGK,

EHALLAYTLGVK,

EEF1A1P5,

CCDC72,

SYSCQVTHEGSTVEKTVAPTECS,

JMJD6,

SPTBN2, IMPACT,

GALVSGSGLR,

LEQLAARFDR, CNGB1,

GIMDIEAYLER, USH2A,

RYNENQDEIR, MATN1,MFAVGVGNAVEDELR, ATP2A1, GCSSGTGVR, SILDQSISSFMR, INTS10,

HDAC2,CLVNLIEK, VSEADSSNADWVTK, AFSYYGPLR,

MCM6,TIAL1,SF3B1,SRSF7,CFB, FUS,

LAMA1,IESALLDGSERK,ABCC10,FILGLLDAGKAHLQR, CD1A,ITELTDENVKFIIENTDLAVANSIR, SLSLGQR, NID2,

YGAIRDIDLK,DQGSRHDSEQDNSDNNTIFVQGLGENVTIESVADYFK, YLQLAEELIRPER,SAFAPFGK,FEDVVNQSSPK,ICFELLELLK,DTPGHGSGWAETPR, VYVGNLGTGAGKGELER, HVEEFSPR,

QEYDESGPSIVHR,

LCYVALDFEQEMATAASSSSLEK,

TUBB4B,

KESYSIYVYK,

TUBB,

ACTG1,

AGGPTTPLSPTR,

LLEGEEER,

DNIQGITKPAIRR,

LMNA,

TUBB2A,

TUBB4A,

HIST1H2BJ,


LADALQELR,

VAVEEVDEEGKFVR,

ELVHDRFHK,

ATCNNYTPQFQLRPSYSSCFPQYR,

RBM39,

VFNVFCLYGNVEK,

RISVADR,

QYNKLK,NGQVNLTVR,TLSVSRR,ILIGTVFDK,AISEQESAGQLEIR, SEFYSLQEQHLGLALDVDR, GMQPTEFFQSLGGDGER,

IINEPTAAAIAYGLDKK, ARFEELNADLFR,

SQIHDIVLVGGSTR,

HSPA8,

GPAVGIDLGTTYSCVGVFQHGK, LLQDFFNGKELNK,

DHX15,

LAGANPAVITCDELLLGHEK,

LWSLDSDEPVADIEGHTVR,

DVNLASCAADGSVK,

QAEVLAEFER,

AYHNSPAYLAYINAK,

AGIEAGNINITSGEVFEIEEHISER,

HRLDLGEDYPSGK, HQSFVLVGETGSGK,

HIST1H3D,EIAQDFK,

UBC,TITIEVEPSDTIENVK,

IQDKEGIPPDQQR, ALYETELADARR,

RPS27A, TTIPEEEEEEEEAAGVVVEEELFHQQGTPR,

KLPFQR,

LMNB1,HIST1H3B,

NFGDQPDIR,

H3F3A,

HIST1H3A,

SF3B3,

HIST3H3,

VFIGNLNTLVVK,

VFIGNLNTLVVKK,

MIAGQVLDINLAAEPK,

QKVDSLLENLEK,

TVTAMDVVYALKR,

HIST1H4A,

DAVTYTEHAKR,

VEILANDQGNR,

HSPA6,GTLDPVEK,

HIST2H2BE,

HIST1H4C,

YHTVNGHNCEVR,

EDSQRPGAHLTVK,

IFVGGIKEDTEEHHLR,

EDSQRPGAHLTVKK,

PSMB8, NBPF3, ALG13, PON3, SPECC1L, CCDC93,

RSL1D1,

SHROOM3,

TTBK1,

ARID4B,

SEZ6L,

LFENLR,

GBA,

GGNFGFGDSR,

GGGGNFGPGPGSNFRGGSDGYGSGR,

LOC100287441,

LFQECCPHSTDR,

TNEGVIEFR,

TLETVPLER,

ALSRQEMQEVQSSR,

IILDLISESPIKGR,

NQDLAPNSAEQASILSLVTK,

LLEVDLK,

IILDLISESPIK,

TDYNASVSVPDSSGPER,

LOC100287478,

LOC100287238,

LOC100288520,

LOC100287404,

LYN,FLQEQNK,

USP17L3,

USP17L2,

LOC100287327,

DYFEEYGK,

HNRNPA2B1,

EADIDGDGQVNYEEFVQMMTAK, ALRTDYNASVSVPDSSGPER,

LLIHQSLAGGIIGVK,

IITITGTQDQIQNAQYLLQNSVK,

IDEPLEGSEDRIITITGTQDQIQNAQYLLQNSVK,

SRSF4,

QAGEVTYADAHK,

HNRNPK,

VLHEAEGHIVTCETNTGEVYR,

RHLYDPK,

GLVSSDELAKDVTGAEALLER,

PTX4,

AAIDWFDGKEFSGNPIK, LKGEATVSFDDPPSAK, TKDIEDVFYK,

APKPDGPGGGPGGSHMGGNYGDDR,

CNN3,

ALINADELASDVAGAEALLDR,

PNPLA7,

LEEQFR,

PPP2R5A,

KLEELK,

GRIA3, GRIA2,

TTGEGTSLRGDINVCIVGDPSTAK, LFLDFLEEFQSSDGEIK, DTSNHFHVFVGDLSPEITTEDIK, VNWATTPSSQKK,IYNSIYIGSQDALIAHYPR, AIGPHDVLATLLNNLK, NPPGFAFVEFEGPRDAEDAVR, NPPGFAFVEFEDPRDAEDAVR,

FAM98A,

MELQELQLKEAK,

TPM3,

FMRSDHLAK,

LTEVPVEPVLTVHPESK,

MELQEIQLKEAK,DFPVESVKLTEVPVEPVLTVHPESK,

PNN,

TPM3P4,

SP5,

QSLEASLAETEGR,

NHKEEMSQLTGQNSGDVNVEINVAPGK,

QFSSSYLSR,

AVAREESGKPGAHVTVK,

LKYENEVALR,ELTTEIDNNIEQISSYK,

RVLDELTLTK,


LTDCVVMRDPASK,

YHTINGHNAEVR,

LFIGGLSFETTEESLRNYYEQWGK,

GFGFVTFDDHDPVDKIVLQK,

HGVQELEIELQSQLSKK,

LEKEIETYHNLLEGGQEDFESSGAGK, YCGQLQMIQEQISNLEAQITDVR, LASYLDKVQALEEANNDLENK,

H2AFX,

KRT9,

SLNNQFASFIDK,

VELQSKVDLLNQEIEFLK,

THNLEPYFESFINNLR,

SLDLDSIIAEVK,

LNDLEDALQQAKEDLAR,

SGGGFSSGSAGIINYQR,

LTDCVVMR,


IDTIEIITDR,

HGVQELEIELQSQLSK,

KLFVGGIK,

QEYEQLIAK,

SRGFGFVTFSSMAEVDAAMAARPHSIDGR,

FSSSSGYGGGSSR,

LGLDIEIATYR,

SLNSFGGCLEGSR,

FLEQQNQVLETK,LLQDSVDFSLADAINTEFK,

LQGEIAHVKK,

AVSREDSVKPGAHLTVK,

EDSVKPGAHLTVK,

SSGSPYGGGYGSGGGSGGYGSR,

GEEGHDPKEPEQLR, CAPNS2,

YLWNNIK,

LDAMFR,NQGGYGGSSSSSSYGSGR,

LFIGGLSFETTDESLR,

KLFIGGLSFETTDESLR,

LGFEEFK,

THYSNIEANESEEVR,

TDGFGIDTCR,

THYSNIEANESEEVRQFR,

CAPNS1,

LFAQLAGDDMEVSATELMNILNK,

SMVAVMDSDTTGK,

VVTRHPDLK,

RLFAQLAGDDMEVSATELMNILNK,

NKLNDLEDALQQAK,

DVDGAYMTK,

LRSEIDNVK,

SMVAVMDSDTTGKLGFEEFK,

AEAESLYQSKYEELQITAGR, SLNNQFASFIDKVR,

NDEELNKLLGR,HIST1H2AB,

HIST1H2AC,

HIST1H2AE,VTIAQGGVLPNIQAVLLPK,

HIST3H2A,

LALDLEIATYR,

KDIENQYETQITQIEHEVSSSGQEVQSSAK,

QGVDADINGLRQVLDNLTMEK,

SDLEMQYETLQEELMALKK,

QLWWRR,

DIENQYETQITQIEHEVSSSGQEVQSSAK,

IKFEMEQNLR,

TNAENEFVTIK,

SLDLDSIIAEVKAQYEDIAQK,

VQALEEANNDLENKIQDWYDK,

HIST1H2AH,

H2AFJ,

HIST1H2AG,

HIST1H2AJ,VGAGAPVYLAAVLEYLTAEILELAGNAAR,

HLQLAIRNDEELNK,

HIST1H2AI,

AGLQFPVGR,

FSSSGGGGGGGRFSSSSGYGGGSSR,

HLQLAIR,

TLNDMRQEYEQLIAK,

TLLDIDNTR,

KDVDGAYMTK,

THNLEPYFESFINNLRR,

SGGGGGGGLGSGGSIR, GFSSGSAVVSGGSRR,

DVDNAYMIKVELQSK, KRT1,

STMQELNSR,

TAAENDFVTLK,

KRT79,

KRT75,KQISNLQQSISDAEQR,

KRT2,

ADTLTDEINFLR,

AQYEDIAQK,

SRTEAESWYQTK,

GSGGGSSGGSIGGR,

KRT76,

NSKIEISELNR,

GFAFVTFDDHDSVDKIVIQK,

GFGFVTYATVEEVDAAMNARPHK,

KIFVGGIK,

GFAFVTFDDHDTVDKIVVQK,

SESPKEPEQLR,

HNRNPA1,

SDQSRLDSELK,

PPP3R1,

HILLAVANDEELNQLLK,

DDX21,

ISLVLGGDHSLAIGSISGHAR, IVSGEAESVEVTPENLQDFVGKPVFTVER, ISFEEFCAVVGGLDIHK,

AFITNIPFDVK,

APILIATDVASR,

CALM2,

DTDSEEEIR,

ILEVDLK,

ILSISADIETIGEILK,

DDX17,

NFYQEHPDLAR,

SGKAPILIATDVASR,

CALM3,

EEVIDFSKPFMSLGISIMIK,

LOC729973,

VGQTIER,

SFPQ,

NONO,FGQGGAGPVGGQGPR,

DTDSEEEIREAFR,

ELGTVMR,

LOC100287364,

FYN,

LOC100287144,

ICK,

BLK,

YES1,

MAK,

FGR,FGFR3,

NMGGPYGGGNYGPGGSGGSGGYGGR,

USP17,FGFR2,

CDK6,

CDK13,

CDK18,

CDK1,

MAAPIDR,

HNRNPM,

QGGGGGGGSVPGIER,

KCGGIDK,

MGANSLER,

GOLGA6C,

MCM4,

RVVPSDLYPLVLGFLR,

CNN1,

DNQLSEVANK,

ATGATQQDANASSLLDIYSFWLNR,

ALAAAGYDVEK,

ALADDDFLTVTGK, NKPGPYSSVPPPSAPPPKK,

GLQVDLQSDGAAAEDIVASEQSLGQK,

CDK3,

CYAT1,

IGLV1-44,

LLIYSNNQRPSGVPDR,

SHKSYSCQVTHEGSTVEK,

ADSSPVKAGVETTTPSK,

LASYLDK,GGSGGSYGGGSGSGGGSGGGYGGGSGGGHSGGSGGGHSGGSGGNYGGGSGSGGGSGGGYGGGSGSR, MTLDDFR,

TKYETELNLR,

NKIIAATIENAQPILQIDNAR, KRT14,ADLEMQIESLKEELAYLK,

GSCGIGGGIGGGSSR,

NDEELNKLLGK,

IQDWYDKK,

ASLENSLEETK,

FEMEQNLR,

LLGGVTIAQGGVLPNIQAVLLPK,

KRT16,APSTYGGGLSVSSSR,

NHEEEMKDLR,

SEITELRR,

TKYEHELALR,AETECQNTEYQQLLDIK, EVFTSSSSSSSR,

IIAATIENAQPILQIDNAR,

ISSVLAGGSCR,

ALEEANADLEVK,

SEITELR,

VLDELTLTK,

VTMQNLNDR,

NQILNLTTDNANILLQIDNAR,

VLDELTLAR,

QRPAEIKDYSPYFK,


SGGGGGGGGCGGGGGVSSLR,

ILTATVDNANVLLQIDNAR,

DAEEWFFTK,

NHEEEMNALR,

SQYEQLAEQNR,

LLEGEDAHLSSSQFSSGSQSSR,

APSTYGGGLSVSSR,

KGPAAIQK,

VCGRGGGGSFGYSYGGGSGGGFSASSLGGGFGGGSR,

LSGGLGAGSCR,

QVLDNLTMEK,


ILTATVDNANILLQIDNAR,

KRT17,


TEELNREVATNSELVQSGK,

TKFETEQALR,

LAADDFR,

SQYEQLAEQNRK, KDAEAWFNEK,IKEWYEK,

EVATNSELVQSGK,

TIEELQNK,

ADLEMQIESLTEELAYLKK,

LASYLDKVR,

ALEEANTELEVK,

ASLENSLEETKGR,

QRPSEIKDYSPYFK,

NKILTATVDNANVLLQIDNAR,

DAETWFLSK,

ISIGGGSCAISGGYGSR,

GRLDSELR,

SFNRGEC,

VDNALQSGNSQESVTEQDSKDSTYSLSSTLTLSK,

LLIYGVSSR,

ATGIPDRFSGSGSGTDFTLTITR,

FSGSGSATDFTLTISR,

IGK@,

TAAENEFVTLKK,

IGKC,

NMQDMVEDYR,SISISVAR,

LLRDYQELMNVK,

FLEQQNQVLQTKWELLQQVDTSTR,

GSYGSGGSSYGSGGGSYGSGGGGGGHGSYGSGSSSGGYR, FLEQQNQVLQTK,

GGSGGSHGGGSGFGGESGGSYGGGEEASGSGGGYGGGSGK, GGSISGGGYGSGGGK,

QISNLQQSISDAEQR,

VQALEEANNDLENK,

FSSCGGGGGSFGAGGGFGSR,

EIKIEISELNR,

GGSGGGGSISGGGYGSGGGSGGR,

TAAENDFVTLKK,

IEISELNR,

VDLLNQEIEFLK,

TNAENEFVTIKK,

YGSGGGSKGGSISGGGYGSGGGK,

YEELQVTVGR,TLLEGEESR,

HGGGGGGFGGGGFGSR,

SKEEAEALYHSK,

NVQDAIADAEQR,

LNDLEEALQQAKEDLAR, STSSFSCLSR,RSTSSFSCLSR,

YQELQVSAQLHGDR,

QGVDADINGLR,

YCGQLQMIQEQISNLEAQITDVRQEIECQNQEYSLLLSIK,

QEIECQNQEYSLLLSIK, WELLQQVDTSTR,

EIVLTQSPGTLSLSPGER,

WLAWYQQKPGK,GFPSVLR,

ZCWPW2,SCFV,

IGKV3-20,

EVQLLESGGGLVQPGGSLR,

VYACEVTHQGLSSPVTK, GQPREPQVYTLPPSR, VSNKALPAPIEK,

TPEVTCVVVDVSHEDPEVQFNWYVDGVEVHNAK,

AIGGGLSSVGGGSSTIK,

TEAESWYQTK,TTAENEFVMLKK,

SEIIELNR,TSQNSELNNMQDLVEDYKK, AQYEEIANR,YEDEINKR,

LALDVEIATYR,

MSGDLSSNVTVSVTSSTISSNVASK, NLDLDSIIAEVK,

KRT84,

LALDIDIATYR,KRT74,

KRT8,

VLYDAEISQIHQSVTDTNVILSMDNSR,

LEGLTDEINFLR,

LALDIEIATYR,

AQYEEIAQR,

STSESTAALGCLVK,

TTPPVLDSDGSFFLYSR,

GPSVFPLAPCSR,

NQVSLTCLVK,


EPQVYTLPPSREEMTK,

VELKTPLGDTTHTCPR,

KCCVECPPCPAPPVAGPSVFLFPPKPK,

SLLEGEGSSGGGGR,

LSGLNAFDIAEELVK,

NVQALEIELQSQLALK,

MBD3,

SRSF6,

ILSISADIETIGEILKK,

ADLEMQIESLTEELAYLK,

QSVEADINGLRR,IRLENEIQTYR,


DAEAWFNEK,

LENEIQTYR,

QSVEADINGLR,

KRT13,

GSSGGGCFGGSSGGYGGLGGFGGGSFR,

NVSTGDVNVEMNAAPGVDLTQLLNNMR,

LKYDNELALR,


KRT10,

TIDDLKNQILNLTTDNANILLQIDNAR,

ALEESNYELEGK,

GSYGDLGGPIITTQVTIPK,

NYYEQWGK,

LFVGGIKEDTEEHHLR,

EESGKPGAHVTVKK,

KLFIGGLSFETTEESLR,

GFGFVTFSSMAEVDAAMAARPHSIDGR,

LFIGGLSFETTEESLR,

RGFGFVTFDDHDPVDK,

SYELPDGQVITIGNER, LFIGGLNTETNEKALEAVFGK,

GFAFVTFESPADAKDAAR, IVEVLLMK,

QACRTPSR,

CEP85L,

LLLPGELAK,AMGIMNSFVNDIFER,

QVHPDTGISSK,

KLEEIK,

SLFSSIGEVESAK,

ELAVL1,

IGGGIDVPVPR,

KHSRP,

KESYSVYVYK,

IINDLLQSLR,ESYSVYVYK,

HIST1H2BM,

IGGDAATTVNNSTPDFGFGGQKR, HIST2H2BF,

AINQQTGAFVEISR,

CTNILR,

GK,

HIST1H2BH,

KELELK,

NYYGYQGYR,

C7orf73,

H2BFS,

HIST1H2BD,

HSVGVVIGR,

GGGGPGGGGPGGGSAGGPSQPPGGGGPGIRK,

VLKQVHPDTGISSK, HIST1H2BE,

SPRR2G,

SPRR2A,

QPCQPPPVCPTPK,

HIST1H2BK,

HIST1H2BC,

HIST1H2BL, KLASQGDSISSQLGPIHPPPR,

SVSLTGAPESVQK,

VQISPDSGGLPER,

MILIQDGSQNTNVDKPLR,

HIST1H2BN,

HIST1H2BG,

SPRR2D,SPRR2E,

NFILDQTNVSAAAQR, HAAENPGKYNILGTNTIMDK,

EKPYFPIPEEYTFIQNVPLEDR, HLYTKDIDIHEVR,

NGQDLGVAFK,

HNRNPU, FIEIAAR,

GYFEYIEENKYSR,

DIDIHEVR,

YNILGTNTIMDK,

NREELGFRPEYSASQLK,

NTHATTHNAYDLEVIDIFKIER,

SDAYYCTGDVTAWTK,

SLQELFLAHILSPWGAEVK, PARP1,

VAPSKWEAVDESELEAQAVTTSK,

U2SURP,

SNLELFKEELK,

FLFENQTPAHVYYR,

FQDELESGKRPK,FADQKNPPNQSSNERPPSLLVIETK,

KPGQSFQEQVEHYRDK,

DKLEEILR,

SPRR2B,

TIQGHLQSENFK,

NPPNQSSNERPPSLLVIETK,

TLGDFAAEYAK,

VVSEDFLQDVSASTK, LTVNPGTK,

VFSATLGLVDIVK,LYLVSDVLYNSSAK,

LYSILQGDSPTK,AAAEIYEEFLAAFEGSDGNKVK, FGPLASVK,

IGPYQPNVPVGIDYVIPK,

YQLLQLVEPFGVISNHLILNK,

GNLGAGNGNLQGPR, TEEGPTLSYGR,

MATR3,

GDADQASNILASFGLSAR,

GPLPLSSQHRGDADQASNILASFGLSAR,

LCSLFYTNEEVAK,

VCSTNDLKELLIFNK,

GIDLLKK,

DLSAAGIGLLAAATQSLSMPASLGR,

QQVPSGESAILDR,

VVDRDSEEAEIIR,

SFQQSSLSR,ITPENLPQILLQLK,DSFDDRGPSLNPVLDYDHGSR,

VVHIMDFQR,DLPEHAVLK,

SSGPTSLFAVTVAPPGAR, ISKEVLAGRPLFPHVLCHNCAVEFNFGQK,

LLEQYKEESK,

AINTLNGLR,

DYGHSSSRDDYPSR,

GFGFVTMTNYDEAAMAIASLNGYR, GFAFVTFESPADAK,

ELAVL2,

ALEAVFGK,

ELAVL4,

RBMX,

VEQATKPSFESGR,

DSYESYGNSR,

LFIGGLNTETNEK, DRDYSDHPSGGSYR, QTGIVLNRPVLR,

GPPPSYGGSSR, DVYLSPR,AAVLLEQER,SSLGQSASETEEDTVSVSKK,

KPGDLSDELR,GIEKPPFELPDFIK,

QSLGHGQHGSGSGQSPSPSR,

GPYESGSGHSSGLGHR,

HGSGSGQSSSYGPYGSGSGWSSSR, HGSGSGHSSSHGQHGSGSSYSYSR, GSGSGQSPSSGQHGTGFGR,

GHYESGSGQTSGFGQHESGSGQSSGYSK,

LAQHITYVHQHSR, SITVLVEGENTR, SKGYAFIEFASFEDAK, SISLYYTGEK,

EVVNKDVLDVYIEHR, IAQPGDHVSVTGIFLPILR,

QVVQGLLSETYLEAHR,

SLEQNIQLPAALLSR,

MVDVVEKEDVNEAIR,

FLQEFYQDDELGKK,

NDLAVVDVR,

GFGFVDFNSEEDAK,

TLLAILR,

GRHGSGSGHSSSYGQHGSGSGWSSSSGR, LFADAVQELLPQYK,

SSSRGPYESGSGHSSGLGHQESR, AGILTTLNAR,FGYVDFESAEDLEK,

LELQGPR,

GPYESGSGHSSGLGHQESR,

HGSGSGHSSSYGQHGSGSGWSSSSGR,

ISLGMPVGPNAHK,

HRNR,

HGSGSGQSSSYSPYGSGSGWSSSR,

FDLLWLIQDRPDR,

FTVAELK,

RFELYFQGPSSNKPR,

SF3B2, QSSSYGPHGYGSGR,

VGEPVALSEEERLK, TGIQEMR,

AIKVEQATKPSFESGR, LTIHGDLYYEGK,

LHDAFFK,

EKKPGDLSDELR,

IPQALEK,

GIEKPPFELPDFIKR,

LAEIGAPIQGNREELVER,

HGSGSGQSSSYGPYR, SSSGQSSGYSQHGSGSGHSSGYGQHGSR,

SSSSGQHGSGLGESSGFGHHESSSGQSSSYSQHGSGSGHSSGYGQHGSR, YGPPPSYPNLK,

VEGTEPTTAFNLFVGNLNFNK,

VFGNEIK,

KFGYVDFESAEDLEK, ALELTGLK,

VFGNEIKLEKPK,

TGISDVFAK,

SISLYYTGEKGQNQDYR,

TEADAEKTFEEK,

NCL,

VTQDELKEVFEDAAEIR,

QKVEGTEPTTAFNLFVGNLNFNK,

ALLLLLVGGVDQSPR, MCM7,

TQRPADVIFATVR,

SLGTADVHFER,

DIVQFVPFR,

TFIIDYYFEVVQK,

CPNE3,

SPLGEVAIRDIVQFVPFR,

LYGPTNFSPIINHVAR,

SFLLDLLNATGK,

KGADSLEDFLYHEGYACTSIHGDR,

DDX3X,

NINITKDLLDLLVEAK,

DLLDLLVEAK,VGSTSENITQK,

AHQCGDDDKTRPLVK,

HAIPIIK,

INAGFGDDLNCIFNDDNAEKLVLR, YSGSYNDYLR,

VAVGELTDEDVK,QINIHNLSAFYDSELFR,

ESLVVNYEDLAAR,

MCM2,

IFASIAPSIYGHEDIKR,

SGYGFNEPEQSR,

RDNNELLLFILK,AVPKEDIYSGGGGGGSR,

HLSDHLSELVEQTLSDLEQSK,

HLILPEKYPPPTELLDLQPLPVSALR,

KLFVGGLK,

YAQAGFEGFK,

SNRNP200,

SNIDALLSR,

VLAGQTLDINMAGEPKPDRPK,

LQASNVTNKNDPK,

RALY,

SELKELLTR, EFEEESKQPGVSEQQR,

CPNE1,S100A4,

SDPFLEFFR,

DAINAETVLVLVNAVYFK,

FCFDLFQEIGKDDR, VQDQDLPNTPHSK,

ASSLSESSPPK,

LIMA1,

SSELQAIKTELTQIK,

VFIGNLNTALVK,

TGYTLDVTTGQR,

HRADHPPAEVTSHAASGAK,

NLATTVTEEILEK,JUP,

NKTLVTQNSGVEALIHAILR,

LLLSYGASR,

BARD1, DYNC1H1,

SQELEVKNAAANDK,

NLSDVATKQEGLESVLK,

TMQNTSDLDTAR,

AGPIWDLR,

LRWLLDNVR,

INVS,

LVQIASR,

GYAFITFCGK,

LFVGSIPK,

GALQNIIPASTGAAK, EQGIQDPIKGGDVR,

LIGLSATLPNYEDVATFLR,

DILCGAADEVLAVLKNEK,

SSQSNYGQHGSGSSQSSGYGQHGSSSGQTTGFGQHR,

FLG2,

SVVTVIDVFYK,LFVGGLKGDVAEGDLIEHFSQFGTVEK,

LGLKEFYILWTK,

GHRHQEEESETEEDEEDTPGHK,

QSSYGQHGSGSSQSSGYGQYGSR,

FGEVVDCTIKTDPVTGR,

DAASVDKVLELK,

SRGFGFVLFK,

HNRPDL,

GFCFITYTDEEPVKK,

YHQIGSGKCEIK,

MFIGGLSWDTSKK,

DLTEYLSR,

FGEVVDCTIK,EVYQQQQYGSGGR, GFGFILFK,

FHTVSGSKCEIK,

IFVGGLNPEATEEK, HNRNPAB,

IREYFGEFGEIEAIELPMDPK,

HSGPNSADSANDGFVR,

ITGEAFVQFASQELAEK,

VTGEADVEFATHEDAVAAMSK,

YIEVFK,

YGDGGSTFQSTTGHCVHMR,

ATENDIYNFFSPLNPVR,

HNRNPF,

VHIEIGPDGR,

HNRNPH1,

EGRPSGEAFVELESEDEVK,

VTGEADVEFATHEDAVAAMAK,


SNNVEMDWVLK,

DLNYCFSGMSDHR,

HNRNPH2,STGEAFVQFASQEIAEK,

GLPWSCSADEVQR,

QEPSQGTTTFAVTSILR,

YVEVFK,

SEDTAMFFCAR,

WLQGSQELPR,

IGHA2,

VAAEDWK,

FFSDCK,

IQNGAQGIR,

HTGPNSPDTANDGFVR,

ENDLSVLR,

TRPV5,

SRGFGFILFK,

MFVGGLSWDTSKK, EYFGEFGEIEAIELPMDPK,

DLKDYFTK,

HSEAATAQREEWK,

FGEVVDCTLK,

GFGFVLFK,IREYFGGFGEVESIELPMDNK,

HNRNPD,MFIGGLSWDTTKK,

FGEVVDCTLKLDPITGR,

IFVGGLSPDTPEEK,

IDASKNEEDEGHSNSSPR,

ESESVDKVMDQK,

VTISVDTSK,

NSLYLQMNSLR,GLEWIGR,

AEDTALYYCAK,

AEDTAVYYCAR,

VDDTAVYYCAR,

LSCAASGFTFR,LRAEDTAVYYCAR,

TTPPVLDSDGSFFLYSK,

EPQVYTLPPSRDELTK,

GPSVFPLAPSSK,

THTCPPCPAPELLGGPSVFLFPPKPK,

SDDTAVYYCAR,

VTVSSASTK,

WQQGNVFSCSVMHEGLHNHYTQK,

SRWQQGNVFSCSVMHEALHNHYTQK,

TPEVTCVVVDVSHEDPEVK,


TPEVTCVVVDVSHEDPEVKFNWYVDGVEVHNAK,

DTLMISR,

IGHM,SDDTAVYFCAR,

GTTVVVSSASTK,

AEDTGVYYCAK,

GLEWVAFIR,

WQQGNVFSCSVMHEALHDHYTQK,

GTLVIVSSASTK,

IGHG1,

LSCAASGFTFSR,

IGHV4-31,

IGHV1OR15-1,

ALPAPIEK,

TEDTAVYYCTR,

QAPGQGLEWMGR,

VVSVLTVLHQDWLNGKEYK,

TTPPMLDSDGSFFLYSK,

IGHG4,

VVSVLTVLHQDWLNGK,

YGPPCPSCPAPEFLGGPSVFLFPPKPK,

SDDTVVYYCAR,

TPLGDTTHTCPR,

SCDTPPPCPR,

STSGGTAALGCLVK,

NTLYLQMNSLR,

DNSKNTLYLQMNSLR,

SLRSDDTAVYYCAR,

GTAVTVSSASTK,

ADDTAVYYCAR,

CPAPELLGGPSVFLFPPKPK,

GRFTISR,

LSSVTAADTAVYYCAR,


WQEGNVFSCSVMHEALHNHYTQK,

WQQGNVFSCSVMHEALHNHYTQK,

GTTVIVSSASTK,

LTVDKSR,

GDDTAVYYCAR,

VVSVLTVVHQDWLNGK,

EPQVYTLPPSR,LOC100290146,

GRFTVSR,

SEDSALYYCAR,SEDTAVYYCAR,

GTLVTVSSASTK,

FNWYVDGVEVHNAK, SEDTAVYFCAR,

GPSVFPLAPSSKSTSGGTAALGCLVK,

GLVWVSR,

FTISRDDSK,

GLEWVGFIR,

VTISLDTSK, IGH@,

IGHA1,

TPLTATLSK,

NQFSLR,

DASGVTFTWTPSSGK,

VEDTAVYYCAR,

TFTCTAAYPESK,

LSISIDTSK,

TLVTQNSGVEALIHAILR,

ILVNQLSVDDVNVLTCATGTLSNLTCNNSK,

TSFSPCVPQCQTQGSYGSFTEQHR,

KPRP,

CPVEIPPIR,

RLDQCPESPLQR,

GRPAVCQPQGR,

VSVELTNSLFK,


TUBB2B,

QDLMIEDNLLK, IDEVPSR,

LOC100289196,

HNRNPL,

NDQDTWDYTNPNLSGQGDPGSNPNKR, CSGEEQSLEQCQHR,

FSAFLDK,

EATLQDCPSGPWGK, QINVSTDDSEVK,

SASPQRAARPR,

TDNAGDQHGGGGGGGGGAGAAGGGGGGENYDDPHKTPASPVVHIR,

YYGGGSEGGRAPK,

HQNQWYTVCQTGWSLR,

KFVIHPESNNLIIIETDHNAYTEATK,

GOLGA6D,

EKADGTEQVER, PRKDC, LAAVVSACK,

LWIANYSLPR,

SLGTIQQCCDAIDHLCR, DFSAFINLVEFCR,

QLFSSLFSGILK,

RBBP6,

KSNSSPSR,

EVDDLGPEVGDIK,

SNLGSVVLQLK,

TLATDILMGVLK,

HIANYISGIQTIGHR,

H3F3B,

RVTIMPK,UBA52,

DAALATALGDKK,UBB,

TITLEVEPSDTIENVK,

EATADDLIKVVEELTR,

DDB1,

KTEPATGFIDGDLIESFLDISRPK,

RHPN2,VPS13D,MAGI2, CCDC39,

QENKMR,

GFAFVTFDDHDSVDK,

SHFEQWGTLTDCVVMR,

IEVIEIMTDR,DYFEQYGKIEVIEIMTDR,

SRGFGFVTYATVEEVDAAMNARPHK,

SSGPYGGGGQYFAKPR, LFIGGLSFETTDDSLR,

YHTINGHNCEVKK,

RGFAFVTFDDHDSVDK, IFVGGIKEDTEEYNLR,

LFIGGLSFETTDDSLREHFEK,

GFGFVTYSCVEEVDAAMCARPHK,

YGKIETIEVMEDR,

SDVEAIFSK,

VDSLLENLEK,

MIASQVVDINLAAEPK,

GDDLQAIKK,

HNRNPCL1,

VPPPPPIAR,

MYSYPAR,

HNRNPC,

STAGDTHLGGEDFDNR,

QTQTFTTYSDNQPGVLIQVYEGER,

TTPSYVAFTDTER,

NQVAMNPTNTVFDAK, DAGTIAGLNVLR,

VQVEYKGETK,

VEIIANDQGNR,

FEELNADLFR,

IINEPTAAAIAYGLDK,

VSYARPSSEVIKDANLYISGLPR,

VAPEEHPVLLTEAPLNPK,

GYSFTTTAER,

ACTB,

TTGIVMDSGDGVTHTVPIYEGYALPHAILR,

AGFAGDDAPR,

IIAPPER,

ISGLIYEETR,VFLENVIRDAVTYTEHAK,

HIST1H4E,DNIQGITKPAIR,

DAVTYTEHAK,

HIST2H4A,TVTAMDVVYALK,

VLRDNIQGITKPAIR,

EIQTAVR,HIST1H2BO,

HIST1H2BB,

ISGLIYEETRGVLK,

KLFIGGLSFETTDDSLR, SRGFGFVTYSCVEEVDAAMCARPHK,

EDSVKPGAHLTVKK,

HNRNPA3,

YHTINGHNCEVK,

GFAFVTFDDHDTVDK,

WGTLTDCVVMRDPQTK,

VAPDEHPILLTEAPLNPK,

ACTBL2,

GFSSGSAVVSGGSR,

DVDAAYMNKVELEAK,


SLVNLGGSK,

NKLNDLEEALQQAK, LDNLQQEIDFLTALYQAELSQMQTQISETNVILSMDNNR,

TSFTSVSR,

KRT6B,KRT6A,

KRT5, SGFSSVSVSR,

YEELQVTAGR,

QNLEPLFEQYINNLRR,

SFSTASAITPSVSR,

QSSVSFRSGGSR,ATGGGLSSVGGGSSTIK,

VSLAGACGVGGYGSR,

SRGSGGLGGACGGAGFGSR,

NTKQEIAEINR,

WTLLQEQGTK,FSGSGSGTDFTLK,

SGTASVVCLLNNFYPR,

QNLEPLFEQYINNLR,

TLNNQFASFIDKVR,

LQGEIAHVK,

DVDNAYMIK,KRT73,

YLDFSSIITEVR,

VIM,

KRT78,

LLEGEECR,

NVQDAIADAEQRGEHALK,

FLEQQNK,FASFIDKVR,

SLDLDSIIAEVR,

FASFIDK,

KLLEGEECR,

GGSSSGGGYGSGGGGSSSVK,

SLVGLGGTKSISISVAGGGGGFGAAGGFGGR,

KRT3,

DVDAAYMSK,

KDVDAAYMSK,

KRT4,

LALDIDIATYRK,

KRT7,

LLRDYQELMNTK,


SKAEAESLYQSK,


SGGGGGRFSSCGGGGGSFGAGGGFGSR,

GGGFGGGSSFGGGSGFSGGGFGGGGFGGGR,

DSGVPDRFSGSGSGTDFTLK,

HKVYACEVTHQGLSSPVTK,

DSTYSLSSTLTLSK,

VQWKVDNALQSGNSQESVTEQDSK,

RTVAAPSVFIFPPSDEQLK, VDNALQSGNSQESVTEQDSK,

TVAAPSVFIFPPSDEQLK,

SISISVAGGGGGFGAAGGFGGR, TEAESWYQTKYEELQQTAGR, LALDVEIATYRK,

YEELQQTAGR,

SRAEAESWYQTK,GLGVGFGSGGGSSSSVK,

KIAA1598, ENO1, TCEB3C, CUL4A, ZNF20, CHD5, RBFOX3,

LSGEAFDWLLGEIESK, RLPDAHSDYAR,FRHENIIGINDIIR,FSGVPDR,MPCQLHQVIVAR,DTSSVAVNLPVPSGVAFPAVFLR,

RBFOX2,CHD3,ZNF571,TCEB3CL,ENO2, CAPN3,

ALYREF,SRRM2,DDX1,NCOA5,C19orf43,KRT23,ZNF326,POLR2A, RBM3,ANXA6,

RBFOX1,CUL3,ENO3,AKAP9, LOC100506888, CHD4,ZNF461,

YSGGNYRDNYDN, QQLSAEELDAQLDAYNAR, ALIEILATR,NQGGSSWEAPYSR, GGHPPAIQSLINLLADNR, QRQEEPPPGPQRPDQSAAAAGPGDPK, AAFGISDSYVDGSSFDPQRR, LASYVEKVR, FLICTDVAAR,

HNRNPUL2, DSC1, SERPINB12, GAPDH, CHERP,

VTIFTVPENCR, VIHDNFGIVEGLMTTVHAITATQK, LLAAVEAFYSPPSHDRPR, ALDVMVSTFHK,DIVQFVPYRR,TLLADQGEIR, LLIYETEAK,

ILDVEIIFNER,HHYEQQQEDLAR, QCGKAFIR,LEGMFR,TEKDNAALAR,SGETEDTFIADLVVGLCTGQIK,

RBM14,MAPK1,IGKV2-24,CAPN2,RPAP1,RBM25,MTA2,

FAANPNQNK,

RGPPPPPR,



LTIHGDLYYEGKEFETR,

VLVDQTTGLSR,

TFGQGTKVEIK,

ASQSVSSSYLAWYQQKPGQAPR,

LLIYGASSR,

ATGIPDRFSGSGSGTDFTLTISR,

AEDTAVYYCAK,

LYACEVTHQGLSSPVTK,

HKLYACEVTHQGLSSPVTK,

ATLFCR,

ATGVPDRFSGSGSATDFTLTISR,

LLIYVASSR,

IGKV3D-20,

FSGSGSGTDFTLTISR,

FSGSGSGTDFTLTITR,

SCDTPPPCPYCPAPELLGGPSVFLFPPKPK,

SCDTPPPCPNCPAPELLGGPSVFLFPPKPK,

TPEVTCVVVDVSHEDPDVQFK,

SCDTPPPCPGCPAPELLGGPSVFLFPPKPK,

SCDTPPPCPICPAPELLGGPSVFLFPPKPK,

IGHG3,

WQQGNIFSCSVMHEALHNR,

FDPWGQGTLVTVSSASTK,

IGHG2,

TPEVTCVVVDVSHEDPEVQFK,

GLPAPIEK,

WYVDGVEVHNAK,

IRVESLLVTAISK,

DVLIQGLIDENPGLQLIIR, LQETLSAADR,

GOLGA6B,

GMKPTEFFQSLGGDGER, QNVNILIDTEK,

TKAPDDLVAPVVK,

ILIGTVFMK, STLDINEMVR,

AITHLNNNFMFGQK,

RFEPCSSSYLPLRPSEGFPNYCTPPR,

ELGCGAASGTPSGILYEPPAEK,

LCFSTAQHAS,

IWSLDSDEPVADIEGHTVR, PRPF4,

GQWGTVCDDGWDIKDVAVLCR,

FEPIHGNFLLTGAYDNTAK,

MKRN4P,GIAYVEFVDVSSVPLAIGLTGQR, RHPYFYAPELLFFAK, AFGYYGPLR, QYGLGPNGGIVTSLNLFATR, KIAA1949,GVDEVTIVNILTNR, ASAISVTVLNVIEGPVFRPGSK,

WTLNSR,

LVQLLVK,

WLLDNVR,

VTVPLVR,

DLYEDELVPLFEK,

SNIDALLSRLEQIAAEQK,

HNRNPR,

LCDSYEIRPGK,

MDLRQACR, SALEQMR,

SKGIAYVEFVDVSSVPLAIGLTGQR,

LRPEPCISLEPRPRPLPR,

YLPM1,SUPT16H, DLEEFFSTVGK,ELAAQLNEEAK,

IWLDNVR,LVGGDNLCSGRLEVLHK,

PPP1R18,

LTGHLHAQGTPPYVINLDPAVHEVPFPANIDIR,

SMSLVLDEFYSSLR, TNQELQEINR,

ANXA2,

IHSDCAANQQVTYR,

SALSGHLETVILGLLK, DSG1, GPN1,

PABPC3,

ESSNVVVTER,

HPYFYAPELLFFAK, SRAEAALEEESR,

SMARCE1,

DSCPLDCK,

NPPGFAFVEFEDPRDAADAVR, SRSF3,

BRD4,

QLSLDINKLPGEK,

CSNK2A1,

GGPNIITLADIVKDPVSR,

SHCIAEVENDEMPADLPSLAADFVESK,

EQVANSAFVER,

MAVAEVSR,PLA2G4B,AGGGQALSR,

HSP90AB1,

CSPP1,KNETGVLR, NOS2, PTPRG,

ALAEFEEK,

ABHD11,DGGLAMLPILVSK, C6orf99,TSVEAAVAPR,PPP1R3F, TAMLLALQR,

ABCC13, SAAKGSDSR, SRCAP, IFCFILSTR, CMA1, LLLFLLCSR,

LOC100506191,

LIICVFLEK,

AK3,

MACROD1,

C1orf144, ITQKESR, LEQLASR, MLLT4, ELQALSR,MVSMMEGVIQVR, PLEKHJ1,

EDDKPETVINR,TVTEDFPK,NAP1L2,KSNTLLR,YIPF6,FAM105B,

RFASLSR,

EPEKEAGAGALPR,

VARS,

EPLHVVATK,

SSLKGLAR,

HVEDVPAFQALGSLNDLQFFR,

MPGTVATLR,

HTATSF1,

LLLSFWK,

LREAMAALR,

PLD5,

JMJD5,

KLTGCSR,

SETDB1,

MANLTELELIR, CRH,

QLKGLEEK, CLVALKER,

ESLTLLR,

DACT2, MYOM3,

GPLD1,

LMEIIGK, C1orf87,

C10orf47,

SVLNNSEVK,

ITLGEEKDR,

TSDLIVLGLPWK,

DRWTAAGALPR,

SLAEKQNLEK,

PABPC1,

ESRP2,

TVLDQAR,

FLJ22184, KLHL34,

PSD4,

TRIM37,ANSLVTLGSHR,

ZKSCAN3,

RBBP4,FAM133B,

KIF26B,

RTWLESMAK,

SMARCC2,

RBBP7,

NSIVVPSER,

HSP90AB3P,

KVDCLESTLEK,

RASAL2,

CDH11,

VFQTEAELQEVISDLQSK,

KLSEVGR,

MESRGLR, ANKRD26,

TXLNG2P,

OR1Q1, KPNA2,

MAANSSGQGFQNK,

SMARCC1,

NLDVGANIFIGNLDPEIDEK,

LEQRDR,

YSFLRGTK,

SF3B4,

TQDECILHFLR,

VTGQHQGYGFVEFLSEEDADYAIK,

THEMIS,

IENLSYDAK, FNKDAQR,

USP24,

CEP350, LEGQLEAGEPK, SSSHSGREVVMR,

DTX4,

RPAVVYIGSAGKPHER, GVPTPSGDR, DFPTISR, MYH10,SLEAEILQLQEELASSER, LASYTVR,RAD51AP2, SNED1,C10orf114, MSSVYLK,APBA3,

LILEHQEK,SCLT1,SLIMEAPR,SGCZ,ETFA,AAELEKK, SPDTFVR,DIAPH1,NVEAVRNAK,

LRHLPSR,GLOD5,

ELKSIPR,

NIETIINTFHQYSVK,

EAAVLQEIR,

YEDLLK,TAQEYGGIRHGAK,

GLG1,

CCDC87,

JMJD1C,

VKLAQLAEK,

DEIIER, IRGC,

QCGKAFLR,

AQHQQALSSLELLNVLFR,

S100A9,LPFETFR,

PCCB,

AAVENLPTFLVELSR,

YGPIVDVYVPLDFYTR,

LAQAYHR,

NWGVMKMILMK,

RPQDKVAVSDMK,

TLQCLGLR,

CAIGVTHFQLVQMK,

LAGSKDPR,

TKILATGGASHNR,

NIPAL3,

XYLB,

FFEVILIDPFHK,

VFIGNLNTAIVK,

SDSVLLGMEGSVK, GBE1,EMENKETLK, IPO7,LAMAIPDK,

ZMAT1,

IL18RAP,

MYBBP1A,

PTPRZ1,

GBP1,

PABPN1,

RPUSD3,DAEWTTVFK,ARVCF,

ACO1,MSAPSLR,LRIG3,

SPGSAQSLGK,

PITRM1,

MAST4,

C14orf166,

EFCAB7,

LOC646214,

ACAT2,

S100G,

PRPF40A,

HDAC9,

ZNF687,

TRIM66,

FBLN7,

FISHKK, PTPRB,

NBPF8,

ARFGEF2,

MYLIP,

SSIVPEVK,

YVFDIK,

TLKINDEK,

IGEEARR,

ISFLLGASR,

FUT8,

DLKAENLLLDANLNIK,

VLDEEGSER, KLRAP1,

FBXO18,NGHTDCVR,

EQILTNKTLK,RMAESLR,

RBBP8, TMEM198, VGSAAQTR,

IFEIFGPVAHPFYVLR,

KCNEVK,FBXO22,GWALYCAGK,

RRP1B,

SPR,LMDAAELVK,

DWKEVLR,

KLEEMR,

QGLCVLRR,

PPP2R3C,

WNT8A,

PRPF31,

MAP3K1,

PDZD2,AEMNILQINEK,

ATF2,

DDX23,

SPMDKVLR,

SPEELKR,

NSKLMEPNLIK,

SLDWNSLLRQK,

QLDAGLAR,

ILLGDDSQK,

VAVLSQNR,

GBP2, MLEEIQK,

BRP44L, MAP1A,

TLR7,

VAIDRSR,

TMESESLRTLEFR,

LKQHFNNALPK,

VSTMLDRLHSTR,

ZHX3,

ANKRD12,

GKNLSKPK,

LESGDRIR,

PARPBP,

C14orf38,

RVEGIQYEDISK,

VPSLREAMEK,

LAALQNSVGR,

RTLSLPR,

TRIO,

BEGAIN,

LAMA4,MLVAENGKK,

PYLGSEDVVKELK,

SVLVDFLIGSGLK, CSE1L,

YNEDLELEDAIHTAILTLK,

LQDAGIAR,

ARHGEF1,

RPL15,NLEQENQNLR,POF1B,TSLALDESLFR,BBS9,

C1orf141,

HILGFDTGDAVLNEAAQILR, AGQAVSSGGILR, YJEFN3,

MRCLTTPMLLR,

YPEEQLK, PCM1,

LYST,

PIK3R6,

KPNB1,

EAGEAPKPGEEVK,

GLAAALLLCQNK,

VLGTEAVQDPTKVEAHVR,

VAPPERR,

SRSF10,

YBX1, RALYL,RPS6,GCIVDANLSVLNLVIVK, NYQQNYQNSESGEKNEGSESAPEGQAQQR, SART1,

PRPF3,

HIVEP3,

ETSLLLR,RAPH1,

ZNF14,

IMKNEIQDLQTK, SWAP70,

DQNEAQTAKEFIK, CAPZA1, EHD2,

MGSLKEELLK,

TSFIQYLLEQEVPGSR,

LTVLAVNR,

LPEMEPLVPR,

CEP250,

09-Mar,

ELSEARIR,

TCP10,

QKLGGASQGR,HDAC6, MYRIP,

ATP13A5, ELKEAAR,

MDTLAVALR,

KNVLQGGESTK,

LQATEVR,

MORF4L1,

ANPEP,

AKAP13,

BCL2L10, PRKCE,

TSGLQQKNVEVK,

TRMT112,

YESLLR,

GLDIESTSK,

QGHYESLKER,

LKLLEEER,

GLQTVHINENFAK,

MAPAYLPR,

EGECLTLCK,

LLLVAWDR,

SH3TC2,

ITGLCPFGPR,

NCAPD3,

SCEL,

GSVMPSVIR,

MNKSLNR,

SLC6A10P,MLANTGRK, EGTPDTAPTSR, TRIM56,

MAAPTLGR,

VVDVSVPR,

KIAA0825,

KVHGDVVK,

LALTDAIK,AQIEDTLR, ASXL3,

ITPR3, PCDH17,

VLQVPMLAHGGCCREDAVVASR, CEP97,

MQVAMAR,

SERPINA12,

CCDC124,

DAGFPFSQDINSHLASLSMAR,

LNASDLRLPSR,

RYR3,

KLDQQYK,

SLC52A1,

FPSNIIVTNGAAR,

SSAWKTPR,

LEKAEAR,

VLVDMSR,

PRPS1L1, SRSPISAK,

ATRDPFAPSR,

MKTTSTK,

EEPRVPPLK,

ZNF284,

SNW1,

ABCF1,

MTR,

PLXNA3,

DPYGFGDSR,FLVLDEADGLLSQGYSDFINR,

HVINFDLPSDIEEYVHR,

SLSYSPVER,

SSPVEFECINEK,

SEIDLLNIRR, YYDSRPGGYGYGYGR,

C9orf174,LOC100508206,

CAMK2B,

KIAA1529,

SNRPN,

SRSF8,

VGDVYIPR,

CEL,MYT1L,

SNRPB,SNRPD2P1,

SGQSSYGQHSSGSSQSSGYGQHGSR, EWTLEAGALVLADR,

DNNELLLFILK,

LASYVEK,

SLSGCPRAK,

MYT1,

LGLLGDSVDIFKGIPFAAPTK,

FSNSSSSNEFSK,

VADPDHDHTGFLTEYVATR, FSGVPDRFSGSGAGTDFTLK,

LTAIDILTTCAADIQR,

DILCGAADEVLAVLK,

GFGFVYFQNHDAADK,

HNRNPA0,

TRIP10,

MKRN1,

QLEMEK,

MXRA5,

CCT2,

TVALWDLR,

REEPQR,

EEA1,

SGGGGGGGGSSWGGR, EDIYSGGGGGGSR,

IIERDTSSVAVNLPVPSGVAFPAVFLR, EYSSELNAPSQESDSHPR,

HSSWSEGEEHGYSSGHSR,

LFIGGLNVQTSESGLR,

TEDPKLR,

MLSAYLR,MLSSFLR,

LEKEAAR,

NNTQVLINCR,

SNRPD2, SRSF2,

RBM15,

LTQYIDGQGRPR,

DPP7,

SVAAWLR,

LNYTLSQGHR,

CAMK2G,

DLKPENLLLASK,

KLGELQEK,GOLGA2,

WASGILR,

VLGLVLLR,

LOC100507381,

ARMCX6,SSKVNDVEAIR,

HERC1,FLQQTGGR,

SUDS3,SCYL1,

CCDC112,

ADCK5,

ANSTITQLTANITK, LAMB4,GGLSGPQGGR,ARL6IP4,QLGLPTALKR,MRTO4,KNSASAER,ALSLWPR,

NQDATVYVGGLDEK,

DOCK10,

CST8,

FAM133A,

RYLTQGITQGK,

ABCF3,

LOC729175,

TNK2,

INAEPSESDTAR, IPDVAALSMGFSVK, FANCM,

SHEGETAYIR,

TCLQASR,

SFVEFILEPLYK,

EAMYNQLGLTGCAGVASNK,

VVQGDIGEANEDVTQIVEILHSGPSK, FYVPPTQEDGVDPVEAFAQNVLSK, MSVQPTVSLGGFEITPPVVLR,

SSRP1,

GLAEDIENEVVQITWNR,

ESRP1,

EAGDVCYADVYRDGTGVVEFVR, DSNFAGDLVR,KSEYTQPTPIQCQGVPVALSGR, QVTITGSAASISLAQYLINVR, INISEGNCPER, LLYDTFSAFGVILQTPK, LVEGILHAPDAGWGNLVYVVNYPK, ELQCLTPR,

Figure 3.5: Peptide-protein graph constructed from an AP-MS experiment with RPAP3 as a bait. Proteinsare drawn as red triangular vertices while peptides are drawn as blue round vertices.


HIST2H2AA3,VTIAQGGVLPNIQAVLLPK,

HIST1H2AJ,

HIST1H2AD,

NDEELNKLLGK,

HLQLAIRNDEELNK,

NDEELNKLLGGVTIAQGGVLPNIQAVLLPK,

NDEELNKLLGR,

AGLQFPVGR,

HLQLAIR,

NDEELNK,

H2AFJ,

HIST3H2A,

HIST1H2AC,

HIST1H2AA,

H2AFX,

HIST1H2AE,

HIST1H2AH,

HIST1H2AG,

HIST1H2AB,

HIST1H2AI,

HIST2H2AC,

x=7

σ=1

σ=1

σ=1σ=1

σ=1

σ=8

σ=2

x=1

x=2

σ=2

Figure 3.6: A connected component of the bipartite graph from the CDK9 dataset, where all peptides inthe component are shared peptides that belong to histones. Our algorithm for MINIMUMPROTEINTYPES

chooses only 3 proteins (HIST1H2AH, HIST3H2A and H2AFX) to be in the solution. Blue round verticesindicate peptides while proteins are drawn as red triangular vertices.

ular selection of proteins may not be very accurate due to the relatively small number of

peptides. However, the existing approach of simply ignoring shared peptides within MS

data would lead us to miss this instance entirely. On the other hand, using the naive ap-

proach of assigning each protein the average value of constituent peptide abundances

would lead us to selecting all the histones in this component, resulting in poor quanti-

tative data.

Such an effect is more prominent in the example shown in Figure 3.7. Here, the pep-

tide abundances are taken from an AP-MS run with RPAP3 as the bait protein. While the

peptides are shared by four distinct proteins, our algorithms for MINIMUMPROTEINTYPES

and MINIMUMERRORSUM both picked HNRNP1 and HNRNP2 (heterogeneous ribonucle-

oprotein particles) with positive abundances. This is indeed a reasonable identification

as the remaining candidate proteins are supported by only 1 or 2 constituent peptides.


DLNYCFSGMSDHR,

VTGEADVEFATHEDAVAAMSK,

YGDGGSTFQSTTGHCVHMR,

ATENDIYNFFSPLNPVR,

VHIEIGPDGR,

HTGPNSPDTANDGFVR,

YVEVFK,

FFSDCK,

IQNGAQGIR,

SNNVEMDWVLK,

EGRPSGEAFVELESEDEVK,

STGEAFVQFASQEIAEK,

HNRNPH1,

VTGEADVEFATHEDAVAAMAK,

GLPWSCSADEVQR,


HNRNPF,

HNRNPH2,

TRPV5,

x=5

x=6

σ=5

σ=2

σ=2

σ=1

σ=4

σ=2

σ=5

σ=2

σ=4

σ=2

σ=1

σ=6

σ=4

σ=1

σ=3

Figure 3.7: A connected component of the bipartite graph from the RPAP3 dataset. Our algorithmsfor MINIMUMPROTEINTYPES and MINIMUMERRORSUM both select HNRNPH1 and HNRNP2 in the solutionwhile discarding the less supported proteins.

3.5 Discussions

In this chapter, we studied the protein quantification problem by first modelling the

mass spectrometry data using mathematical programs. While the four problems for-

mulated are all expressed as integer programs, we studied their hardness from combi-

natorial standpoints by reductions from SETCOVER and EXACTCOVER. Then, we devised

approximation algorithms for the corresponding problems which also showed promis-

ing empirical performances when tested on synthetic data as well as biological data.

There remain various extensions to the models discussed here. For example, all

four problems (MULTICOVER, MINIMUMPROTEINTYPES, MINIMUMUNIFORMERROR, MINI-

MUMERRORSUM) model the mass spectrometry data under the assumption that each

peptide may occur in a protein at most once. Lifting this assumption can be achieved

easily by modifying the adjacency matrix A so that A i ,j is the number of times peptide

s i appears in protein w j , and the resulting problems can be attacked using the same

3.5. Discussions 73

approaches.

Moreover, one may wish to incorporate into our models the “detectability” c (i ) of

each peptide7, indicating the difficulty for correctly detecting a particular peptide. Re-

cent studies have shown that peptide detectability can be estimated reliably [2, 102].

For example, Tang et al. proposed a machine learning approach to estimate the pep-

tide detectability using its sequence length and its neighbouring regions in the parent

protein sequence [133]. To incorporate this into our model of MINIMUMERRORSUM, we

can minimize the objective function∑

i c (i ) · εi . The algorithm for the modified objec-

tive function need not change, as our randomized rounding scheme simply rounds the

fractional solution from the LP relaxation.

In the case of MULTICOVER and MINIMUMPROTEINTYPES, we can integrate this no-

tion of peptide detectability to each protein w j by p (j ) =∑

(i ,j )∈E c (i ). Then, given this

penalty function for each protein, we can ask to minimize the total cost of chosen pro-

teins:∑

j p (j )x j for MULTICOVER, and∑

j p (j )χ(x j ) for MINIMUMPROTEINTYPES. The re-

sulting problems can be solved under the same framework, using algorithms for the

corresponding weighted set cover problems.

Another factor that can be incorporated into our model is the set of peptides that

were not observed in the MS data. For every unobserved peptide sk , we can introduce

its absence to our model by adding sk to the bipartite graph with the abundance value

σk = 0. As these are additional constraints to the integer programs, they would help us

obtain a more accurate solutions albeit with slower running time. However, as discussed

in Section 3.4.1, the coverage of a protein by the identified peptides is typically low due

to low sensitivity of mass spectrometers. As a result, we cannot simply interpret un-

detected peptides as zero abundance, and the addition of these additional constraints

should be avoided until the sensitivity of MS data improves.

7 The detectability c (i ) for peptide s i can be set higher if it is less likely to observe the correct abundanceof s i , say.


3.6 Bibliographic Notes

The set cover problem and its generalization into the multicover problem have been

studied extensively in the approximation algorithms community. Lund and Yannakakis [101]

showed that SETCOVER cannot be approximated within a factor of o(log n ) unless P =

NP. Moreover, Feige [48] showed that, unless NP ⊂ DT I M E (nO(log log n )), there is no

(1−o(1)) ln n approximation. Thus the approximability of the set cover problem is es-

sentially resolved if P 6= NP. When the input set system admits a special underlying

structure, the lower bound on the approximation ratio for the general set cover (and

multicover) can be improved: for example, Brönnimann and Goodrich [21] showed an

improved O(1) approximation ratio when the elements are points on the plane while the

sets correspond to disks on the plane. Further, there are various algorithms specialized

for different restricted cases of set systems [25, 47] that may provide better approxima-

tion ratios. Unfortunately, however, the set systems constructed from our PPI dataset do

not admit these underlying structures.

The results discussed in this chapter are to be submitted [89].

Chapter 4

Direct PPI Networks from AP-MS Data

With the help of high-throughput experimental techniques, a large amount of PPI data

has recently become available, providing us with a rough picture of how proteins inter-

act in biological systems. However, interaction data from high-throughput experiments

often suffers from relatively high error rates and protocol-specific biases. Therefore, in-

ferring the physical PPI network from high-throughput data remains a challenge in sys-

tems biology.

As discussed in Chapter 1, affinity purification followed by mass spectrometry iden-

tification (AP-MS) is an increasingly popular approach to observe protein-protein inter-

actions (PPI) in vivo. One drawback of AP-MS, however, is that it is prone to detecting

indirect interactions mixed with direct physical interactions. Therefore, the ability to

distinguish direct interactions from indirect ones is of much interest. In this chapter,

we propose a simple probabilistic model for the interactions captured by AP-MS exper-

iments, under which the problem of separating direct interactions from indirect ones is

formulated. Then, given idealized quantitative AP-MS data, we propose an approach to

identify the most likely set of direct interactions that produced the observed data.

This chapter is organized as follows. In Section 4.1, we give a brief overview of related

research around the AP-MS technology. Then, Section 4.2 describes the mathematical

75

76 Chapter 4. Direct PPI Networks from AP-MS Data

modelling of an AP-MS experiment, and formulates an optimization problem. In Sec-

tion 4.3, we describe the overall algorithm, which is based on a collection of graph theo-

retic approaches that succeed at inferring a large fraction of the network nearly exactly,

followed by a genetic algorithm that infers the remainder of the network. Finally, in Sec-

tion 4.4, we test the accuracy of our method using both biological and simulated PPI

networks. Here, we apply our algorithm to the prediction of direct interactions based on

a large set of AP-MS PPI data in yeast [93].

4.1 Background

In an AP-MS experiment, a protein of interest (the bait) is first tagged and expressed

in vivo. The bait is then immuno-precipitated (IP), together with all of its interacting

partners (the preys), and finally, preys are identified using mass spectrometry. Like Y2H

and other high-throughput experimental methods, however, AP-MS suffers from experi-

mental noise. A number of approaches have been proposed to separate true interactions

from false-positives. These approaches mostly focus on reducing false-positives due to

protein misidentification from MS data [63, 107, 150], on detecting contaminants [19],

or a combination of both [26, 33, 35, 111, 122, 123]. These methods often make use of

the guilty-by-association principle, and quantify the confidence level of an interaction

by considering alternative paths between two protein molecules. In this context, a true

interaction between bait b and prey p is considered a true positive if, at some point in

the set of cells considered, there exists a complex that contains both b and p . One con-

cern with this line of reasoning is that, as the sensitivity of the AP-MS methods improves

(and thus the stability of the complexes that can be detected decreases), the transience

of detectable interactions will increase to a point where, eventually, every protein may

be shown to marginally interact with every other protein.

A key property of AP-MS approaches is that a significant number of the co-purified

prey proteins are in fact indirect interaction partners of the bait protein, in the sense that

4.2. Mathematical Modelling and Problem Formulation 77

they do not interact physically and directly with the bait, but interact with it through a

chain of physical interactions involving other proteins in the complex. Therefore, it is

critical, when interpreting AP-MS-derived PPI networks, to correctly interpret the mean-

ing of the term “interaction”. Although not designed to identify physical interactions, the

increasing amount of AP-MS data available justifies the need for separating direct phys-

ical interactions from indirect ones. This is the problem we consider in this chapter:

“Given the results of a set of AP-MS experiments, filtered for protein misiden-

tifications and contaminants, how can we distinguish direct (physical) from

indirect interactions?”

Note that since the false-positive filtering methods listed above consider indirect in-

teractions as true-positives, they cannot be used to address this problem. Gordân et

al. [65] study the related problem of distinguishing direct vs. indirect interactions be-

tween transcription factors (TF) and DNA. While the objective of their study is similar

to ours, their method makes use of information specific to TF-DNA interactions (e.g. TF

binding data, motifs from protein binding microarrays), and thus is not immediately ap-

plicable to the problem on general PPI networks. In fact, to our knowledge, no existing

approach seems directly applicable.

4.2 Mathematical Modelling and Problem Formulation

Throughout this chapter, we make the assumption that appropriate methods have been

used to reduce as much as possible protein misidentifications and contaminants, in

such a way that all interactions detected are either direct or indirect interactions. Our

task is to separate the former from the latter. To save from confusion, therefore, we note

that false positives (resp. false negatives) henceforth refer to falsely detected (resp. un-

detected) direct interactions inferred by our algorithm.


4.2.1 A Probabilistic Model for AP-MS Data

We first describe a simple model of the AP-MS PPI data that shall be used throughout

this chapter. Though admittedly rather simplistic, our model has the benefit of allowing

the formulation of a well-defined computational problem.

Let Gd i r e c t = (V, Ed i r e c t ) be the graph of direct (physical) interactions over the set V

of proteins considered, and let N (b ) = {p ∈ V : (b , p ) ∈ Ed i r e c t } be the set of direct in-

teraction partners of protein b . We model the physical process through which PPIs are

identified in an AP-MS experiment as follows. If a bait protein b is in contact with a di-

rect interaction partner p ∈N (b ), the affinity purification (AP) process (see Section 1.1.2

in Chapter 1) on b will pull down p , which will then be identified through mass spec-

trometry. In addition, if p interacts with p ′ ∈ N (p ) at the same time as it interacts with

b , protein p ′ may also be pulled down, even though the two proteins b and p ′ only

indirectly interact. In general, any protein x that is connected to b by a series of simulta-

neous direct interactions may be pulled down by b . As a result, all interaction partners

of b (direct or indirect) will be identified together. Figure 4.1 depicts an example of this

effect.

In order to distinguish direct physical interactions from indirect ones, the availabil-

ity of quantitative AP-MS data is helpful. Although quantitative AP-MS remains in its in-

fancy, prey abundance can be estimated fairly accurately using various approaches, e.g.

the peptide count [55], spectral count [100], sequence coverage [51], protein abundance

index [73], and our own approach of quantifying absolute protein abundance discussed

in Chapter 3. Combined with the increasing accuracy and sensitivity of mass spectrom-

eters, these methods are becoming more reliable. Throughout the discussions in this

chapter, we thus assume that this quantitative data is available to us.

To model the quantitative AP-MS data, we first consider the strength of a direct in-

teraction, which can be measured by the energy required to break it. To give an example,

suppose the PPI network consists of only two proteins, b and p . Then, let A(b , p ) denote


b

p1 p2 p3 p4 p5 p6 p7 p8

b

p1 p2

p7

p6

p5

b

p5p6

p8

p1

p2

p3

p4

b

bp1

p2

p3

p4

p5

p6

p7 p8

(a)

(b)

(c)

b

Cell

Figure 4.1: A schematic for indirect interactions in AP-MS data. (a) In a cell, multiple copies of a baitprotein b are expressed, and interact (directly or indirectly) with other prey proteins p1, . . . , p8. (b) Afterthe pull-down on the bait b , MS detects all prey proteins, including indirect interaction partners p4 andp8. (c) The direct interaction network should, however, contain only edges between direct interactionpartners.

the observed abundance of a prey protein p obtained by the bait b , and letχ(b , p ) denote

the true abundance of interactions between b and p . Since protein interactions may be

disrupted by the AP process, we expect that A(b , p ) is correlated with the strength of

the interaction between b and p , as well as the true abundance χ(b , p ). We model the

strength of an interaction using the probability p (b , p ) that the interaction between b

and p survives the purification process, and assume that the interaction breaks with

probability 1− p (b , p ). Then, the observed abundance of protein p obtained from the

pull-down on b would be

A(b , p )∝χ(b , p ) · p (b , p ).

As another example, consider now a system of three proteins b , p1, p2, where (b , p1)

and (p1, p2) form direct interactions, but b and p2 do not interact directly. As before,

let χ(b , p1, p2) denote the true abundance of protein interactions that occur simulta-

neously, i.e., number of times both (b , p1) and (p1, p2) occur together. Then, A(b , p2) ∝


χ(b , p1, p2) · p (b , p1) · p (p1, p2). In general, the observed amount of protein p obtained

upon pull-down of b will be proportional to the probability that b and p remain con-

nected after each edge (u , v ) ∈ E (Gd i r e c t ) is broken with probability p (u , v ). Our goal

is then to infer Gd i r e c t from the set of observed abundances A(u , v ). In this paper, we

make the following simplifying assumptions:

1. All direct interactions (u , v ) ∈ Ed i r e c t survive with the uniform probability p , and

fail independently with probability 1− p .

2. All possible direct interactions take place at the same time, irrespective of the pres-

ence of other interactions, and with the same frequency.

Albeit rather strong, these assumptions provide a useful starting point for separating

direct interactions from indirect ones (see Section 4.5 for possible relaxation of these

assumptions). Despite its simplicity, our mathematical modelling of AP-MS does fit ex-

isting biological data reasonably well (see Section 4.4.5). We note that Asthana et al. [5]

have proposed an approach similar to our probabilistic graph model to identify novel

members of known protein complexes. However, their goal is not to identify indirect

interactions, so their approach is not applicable to our problem.

4.2.2 Problem Formulation

We are now ready to formulate the algorithmic problem addressed in this chapter. We

henceforth consider the (unknown) direct interaction network Gd i r e c t as a probabilistic

graph, where each edge in Gd i r e c t survives the AP-MS process with probability p , and

fails otherwise. Let G denote a random graph obtained from Gd i r e c t by removing edges

in Ed i r e c t independently with probability 1−p . Then, define PGd i r e c t (u , v ) to be the prob-

ability that vertices u and v remain connected (directly or indirectly) in G :

PGd i r e c t (u , v ) = Pr[ there exists at least one path from u to v in G ].


a b

c

d

e f

GDirect

a b c d e f

a 1 5/8 5/8 5/16 5/32 5/32

b 1 5/8 5/16 5/32 5/32

c 1 1/2 1/4 1/4

d 1 1/2 1/2

e 1 1/4

f 1

PGDirect

(a)

a b

c

d

e f

GDirect

a b c d e f

a 1 5/8 5/8 5/16 5/32 5/32

b 1 5/8 5/16 5/32 5/32

c 1 1/2 1/4 1/4

d 1 1/2 1/2

e 1 1/4

f 1

PGDirect

(b)

Figure 4.2: An example of (a) a direct interaction network GDi r e c t and (b) its connectivity matrix PGDi r e c t

calculated with p = 12

. Assuming each edge of GDi r e c t survives with probability p , the probability ofconnectivity between each pair of protein can be estimated via sampling of the probabilistic network.

We call PGd i r e c t the connectivity matrix of Gd i r e c t . See Figure 4.2 for an example of a direct

interaction network and its connectivity matrix. In Figure 4.2(b), the entry PGd i r e c t (b , d )

is assigned 516

, because, out of all 64 subgraphs of Gd i r e c t , only 20 of those maintain

the vertices b and d in the same component. In general, PGd i r e c t (u , v ) can be estimated

from Gd i r e c t by straight-forward Monte Carlo sampling. However, its exact computation

(known as the two-terminal network reliability problem [34, 138]) is #P-complete [138].

A set of AP-MS experiments where all proteins have been tagged and used as baits

yields an approximation of A(x , y ) for all pairs of proteins (x , y ), which can be trans-

formed into an estimate M (x , y ) of PGd i r e c t (x , y ) through an appropriate normalization.

We are thus interested in inferring Gd i r e c t from M :

EXACT DIRECT INTERACTION GRAPH FROM CONNECTIVITY MATRIX (E-DIGCOM)

Given: An n ×n connectivity matrix M

Find: A graph G = (V, E ) such that PG (u , v ) =M (u , v ) for each u , v ∈V .


In a more realistic setting, the connectivity matrix PG would not be observed pre-

cisely, and the E-DIGCOM problem may not admit a solution. We are thus interested in

an approximate, optimization version of the problem:

APPROXIMATE DIRECT INTERACTION GRAPH FROM CONNECTIVITY MATRIX (A-DIGCOM)

Given: An n ×n connectivity matrix M and a tolerance level 0≤δ≤ 1

Find: A graph G = (V, E ) such that the number of pairs (u , v )∈V×V such that |PG (u , v )−

M (u , v )| ≤δ is maximized.

Note that although the complexity of the DIGCOM problems are currently unknown,

the reverse problem of finding the connectivity matrix M given an input graph G (net-

work reliability problem) is well studied. Furthermore, related network design problems

have been studied extensively in the computer networking community. For example,

the reliable network design problem is to choose a minimal set of edges over a set a

nodes so that the resulting network has at least the prescribed all-pairs terminal relia-

bility; various algorithms including branch-and-bound heuristics [76] and genetic algo-

rithms [38, 39] have been proposed.

4.3 Algorithm for A-DIGCOM

Our algorithm for the A-DIGCOM problem has three main phases outlined as below.

PHASE I. We start by identifying, based on the connectivity matrix M , vertices from Gd i r e c t

with low degree, together with edges incident to them. As most PPI networks ex-

hibit the properties of scale-free networks [10], this resolves the edges incident to

a significant portion of the vertices (∼ 75%, in our networks).

PHASE II. At the other end of the spectrum, Gd i r e c t contains dense clusters that often cor-

respond to protein complexes. We use a quasi-clique finding heuristic to identify

such dense clusters from the connectivity matrix M .

4.3. Algorithm for A-DIGCOM 83

(a) (b)

(c) (d)

Figure 4.3: The outcome of our direct PPI detection algorithm after each phase. (a) The original solutionspace; (b) After detecting weakly connected regions; (c) After detecting dense clusters; (d) The geneticalgorithm detects the remaining interconnecting regions.

PHASE III. To infer the remainder of the network, we use a novel genetic algorithm (see Sec-

tion 2.4). This highly customized genetic algorithm makes use of the findings from

the previous two steps in order to dramatically reduce the dimension of the prob-

lem space, and to guide the mating process between parent candidates to create

good offspring solutions.

Figure 4.3 gives an example of the output after each phase of our algorithm. In what

follows, we describe each phase of the algorithm in detail.


4.3.1 Identification of Weakly Connected Regions

I-a. Finding cut edges

A cut edge in a graph G is an edge (u , v )whose removal would result in u and v belonging

to two distinct connected components (e.g. edge (c , d ) in Figure 4.2(a)). The following

theorem allows the identification of all cut edges based on the connectivity matrix PG .

Theorem 4.1. A pair of vertices u and v from V forms a cut edge in G if and only if the

following two conditions hold:

(i) PG (u , v ) = p

(ii) V can be partitioned into V = Vu ∪ Vv , where Vu = {x ∈ V : PG (x , u ) ≥ PG (x , v )}

and Vv = {x ∈ V : PG (x , u ) < PG (x , v )}, such that ∀s ∈ Vu and ∀t ∈ Vv , PG (s , t ) =

PG (s , u ) · p ·PG (v, t ).

Proof. Necessity is trivial. For sufficiency, suppose the conditions (i) and (ii) hold, and

(u , v ) is not a cut edge. Then, to keep the graph connected, there must be an edge (s , t ) 6=

(u , v ) joining Vu and Vv . Since (s , t ) is an edge, p (s , t )≥ p . However, by assumption, we

have p (s , t ) = p (s , u ) · p ·p (v, t )< p , which is a contradiction.

The above theorem immediately provides an efficient algorithm to test whether a

pair of vertices forms a cut edge in time O(|V |2): recursively find edges satisfying condi-

tions in Theorem 4.1. Observe that removing a cut edge (u , v ) allows us to decompose

the graph into two connected components (subgraphs induced by Vu and Vv , respec-

tively), and the probability of connectivity between every pair of vertices in Vu (Vv , resp.)

remains the same after removing (u , v ). Therefore, the submatrices that correspond to

Vu and Vv can be treated as independent subproblems, and one can recursively detect

cut edges in the remaining subproblems.

Note that a special case of cut edges is an edge whose endpoint is a degree-1 vertex.

Because removing a cut edge does not change the connectivity between any two nodes


on the same side of the cut, this allows us to repeatedly identify and remove degree-1 ver-

tices, i.e., the corresponding rows and columns in the input matrix. Hence, if the entire

graph is assumed to be a tree, this algorithm can reconstruct the graph completely. On

the other hand, PPI networks tend to contain many cut edges due to their sparsity. Con-

sequently, this simple characterization for cut edges allows a significant simplification

of our problem by identifying all cut edges.

Theorem 4.2. If G is a tree, E-DIGCOM can be solved efficiently.

I-b. Finding degree-2 vertices

We now consider the problem of identifying degree-2 vertices from the connectivity ma-

trix M . After degree-1 vertices, which are identified in the previous step, they constitute

the next most frequent vertices in the biological networks we studied. While we do not

have a full characterization of these vertices, the following theorem gives a set of neces-

sary conditions that lead to a heuristic to predict degree-2 vertices.

Theorem 4.3. Let s be a degree-2 vertex in G such that N (s ) = {u , v }. Then, the following

four conditions must hold.

(i) Low connectivity: for each t ∈V, PG (s , t )< 2p − p 2.

(ii) Symmetry: PG (s , u ) = PG (s , v ).

(iii) Neighbourhood: for each t ∈V −{s , u , v }, PG (s , t )< PG (s , u ).

(iv) “Triangle” inequality: for each t ∈V −{s , u , v }, PG (s , t )<m a x { PG (u , t ), PG (v, t ) }.

Before we prove this, consider the following results on graph composition. Let G1 =

(V1, E1) and G2 = (V2, E2) be two graphs. Then the following hold.

Fact 4.4. (Series composition) Suppose V1 ∩V2 = {c}, and a new graph G is constructed

by joining G1 and G2 at c . Then, for any s ∈ V1 − {c} and for any t ∈ V2 − {c}, PG (s , t ) =

PG1(s , c ) ·PG2(c , t ).


Proof. Since c is a cut vertex, PG (s , c ) = PG1(s , c ), and PG (c , t ) = PG2(c , t ). For the vertices s

and t to be connected, all of s , c , t must be in the same connected component. Because

every edge removal is independent, we have PG (s , t ) = PG1(s , c ) ·PG2(c , t ).

Fact 4.5. (Parallel composition) Suppose V1∩V2 = {s , t }, and a new graph G is constructed

by joining G1 and G2 at s and t (possibly leading to parallel edges between s and t ). Then,

PG (s , t ) = PG1(s , t )+PG2(s , t )−PG1(s , t ) ·PG2(s , t ).

Proof. The event that s and t are not connected is when s and t are disconnected in

both G1 and G2. Thus 1−PG (s , t ) = (1−PG1(s , t )) · (1−PG2(s , t )) = 1−PG1(s , t )−PG2(s , t )+

PG1(s , t ) ·PG2(s , t ).

Now we are ready to prove Theorem 4.3.

Proof. We prove each condition separately.

Condition (i) Low connectivity. Since s has degree 2, it becomes disconnected from

the rest of the graph with probability (1− p )2. Thus, PG (s , t )≤ 1− (1− p )2 = 2p − p 2. The

equality can only hold if PG (s , u ) = PG (s , v ) = 1, which is impossible.

Condition (ii) Symmetry. Let PG−{(s ,u )}(s , u ) denote the connectivity of s and u when

edge (s , u ) is removed from E . Then, we have:

PG (s , u ) = p +PG−{(s ,u )}(s , u )− p ·PG−{(s ,u )}(s , u ) (by Fact 4.5)

= p + p ·PG−{(s ,u ),(s ,v )}(v, u )− p 2 ·PG−{(s ,u ),(s ,v )}(v, u ) (by Fact 4.4)

= p + p ·PG−{(s ,u ),(s ,v )}(u , v )− p 2 ·PG−{(s ,u ),(s ,v )}(u , v )

= p +PG−{(s ,v )}(s , v )− p ·PG−{(s ,v )}(s , v ) (by Fact 4.4)

= PG (s , v ) (by Fact 4.5)


Condition (iii) Neighbourhood Next, we show that for any t ∈ V −{s , u , v }, PG (s , t ) <

PG (s , u ). For a subgraph H ⊆G , let p (H ) denote the probability of observing H from a

probabilistic graph G . We can write this probability as

p (H ) = p |E (H )| · (1− p )|E (G )|−|E (H )|

Thus, for any two subgraphs Hi , H j ⊆G , such that |E (Hi )| = |E (H j )|, p (Hi ) = p (H j ). Let

H(s , t ) be the set of subgraphs of G where s and t are connected. We can then write the

probabilities PG (s , t ) and PG (s , u ) as the sum of the probabilities of these subgraphs.

PG (s , t ) =∑

Hi∈H(s ,t )

p (Hi )

PG (s , u ) =∑

H j ∈H(s ,u )

p (H j )

Observe that these two sets of subgraphs may overlap:

H(s , t ) =H(s , t , u )∪H(s , t , u )

H(s , u ) =H(s , t , u )∪H(s , t , u )

where H(s , t , u ) is the set of subgraphs such that s , t , and u are all connected, while

H(s , t , u ) is the set of subgraphs where s , t belong to the same connected component

that doesn’t contain u . Thus, we now focus on H(s , t , u ) and H(s , t , u ). The subgraphs in

H(s , t , u ) can be partitioned as follows, depending on which component v belongs to:

(i) {s , u , v },{t }: Vertices s , u , v belong to the same component, and t belongs to an-

other component.

(ii) {s , u },{v },{t }: Vertices s , u belong to the same component, and each of v and t

belongs to distinct components.

(iii) {s , u }, {v, t }: Vertices s , u belong to the same component, and v, t belong to an-

other component.


It is easy to see that the cases (i) and (ii) are nonempty, since (s , u ), (s , v )∈ E (G ).

Finally, consider the set of subgraphs in H(s , t , u ). Let H be a subgraph in this set.

Since s and u belong to distinct components, (s , u ) /∈ E (H ), and thus (s , v ) ∈ E (H ) in

order to have s and t connected. We then make the following operation on H to con-

struct H ′: (1) remove (s , v ) from H , and (2) insert (s , u ). Then, H ′ has s and u in the same

component, and v and t in another component. Therefore, H ′ belongs to Case (iii) of

H(s , t , u ). Furthermore, note that H and H ′ have the same number of edges. Therefore,

there is a mapping from subgraphs in H(s , t , u ) to subgraphs in Case (iii) of H(s , t , u )

with an equal number of edges. Since the cases (i) and (ii) are nonempty, it follows that

PG (s , t )< PG (s , u ).

Condition (iv) “Triangle” inequality. Given the probabilistic graph G , we partition the

subgraphs of G into four cases depending on the existence of the two edges (s , u ) and

(s , v ).

• (s , u ), (s , v )∈ E : occurs with probability p 2.

• (s , u )∈ E , (s , v ) /∈ E : occurs with probability p (1− p ).

• (s , u ) /∈ E , (s , v )∈ E : occurs with probability p (1− p ).

• (s , u ), (s , v ) /∈ E : occurs with probability (1− p )2.

We can thus rewrite the probability PG (s , t ) as follows.

PG (s , t ) = p · (1− p ) ·PG−{s }(u , t )+ p · (1− p ) ·PG−{s }(v, t )+ p 2 ·PG−{s }(u or v, t ) (4.1)

where PG−{s }(u or v, t ) denotes the probability that u or v is connected to t in the graph

G −{s }. We write PG (u , t ) and PG (v, t ) similarly:

PG (u , t ) = (1− p 2) ·PG−{s }(u , t )+ p 2 ·PG−{s }(u or v, t ) (4.2)

PG (v, t ) = (1− p 2) ·PG−{s }(v, t )+ p 2 ·PG−{s }(u or v, t ) (4.3)


Subtracting (4.1) from (4.2), and (4.1) from (4.3) gives:

PG (u , t )−PG (s , t ) = (1− p ) ·PG−{s }(u , t )− p (1− p ) ·PG−{s }(v, t )

PG (v, t )−PG (s , t ) = (1− p ) ·PG−{s }(v, t )− p (1− p ) ·PG−{s }(u , t )

If PG−{S}(u , t )> p ·PG−{S}(v, t ), it follows that PG (u , t )> PG (s , t ). Otherwise, PG−{s }(u , t )≤

p · PG−{s }(v, t ) implies p · PG−{s }(v, t ) < PG−{s }(v, t ), and it follows that PG (v, t ) > PG (s , t ).

This completes the proof.

These necessary conditions allow us to rule out vertices that cannot have degree

greater than 2. As a result, Theorem 4.3 provides a heuristic (Algorithm 4.3.1) for pre-

dicting the set of degree 2 vertices as well as the edges incident to them. It is easy to see

that Algorithm 4.3.1 runs in time O(|V |2), and in practice, our studies have shown that

vertices satisfying these conditions while having degree higher than two are rare (see

Table 4.1).

Unlike the case of degree-1 vertices, we cannot simply remove degree-2 vertices

without affecting the remaining entries in the matrix. Therefore, as shown in Algo-

rithm 4.3.1, degree-2 vertices and their incident edges are simply marked as such in the

solution, but are not removed from the input matrix.

Algorithm 4.3.1: Prediction of degree 2 vertices

Input: Probability matrix M and vertex set VOutput: Degree 2 vertices V 2 and their incident edges E 2

foreach vertex s ∈V doif s satisfies conditions in Theorem 4.3 then

V 2←V 2 ∪{s };u , v ← the two vertices with M s ,u =M s ,v =max{M s ,t :∀t ∈V −{s }};E 2← E 2 ∪{(s , u ), (s , v )};

endendReturn V 2 and E 2


4.3.2 Identification of Densely Connected Regions

We now turn to the problem of finding densely connected regions in the network. These

regions may correspond to protein complexes, where tagging any one of the members

of the complex results in the identification of all other members with high probability.

While correctly predicting the physical interactions within each complex is a difficult

task, separating these dense regions from the remainder of the network is key to im-

proving the accuracy of the genetic algorithm (Phase III).

In order to build an algorithm for detecting dense regions, we consider the connec-

tivity between nodes in an n-clique Kn , which we denote as CliqueConn(n). First, let us

recall a classical result in graph enumeration.

Lemma 4.6. [62, 67] Let γn denote the number of connected simple graphs over n labeled

vertices. We can count this number by the recurrence:

γn =

1 if n = 1

2(n2)− 1

n

∑n−1i=1

�

i�n

i

�

·2(n−i

2 ) ·γi

�

otherwise

Using this combinatorial result, we can construct the formula of CliqueConn(n )when

p = 12

, which is the value we used for our analyses. The formula for other values of p can

be formulated using a similar proof.

Lemma 4.7. Let G be a complete, probabilistic graph on n vertices with p = 12

. Then, for

any two vertices in G , the probability that they are connected is:

C l iqu e Conn (n ) =n∑

i=2

� ni−2

�

γi

2(n2)−(n−i

2 )

Proof. We write the probability as follows.

C l iqu e Conn (n ) =σ2+σ3+ . . .+σn

2(n2)


Algorithm 4.3.2: Detecting dense clusters in the network

Input: Probability matrix M , and minimum cluster size kOutput: Set of possible k -cliques in Gd i r e c t .

t ←CliqueConn(k )G t ← (V, E t ), where E t = {(u , v ) : u , v ∈V and M (u , v )≥ t }foreach connected component S ⊆V do

Find cliques of size k in SendReturn cliques discovered.

where σi denotes the number of subgraphs of Kn with i vertices, where the two given

vertices u and v are connected. To computeσi , there are� n

i−2

�

ways to pick i−2 vertices

(in addition to u and v ) for the connected components, and γi ways to keep them all

connected, and finally, 2(n−i

2 ) ways to assign edges among the remaining vertices. So we

have

σi =�

n

i −2

�

·γi ·2(n−i

2 ),

and finally, we have

C l iqu e Conn (n ) =n∑

i=2

� ni−2

�

γi

2(n2)−(n−i

2 )

Lemma 4.7 gives rise to a heuristic (see Algorithm 4.3.2) that finds a set of dense re-

gions covering the network. While the algorithm guarantees to identify all cliques of size

k ′ ≥ k contained within Gd i r e c t , sets of vertices that do not form a k -clique may also be

reported, provided that they are sufficiently connected among themselves, possibly via

vertices outside the dense cluster. However, for sufficiently large values of k , we found

this to be a very rare occurrence. While finding cliques in a graph is a computation-

ally intensive task in general, the construction of G t (in Algorithm 4.3.2) for large values

of k creates few small connected components and leaves the remaining vertices iso-

lated. Therefore, in practice, Algorithm 4.3.2 can be implemented to run in a reasonable

amount of time (see Table 4.5(ii)).

Based on the connectivity matrix M , our algorithm identifies (possibly overlapping)


clusters of proteins of size at least k such that, for every pair u , v in each cluster, M (u , v )≥

tk for some threshold tk . For appropriately chosen values of k and tk , the discovered

clusters correspond to cliques in Gd i r e c t with high accuracy (see Section 4.4.6).

The dense regions discovered at this phase provide us with (1) the set of edges within

each dense region; and (2) sparse cuts between disjoint dense regions. The edge set

within each cluster will be used for initial candidates in the genetic algorithm, whereas

the cuts defined by the clusters will be used as crossover points during the crossover

operation in the genetic algorithm.

4.3.3 Cut-based Genetic Algorithm

To predict the remaining section of the network, we use a customized genetic algorithm

that aims at finding an optimal solution to the A-DIGCOM problem. We first devise a

solution to a generalization of the A-DIGCOM problem, and then show how the results

of Phase I and Phase II of the algorithm are used to improve the performance.

Genetic algorithms have been shown to be an effective family of heuristics for a

wide variety of optimization problems [64], including network design under connec-

tivity constraints [38, 39]. A genetic algorithm models a set of candidate solutions as in-

dividuals of a population. From this population, pairs of promising candidate solutions

are mated, and their offspring solutions inherit properties of the parents with some ran-

dom mutations. Over generations, this process of natural selection improves the fitness

of the population.

The A-DIGCOM problem is a hard optimization problem, because (i) the size of the

search space is huge – 2(n2) for a graph of size n , and (ii) there is no known polynomial-

time algorithm to evaluate a proposed candidate solution (i.e. compute PG from G ). For

these reasons, a straight-forward genetic algorithm implementation failed to produce

satisfactory results (data not shown). Instead, we use a more sophisticated approach

by making use of the results obtained in previous sections in order to reduce the search


space and to guide the mating operations for more effective search.

The genetic algorithm aims to solve a generalization of A-DIGCOM. First, we allow

each edge (u , v ) in the network to survive with a non-uniform probability p (u , v ), in-

stead of one constant probability p over all edges. Secondly, we assume that we are

given two sets of edges EY ES and ENO that indicate the set of edges that are guaranteed

to be in the solution, and guaranteed not to be in the solution, respectively. This will

later allow us to factor in the outcomes of the previous sections. Therefore, the edges

whose presence remains to be determined are EM AY B E = EY ES ∪ENO .

Encoding of candidate solutions. To represent a candidate solution, we first create a

hash table that maps each putative edge in EM AY B E to an integer. Each candidate is then

encoded as a list of integers (edges). Edges in EY ES , which are part of all solutions, are

not explicitly listed, in order to save space. Since the networks we consider are sparse

(|E |=O(|V |)), such an encoding technique significantly reduces the space requirements.

Initial population. The initial population of candidates is generated using a preferen-

tial attachment model [10] using the following observations: (i) The average connectiv-

ity of vertex u , avgCon(u ) = 1|V |

∑

v∈V−{u }M (u , v ) is strongly positively correlated with

the degree of u in Gd i r e c t ; (ii) the age of a vertex, measured by when the vertex was in-

troduced to the graph, is positively correlated with the degree of the vertex. Therefore,

during the generation of each candidate, we choose the next vertex to be added with

probability proportional to its average connectivity. This results in a candidate solu-

tion where the degree of most vertices is likely to be close to their true degree in Gd i r e c t .

Furthermore, in order to create candidates that are clustered similarly to the true direct

interaction graph, we include the set of edges predicted by Algorithm 4.3.2 to each initial

candidate.


Fitness function. The fitness of a candidate solution G , fitness(G ) is obtained by first

estimating the probability matrix PG using 500 Monte Carlo samples. To compute these

samples, we break each edge independently with probability p , and count the number

of times each pair of vertices remain connected. Given 500 samples, we then count

the number of vertex pairs (u , v ) whose estimated connectivity PG (u , v ) is within the

tolerance level δ, i.e., M (u , v )±δ (see Section 4.4.2 on how to choose δ).

Crossover. The crossover operation needs to hybridize two parent candidates to pro-

duce offsprings preserving the good properties of the parents. This operation will be

guided by a randomly chosen balanced cut V = V1 ∪V2. Let G1 and G2 denote the two

parent networks and let E1(G i ) and E2(G i ) denote the edges of G i such that both end-

points lie in V1 and V2, respectively. Furthermore, let E1,2(G i ) denote the edges of G i

that cross from V1 and V2. Mating G1 and G2 results in two children G ′ = (V, E ′), and

G ′′ = (V, E ′′) such that:

E ′ = E1(G1)∪E2(G2)∪ (E1,2(G1)∩E1,2(G2))

E ′′ = E2(G1)∪E1(G2)∪ (E1,2(G1)∩E1,2(G2))

While choosing a random cut as the crossover point is a reasonable strategy to con-

struct a new pair of offsprings, our studies have shown that a planned strategy in choos-

ing the crossover points results in better performance with less chance of premature

convergence. In particular, if the crossover point is chosen as a dense cut in the parent

networks, then the connectivity among vertices within each partition would be deterio-

rated significantly. This results in offsprings with much poorer fitness than their parents.

On the other hand, if the parents are hybridized at a sparse cut, the connectivity among

vertices within each partition is disrupted much less. Therefore, crossover operations

are best done by selecting sparse balanced cuts (|V1| ≈ |V2|). Finding sparse balanced cuts

is a well-studied problem in combinatorial optimization, for which various approxima-

tion algorithms exist [96, 140]. However, these algorithms assume that the graph itself,


not the connectivity matrix M , is given as input. We therefore use a simple heuristic that

avoids cutting through the dense regions of the network. To generate these sparse cuts,

we contract each dense region identified in Algorithm 4.3.2 to a single vertex, and then

generate a weighted (by the number of vertices in each dense region) balanced partition

of the vertices at random.

Mutation. In order to introduce variability to the population of candidates, a small

number of edges (5−10%) are randomly inserted or deleted. Moreover, observe that the

child network constructed as above may not remain connected. Aside from the random

mutation, therefore, we employ a simple local search that greedily adds edges to keep

the network connected.

Genetic algorithm parameter selection. The various parameters of the genetic algo-

rithm were selected based on the resulting performance on the Yu et al. data set. Two

main parameters that affect the performance significantly are the population size and

the selection criteria. For selection criteria, we tested several different selection criteria

by setting the probability of choosing a candidate as a parent. The best compromise

between running time and accuracy was obtained using a population size of 500 and, a

selection probability for a parent proportional to fitness(c i )−minFit, where minFit is the

fitness of the worst candidate in the population.

4.3.4 Restricting the Solution Space for GA

While our genetic algorithm offers a plausible method for the A-DIGCOM problem, one

can reduce the size of the solution space, which typically results in faster convergence to

better solutions, using the results in Theorem 4.1 and 4.3. First, recall that finding all cut

edges decomposes the problem into independent subproblems on 2-edge-connected

components. Second, the identification of degree-2 vertices defines two sets of edges

EY ES and ENO that constitute all putative edges incident to the identified degree-2 ver-


tices. In other words, EM AY B E forms the subgraph of G induced by the set V 3+ of vertices

with degree ≥ 3.

Furthermore, observe that the edges in EY ES form parallel paths between vertices

in V 3+. A classical result in network reliability (see Fact 4.5) suggests that these parallel

paths can be merged into a single meta-edge whose reliability can be efficiently com-

puted. To be more formal, consider the set

{P1(u , v ),P2(u , v ), . . . ,Pk (u , v )}

which is the set of paths between u and v in EY ES . These paths can then be replaced by

a single edge (u , v ) with survival probability p (u , v ) = 1−Πi=1...k (1− p |Pi (u ,v )|). By merg-

ing every set of parallel paths, we obtain a compact network over V 3+ that efficiently

encodes the edges in EY ES . Since our genetic algorithm handles the case where the edge

survival probability is non-uniform, this compact encoding results in substantial gains

in the running time for estimating the fitness of the candidates, as well as in time and

space requirements for handling large population sizes. In our applications, this allows

us to remove approximately 70−75% of the original set of vertices.


In this section, we provide the details on how our algorithm is tested on various datasets.

Then, the accuracy of the predicted direct PPI network is evaluated using simulated

data. Finally, our algorithm is tested on a well-known yeast AP-MS dataset to compare

against PPI datasets from other sources.


4.4.1 Randomized Hill-climbing Algorithm

For performance comparisons, we tested our algorithm against a simple randomized

hill-climbing approach. In this approach, we start with a randomly chosen spanning

tree G 1 of the vertex set V . At the i t h iteration, we first sample the connectivity proba-

bility PG i of G i , using the Monte Carlo simulation. Then, we randomly pick a vertex pair

u , v with probability proportional to

D(u , v )∑

∀i ,j∈V D(i , j ),

where D(u , v ) = |M (u , v )−PG i (u , v )|. If the selected pair u , v are connected by an edge

in G i , but M (u , v ) > PG i (u , v ), then we remove (u , v ) from G i . On the other hand, if u

and v are not connected by an edge, but M (u , v ) < PG i (u , v ), then we add (u , v ) to G i .

We repeat this local optimization heuristic while making sure the candidate solution

remains connected.

4.4.2 Choosing a Tolerance Level δ and Handling Numerical Errors

In order to deal with numerical errors from the Monte Carlo sampling, we use an ad-

ditive tolerance level δ. Note that the sampling process for estimating the probability

matrix PG is a binomial process, which, by the central limit theorem, is closely approxi-

mated by a normal distribution. The confidence interval is largest when the estimated

probability is equal to 0.5, in which case we obtain a confidence interval of

p ±δ= p ±z 1−α/2

2p

n,

where p denotes the fraction of samples where the two vertices are connected after n

samples, and z 1−α/2 is the z-value for desired level of confidence. Using this formula, we

conclude:


1. When n = 20000 (the input matrix M by sampling the test networks), we obtain a

95% confidence interval of size at most 2 ·δ= 2 ·0.007= 0.014.

2. When n = 500 (the connectivity matrix by sampling each candidate solution in our

genetic algorithm), the 95% confidence interval is of size at most 2 ·δ = 2 · 0.04 =

0.08.

With the chosen tolerance levelδ, we modify our algorithm as appropriate each time

we compare two connectivity probabilities. For example, in Theorem 4.1, the first con-

dition PG (u , v ) = p is modified to PG (u , v )∈ [p−δ, p+δ]; and in Theorem 4.3, we modify

the first condition PG (s , t )< 2p − p 2 to PG (s , t )< 2p − p 2+δ.

4.4.3 Generation of Scale-free Networks

In order to generate artificial scale-free networks, we used two generation models: the

preferential attachment model, and the duplication model. In the preferential attach-

ment model, we evaluated the degree distribution of the two biological networks (GY u

and GDI P ) and used the Barabási-Albert algorithm to construct a scale-free network with

attachment factor 1.5 (each iteration adds a new vertex with 1− 2 edges attached to ex-

isting vertices). In the duplication model, at each iteration, we randomly pick a vertex to

duplicate with probability proportional to its degree and randomly drop the duplicated

edges with probability at 0.5 in order to fit the degree distributions and sparsity of bio-

logical networks. While there are various other random graph models proposed for PPI

networks, these scale-free network models provided a sufficiently diverse set of initial

population for our genetic algorithm.

4.4.4 Calculation of Connectivity Matrix from Peptide Counts

The peptide count of a prey protein in an AP-MS experiment is the number of different

peptides that have been observed by mass spectrometry for that protein. We note that


the peptide counts are biased towards preys with longer protein sequences, and to rec-

tify this propensity, we normalized the abundance data by the protein sequence lengths

to obtain the abundance ratios R(i , j ). In order to turn the normalized abundance ratios

into the connectivity matrix for our probabilistic graph model, we used a simple logistic

function

M (i , j ) =1

1+ e−R(i ,j )−αβ

where the parameters α,β are chosen so that the computed distribution of p fits the

simulated connectivity distribution of GY u , using a χ2 test (α= 2.8921,β =−0.6318). In

the cases where R(i , j ) differ from R(j , i ), we choose the average of the two entries to

symmetrize the matrix.

4.4.5 Model Validation

To test our approach, we first sought to validate our model of AP-MS data. To this end,

we used one of the most comprehensive yeast AP-MS networks, obtained by Krogan et

al. [93]. The dataset reports the Mascot score [114] and the number of peptides detected

for each bait-prey pair. The complete set of interactions reported contains 2186 proteins

and 5496 interactions (Krogan et al. Table S6); we call the resulting network GK ro g a n Fu l l .

The authors identified a subset of these interactions as high-confidence, based on their

Mascot scores (Krogan et al. Table S5). We call this set of high-confidence interactions

GK ro g a nHi g h ; this network consists of 1210 proteins and 2357 interactions. We expect

that GK ro g a nHi g h is relatively rich in direct interactions, whereas the complete set of in-

teractions GK ro g a n Fu l l consists in part of indirect interactions.

Considering GK ro g a nHi g h as a direct interaction network, we simulated Monte Carlo

sampling to estimate PGK ro g a nHi g h , using p = 0.5 and 50,000 samples, which yields a 95%

confidence interval of size at most 0.007 on each PGK ro g a nHi g h (u , v ) entry. Next, we normal-

ized the peptide counts of the interactions in GK ro g a n Fu l l using protein lengths. We then

compared PGK ro g a nHi g h to the normalized peptide counts of the interactions in GK ro g a n Fu l l .


We expect that a significant fraction of low-confidence interactions in GK ro g a n Fu l l −

GK ro g a nHi g h are likely to be indirect interactions. If our model is correct, their peptide

counts should then be correlated with the corresponding entries in PGK ro g a nHi g h . Indeed,

the positive linear correlation between the predicted connectivity PGK ro g a nHi g h and the ob-

served normalized peptide counts is very significant (regression p-value of 8.17×10−11,

Student t -test). Furthermore, this correlation is strongest when p ≈ 0.5, as compared to

p = 0.3 or 0.7, justifying the use of this value in our subsequent analyses.

4.4.6 Accuracy of the Algorithm

The ideal validation of the accuracy of our algorithm would involve (i) constructing a

connectivity matrix M using actual quantitative AP-MS data; (ii) predicting direct inter-

actions based on M using our algorithm; and then (iii) comparing our predictions to

experimentally generated direct interaction data. Yeast 2-Hybrid (Y2H) experiments are

less prone to detect indirect interactions than are AP-MS methods, and several large-

scale efforts have been reported [53, 74, 137]. Unfortunately, for a number of techni-

cal reasons, the overlap between AP-MS PPI networks and Y2H networks remains very

small [149]. As a consequence, Y2H data cannot be used directly to validate predictions

made on AP-MS data. Instead, we had to rely on partially-synthetic data set, where an

actual network of high-quality Y2H interactions is assumed to form the direct interac-

tion graph, and a connectivity matrix is generated from it using Monte Carlo sampling,

under our model. Two sets of Y2H interactions were used: (i) GY u is the network con-

structed from the gold standard dataset of Yu et al. [149]. This network consists of 1090

proteins and 1318 interactions with high confidence of direct interactions; (ii) GDI P is the

core, high-quality, physical interaction network of yeast, available at the DIP database,

version 20090126CR [146], consisting of 1406 proteins and 1967 interactions.

These biological networks were complemented with two artificial 1000-vertices net-

works. The first was generated using the preferential attachment model (PAM) [10]. For

the second, we used the duplication model (DM) [13], which, in contrast to the PAM,


generates graphs containing several dense clusters. The resulting artificial “direct” in-

teractions graphs are called GPAM and GDM and contain 1500− 2000 interactions each.

We then used the Monte Carlo sampling approach described above to estimate the con-

nectivity matrices PGY u , PGDI P , PGPAM , and PGDM . These will form the input to our inference

algorithm, whose output will then be compared to the corresponding direct interaction

graph. It is important to note that these input matrices are not perfectly accurate and

may contain sampling errors. However, it is easy to bound the size of the errors with

high probability and use this as a tolerance level within our algorithm. We also note that

the results presented in this section only aim at evaluating the performance of the infer-

ence algorithm on input data that was generated exactly according to our probabilistic

model. As such, the error rates reported may be considered as lower bounds for those on

actual biological data. An assumption-free evaluation is provided later in this section.

Identification of weakly connected vertices

Theorem 4.1 provides an efficient algorithm that guarantees the identification of all cut

edges, provided that the given connectivity matrix is precise. We say that a vertex v is a

1-cut vertex if all edges incident on v are cut edges. By applying Theorem 4.1 recursively

to detect cut edges and decomposing the graph into two connected components, we

can detect and remove all 1-cut vertices from the input connectivity matrix. Table 4.1 (i)

reports the number of 1-cut vertices that are detected by the recursive algorithm from

Theorem 4.1. In both the Yu and the DIP network, 1-cut vertices constitute approxi-

mately 50% of the network, and identifying them allows a significant reduction in the

problem size. We note that the inaccuracies in the input connectivity matrices could, in

principle, have introduced errors in the detection of cut edges. However, this rare event

was never observed on any of our networks.

Algorithm 4.3.1 (see Methods) guarantees to efficiently identify all degree-2 vertices

(again, provided that the connectivity matrix is known), but may also incorrectly flag

some higher-degree vertices. As seen in Table 4.1 (ii), nearly all degree-2 vertices were


Network Total(i) 1-cut vertices (ii) Degree 2 vertices

Remainingreal pred. FDR(%) FNR(%) real pred. FDR(%) FNR(%)

Yu 1090 552 552 0 0 195 207 7.7 2.05 331DIP 1406 656 656 0 0 309 326 5.82 0.64 424PAM 1000 457 457 0 0 351 363 3.58 0.28 180DM 1000 323 323 0 0 117 126 11.9 5.12 551

Table 4.1: Performance of detecting weakly connected vertices. Number of vertices detected as (i) 1-cutvertices, and (ii) degree 2 vertices in the real network and the predicted network. False discovery (FDR)and false negative (FNR) ratios are given in percentage. Remaining: the number of vertices remainingafter identifying 1-cut vertices and degree 2 vertices.

identified, with a low false-discovery rate ranging from 6% to 9%. Moreover, the false-

positives incorrectly detected as degree-2 vertices indeed had small degrees, and their

predicted neighbours were mostly correct (but incomplete) predictions. Flagging degree-

2 vertices reduces the problem size further by 15% to 36%.

After repeatedly detecting and removing 1-cut vertices and degree-2 vertices from

the problem space, the edges adjacent to approximately 70% of the vertices are detected

with very low error rate. The remaining vertices only constitute approximately 30% of

the original network. We call this remaining subset the hard core of the connectivity

matrix. Because it is more densely connected than the rest of the network, the topology

of hard core is more difficult to reconstruct.

Running our algorithm on the PAM simulated data yields similar resolution and error

rates as on the Y2H networks. However, our DM network is found to be less amenable to

these strategies, leaving 55% of vertices unresolved and resulting in an error rate approx-

imately twice that seen for other networks. This is simply due to the fact that networks

generated by the duplication model do not contain as many 1-cut vertices or degree 2

vertices when compared to other networks, including biological Y2H networks.

Identification of dense regions

Our dense region detection algorithm aims at identifying all edges that belong to a k -

clique in GDi r e c t , for a given value of k . Table 4.2 reports the accuracy of the algorithm.


Networkk = 7 k = 6 k = 5

real pred. FD(%) FN(%) real pred. FD(%) FN(%) real pred. FD(%) FN(%)Yu 42 54 22.22 0 66 104 36.53 0 146 308 55.19 5.48

DIP 0 42 100 0 86 112 26.79 4.65 184 266 35.34 6.52PAM 0 31 100 0 0 45 100 0 0 96 100 0DM 254 346 28.32 2.36 194 267 29.21 2.57 488 718 34.96 4.30

Table 4.2: Performance of quasi-clique predictions. Number of edges that belong to maximal cliques ofsize k . Real: actual number of edges that belong to maximal cliques of size k ; pred: predicted numberof maximal k -clique edges; FD(false discovery ratio): percentage of false positives in the predicted set;FN(false negative ratio): percentage of false negatives in the real set.

As expected, our algorithm achieves extremely high sensitivity for clique edges. How-

ever, the false-discovery rate is quite high, especially for the case of DIP and PAM, k = 7.

This is due to the fact that distinguishing a 5-clique from, say, a quasi-clique of size 7

is extremely difficult, causing false-positive predictions. We note however that these er-

roneous predictions are mostly inconsequential, as the intra-cluster topology of each

dense region shall only be used in generating the initial candidate solutions for the ge-

netic algorithm.

Cut-based genetic algorithm

The various parameters of the genetic algorithm (population size, mate selection prob-

ability, mutation rate, etc.) were optimized for the running time and accuracy of the so-

lution based on GY u . Although our genetic algorithm could in principle be used on any

connectivity matrix, running it on the full matrix of > 1000 proteins is impossible: the

search space is huge, and the amount of time required to evaluate the fitness of a given

candidate solution is too large. However, as discussed previously, applying first the 1-

cut and degree-2 vertex detection algorithms significantly reduces the problem size and

makes it accessible to our genetic algorithm. Table 4.3 (i) reports the accuracy of the

genetic algorithm predictions on the hard core of each connectivity matrix. We note

that since the network to be inferred is relatively highly connected, the problem is sig-

nificantly more difficult than the identification of 1-cuts and degree-2 vertices. Indeed,

the false-discovery and false-negative rates range from 35% to 55% for most datasets.


Network(i) reduced network (ii) overall network

real pred. FDR(%) FNR(%) real pred. FDR(%) FNR(%)Yu 563 552 43.65 44.76 1318 1390 14.96 10.31

DIP 931 890 35.50 38.34 1967 2041 17.34 14.23PAM 473 421 49.88 55.39 1538 1462 16.14 20.28DM 1138 1295 43.39 35.58 1869 1804 32.81 35.15

Table 4.3: Performance of genetic algorithm and overall algorithm. (i) Performance of the genetic algo-rithm on the reduced network obtained from removing 1-cut vertices and degree-2 vertices. (ii) Overallperformance of the combined prediction pipeline on the complete connectivity matrix. Real and pre-dicted describe the number of edges in the real and predicted networks, respectively.

For a comparison, an algorithm that would pick edges randomly would achieve 98.75%

false-discovery and false-negative rates.

Combining the three phases of the algorithm, the overall error rate obtained on each

data set ranges from 10 to 20% false-discovery and false-negative rates, except for the

DM data set, which fares considerably worse, for the reasons explained earlier.

To the best of our knowledge, there has been no other efforts to solve the DIGCOM

problems (neither exact, nor approximate version). We thus compared our approach

to a simple hill climbing search algorithm on the Yu et al. data set (see Methods). We

let this algorithm run over several days (as opposed to few hours spent using our ap-

proach), with multiple restarts, and discovered that it provides very poor sensitivity and

specificity (see Table 4.4 for the best results obtained). This is not surprising since the

hill climbing method is highly dependent on the initial solution (in this case, a spanning

tree chosen randomly based on the connectivity matrix) and the search space is simply

Network Real Predicted FDR(%) FNR(%)(i) Hill-climbing 1318 2108 86.67 78.68

(ii) Hill-climbing +weakly conn. nodes 1318 1517 43.57 35.05(iii) Our approach (GA) 1318 1390 14.96 10.31

Table 4.4: Comparison of our method to simple hill-climbing approach. (i) Accuracy of the hill-climbingapproach used over the complete network; (ii) Accuracy of the hill-climbing approach after fixing theweakly connected nodes using our algorithm; (iii) Accuracy of our combined pipeline using the geneticalgorithm.


Network (i) 1-cut & degree 2 vertices (secs) (ii) Quasi-clique predictions (iii) Genetic algorithmYu 0:00:29 0:31:02 15:00:00

DIP 0:00:41 0:48:14 15:00:00PAM 0:00:19 0:29:49 15:00:00DM 0:00:24 1:18:03 15:00:00

Table 4.5: Running times for each phase of our algorithm. (i) Detecting 1-cut vertices and degree twovertices; (ii) Predicting quasi-clique clusters; and (iii) Running the genetic algorithm. We are reporting theaverage run time over three runs on each network. The implementation was tested on a Powermac G52Ghz with 4GB of RAM. Note that the genetic algorithm was run for a fixed amount of time, and the topscoring candidates achieved the quality as shown in Table 3. The times are shown in hh:mm:ss format.

too large to exhaustively search for the a good initial solution. We also tested the hill-

climbing approach in the same setting as the genetic algorithm, i.e. combining it with

the 1-cut edges and degree-2 vertices detection algorithms. Here, the hill-climbing ap-

proach showed a better sensitivity and specificity than the pure hill-climbing approach,

but still performed much worse than our genetic algorithm (Table 4.4). Furthermore,

the improvement over the pure hill-climbing approach was mostly due to the high sen-

sitivity and specificity of our algorithm for detecting weakly connected vertices.

To provide an idea of the running times, Table 4.5 gives the empirical data from our

experiments. The first two phases (detecting weakly connected vertices and recognizing

dense regions) were run within seconds while the genetic algorithm was run for a fixed

amount of time.

4.4.7 Inferring Direct Interactions from AP-MS Experimental Data

In order to apply our algorithm to biological data from AP-MS experiments, we used the

raw data reported by Krogan et al. [93] for the 2186 putative interactions of GK ro g a n Fu l l .

We only considered the subnetwork of tagged proteins, and further focussed our efforts

on the analysis of 77 proteins that are well separated in the tag-induced subnetwork.

Quantitative abundance estimates were derived from the peptide counts reported for

each prey, and an experimentally derived connectivity matrix M was obtained after nor-

malization (see Section 4.4.4). Our full prediction algorithm was then run on the es-


timated connectivity matrix, resulting in a direct interaction graph prediction we call

GK i m that consists of 164 interactions (see Table 4.6). The network GK i m was compared

to GK ro g a nHi g h , the set of high-confidence interactions reported by Krogan et al., and

to G TopK ro g a nHi g h , a subset of GK ro g a nHi g h consisting of the 164 (to compare against GK i m )

most confident interactions they reported.

Both GK ro g a n Fu l l and GK ro g a nHi g h overlap GK i m quite substantially. These three sets

of predictions were then compared against a set of high-quality binary interactions from

GY u . In Y2H experiments, the interaction partners are separately screened using a ge-

netic readout. Therefore, interactions from GY u are believed to be direct, and thus used

to test against the predictions from AP-MS data. On the other hand, these interactions

may reflect only a subset of all direct interactions among the 77 proteins.

As shown in Figure 4.4, our results show that the high-confidence AP-MS data GK ro g a nHi g h

exhibited very little overlap with the direct binary interaction set GY u . 72.6% of inter-

actions in GK ro g a nHi g h are disjoint from GY u , and 25% of GY u remains undetected by

GK ro g a nHi g h . Furthermore, even the top scoring set of interactions G TopK ro g a nHi g h showed

high discrepancy ratios against GY u . In contrast, GK i m produced by our algorithm coin-

cides with GY u with better sensitivity and specificity. Given the crudeness of the method

in translating the AP-MS data into a connectivity matrix, our algorithm has thus per-

formed relatively well in predicting direct interactions from real AP-MS data.

4.5 Discussions

Approaches for determining bait-prey abundance remain in their infancy, and to date,

no large-scale PPI networks have this type of quantitative data. As these approaches

improve in accuracy, so will the results of our method. Furthermore, as the sensitiv-

ity of AP-MS pipelines improves, the fraction of indirect interactions detected will also

increase, thereby making the ability to distinguish them even more critical.

4.5. Discussions 107

Protein A Protein B Protein A Protein B Protein A Protein BYOL012C YKR048C YNL031C YJL115W YLR418C YIL035CYBL003C YGL241W YNL031C YBL003C YDR054C YGL173CYBL003C YKR048C YNL031C YLL022C YDR054C YLR418CYBL003C YGL207W YNL031C YIL035C YER112W YCR077CYBL003C YIL035C YNL031C YGL207W YER112W YDR432WYGR103W YIL035C YNL031C YER164W YMR229C YDR432WYGR103W YMR229C YOR039W YER164W YOR206W YMR229CYGR103W YOR272W YER164W YGL207W YHL034C YOR310CYGR103W YKR048C YBL002W YOL012C YHL034C YDR432WYGR103W YNL189W YBL002W YDR224W YHL034C YLR175WYGR103W YLR347C YBL002W YKR048C YHL034C YMR229CYMR304W YGR159C YBL002W YBR245C YHL034C YGL173CYMR304W YDR432W YBL002W YGL207W YOL076W YGR159CYMR304W YLR197W YBL002W YIL035C YOL076W YDR432WYMR075W YBL002W YLL022C YPL001W YNL088W YGR159CYMR075W YDR432W YBR010W YBR245C YCR077C YDR432WYMR075W YNL189W YBR010W YGL207W YER146W YCR077CYOL004W YAL034C YBR010W YDR432W YHR064C YGL173CYOL004W YMR075W YBR010W YKR048C YOR048C YDR432WYOL004W YDR432W YBR010W YPL001W YGL241W YKR048CYOL004W YIL035C YBR010W YBL002W YDR225W YGL207WYOR310C YGL120C YBR010W YLL022C YDR225W YKR048CYOR310C YNL189W YCR060W YGR159C YDR225W YGL241WYLR455W YOR310C YJR041C YNL189W YGL207W YOR039WYLR455W YOR206W YJR041C YDR432W YGL207W YGL244WYLR455W YNL088W YJR041C YMR125W YGL207W YIL035CYLR455W YMR229C YJR041C YPL178W YGL207W YBR279WYLR455W YDR496C YNL209W YDR432W YGL207W YOL145CYBR009C YPL001W YNL209W YER146W YGL207W YNL088WYBR009C YGL207W YNL209W YKR048C YGL207W YGL173CYNL030W YPL001W YGL244W YLR418C YGL207W YOR061WYNL030W YBR009C YGL244W YOL145C YDR224C YGL207WYNL030W YLL022C YOR123C YGL207W YDR224C YKR048CYNL030W YGL207W YOR123C YGL244W YGL120C YGR159CYKR024C YMR229C YDR234W YHR064C YGL120C YDR432WYKR024C YGL173C YOR061W YOR039W YMR128W YGL120CYBR103W YDR432W YIL035C YBR009C YGR272C YHL034CYDL229W YBR169C YIL035C YDR172W YGR272C YDR432WYDL229W YMR229C YIL035C YOR061W YMR014W YDR432WYDL229W YGL207W YIL035C YER164W YMR014W YGR272CYDL229W YDR432W YPL153C YGR159C YNL147W YGL173CYDL229W YKR048C YPL153C YNL189W YDR432W YDL060WYDL229W YGR159C YJL115W YOR061W YGL147C YDR432WYDL229W YHR064C YJL115W YIL035C YPL178W YLR347CYOL145C YBR279W YJL115W YPL153C YPL178W YMR125WYOL145C YDR172W YLR410W YGR159C YGR159C YOR310CYOL145C YOR123C YLR410W YKR048C YML062C YNL189WYBR279W YLR418C YLR197W YOR310C YML062C YGR159CYBR279W YIL035C YLR197W YLR347C YMR125W YNL189WYBR279W YGL244W YDL040C YMR229C YNL139C YLR347CYBR279W YOR123C YDL040C YLR175W YNL139C YNL189WYDR365C YMR229C YDL040C YLR197W YNL189W YLR347CYDR365C YGR159C YLR418C YGL207W YDR289C YNL189WYNL031C YPL001W YLR418C YOR123C YDR289C YLR347CYNL031C YBR009C YLR418C YNL189W

Table 4.6: The set of direct interactions among 77 yeast proteins, predicted by our algorithm. The connec-tivity matrix is generated from normalized peptide counts, and then our algorithm is run to predict thedirect interactions.


GY u GKroganHigh

121

332

GY u GTopKroganHigh

121164

GY u GKim

121164

31 90 242 42 79 85 15 106 58

Figure 4.4: Inferring direct interactions from actual AP-MS dataset. Overlap between the Y2H interactionnetwork of Yu et al. and various AP-MS-based networks: (a) High-confidence set of interactions from Kro-gan et al. (b) The set of 164 highest scoring interactions from Krogan et al. (c) The set of 164 interactionspredicted as direct interactions by our algorithms, based on the AP-MS data from Krogan et al.

In this chapter, we lay the groundwork for modelling indirect interactions in AP-

MS experiments. We formulate the DIGCOM problems, which aim at distinguishing

direct interactions from indirect ones, and provide a set of theoretical and heuristic

approaches that are shown to be highly accurate on both biological PPI networks and

simulated networks. Despite the unrealistic assumptions that should be relaxed, our re-

sults show that the predicted set of interactions fits the experimental data reasonably

well. In addition, applying our algorithms to a large-scale AP-MS data set from Krogan

et al. results in predictions that overlap Y2H data approximately 35% more often than

the equivalent number of top-scoring interactions reported by these authors.

The DIGCOM problems raise a number of challenging, yet fascinating computa-

tional and mathematical problems. Is the solution to the exact DIGCOM problem, if

it exists, always unique? We suspect it is. What is the computational complexity of the

exact and approximate DIGCOM problems? We believe they are NP-hard, and possi-

bly not even in NP. Are there types of graph substructures, other than those discussed

here, that can be unambiguously inferred from PG ? Are there special properties of PPI

networks, other than the power-law degree distribution, of which an algorithm can take

advantage to make more accurate predictions and/or provide approximation or proba-

bilistic guarantees?


The model and algorithm proposed here are only a first step toward an accurate de-

tection of direct interactions from AP-MS data. Several generalizations and improve-

ments are worth investigating. First, the abundance of an interaction is not constant

and needs to be modelled more accurately. Second, the strength of all physical inter-

actions is non-uniform, and some interactions may be more prone to disruption by the

affinity purification process than others. Given sufficient quantitative AP-MS data, one

may study a generalization of the DIGCOM problems that aims at identifying not only

the set of direct interactions, but also their individual strengths and abundances. While

modelling these aspects is in theory possible, the amount and quality of experimental

data required is currently unavailable, and the computational complexity of the result-

ing problems are likely to be daunting.

Perhaps a more significant limitation of our model is that all direct interactions are

assumed to occur simultaneously, though it is clear that certain interactions are either

mutually exclusive, or restricted to specific subcellular compartments or conditions.

One can investigate approaches to decompose the observed network into a family of

simultaneously occurring interactions. In order to do so, complementary experimental

data, such as comprehensive protein localization assays or cell cycle expression data,

would be required to reduce the space of possible solutions in a biologically meaningful

manner.

An additional assumption that may need to be relaxed is the independence of the

edge failures, which may not hold in cases where the loss of an interaction between two

proteins causes a significant destabilization of the larger complex they belong to. Un-

fortunately, in the presence of strong dependencies between edge failures, it becomes

almost impossible to distinguish direct from indirect interactions. Nonetheless, it may

be possible to at least identify complexes where such dependencies hold, by studying

subsets of proteins for which the AP-MS data differs significantly from our model.



The results discussed in this chapter have been published in [88].

Chapter 5

Hypergraph Modelling of PPI Data

PPI networks are traditionally modelled as graphs whose vertices represent proteins,

and two vertices are joined by an edge if and only if the two corresponding proteins

interact with each other. However, many recent discoveries of protein complexes as

building blocks of the proteome led us to believe that protein-protein interactions may

involve any number of interaction partners [7, 58, 93]. Consequently, various methods

have been proposed for detecting protein complexes (e.g. [7, 90, 129]). We discussed

some of these methods in detail in Chapter 1.

One of the limitations of the existing approaches is that the identified protein com-

plexes are often disjoint, and the sparse inter-cluster bridges remain as binary interac-

tions. Some recently proposed methods do allow finding overlapping clusters [1, 7, 98],

but most of these approaches are designed to first identify dense clusters, and then try

to merge the nearby clusters. While intuitive, such a strategy often fails to find a globally

optimum structure, if, for example, the optimization criteria is to minimize the number

of clusters. In fact, none of these methods guarantees to find the optimum solution for

the problem they are designed for. As a result, finding a comprehensive model for PPI

networks as a collection of protein complexes remains unsolved to date.

This chapter is concerned with the problem of modelling binary PPI data as a hyper-

111

112 Chapter 5. Hypergraph Modelling of PPI Data

graph, i.e., a set system where each set represents a protein complex. In order to model

this problem, let G = (V, E ) denote the combinatorial graph constructed from the widely

available binary PPI data (e.g. Y2H interaction data, or the direct AP-MS data produced

from Chapter 4). Then, each protein complex in the PPI network corresponds to a set S

of vertices in G , where each pair of vertices in S share an edge between them. In other

words, each protein complex corresponds to a clique in G . Our problem of finding a

hypergraph (with possibly a small number of sets, or hyperedges) can then be stated as

finding a clique cover of minimum cardinality.

(EDGE) CLIQUE COVER. Given a graph G = (V, E ), find a smallest collection Q of cliques

in G such that every edge in E belongs to at least one clique in Q .

A similar problem can be defined as a partitioning problem:

(EDGE) CLIQUE PARTITION. Given a graph G = (V, E ), find a smallest collection Q of

cliques in G such that every edge in E belongs to precisely one clique in Q .

Therefore, we draw our attention to the clique cover problem, with an application

to hypergraph modelling of PPI data. In fact, our theoretical pursuit on the clique cover

problem gives rise to efficient algorithms for sparse networks (including PPI networks),

where the graph sparsity is measured by various graph parameters.

Related Work. In addition to graph theoretic aspects, the clique cover problem has

been studied extensively from the standpoint of computational complexity. In particu-

lar, there are numerous studies concerning approximability and fixed-parameter tractabil-

ity. Clique cover is NP-hard in general [110], even when the input graph is planar [24]

or has bounded degree [71]. Furthermore, Lund and Yannakakis have shown that clique

cover is not approximable within a factor of |V |ε for some ε > 0 unless P = NP [101],

thereby removing the hope of good approximation algorithms in the general case. In the

case of clique partition problem, the problem has also been shown to be NP-complete

5.1. Preliminaries 113

for various restricted classes of graphs [23, 56, 57, 72].

A parameterized problem is fixed-parameter tractable (FPT) if it can be solved in

f (k ) · |I |O(1) time, where f is a computable function depending on some parameter k ,

independent of the input size |I |. Recently, Gramm et al. [66] showed that the clique

cover problem is FPT when the size of the cover is chosen for the parameter k . Similarly,

Mujuni and Rosamond [105] have shown that the clique partition problem is FPT with

the output size chosen as the parameter. These algorithms run in polynomial time in

the input size but exponential time in the number of cliques in the solution. As a result,

these algorithms are well suited for dense graphs where a few cliques can cover the entire

graph, but they are not suitable for sparse graphs that require a large number of small

cliques in the solution.

In this chapter, we fill the gap by designing efficient algorithms for such sparse graphs.

We first introduce some definitions and related results in Section 5.1. Then, in Sec-

tion 5.2, we present an exact algorithm for clique cover when the input graph has bounded

treewidth. Section 5.3 discusses the problem restricted to planar graphs, and we provide

a polynomial time approximation scheme. Finally, we show the performance of our al-

gorithm from experimental studies in Section 5.4. In particular, our algorithm shows

efficient and practical running time for both real and simulated biological networks.

Furthermore, our PTAS for planar graphs shows a clear trade-off between the approxi-

mation ratio and the running time when tested against random planar graphs.

5.1 Preliminaries

Throughout this chapter, we focus on the clique cover problem for sparse networks.

Where possible, we shall discuss how the algorithms for the clique cover problem can

be modified to solve the clique partition problem. Various measures of network sparsity

have been proposed in the past, with possibly the best known being treewidth. We recall

the definition of tree decomposition from Section 2.2.


Definition 5.1. [120] Tree decomposition of a graph G = (V, E ) is a pair (X = {X i |i ∈

I }, T = (I , F ))where each node i ∈ I is associated with a set of vertices X i ⊆V , such that

(1) Each vertex belongs to at least one node:⋃

i∈I X i =V .

(2) Each edge is induced by at least one node: ∀(v, w ) ∈ E , there is an i ∈ I with v, w ∈

X i .

(3) For each v ∈V , the set of nodes {i ∈ I |v ∈X i } induces a subtree of T .

Then, the width of a tree decomposition (X , T ) is defined as m a x i∈I |X i | − 1, and the

treewidth of a graph G , denoted tw(G ), is the minimum width over all tree decomposi-

tions of G .

An alternative definition of treewidth can be given using k -trees.

Definition 5.2. [3] A k -tree is defined recursively as follows:

(i) The complete graph on k vertices is a k -tree, and

(ii) A k -tree G with n + 1 vertices (n ≥ k ) can be constructed from a k -tree H with n

vertices by adding a vertex adjacent to exactly k vertices that form a k -clique in H.

Definition 5.3. A graph is a partial k -tree if it is a subgraph of a k -tree.

Then, the class of partial k -trees is equivalent to the class of graphs with treewidth

k . Note that this recursive definition provides a simple algorithm to construct a graph

of treewidth k ; we shall use this approach to generate a set of synthetic test data in

Section 5.4.1. In general, it is NP-complete to determine the treewidth of a graph [3].

However, when k is fixed, graphs with treewidth k can be recognized, and width k tree

decompositions can be constructed, in linear time [17].

From the empirical studies of our input PPI data (shown in Section 5.4, Table 5.1), the

treewidth of PPI networks is often small compared to the network sizes. We thus assume,

5.1. Preliminaries 115

where applicable, that the input graph has a bounded treewidth, and its optimum tree

decomposition can be constructed efficiently. Furthermore, for ease of exposition, we

assume that the decomposition tree T admits a nice structure as defined below.

Definition 5.4. [92] A tree decomposition (X , T ) is called nice if the tree T is rooted, and

for each node i ∈ I , one of the following holds:

1. LEAF: node i is a leaf of T , and |X i |= 1.

2. JOIN: node i has exactly two children j1 and j2 such that X i =X j1 =X j2 .

3. INTRODUCE: node i has exactly one child j , and X i =X j ∪{v }.

4. FORGET: node i has exactly one child j , and X j =X i ∪{v }.

Figure 5.1 illustrates an example of these node types. It is easy to see that if tw(G )≤ k ,

then G also admits a nice tree decomposition of width ≤ k , with O(n ) tree nodes: given

an arbitrary decomposition tree T , one can repeatedly split each node X i until all nodes

satisfy the conditions above.

Another closely related graph parameter is branchwidth. We recall its definition from

Section 2.2.

Definition 5.5. [121] A branch decomposition (T,φ) of a graph G is characterized by a

ternary tree1 T , and a bijectionφ from the leaves of T onto the edges of G .

Let e be a tree edge in T . Removing e from T partitions into T1 and T2, and this

partition induces a partition of edges in G , called an e -separation, associated with the

leaves of T1 and T2. The set of vertices in G that are shared by both G1 and G2 is called

the middle-set of e , and the width of this separation is the number of vertices in the

middle-set.

Given a branch decomposition (T,φ), the width of this branch decomposition is the

maximum width over all e -separations in T , and the branchwidth of G , denoted bw(G ),

1 A tree T is a ternary tree if every non-leaf node has degree 3.


a b

c v

a b

c

Xi

Xj

(1)

a b

c

a b

c v

Xi

Xj

(2)

a b

c d

a b

c d

a b

c d

Xj1

Xi

Xj2

(3)

Figure 5.1: Node types in a nice tree decomposition: (1) introduce node: X i = X j ∪ {v }; (2) forget node:X j =X i ∪{v }; (3) join node: X i =X j1 =X j2

is the minimum width over all branch decompositions. It is well known that the branch-

width is closely related to the treewidth of graph [121]: bw(G )≤tw(G )+1≤ 32

bw(G ). For

planar graphs, Fomin and Thilikos gave an upper bound on the branchwidth:

Theorem 5.6. [52] For any planar graph G , bw(G )≤p

4.5n ≈ 2.122p

n.

5.2 Clique Cover for Graphs with Bounded Treewidth

In this section, we design a dynamic programming algorithm for finding a minimum

clique cover for a graph G where a nice tree decomposition (X , T ) is given. Let k denote

the width of that decomposition. First, let us define E (X i ) to be the set of edges in the

subgraph induced by the vertices in X i . Furthermore, we let Vi denote the union of all

vertices in X i and its descendent nodes. Similarly, let G i denote the subgraph of G in-

duced by the vertices Vi . Finally, for some v ∈ V (G ), let δ(v ) denote the set of edges that

5.2. Clique Cover for Graphs with Bounded Treewidth 117

are incident to v in G .

We shall in fact design an algorithm for a generalization of the clique cover problem,

where we are given a subset S of edges that are already covered, i.e., our solution need not

cover S, but may use these edges in the cliques. Then, the original clique cover problem

is a special case where S = ;. Since our dynamic programming is formulated around the

decomposition tree T , we often speak of a subgraph G i where a certain subset S of edges

is already covered, denoted by G i (S). Now we can define a cost function:

C i (S) = minimum size of clique cover for G i where S is already covered.

Then, our final solution is precisely Cr (;) where r is the root of T . Our dynamic

programming algorithm will proceed from the leaves of T up to its root, computing, for

each node X i , the value of C i (S) for every possible subset S of E (X i ). Depending on the

type of node, C i (S) is computed differently.

LEAF NODE. Suppose i is a leaf node. Then, by the definition of nice tree decomposition,

|X i |= 1, and thus C i (S) = 0 for all S ⊆ E (X i ), trivially.

FORGET NODE. Suppose i is a forget node. Then, it has one descendant X j = X i ∪ {v } for

some unique vertex v .

Lemma 5.7. If i is a Forget node, then for any subset S ⊆ E (X i ), C i (S) =C j (S).

Proof. Note that since X i ⊂ X j , the corresponding graphs G i and G j are the same. Fur-

thermore, since v is not in X i , S ∩δ(v ) = ;. Therefore, S ∩ E (X i ) =S ∩ E (X j ) and we have

C i (S) =C j (S).

INTRODUCE NODE. Suppose i is an introduce node. Then, it has one descendant node X j

such that X i = X j ∪ {v } for some unique vertex v . Consider an arbitrary clique coverW

for G i (S). Since the cliques inW can be partitioned into Qv = cliques containing v and

Qv =W −Qv , the following recurrence relation holds for each introduce node.


Lemma 5.8. If i is an Introduce node, then for any subset S ⊆ E (X i ):

C i (S) =min{ |Qv |+C j (S ∪E (Qv )) : Qv is a clique cover for δ(v )−S }

Proof. LetW be a clique cover for G i (S). We partition the coverW =Qv ∪Qv as defined,

and consider the edgeset in Qv . Since these edges are covered by Qv , Qv needs only

cover G i − (S ∪Qv ). Moreover, since the cliques in Qv do not contain v , Qv is a cover for

G j − (S ∪Qv ). Therefore, |Qv | ≥C j (S ∪Qv ), and the lemma follows.

To compute C i (S), we need to consider all possible clique covers, Qv , for δ(v )−S.

Since |X i | ≤ k , this is the number of ways to partition k vertices, and is given by the k th

Bell number B (n )≤ k !2k .

JOIN NODE. Finally, suppose i is a join node. Then, it has two children nodes j1 and j2

such that X i = X j1 = X j2 , and G i = G j1 ∪G j2 . Therefore, a clique cover for G i contains

cliques that belong to G j1 or G j2 . We need to ensure not to double count cliques that

belong in both G j1 and G j2 .

Lemma 5.9. Let S ⊆ E (X i ) be the set of already covered edges, and let R = E (X i )−S be the

edges to be covered. Then,

C i (S) =min{ C j1(S ∪R2)+C j2(S ∪R1) : ∀R1 ⊆R and R2 =R −R1}

Proof. Assuming S is already covered, let W be a minimum clique cover for G i (S) with

cost C i (S). By definition, any clique that belongs to both G j1 and G j2 must also belong to

X i . Thus, the cliques inW can be partitioned asW =Q1 ∪Q2 ∪Q3, where

Q1 = {q ∈W | q ⊆Vj1 and q 6⊆X i }

Q2 = {q ∈W | q ⊆Vj2 and q 6⊆X i }

Q3 = {q ∈W | q ⊆X i }.

5.2. Clique Cover for Graphs with Bounded Treewidth 119

Thus, |W |= |Q1|+ |Q2|+ |Q3|. Now, the edgeset R can be partitioned to R = R1 ∪R2 such

that:

R1 = {e ∈R | e covered by Q1 or Q3 }

R2 = {e ∈R | e covered only by Q2 }=R −R1

By definition, Q1∪Q3 needs to cover the edges in R1. Furthermore, Q3 needs to cover the

edges in G j1−E (X j1), and thus |Q1∪Q3| ≥C j1(S∪R2). On the other hand, the cliques in Q2

only need to cover the edges in R2 together with G j2 −E (X j2), and thus |Q2| ≥C j2(S ∪R1),

and the result follows.

Therefore, we can compute C i (S) for any given subset S ⊆ E (X i ). Observe that the

recurrence relation looks at all possible bipartitions of R . Since the number of edges in

E (X i ) is at most�k

2

�

, we need to check at most 2(k2) different partitions of R . Once the

bipartition of R is fixed, it takes constant time to look up the values from C j1 and C j2 .

Note that, while these recurrence relations calculate the size of clique covers, they

are also constructive: a little bookkeeping at each node will allow us to construct the

optimal cover at the root node.

5.2.1 Running Time of Treewidth-based Algorithm

For each node i ∈ I , we compute C i (S) for every S ⊆ E (X i ). Since tw(G ) = k , each node

contains at most k vertices. Let ρ denote the maximum number of edges induced in

any node. Then we need to consider 2ρ cases. Then, for each fixed S ⊆ E (X i ), we carry

out one of the four recurrence relations. Leaf nodes and Forget nodes can be computed

in constant time. An Introduce node can be computed in O(B (k ) · k ) time, where B (k )

is k th Bell number. Finally, a Join node takes O(2ρ) to compute. Therefore, the dynamic

programming algorithm takes 2ρ ·max{B (k ) ·k , 2ρ} ·O(n ) =O(4ρn ) time overall. Whileρ

can be as large as�k

2

�

in theory, this is rarely the case as shown in our experimental tests


(see Section 5.4, Table 5.1).

Theorem 5.10. There is a linear time algorithm for computing minimum clique cover for

graphs with fixed treewidth k .

5.2.2 Modifications for Clique Partition

It is straightforward to modify the above algorithm to solve the clique partition problem:

instead of assuming that edges already covered can be re-used to form other cliques,

we simply delete those edges and solve for remaining edgeset. If we redefine the cost

function C i (S) to be the size of the minimum clique partition for the graph G i −S, the

same recurrence holds for each node type. The only difference is when, at an Introduce

node, finding a local solution for δ(v ), we look for a clique partition rather than a cover.

5.3 Planar Clique Cover

In this section, we study the clique cover problem restricted to planar graphs, and present

a PTAS for planar graphs. While planar graphs are possibly the most restricted class of

interest for the clique cover problem (the largest clique being just K4), the problem re-

mains NP-hard [136]. Furthermore, as treewidth is unbounded in planar graphs [68,

120], simply applying our algorithm from Section 5.2 would result in an exponential

running time.

Instead, we shall design an exact polynomial time algorithm for planar graphs with

bounded branchwidth. As we shall see, the algorithm runs in O(2k n ) time with k being

the branchwidth, and since bw(G ) ≤p

4.5n when G is planar, this would be the first

subexponential algorithm for the clique cover problem on planar graphs.

Then, we will use our exact algorithm to construct a polynomial time approximation

scheme. Baker [8] has proposed a divide-and-conquer technique to design approxima-

tion schemes for various optimization problems on planar graphs. We will show that her

5.3. Planar Clique Cover 121

technique can be applied to the planar clique cover problem, using our exact algorithm

as a subroutine, resulting in a (1+ε) approximation algorithm.

5.3.1 Clique Cover for Planar Graphs with Bounded Branchwidth

As with its counterpart, treewidth, it is NP-complete to determine if a graph has a branch

decomposition of width at most k in general, but this decomposition can be found in

linear time when k is fixed. We thus assume that the input graph G is given together

with a branch decomposition (T,φ) of width at most k . Now pick an arbitrary edge e of

T , and subdivide it to create a root node r . Then each tree node X is associated with a

subset of edges in E (G ), namely the leaf nodes of the subtree rooted at X . We let E (S)

denote the edges in the subgraph induced by a subset of vertices S.

Define the middle-set of X , denoted by mid(X ), to be the middle-set of the edge be-

tween X and its parent node. Since the root node r has no parent, set mid(r ) = ;. Then,

we create a table WX [·] indexed by a subsetF of edges as follows:

WX [F ] =minimum clique cover for edges in X withF already covered.

Since mid(X ) is a cutset in G , we can paste together solutions from each subproblem

by computing only the entries WX [F ]whereF is a subset of edges in E (mid(X )).

Before describing the recurrence relation for this table, we study the middle-set of

three adjacent edges. Consider the sphere-cut branch decomposition for planar graphs,

as studied by Dorn et al. [41]; here, each middle-set defines a closed curve (namely, a

noose) on the planar embedding of the input graph that intersects only the vertices in

the middle-set. Let X be a tree node with two children nodes XL and XR , and a parent

node XP . The three edges adjacent to X define 3 middle-sets which we denote by OP , OL ,

OR for parent edge, left and right child edge, respectively. Since OR − (OP ∪OL) = ; and

OL − (OP ∪OR ) = ;, the vertices of OP ∪OL ∪OR can be partitioned as follows:


OL OR

OP

XP

XL XR

X

(a)

OL OR

OP

(b)

Figure 5.2: A sphere cut decomposition at a node X and its planar embedding. (a) A tree node X hasthree adjacent edges, each of which defines a middle-set that forms a simple curve called a noose; (b) Themiddle-set of OP is drawn as solid lines. Inside the noose OP are OL and OR . Here, portal vertices P aredrawn as red circles, intersection vertices are drawn as blue squares, and symmetric difference verticesare drawn as green diamonds.

• Portal vertices P =OL ∩OR ∩OP

• Intersection vertices I =OL ∩OR −P

• Symmetric Difference vertices D =OP − (P ∪ I )

Figure 5.2 gives an example of a sphere cut decomposition at a tree node X .

Lemma 5.11. The table WX [F ] can be calculated as

WX [F ] =min{WX1[F ∪ F2]+WX2[F ∪ F1] : F1 ∪ F2 is a partition of E (I ∪P) }

Proof. For an arbitrary clique cover for X with F already covered, consider the cliques

covering the edges E (I ∪P). Observe that, by planarity of G , any clique intersecting with

I ∪ P only contains either vertices of X1 or vertices of X2. Therefore, we can partition

the edges in E (I ∪ P) into F1 and F2, where F1 is to be covered by cliques in X1, and

F2 is covered by cliques in X2. For each partition, the solution from the subproblem

WX1[F ∪ F2] together with the solution from WX2[F ∪ F1] gives the solution for WX [F ].

Since we consider all possible partitions of E (I ∪P), the lemma follows.


The algorithm runs in a bottom-up manner. Observe that to compute each state

F of a node, we need to consider all bipartitions of E (I ∪ P). Since I ∪ P ⊆ mid(X1),

|E (I ∪P)| =O(k ), and thus there are 2O(k ) such partitions. Moreover, to compute for all

statesF , we consider 2O(k ) subsets of edges in E (mid(X )). Altogether, the above dynamic

programming algorithm runs in 2O(k )O(n ) time.

Lemma 5.12. There is a linear time algorithm to compute a minimum clique cover for

planar graphs with bounded branchwidth.

5.3.2 Modifications for Clique Partition

As with our treewidth-based algorithm, the above algorithm can be easily modified to

solve the clique partition problem: rather than assumingF is already covered, one can

simply delete those edges and solve for the remaining edges.

5.3.3 Baker’s Technique on Planar Graphs

Baker [8] has proposed a general approach to design approximation algorithms for var-

ious NP-hard problems on planar graphs. Here, we show how this technique, together

with our exact algorithm in Section 5.3.1, results in a (1+ε) approximation algorithm.

Baker’s technique is a divide-and-conquer approach, where the input graph is de-

composed into layers of subgraphs defined by the distance from a chosen vertex. Ap-

plying this technique to the planar clique cover problem, we obtain Algorithm 5.3.1. In

short, we pick an arbitrary vertex r , and define the level of each vertex of G −{r } as dis-

tance from r . Then, we slice up the graph into layers of subgraphs, where each subgraph

is solved independently. The clique covers for the layers are then merged to construct a

solution for the entire graph G . The edges on the boundary of each layer are covered by

cliques from adjacent layers, which can be bounded to obtain an approximation ratio.

See Algorithm 5.3.1 for details, and Figure 5.3 shows a schematic of the approach.


Algorithm 5.3.1: PTAS for Planar Clique Cover

Input: A planar graph G = (V, E ), 0<ε< 1Output: A clique cover C of G .

r ← an arbitrary vertex of G ;foreach vertex v ∈V \ {r } do

level(v )← distance from r ;endk ←d2/εe;for i = 0, 1, . . . , k −1 do

for j = 0, 1, 2 . . . doG i j ← subgraph of G induced by the vertices at levels j k +1 through(j +1)k + i ;C i j ←minimum clique cover for G i j ;

endendfor i = 0, 1, . . . , k −1 do

C i ←⋃

j C i j .

endReturn a clique cover from {C0, . . . ,Ck−1}with the minimum size.

Lemma 5.13. Algorithm 5.3.1 finds a clique cover of weight at most (1+ε)OPT.

Proof. Let Q∗ denote the optimum solution, and given some congruence class i mod k ,

let us assume that the above approximation scheme divides the problem into m pieces.

Since the solution for each piece is solved exactly, we have

|C i |=m∑

j=1

|C i j | ≤m∑

j=1

|Q∗i j |,

where Q∗i j denotes the optimum solution restricted to the graph G i j . We therefore need

to compare the two values∑m

j=1 |Q∗i j | and |Q∗|. To compare these, consider the number

of cliques in Q∗ that contain vertices at levels i mod k . For planar graphs, each clique

belongs to at most 2 consecutive levels. Therefore, for at least one value of i , 0 ≤ i ≤ k ,

there are at most d 2ke|Q∗| cliques containing a vertex at level i mod k . Since these cliques


(a) (b)

(c) (d)

Figure 5.3: Planar clique cover using Baker’s technique. (a) For each value of 0 ≤ i ≤ k − 1, the graph issliced into layers; (b) Each layer is treated as an independent subproblem; (c) Each layer is solved usingour clique cover algorithm on planar graphs with bounded branchwidth; (d) The solutions from the layersare pasted together. The number of edges doubly-covered by cliques from adjacent layers can be boundedto give the approximation ratio.

are double counted in the sum∑m

j=1 |Q∗i j |, we have

m∑

i=1

|Q∗i | ≤ (1+2

k)|Q∗|

and it follows that |C i | ≤ (1+ 2k)|Q∗|.

What remains now is an algorithm to compute the clique cover for each G i j . Note

that Lemma 5.12 provides an algorithm for planar graphs with bounded branchwidth,

and thus it suffices to show that each G i j also has bounded branchwidth. Tamaki’s the-


orem [132] directly provides an answer.

Definition 5.14. Given a plane-embedded graph G , the face-incidence graph G = (V ,E )

of G consists of vertices V from faces of G , and two vertices in V are joined by an edge if

and only if the corresponding faces in G share a vertex.

Theorem 5.15. [132] Given a planar graph G , there is a linear time algorithm that finds

a branch decomposition with width at most the radius of the face-incidence graph of G .

Since the face-incidence graph of G i j has a bounded radius, each G i j is a planar

graph with bounded branchwidth. Therefore, our algorithm from Lemma 5.12 can com-

pute the minimum clique cover for the subgraph in each G i j .

Recall that our dynamic programming algorithm for graphs with branchwidth ≤ k

runs in time 2O(k )O(n ). Due to Theorem 5.6, directly applying this algorithm to planar

graphs gives an exact solution in time 2O(p

n )O(n ). On the other hand, the PTAS using

Baker’s technique involves decomposing the graph into n/k layers, and solving each

piece exactly in time 2O(k )O(n/k ). Trying for every congruent classes between 1 and k−1,

the overall algorithm takes k · dnke · 2O(k )O(n/k )≈ 2O(k )O(n 2

k) to obtain a solution of value

at most (1+ 2k)OPT . Therefore, while our PTAS provides an approximation with varying

degree of approximation ratio, if one wants to get a solution any closer than (1+ 1pn)OPT ,

one may be better off running our dynamic programming algorithm directly to the graph

to obtain an exact solution. This trade-off is clearly shown empirically in Table 5.2.

Theorem 5.16. There is a PTAS for planar clique cover.


We have implemented the described algorithms and tested them against both real bi-

ological PPI data and simulated data. As a comparison, we considered Gramm et al.’s

algorithm [66] for the decision problem version of clique cover: given the size of clique


cover k as an input parameter, their algorithm works by first applying a set of reduc-

tion rules to reduce the problem instance, and then using a search tree algorithm on the

reduced instance, in time exponential in k . While their algorithm works well in cases

where the solution size is small, it performed poorly against our test data which con-

tains several hundreds of cliques. This is mainly because their reduction rules did not

significantly decrease the size of our input graphs (especially biological networks): the

solutions for the reduced instances would still contain a large number of cliques, result-

ing in inefficient running time for the search tree algorithm. As their input parameter is

the size of the minimum clique cover, it motivates our development of approaches using

edge sparsity for these networks.

5.4.1 Simulated and Biological Networks

Each of our algorithms was tested on both simulated and actual biological networks.

First, Krogan et al. [93] obtained an extensive dataset on yeast protein interactions. Tak-

ing the largest connected component from the dataset, with 323 vertices and 742 edges,

we formed a model network, GK ro g a n . Various studies have shown that PPI networks ex-

hibit the properties of scale-free networks [10]. Many generative models for scale-free

networks have also been proposed, and we used the two most frequently used mod-

els [10, 32] to create test graphs for our algorithm: (1) preferential attachment model

(denoted by GPAM ), and (2) duplication model (denoted by GDM ). In both cases, we set

the parameters of generative models so that the resulting networks show similar charac-

teristics to that of real PPI data; density of |E | ≈ 2|V |, and degree distribution P(k )∼ k−γ

where γ≈ 1.7.

To investigate the behaviour of our algorithms on denser graphs, we generated a set

of partial k -trees. Recall that a k -tree is a maximal graph with treewidth k such that no

edge can be inserted without increasing its treewidth, and a graph is a partial k -tree if it

is a subgraph of a k -tree. The partial k -trees of given treewidth have been generated by

first generating a k -tree, and randomly removing edges to obtain desired edge density.


n m tw ρ tree decomp. clique cover algo. # of cliques(hh:mm) (hh:mm)

Krogan 323 742 11 21 2:13 6:08 482PAM 300 634 14 29 2:41 9:21 421DM 300 794 21 43 4:13 17:47 583

Table 5.1: Performance of treewidth-based exact clique cover algorithm; we show the treewidth of eachgraph, maximum # of edges per tree node (ρ), time taken to compute an optimal tree decomposition anda clique cover, and size of the solution.

Since the generation process of k -trees is similar to that of the preferential attachment

model, this allows us to create graphs with higher density than GPAM while preserving

low treewidth.

5.4.2 Performance of the Treewidth and Branchwidth-based Algorithms

Table 5.1 reports results obtained on real and simulated biological networks. Both Kro-

gan’s PPI network and simulated networks exhibit relatively low treewidths for their size.

Figure 5.4 gives a more comprehensive view of the running times from empirical testing.

As expected, the running time increases linearly with n for graphs with fixed treewidths

(Figure 5.4(a)), but exponentially with treewidth k for graphs with fixed n (Figure 5.4(b)).

Running times for partial k -trees (Figures 5.4(c) and (d)) follow the same trends, al-

though they are somewhat higher due to the higher edge density.

While our branchwidth-based exact algorithm was designed for planar graphs, it

is easy to modify the algorithm to handle non-planar graphs with bounded branch-

width (but at the expense of higher time complexity due to non-planarity). Figure 5.5

shows that, in practice, the running time of the treewidth-based algorithm grows slower

than that of the branchwidth-based algorithm, allowing it to handle graphs with larger

treewidths.


50 100 150 200 250 300

80,000

0

10,000

20,000

30,000

40,000

50,000

60,000

70,000

# of vertices

Tim

e ( s

econ

ds )

tw=8tw=9

tw=10

tw=11

tw=12

(a)

8 9 10 11 12

80,000

0

10,000

20,000

30,000

40,000

50,000

60,000

70,000

TreewidthTi

me

( sec

onds

)

n=50

n=100

n=150

n=200

n=250

n=300

(b)

40 80 120 160 200

160,000

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

# of vertices

Tim

e ( s

econ

ds )

tw = 8tw = 9

tw = 10

tw = 11

tw = 12

(c)

8 9 10 11 12

160,000

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

Treewidth

Tim

e ( s

econ

ds )

n = 200

n = 150

n = 100

n = 50

(d)

Figure 5.4: Performance of the treewidth-based algorithm on simulated networks: (a) scale-free networks(PAM) with fixed treewidth; (b) scale-free networks (PAM) with fixed graph size; (c) partial k -trees withfixed treewidth; (d) partial k -trees with fixed graph size.


8 9 10 11 12

17

10

11

12

13

14

15

16

Treewidth

log2

( tim

e )

y = 1.4219x -

0.2377 R²

= 0.9914

y = 1.0604x + 0.8255 R

² = 0.9789

tw-based

bw-based

(a)

8 9 10 11 12

12

13

14

15

16

17

Treewidth

log2

( tim

e )

y = 1.

3651

x + 1.

3206

R²

= 0.

9838

y = 0.8246x + 4.5411 R

² = 0.9959

bw-based

tw-based

(b)

Figure 5.5: Performance comparison of treewidth-based algorithm vs. branchwidth-based algorithm onscale-free networks; y -axes are shown in log scale to exemplify the difference in exponents in the runningtime. (a) scale-free networks with 50 vertices; (b) scale-free networks with 100 vertices; data points withthe same treewidth values in each chart are taken from the same network.

5.4.3 Performance of PTAS for Planar Graphs

To test our PTAS on planar graphs, we generated a set of random planar graphs us-

ing the simple algorithm by Denise and Vasconcellos [40]. Then, we ran both our ex-

act branchwidth-based algorithm and the PTAS with varying values of ε to explore the

trade-off between running time and quality of the solution. As shown in Table 5.2, the

promised approximation ratios are almost exactly realized and substantial speed-ups

are obtained for relatively large values of ε, compared to the exact branch-decomposition

based algorithm. The running time increases exponentially with 1/ε, and the exact al-

gorithm starts to become faster than the PTAS when ε becomes sufficiently small.

5.4.4 Clique Cover in Biological Networks

When executed on the yeast protein-protein interaction network of Krogan et al. [93],

our treewidth-based algorithm finds a clique cover that includes 93 cliques of size 5 or


ε # of cliques time (hh:mm:ss)0.5 159 00:12:080.2 124 00:39:170.1 116 02:36:18

0.05 113 03:53:42OPT 109 02:46:31

Table 5.2: Performance of PTAS for planar graph with n = 200, m = 407, t w = 10. OPT was obtained usingthe branch-decomposition based algorithm.

more. While the PPI data may not admit a unique clique cover on the network, we man-

ually verified that most discovered cliques correspond to known complexes, such as the

RNA polymerase II, the RSC complex, the mediator complex, and the 20S proteasome.

In addition, three highly overlapping complexes, SWR1 (a chromatin remodelling com-

plex), NuA4 (a histone acetyltransferase complex), and INO80 (another chromatin re-

modelling complex) are correctly identified, despite the fact that SWR1 and NuA4 share

three subunits (ARP4, GOD1, and YAF9) and SWR1 and INO80 share four (ARP4, GOD1,

RVB1, RVB2). This suggests that our algorithm is capable of identifying biologically rel-

evant protein complexes, even those that share a significant number of subunits.

5.5 Discussions

In this chapter, we studied the clique cover problem on sparse networks as measured

by treewidth and branchwidth, with an application to protein-complex discovery in PPI

networks. We gave exact polynomial-time algorithms for graphs with bounded treewidth

and bounded branchwidth, and built on the latter using Baker’s technique to obtain a

polynomial time approximation scheme for planar graphs.

Our empirical studies show that the biological networks as well as synthetic net-

works with similar characteristics (e.g. edge density, degree distribution) indeed exhibit

low treewidth, and our algorithms showed practical running times on these test net-

works. Moreover, our branchwidth based PTAS algorithm shows practical running time

for computing solutions close to the optimal.


In proteomics research, existing experimental methods for detecting binary interac-

tions often suffer from false negatives, i.e., some edges are not detected in the experi-

ments. Therefore, in the direction towards hypergraph modelling of PPI networks, one

may wish to cover the edges with quasi-cliques: Here, the optimization function needs

to be modified slightly. One possible formulation may be to find a minimum cardinal-

ity quasi-clique cover, where a quasi-clique is defined by some lower bound on the edge

density. On the other hand, there may be classes of graphs other than the ones discussed

here that admit polynomial time exact algorithms, for example, graphs with bounded

genus or bounded degree.


The results discussed in this chapter have been published in [14].

Chapter 6

Conclusion and Future Directions

Researchers in systems biology strive to uncover complex interactions in biological sys-

tems, and protein-protein interactions lie at the heart of their efforts. As opposed to the

classical reductionist paradigm, systems biologists consider biological phenomena as a

complex system, and thus high throughput experimental technologies are vital to their

studies. Indeed, the recent development of high throughput techniques has provided a

huge momentum – the amount of PPI data available is rapidly growing, and large-scale

studies of the human proteome are also underway.

Unfortunately, high throughput technologies remain in their infancy, and produce a

large amount of data with relatively poor accuracy. As a result, the accumulating data in

the literature presents us with both an opportunity and a challenge. While the analysis

of the PPI networks provides invaluable information on the inner workings of the cell,

the significant amount of noise within the data makes it difficult to handle. AP-MS is a

good example of such a technology that suffers from noisy output data.

Noise in proteomics data can often be classified as two distinct types [60]: stochas-

tic errors and systematic errors. Stochastic errors are measurement errors with random

variability, which can be reduced by simply repeating the experiment. Systematic er-

rors, on the other hand, are recurrent in the measurements, and thus require a careful

133

134 Chapter 6. Conclusion and Future Directions

modelling of every step of the experiment in order to correctly interpret the data.

In this thesis, we considered the overall pipeline of the AP-MS method, and looked

for sources of systematic errors. Here we highlight the novel contributions put forward

in this thesis, and discuss future research directions.

Protein Quantification with Shared Peptides. Chapter 3 is concerned with protein

quantification from quantitative MS data. While significant progress has been made, the

problem of protein identification (and quantification) still remains unresolved due to

shared peptides prevalent in many MS datasets [108]. We proposed several approaches

to protein quantification in the presence of shared peptides by defining a set of opti-

mization problems based on a clean combinatorial model.

One of the limitations in our approach comes from the assumption that every pep-

tide can be detected by MS equally well – and thus the errors in the peptide abundances

are equally likely. While we have empirically tested the robustness of our algorithms by

varying the coverage of the data (see Section 3.4.1), an approach that directly models the

noisy MS data would likely yield more practical results.

Open Problem 6.1. Does there exist an approach for protein quantification with shared

peptides that incorporates peptide detectability?

Peptide detectability may be estimated either using a suitably large MS/MS dataset [2,

133] or using the physical properties of each peptide [99]. In Section 3.5, we discussed

possible ways to extend our combinatorial approach by incorporating peptide detectabil-

ity as weighted errors. Alternatively, we may also attack this problem using a probabilis-

tic framework where observed peptide abundances are thought to be random variables

that incorporate detectability. Then, we can look for protein abundances maximizing

the probability of observing the given peptide abundances using various Bayesian ap-

proaches.

135

Predicting Direct PPI Network. In Chapter 4, we studied the problem of distinguish-

ing direct interactions from indirect ones within the PPI data from AP-MS experiments.

We tackled this problem by proposing a probabilistic graph model, and formulated the

DIGCOM problems to identify the direct interaction network from quantitative PPI data.

Our algorithm consists of three main phases: (1) Identify weakly connected vertices (and

their neighbours); (2) Discover dense clusters in the network; (3) Identify remaining di-

rect interactions via a genetic algorithm.

As discussed in Section 4.5, the DIGCOM problems raise a number of challenging

computational problems to investigate further. From the standpoint of computational

complexity, the hardness of the problems are not yet known, and we conjecture that they

are NP-hard.

Open Problem 6.2. What is the computational complexity of E-DIGCOM and A-DIGCOM?

On the other hand, the probabilistic graph model is built around several assump-

tions. For example, the abundance of protein complexes is not constant, and the strength

of all physical interactions is non-uniform as some interactions may be more prone to

disruption by the affinity purification process than others.

Open Problem 6.3. Given sufficient AP-MS data, can we generalize the DIGCOM prob-

lems to handle individual interaction strengths and abundances?

In fact, if we were given the individual interaction strength for each pair of proteins

(possibly from a complementary dataset such as protein co-crystallization or physical

models), the second and third phases of our algorithm can be easily modified to handle

non-uniform interaction probabilities. The first phase of the algorithm would require

more careful modifications to handle arbitrary survival probability for each edge.

PPI Networks as Hypergraphs. In Chapter 5, we considered the problem of modelling

PPI networks as hypergraphs via the (edge) clique cover problem on the binary PPI net-

work. In particular, we devised an exact algorithm for graphs with bounded treewidth

136 Chapter 6. Conclusion and Future Directions

which showed promising performance on both simulated data and real PPI data. Fur-

thermore, our theoretical pursuit on clique cover for planar graphs with bounded branch-

width resulted in a PTAS for planar graphs.

However, existing experimental techniques for detecting binary interactions often

suffer from false negatives, i.e., some edges are missing in the PPI network. Conse-

quently, protein complexes that should ideally form cliques instead appear as quasi-

cliques or simply dense subgraphs. In the direction towards modelling PPI networks as

hypergraphs, therefore, we may wish to relax the assumption that each protein complex

forms a clique.

Open Problem 6.4. Does there exist a quasi-clique cover algorithm?

Quasi-cliques can be defined in various ways: one possible definition of a quasi-

clique is a dense subgraph with a lower bound on the edge density. Using this defini-

tion, one may generalize our algorithm for graphs with bounded treewidth. Following

our algorithm, the same type of dynamic programming algorithm can be conceived –

however, our approach of finding cliques in the local neighbourhood of a vertex should

now be extended to multi-hop neighbours depending on the edge density, resulting in

an increase in the running time. On the other hand, one may devise an algorithm spe-

cialized for graphs with other properties of PPI networks, for example the scale-freeness

of networks.

There are many problems in PPI networks to pursue further studies, and the work

presented in this thesis opens the door to several opportunities with important impli-

cations in systems biology. One of the main pursuits in the field is a comprehensive

understanding of the proteome as a complex dynamic system, and any improvements

in tackling the problems discussed here would take us one step closer towards this goal.

Bibliography

[1] B. Adamcsek, G. Palla, I. J. Farkas, I. Derényi, and T. Vicsek. CFinder: locatingcliques and overlapping modules in biological networks. Bioinformatics (Oxford,England), 22(8):1021–1023, Apr. 2006.

[2] P. Alves, R. J. Arnold, M. V. Novotny, P. Radivojac, J. P. Reilly, and H. Tang. Advance-ment in protein inference from shotgun proteomics using peptide detectability.Pacific Symposium on Biocomputing, pages 409–420, 2007.

[3] S. Arnborg, D. G. Corneil, and A. Proskurowski. Complexity of finding embeddingsin a k -tree. Society for Industrial and Applied Mathematics. Journal on Algebraicand Discrete Methods, 8(2):277–284, 1987.

[4] S. Arora. Computational Complexity: A Modern Approach. Cambridge UniversityPress, 1 edition, 2009.

[5] S. Asthana. Predicting protein complex membership using probabilistic networkreliability. Genome Research, 14(6):1170–1175, 2004.

[6] J. Azé, T. Bourquard, S. Hamel, A. Poupon, and D. Ritchie. Using Kendall-TauMeta-Bagging to Improve Protein-Protein Docking Predictions. In Lecture Notesin Bioinformatics 7036, PRIB 2011, pages 284–295, 2011.

[7] G. D. Bader and C. W. V. Hogue. An automated method for finding molecular com-plexes in large protein interaction networks. BMC bioinformatics, 4:2, Jan. 2003.

[8] B. Baker. Approximation algorithms for NP-complete problems on planar graphs.Journal of the ACM, 41(1), 1994.

[9] M. Bantscheff, M. Schirle, G. Sweetman, J. Rick, and B. Kuster. Quantitative massspectrometry in proteomics: a critical review. Analytical and bioanalytical chem-istry, 389(4):1017–1031, Oct. 2007.

[10] A. L. Barabasi and R. Albert. Emergence of scaling in random networks. Science(New York, N.Y.), 286(5439):509–512, Oct. 1999.

[11] P. L. Bartel, J. A. Roecklein, D. SenGupta, and S. Fields. A protein linkage map ofEscherichia coli bacteriophage T7. Nature Genetics, 12(1):72–77, Jan. 1996.

137

138 Bibliography

[12] H. M. Berman, T. N. Bhat, P. E. Bourne, Z. Feng, G. Gilliland, H. Weissig, andJ. Westbrook. The protein data bank and the challenge of structural genomics.Nature structural biology, 7 Suppl:957–959, Nov. 2000.

[13] A. Bhan, D. Galas, and T. Dewey. A duplication growth model of gene expressionnetworks. Bioinformatics, 18:1486–1493, 2002.

[14] M. Blanchette, E. Kim, and A. Vetta. Clique Cover on Sparse Networks. In The 9thSIAM Meeting on Algorithm Engineering & Experiments (ALENEX), Kyoto, Japan,2012.

[15] M. Blatt, S. Wiseman, and E. Domany. Superparamagnetic Clustering of Data.Phys. Rev. Lett., 76:3251–3254, Apr 1996.

[16] H. Bodlaender and T. Kloks. A simple linear time algorithm for triangulating three-colored graphs. Journal of Algorithms, 15(1):160–172, 1993.

[17] H. L. Bodlaender. A linear-time algorithm for finding tree-decompositions ofsmall treewidth. SIAM Journal on Computing, 25(6):1305–1317, 1996.

[18] H. L. Bodlaender, M. R. Fellows, M. T. Hallett, H. T. Wareham, and T. J. Warnow. Thehardness of perfect phylogeny, feasible register assignment and other problemson thin colored graphs. Theoretical Computer Science, 244(1-2):167–188, 2000.

[19] A. Breitkreutz, H. Choi, J. Sharom, L. Boucher, V. Neduva, B. Larsen, Z.Y.Lin, B. Bre-itkreutz, C. Stark, G. Liu, J. Ahn, D. Dewar-Darch, Z. Qin, T. Pawson, A. Gingras, A. I.Nesvizhskii, and M. Tyers. Global architecture of the yeast kinome interaction net-work. Science, 2010.

[20] S. Brohée and J. van Helden. Evaluation of clustering algorithms for protein-protein interaction networks. BMC bioinformatics, 7:488, 2006.

[21] H. Brönnimann and M. Goodrich. Almost optimal set covers in fi-nite VC-dimension. Discrete & Computational Geometry, 14:463–479, 1995.10.1007/BF02570718.

[22] P. C. Carvalho, J. Hewel, V. C. Barbosa, and J. R. Yates. Identifying differences inprotein expression levels by spectral counting and feature selection. Genetics andmolecular research : GMR, 7(2):342–356, 2008.

[23] M. R. Cerioli, L. Faria, T. O. Ferreira, C. A. J. Martinhon, F. Protti, and B. Reed. Par-tition into cliques for cubic graphs: planar case, complexity and approximation.Discrete Applied Mathematics, 156(12):2270–2278, 2008.

[24] M.-S. Chang and H. Müller. On the tree-degree of graphs. In Graph-theoreticConcepts in Computer Science (Boltenhagen, 2001), pages 44–54, 2001.

BIBLIOGRAPHY 139

[25] C. Chekuri, K. L. Clarkson, and S. Har-Peled. On the set multi-cover problem in ge-ometric settings. In Proceedings of the 25th Annual Symposium on Computationalgeometry, SCG ’09, pages 341–350, New York, NY, USA, 2009. ACM.

[26] J. Chen, W. Hsu, M. L. Lee, and S. Ng. Increasing confidence of protein interac-tomes using network topological metrics. Bioinformatics, 22(16):1998–2004, Aug.2006.

[27] J. Chen, W. Hsu, M. L. Lee, and S.-K. Ng. Systematic assessment of high-throughput experimental data for reliable protein interactions using networktopology. In Proceedings of the 16th IEEE International Conference on Tools withArtificial Intelligence, ICTAI ’04, pages 368–372, Washington, DC, USA, 2004. IEEEComputer Society.

[28] Q. Cheng, P. Berman, R. Harrison, and A. Zelikovsky. Efficient alignments ofmetabolic networks with bounded treewidth. In IEEE International Conferenceon Data Mining Workshops (ICDMW), pages 687–694, 2010.

[29] H. Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based onthe sum of observations. The Annals of Mathematical Statistics, 23(4):pp. 493–507,1952.

[30] H. Choi, D. Fermin, and A. I. Nesvizhskii. Significance analysis of spectralcount data in label-free shotgun proteomics. Molecular & Cellular Proteomics,7(12):2373–2385, December 2008.

[31] V. Choi. Yucca: an efficient algorithm for small-molecule docking. Chemistry &biodiversity, 2(11):1517–1524, Nov. 2005.

[32] F. Chung, L. Lu, T. G. Dewey, and D. J. Galas. Duplication models for biologicalnetworks. Journal of Computational Biology, 10(5):677–687, Oct. 2003.

[33] P. Cloutier, R. Al-Khoury, M. Lavallée-Adam, D. Faubert, H. Jiang, C. Poitras,A. Bouchard, D. Forget, M. Blanchette, and B. Coulombe. High-resolution map-ping of the protein interaction network for the human transcription machineryand affinity purification of RNA polymerase II-associated complexes. Methods(San Diego, Calif.), 48(4):381–386, Aug. 2009.

[34] C. J. Colbourn. The Combinatorics of Network Reliability. Oxford University Press,Inc., 1987.

[35] S. R. Collins, P. Kemmeren, X.-C. Zhao, J. F. Greenblatt, F. Spencer, F. C. P. Holstege,J. S. Weissman, and N. J. Krogan. Toward a comprehensive atlas of the physicalinteractome of Saccharomyces cerevisiae. Molecular & cellular proteomics : MCP,6(3):439–450, Mar. 2007.

140 Bibliography

[36] B. Coulombe, M. Blanchette, and C. Jeronimo. Steps towards a repertoire ofcomprehensive maps of human protein interaction networks: the Human Pro-teotheque Initiative (HuPI). Biochemistry and Cell Biology, 86(2):149–156, Apr.2008.

[37] T. Dandekar, B. Snel, M. Huynen, and P. Bork. Conservation of gene order: a finger-print of proteins that physically interact. Trends in Biochemical Sciences, 23(9):324– 328, 1998.

[38] B. Dengiz, F. Altiparmak, and A. Smith. Efficient optimization of all-terminal reli-able networks, using an evolutionary approach. Reliability, IEEE Transactions on,46(1):18–26, 1997.

[39] B. Dengiz, F. Altiparmak, and A. Smith. Local search genetic algorithm for op-timal design of reliable networks. Evolutionary Computation, IEEE Transactionson, 1(3):179–188, 1997.

[40] A. Denise and M. Vasconcellos. The random planar graph. Congressus Numeran-tium, 113:61–79, 1996.

[41] F. Dorn, E. Penninkx, H. Bodlaender, and F. Fomin. Efficient exact algorithmson planar graphs: exploiting sphere cut branch decompositions. Algorithms–ESA2005, 3669:95–106, 2005.

[42] B. Dost, N. Bandeira, X. Li, Z. Shen, S. P. Briggs, and V. Bafna. Accurate mass spec-trometry based protein quantification via shared peptides. Journal of Computa-tional Biology, 19, 2012.

[43] B. Dost, T. Shlomi, N. Gupta, E. Ruppin, V. Bafna, and R. Sharan. QNet: a toolfor querying protein interaction networks. Journal of Computational Biology,15(7):913–925, 2008.

[44] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis anddisplay of genome-wide expression patterns. Proceedings of the National Academyof Sciences of the United States of America, 95(25):14863–14868, Dec. 1998.

[45] D. Eppstein. Subgraph isomorphism in planar graphs and related problems. J.Graph Algorithms and Applications, 3(3):1–27, 1999.

[46] D. Eppstein. Diameter and treewidth in minor-closed graph famillies. Algorith-mica, 27:275–291, 2000.

[47] G. Even, D. Rawitz, and S. Shahar. Hitting sets when the VC-dimension is small.Inf. Process. Lett., 95(2):358–362, July 2005.

[48] U. Feige. A threshold of ln n for approximating set cover. J. ACM, 45(4):634–652,1998.

BIBLIOGRAPHY 141

[49] S. Fields and S. O. A novel genetic system to detect protein-protein interactions.Nature, 340(6230):245–6, 1989.

[50] R. L. Finley and R. Brent. Interaction mating reveals binary and ternary con-nections between Drosophila cell cycle regulators. Proceedings of the NationalAcademy of Sciences of the United States of America, 91(26):12980–12984, Dec.1994.

[51] L. Florens, M. Washburn, J. Raine, R. Anthony, M. Grainger, J. Haynes, J. Moch,N. Muster, J. Sacci, D. Tabb, and et al. A proteomic view of the Plasmodium falci-parum life cycle. Nature, 419(6906):520–526, 2002.

[52] F. V. Fomin and D. M. Thilikos. New upper bounds on the decomposability ofplanar graphs. Journal of Graph Theory, 51(1):53–81, 2006.

[53] M. Fromont-Racine, J. Rain, and P. Legrain. Toward a functional analysis of theyeast genome through exhaustive two-hybrid screens. Nat Genet, 16(3):277–282,July 1997.

[54] A. Galarneau, M. Primeau, L.-E. Trudeau, and S. W. Michnick. β-Lactamase pro-tein fragment complementation assays as in vivo and in vitro sensors of protein-protein interactions. Nature Biotechnology, 20(6):619–622, June 2002.

[55] J. Gao, G. Opiteck, M. Friedrichs, A. Dongre, and S. Hefta. Changes in the proteinexpression of yeast as a function of carbon source. J Proteome Res, 2(6):643–649,2003.

[56] M. R. Garey and D. S. Johnson. Computers and Intractability : A Guide to the Theoryof NP-Completeness. WH Freman,1979, 1979.

[57] M. R. Garey, D. S. Johnson, and L. Stockmeyer. Some simplified NP-completegraph problems. Theoretical Computer Science, 1(3):237–267, 1976.

[58] A. Gavin, M. Bösche, R. Krause, P. Grandi, M. Marzioch, A. Bauer, J. Schultz,J. M. Rick, A. Michon, C. Cruciat, M. Remor, C. Höfert, M. Schelder, M. Brajen-ovic, H. Ruffner, A. Merino, K. Klein, M. Hudak, D. Dickson, T. Rudi, V. Gnau,A. Bauch, S. Bastuck, B. Huhse, C. Leutwein, M. Heurtier, R. R. Copley, A. Edel-mann, E. Querfurth, V. Rybin, G. Drewes, M. Raida, T. Bouwmeester, P. Bork,B. Seraphin, B. Kuster, G. Neubauer, and G. Superti-Furga. Functional organi-zation of the yeast proteome by systematic analysis of protein complexes. Nature,415(6868):141–147, 2002.

[59] L. Y. Geer, S. P. Markey, J. A. Kowalak, L. Wagner, M. Xu, D. M. Maynard, X. Yang,W. Shi, and S. H. Bryant. Open mass spectrometry search algorithm. Journal ofproteome research, 3(5):958–964, Aug. 2004.

[60] R. Gentleman and W. Huber. Making the most of high-throughput protein-interaction data. Genome Biology, 8(10):112, 2007.

142 Bibliography

[61] S. A. Gerber, J. Rush, O. Stemman, M. W. Kirschner, and S. P. Gygi. Absolute quan-tification of proteins and phosphoproteins from cell lysates by tandem MS. Pro-ceedings of the National Academy of Sciences, 100(12):6940–6945, 2003.

[62] E. Gilbert. Enumeration of labeled graphs. Canad. J. Math, 8:405–411, 1956.

[63] J. Gilmore, D. Auberry, J. Sharp, A. White, K. Anderson, and D. Daly. A bayesian es-timator of protein-protein association probabilities. Bioinformatics, 24(13):1554–5, 2008.

[64] D. E. Goldberg. Genetic Algorithms in Search, Optimization, and Machine Learn-ing. Addison-Wesley Professional, January 1989.

[65] R. Gordân, A. J. Hartemink, and M. L. Bulyk. Distinguishing direct versus indirecttranscription factor-DNA interactions. Genome Research, 19(11):2090–2100, Nov.2009.

[66] J. Gramm, J. Guo, F. Hüffner, and R. Niedermeier. Data reduction and exact algo-rithms for clique cover. ACM Journal of Experimental Algorithmics, 13, 2009.

[67] J. Gross and J. Yellen. Handbook of Graph Theory (Discrete Mathematics and itsApplications). CRC, 2003.

[68] R. Halin. S-functions for graphs. Journal of Geometry, 8:171–186, 1976.10.1007/BF01917434.

[69] E. Hartuv, A. O. Schmitt, J. Lange, S. Meier-Ewert, H. Lehrach, and R. Shamir. Analgorithm for clustering cDNA fingerprints. Genomics, 66(3):249 – 256, 2000.

[70] Y. Ho, A. Gruhler, A. Heilbut, G. D. Bader, L. Moore, S. Adams, A. Millar, P. Taylor,K. Bennett, K. Boutilier, L. Yang, C. Wolting, I. Donaldson, S. Schandorff, J. Shew-narane, M. Vo, J. Taggart, M. Goudreault, B. Muskat, C. Alfarano, D. Dewar, Z. Lin,K. Michalickova, A. R. Willems, H. Sassi, P. A. Nielsen, K. J. Rasmussen, J. R. An-dersen, L. E. Johansen, L. H. Hansen, H. Jespersen, A. Podtelejnikov, E. Nielsen,J. Crawford, V. Poulsen, B. D. Sørensen, J. Matthiesen, R. C. Hendrickson, F. Glee-son, T. Pawson, M. F. Moran, D. Durocher, M. Mann, C. W. V. Hogue, D. Figeys,and M. Tyers. Systematic identification of protein complexes in Saccharomycescerevisiae by mass spectrometry. Nature, 415(6868):180–183, 2002.

[71] D. N. Hoover. Complexity of graph covering problems for graphs of low degree.Journal of Combinatorial Mathematics and Combinatorial Computing, 11:187–208, 1992.

[72] H. B. Hunt III, M. V. Marathe, V. Radhakrishnan, and R. E. Stearns. The complex-ity of planar counting problems. SIAM Journal on Computing, 27(4):1142–1167(electronic), 1998.

BIBLIOGRAPHY 143

[73] Y. Ishihama, Y. Oda, T. Tabata, T. Sato, T. Nagasu, J. Rappsilber, and M. Mann. Ex-ponentially modified protein abundance index (e m PAI ) for estimation of abso-lute protein amount in proteomics by the number of sequenced peptides per pro-tein. Mol Cell Proteomics, 4(9):1265–1272, 2005.

[74] T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, and Y. Sakaki. A comprehen-sive two-hybrid analysis to explore the yeast protein interactome. Proceedings ofthe National Academy of Sciences of the United States of America, 98(8):4569–4574,2001.

[75] T. Ito, K. Tashiro, S. Muta, R. Ozawa, T. Chiba, M. Nishizawa, K. Yamamoto,S. Kuhara, and Y. Sakaki. Toward a protein-protein interaction map of the buddingyeast: A comprehensive system to examine two-hybrid interactions in all possiblecombinations between the yeast proteins. Proceedings of the National Academy ofSciences of the United States of America, 97(3):1143–1147, Feb. 2000.

[76] R.-H. Jan, F.-J. Hwang, and S.-T. Chen. Topological optimization of a communica-tion network subject to a reliability constraint. Reliability, IEEE Transactions on,42(1):63–70, 1993.

[77] J. Janin, K. Henrick, J. Moult, L. T. Eyck, M. J. E. Sternberg, S. Vajda, I. Vakser, S. J.Wodak, and Critical Assessment of PRedicted Interactions. CAPRI: a Critical As-sessment of PRedicted Interactions. Proteins, 52(1):2–9, July 2003.

[78] L. J. Jensen and P. Bork. Not comparable, but complementary. Science,322(5898):56–57, 2008.

[79] H. Jeong, S. P. Mason, A. L. Barabasi, and Z. N. Oltvai. Lethality and centrality inprotein networks. Nature, 411(6833):41–42, 05 2001.

[80] R. Jin, S. Mccallen, C.-C. Liu, Y. Xiang, E. Almaas, and X. J. Zhou. Identifying dy-namic network modules with temporal and spatial constraints. In Pacific Sympo-sium on Biocomputing, pages 203–214, 2009.

[81] S. Jin, D. Daly, D. Springer, and J. Miller. The effects of shared peptides on pro-tein quantitation in label-free proteomics by LC/MS/MS. Journal of Proteome Re-search, 7(1):164–169, 2008.

[82] R. Jothi and P. T. M. Computational approaches to predict protein-protein anddomain-domain interactions. In M. I. and Z. A., editors, Bioinformatics Algo-rithms: Techniques and Applications. Wiley Press, 2008.

[83] V. Kann. On the Approximability of NP-complete Optimization Problems. PhDthesis, Department of Numerical Analysis and Computing Science, Royal Instituteof Technology, Stockholm, 1992.

[84] R. Kannan, S. Vempala, and A. Vetta. On clusterings: good, bad and spectral. Jour-nal of the ACM, 51(3):497–515, 2004.

144 Bibliography

[85] M. Karas and F. Hillenkamp. Laser desorption ionization of proteins with molec-ular masses exceeding 10, 000 daltons. Analytical Chemistry, 60(20):2299–2301,1988.

[86] R. M. Karp. Reducibility among combinatorial problems. Complexity of ComputerComputations, 40(4):85–103, 1972.

[87] S. M. Khan, B. Franke-Fayard, G. R. Mair, E. Lasonder, C. J. Janse, M. Mann, andA. P. Waters. Proteome analysis of separated male and female gametocytes revealsnovel sex-specific Plasmodium biology. Cell, 121(5):675–687, June 2005.

[88] E. Kim, A. Sabharwal, A. Vetta, and M. Blanchette. Predicting direct protein inter-actions from affinity purification mass spectrometry data. Algorithms for Molecu-lar Biology, 5(1):34, 2010.

[89] E. Kim, A. Vetta, and M. Blanchette. Protein quantification with shared peptidesusing multi cover algorithms. in preparation, 2011.

[90] A. D. King, N. Przulj, and I. Jurisica. Protein complex prediction via cost-basedclustering. Bioinformatics (Oxford, England), 20(17):3013–3020, Nov. 2004.

[91] D. S. Kirkpatrick, S. A. Gerber, and S. P. Gygi. The absolute quantification strat-egy: a general procedure for the quantification of proteins and post-translationalmodifications. Methods, 35(3):265 – 273, 2005.

[92] T. Kloks. Treewidth: Computations and Approximations (Lecture Notes in Com-puter Science). Springer, 1 edition, Sept. 1994.

[93] N. J. Krogan, G. Cagney, H. Yu, G. Zhong, X. Guo, A. Ignatchenko, J. Li, S. Pu,N. Datta, A. P. Tikuisis, T. Punna, J. M. Peregrín-Alvarez, M. Shales, X. Zhang,M. Davey, M. D. Robinson, A. Paccanaro, J. E. Bray, A. Sheung, B. Beattie, D. P.Richards, V. Canadien, A. Lalev, F. Mena, P. Wong, A. Starostine, M. M. Canete,J. Vlasblom, S. Wu, C. Orsi, S. R. Collins, S. Chandran, R. Haw, J. J. Rilstone,K. Gandi, N. J. Thompson, G. Musso, P. St Onge, S. Ghanny, M. H. Y. Lam, G. But-land, A. M. Altaf-Ul, S. Kanaya, A. Shilatifard, E. O’Shea, J. S. Weissman, C. J. Ingles,T. R. Hughes, J. Parkinson, M. Gerstein, S. J. Wodak, A. Emili, and J. F. Greenblatt.Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Na-ture, 440(7084):637–643, Mar. 2006.

[94] W. B. Langdon and R. Poli. Foundations of Genetic Programming. Springer, Dec.2010.

[95] M. Lavallée-Adam, P. Cloutier, B. Coulombe, and M. Blanchette. Modeling Con-taminants in AP-MS/MS Experiments. Journal of Proteome Research, 10(2):886–895, 2011.

[96] T. Leighton and S. Rao. Multicommodity max-flow min-cut theorems and theiruse in designing approximation algorithms. Journal of the ACM, 48:787–832, 1999.

BIBLIOGRAPHY 145

[97] E. D. Levy and J. B. Pereira-Leal. Evolution and dynamics of protein interac-tions and networks. Current Opinion in Structural Biology, 18(3):349 – 357, 2008.<ce:title>Nucleic acids / Sequences and topology</ce:title>.

[98] M. Li, J. Wang, J. Chen, Z. Cai, and G. Chen. Identifying the overlapping com-plexes in protein interaction networks. International Journal of Data Mining andBioinformatics, 4(1):91–108, 2010.

[99] Y. F. Li, R. J. Arnold, H. Tang, and P. Radivojac. The Importance of Peptide De-tectability for Protein Identification, Quantification, and Experiment Design inMS/MS Proteomics. Journal of Proteome Research, 9(12):6288–6297, 2010.

[100] H. Liu, R. G. Sadygov, and J. R. Yates. A model for random sampling and estima-tion of relative protein abundance in shotgun proteomics. Analytical chemistry,76(14):4193–4201, July 2004.

[101] C. Lund and M. Yannakakis. On the hardness of approximating minimizationproblems. Journal of the ACM, 41(5):960–981, 1994.

[102] P. Mallick, M. Schirle, S. S. Chen, M. R. Flory, H. Lee, D. Martin, J. Ranish,B. Raught, R. Schmitt, T. Werner, B. Kuster, and R. Aebersold. Computationalprediction of proteotypic peptides for quantitative proteomics. Nature Biotech-nology, 25(1):125–131, Jan. 2007.

[103] S. W. Michnick. Protein fragment complementation strategies for biochemicalnetwork mapping. Current Opinion in Biotechnology, 14(6):610 – 617, 2003.

[104] M. Mitzenmacher and E. Upfal. Probability and Computing: Randomized Algo-rithms and Probabilistic Analysis. Cambridge University Press, Jan. 2005.

[105] E. Mujuni and F. Rosamond. Parameterized complexity of the clique partitionproblem. In Proc. Fourteenth Computing: The Australasian Theory Symposium(CATS 2008), 2008.

[106] M. Narayanan, A. Vetta, E. E. Schadt, and J. Zhu. Simultaneous clustering of mul-tiple gene expression and physical interaction datasets. PLoS Computational Bi-ology, 6(4):e1000742, 13, 2010.

[107] A. Nesvizhskii. Protein identification by tandem mass spectrometry and sequencedatabase searching. Methods Mol Biol., 367:87–119, 2007.

[108] A. I. Nesvizhskii and R. Aebersold. Interpretation of shotgun proteomic data: theprotein inference problem. Molecular & cellular proteomics : MCP, 4(10):1419–1440, Oct. 2005.

146 Bibliography

[109] W. M. Old, K. Meyer-Arendt, L. Aveline-Wolf, K. G. Pierce, A. Mendoza, J. R. Sevin-sky, K. A. Resing, and N. G. Ahn. Comparison of label-free methods for quanti-fying human proteins by shotgun proteomics. Molecular & Cellular Proteomics,4(10):1487–1502, 2005.

[110] J. Orlin. Contentment in graph theory: covering graphs with cliques. IndagationesMathematicae (Proceedings), 80(5):406–424, 1977.

[111] P. Pei and A. Zhang. A topological measurement for weighted protein interactionnetwork. In Computational Systems Bioinformatics Conference, 2005. Proceedings.2005 IEEE, pages 268–278, 2005.

[112] M. Pellegrini, E. M. Marcotte, M. J. Thompson, D. Eisenberg, and T. O. Yeates. As-signing protein functions by comparative genome analysis: protein phylogeneticprofiles. Proceedings of the National Academy of Sciences of the United States ofAmerica, 96(8):4285–4288, Apr. 1999.

[113] M. Penrose. Random Geometric Graphs (Oxford Studies in Probability). OxfordUniversity Press, USA, July 2003.

[114] D. N. Perkins, D. J. C. Pappin, D. M. Creasy, and J. S. Cottrell. Probability-basedprotein identification by searching sequence databases using mass spectrometrydata. Electrophoresis, 20(18):3551–3567, 1999.

[115] P. A. Pevzner, V. Dancík, and C. L. Tang. Mutation-tolerant protein identification bymass spectrometry. Journal of computational biology : a journal of computationalmolecular cell biology, 7(6):777–787, 2000.

[116] U. Pieles, W. Zürcher, M. Schär, and H. Moser. Matrix-assisted laser desorptionionization time-of-flight mass spectrometry: a powerful tool for the mass and se-quence analysis of natural and modified oligonucleotides. Nucleic Acids Research,21(14):3191–3196, 1993.

[117] N. Przulj, D. G. Corneil, and I. Jurisica. Modeling interactome: scale-free or geo-metric? Bioinformatics, 20(18):3508–3515, 2004.

[118] O. Puig, F. Caspary, G. Rigaut, B. Rutz, E. Bouveret, E. Bragado-Nilsson, M. Wilm,and B. Séraphin. The tandem affinity purification (TAP) method: a general proce-dure of protein complex purification. Methods (San Diego, Calif.), 24(3):218–229,July 2001.

[119] G. Rigaut, A. Shevchenko, B. Rutz, M. Wilm, M. Mann, and B. Séraphin. A genericprotein purification method for protein complex characterization and proteomeexploration. Nature Biotechnology, 17(10):1030–1032, 1999.

[120] N. Robertson and P. D. Seymour. Graph minors. II. Algorithmic aspects of tree-width. Journal of algorithms, 7(3):309–322, 1986.

BIBLIOGRAPHY 147

[121] N. Robertson and P. D. Seymour. Graph minors. X. Obstructions to tree-decomposition. Journal of Combinatorial Theory. Series B, 52(2):153–190, 1991.

[122] R. Saito, H. Suzuki, and Y. Hayashizaki. Interaction generality, a measurement toassess the reliability of a protein-protein interaction. Nucl. Acids Res., 30(5):1163–1168, Mar. 2002.

[123] R. Saito, H. Suzuki, and Y. Hayashizaki. Construction of reliable protein-proteininteraction networks with a new interaction generality measure. Bioinformatics,19(6):756–763, Apr. 2003.

[124] M. E. Sardiu, Y. Cai, J. Jin, S. K. Swanson, R. C. Conaway, J. W. Conaway, L. Flo-rens, and M. P. Washburn. Probabilistic assembly of human protein interactionnetworks from label-free quantitative proteomics. Proceedings of the NationalAcademy of Sciences of the United States of America, 105(5):1454–1459, Feb. 2008.

[125] A. Schrijver. Combinatorial Optimization (3 volume, A,B, & C). Springer, 1 edition,Feb. 2003.

[126] R. Sharan and R. Shamir. CLICK: a clustering algorithm with applications to geneexpression analysis. In Proceedings of the Eighth International Conference on In-telligent Systems for Molecular Biology. AAAI Press, Aug. 2000.

[127] S. S. Shen-Orr, R. Milo, S. Mangan, and U. Alon. Network motifs in the transcrip-tional regulation network of Escherichia coli. Nat Genet, 31(1):64–68, 05 2002.

[128] M. E. Sowa, E. J. Bennett, S. P. Gygi, and J. W. Harper. Defining the human deubiq-uitinating enzyme interaction landscape. Cell, 138(2):389–403, July 2009.

[129] V. Spirin and L. A. Mirny. Protein complexes and functional modules in molecularnetworks. Proceedings of the National Academy of Sciences of the United States ofAmerica, 100(21):12123–12128, Oct. 2003.

[130] M. P. H. Stumpf, T. Thorne, E. de Silva, R. Stewart, H. J. An, M. Lappe, andC. Wiuf. Estimating the size of the human interactome. Proceedings of the Na-tional Academy of Sciences, 105(19):6959–6964, 2008.

[131] K. Suhre. Inference of gene function based on gene fusion events. Springer Proto-cols, 396, November 2007.

[132] H. Tamaki. A linear time heuristic for the branch-decomposition of planar graphs.In G. D. Battista and U. Zwick, editors, European Symposium on Algorithms (ESA),volume 2832 of Lecture Notes in Computer Science, pages 765–775. Springer, 2003.

[133] H. Tang, R. J. Arnold, P. Alves, Z. Xun, D. E. Clemmer, M. V. Novotny, J. P. Reilly,and P. Radivojac. A computational approach toward label-free protein quantifi-cation using predicted peptide detectability. Bioinformatics (Oxford, England),22(14):e481–8, 2006.

148 Bibliography

[134] K. Tarassov, V. Messier, C. R. Landry, S. Radinovic, M. M. S. Molina, I. Shames,Y. Malitskaya, J. Vogel, H. Bussey, and S. W. Michnick. An in vivo map of the yeastprotein interactome. Science, 320(5882):1465–1470, 2008.

[135] J. A. Taylor and R. S. Johnson. Sequence database searches via de novo peptidesequencing by tandem mass spectrometry. Rapid Communications in Mass Spec-trometry, 11(9):1067–1075, 1997.

[136] R. Uehara. NP-complete problems on a 3-connected cubic planar graph and theirapplication. Technical report, Tokyo Woman’s Christian University, Tokyo, Sept.1996.

[137] P. Uetz, L. Giot, G. Cagney, T. A. Mansfield, R. S. Judson, J. R. Knight, D. Lock-shon, V. Narayan, M. Srinivasan, P. Pochart, A. Qureshi-Emili, Y. Li, B. Godwin,D. Conover, T. Kalbfleisch, G. Vijayadamodar, M. Yang, M. Johnston, S. Fields, andJ. M. Rothberg. A comprehensive analysis of protein-protein interactions in Sac-charomyces cerevisiae. Nature, 403(6770):623–627, 2000.

[138] L. G. Valiant. The complexity of enumeration and reliability problems. SIAM Jour-nal on Computing, 8(3):410–421, 1979.

[139] S. M. van Dongen. Graph clustering by flow simulation. PhD thesis, University ofUtrecht, The Netherlands, 2000.

[140] V. Vazirani. Approximation Algorithms. Springer, 2004.

[141] C. von Mering, R. Krause, B. Snel, M. Cornell, S. G. Oliver, S. Fields, and P. Bork.Comparative assessment of large-scale data sets of protein-protein interactions.Nature, 417(6887):399–403, 05 2002.

[142] A. J. Walhout, R. Sordella, X. Lu, J. L. Hartley, G. F. Temple, M. A. Brasch, N. Thierry-Mieg, and M. Vidal. Protein interaction mapping in C. elegans using proteins in-volved in vulval development. Science (New York, N.Y.), 287(5450):116–122, Jan.2000.

[143] X. Wang, J. Venable, P. LaPointe, D. M. Hutt, A. V. Koulov, J. Coppinger, C. Gurkan,W. Kellner, J. Matteson, H. Plutner, J. R. Riordan, J. W. Kelly, J. R. Yates, and W. E.Balch. Hsp90 cochaperone Aha1 downregulation rescues misfolding of CFTR incystic fibrosis. Cell, 127(4):803–815, Nov. 2006.

[144] C. M. Whitehouse, R. N. Dreyer, M. Yamashita, and J. B. Fenn. Electrospray in-terface for liquid chromatographs and mass spectrometers. Analytical Chemistry,57(3):675–679, 1985. PMID: 2581476.

[145] D. P. Williamson and D. B. Shmoys. The design of approximation algorithms. Cam-bridge University Press, 2011.

BIBLIOGRAPHY 149

[146] I. Xenarios, L. Salwínski, X. J. Duan, P. Higney, S. M. Kim, and D. Eisenberg. DIP, theDatabase of Interacting Proteins: a research tool for studying cellular networks ofprotein interactions. Nucleic Acids Res, 30(1):303–305, January 2002.

[147] A. Yamaguchi, K. F. Aoki, and H. Mamitsuka. Graph complexity of chemical com-pounds in biological pathways. In Genome Informatics Vol. 14, pages 376–377,2003.

[148] P. Ye, B. D. Peyser, X. Pan, J. D. Boeke, F. A. Spencer, and J. S. Bader. Gene func-tion prediction from congruent synthetic lethal interactions in yeast. Molecularsystems biology, 1:2005.0026, 2005.

[149] H. Yu, P. Braun, M. A. Yildirim, I. Lemmens, K. Venkatesan, J. Sahalie, T. Hirozane-Kishikawa, F. Gebreab, N. Li, N. Simonis, T. Hao, J. Rual, A. Dricot, A. Vazquez, R. R.Murray, C. Simon, L. Tardivo, S. Tam, N. Svrzikapa, C. Fan, A. de Smet, A. Motyl,M. E. Hudson, J. Park, X. Xin, M. E. Cusick, T. Moore, C. Boone, M. Snyder, F. P.Roth, A. Barabasi, J. Tavernier, D. E. Hill, and M. Vidal. High-quality binary proteininteraction map of the yeast interactome network. Science, 322(5898):104–110,Oct. 2008.

[150] B. Zhang, B. Park, T. Karpinets, and N. Samatova. From pull-down data to proteininteraction networks and complexes with biological relevance. Bioinformatics,24(7):979–86, 2008.

[151] B. Zhang, N. C. VerBerkmoes, M. A. Langston, E. Uberbacher, R. L. Hettich, andN. F. Samatova. Detecting differential and correlated protein expression in label-free shotgun proteomics. Journal of Proteome Research, 5(11):2909–2918, Nov.2006.

Deconvolution of PPI Networks: Approximation Algorithms ... · rithms developed using the theory of...

Documents

Transcript of Deconvolution of PPI Networks: Approximation Algorithms ... · rithms developed using the theory of...