Co-evolutionary Analysis of Domains in Interacting Proteins Reveals Insights Into Domain-domain...

15
Co-evolutionary Analysis of Domains in Interacting Proteins Reveals Insights into DomainDomain Interactions Mediating ProteinProtein Interactions Raja Jothi 1 , Praveen F. Cherukuri 1,2 , Asba Tasneem 3 and Teresa M. Przytycka 1 1 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA 2 Bioinformatics Program Boston University, Boston, MA 02215, USA 3 Booz Allen Hamilton Inc., Rockville, MD 20852, USA Recent advances in functional genomics have helped generate large-scale high-throughput protein interaction data. Such networks, though extremely valuable towards molecular level understanding of cells, do not provide any direct information about the regions (domains) in the proteins that mediate the interaction. Here, we performed co-evolutionary analysis of domains in interacting proteins in order to understand the degree of co- evolution of interacting and non-interacting domains. Using a combination of sequence and structural analysis, we analyzed proteinprotein interac- tions in F1-ATPase, Sec23p/Sec24p, DNA-directed RNA polymerase and nuclear pore complexes, and found that interacting domain pair(s) for a given interaction exhibits higher level of co-evolution than the non- interacting domain pairs. Motivated by this finding, we developed a computational method to test the generality of the observed trend, and to predict large-scale domaindomain interactions. Given a proteinprotein interaction, the proposed method predicts the domain pair(s) that is most likely to mediate the protein interaction. We applied this method on the yeast interactome to predict domaindomain interactions, and used known domaindomain interactions found in PDB crystal structures to validate our predictions. Our results show that the prediction accuracy of the proposed method is statistically significant. Comparison of our prediction results with those from two other methods reveals that only a fraction of predictions are shared by all the three methods, indicating that the proposed method can detect known interactions missed by other methods. We believe that the proposed method can be used with other methods to help identify previously unrecognized domaindomain interactions on a genome scale, and could potentially help reduce the search space for identifying interaction sites. Published by Elsevier Ltd. *Corresponding authors Keywords: co-evolution; proteinprotein interaction; domaindomain interaction Introduction Post-genomic advances in molecular biology have helped uncover the intricate interplay between proteins in metabolic, signaling and regulatory pathways. Identification of proteinprotein interac- tions is an essential step towards a better under- standing of various cellular processes. Various high- throughput experimental techniques such as mass spectrometry, yeast two-hybrid, and tandem affinity purification have been used to discover and gen- erate large scale protein interaction data. 18 In addition, several computational approaches towards predicting proteinprotein interactions have been proposed in an effort to complement experimental methods. 923 Proteinprotein interaction networks, though extremely valuable towards molecular level under- standing of cells, do not provide insights on Abbreviations used: MLE, Maximum Likelihood Estimation; PDB, Protein Data Bank; RCDP, Relative Co-evolution of Domain Pairs; SLA, Sequence Lengths Assigned; DPEA, Domain Pair Exclusion Analysis; RDFF, Random Decision Forest Framework. E-mail addresses of the corresponding authors: [email protected]; [email protected] doi:10.1016/j.jmb.2006.07.072 J. Mol. Biol. (2006) 362, 861875 0022-2836/$ - see front matter. Published by Elsevier Ltd.

description

protein

Transcript of Co-evolutionary Analysis of Domains in Interacting Proteins Reveals Insights Into Domain-domain...

Page 1: Co-evolutionary Analysis of Domains in Interacting Proteins Reveals Insights Into Domain-domain Interactions Mediating Protein-protein Interactions

doi:10.1016/j.jmb.2006.07.072 J. Mol. Biol. (2006) 362, 861–875

Co-evolutionary Analysis of Domains in InteractingProteins Reveals Insights into Domain–DomainInteractions Mediating Protein–Protein Interactions

Raja Jothi1⁎, Praveen F. Cherukuri1,2, Asba Tasneem3 andTeresa M. Przytycka1⁎

1National Center forBiotechnology Information,National Library of Medicine,National Institutes of Health,Bethesda, MD 20894, USA2Bioinformatics ProgramBoston University, Boston,MA 02215, USA3Booz Allen Hamilton Inc.,Rockville, MD 20852, USA

Abbreviations used: MLE, MaximEstimation; PDB, Protein Data BankCo-evolution of Domain Pairs; SLA,Assigned; DPEA, Domain Pair ExcluRandom Decision Forest FrameworkE-mail addresses of the correspon

[email protected]; przytyck@n

0022-2836/$ - see front matter. Publishe

Recent advances in functional genomics have helped generate large-scalehigh-throughput protein interaction data. Such networks, though extremelyvaluable towards molecular level understanding of cells, do not provideany direct information about the regions (domains) in the proteins thatmediate the interaction. Here, we performed co-evolutionary analysis ofdomains in interacting proteins in order to understand the degree of co-evolution of interacting and non-interacting domains. Using a combinationof sequence and structural analysis, we analyzed protein–protein interac-tions in F1-ATPase, Sec23p/Sec24p, DNA-directed RNA polymerase andnuclear pore complexes, and found that interacting domain pair(s) for agiven interaction exhibits higher level of co-evolution than the non-interacting domain pairs. Motivated by this finding, we developed acomputational method to test the generality of the observed trend, and topredict large-scale domain–domain interactions. Given a protein–proteininteraction, the proposed method predicts the domain pair(s) that is mostlikely to mediate the protein interaction. We applied this method on theyeast interactome to predict domain–domain interactions, and used knowndomain–domain interactions found in PDB crystal structures to validate ourpredictions. Our results show that the prediction accuracy of the proposedmethod is statistically significant. Comparison of our prediction results withthose from two other methods reveals that only a fraction of predictions areshared by all the three methods, indicating that the proposed method candetect known interactions missed by other methods. We believe that theproposed method can be used with other methods to help identifypreviously unrecognized domain–domain interactions on a genome scale,and could potentially help reduce the search space for identifyinginteraction sites.

Published by Elsevier Ltd.

*Corresponding authors

Keywords: co-evolution; protein–protein interaction; domain–domain interaction

Introduction

Post-genomic advances in molecular biology havehelped uncover the intricate interplay betweenproteins in metabolic, signaling and regulatory

um Likelihood; RCDP, RelativeSequence Lengthssion Analysis; RDFF,.ding authors:cbi.nlm.nih.gov

d by Elsevier Ltd.

pathways. Identification of protein–protein interac-tions is an essential step towards a better under-standing of various cellular processes. Various high-throughput experimental techniques such as massspectrometry, yeast two-hybrid, and tandem affinitypurification have been used to discover and gen-erate large scale protein interaction data.1–8 Inaddition, several computational approachestowards predicting protein–protein interactionshave been proposed in an effort to complementexperimental methods.9–23

Protein–protein interaction networks, thoughextremely valuable towards molecular level under-standing of cells, do not provide insights on

Page 2: Co-evolutionary Analysis of Domains in Interacting Proteins Reveals Insights Into Domain-domain Interactions Mediating Protein-protein Interactions

862 Co-evolution of Interacting Domains

interaction specificity at the domain level. Mostoften, it is only a fraction of a protein that directlyinteracts with its biological partners. Since two thirdsof proteins in prokaryotes and four fifths of proteinsin eukaryotes are multidomain proteins,24,25 inter-action between two proteins (either stably ortransiently) often involves binding of pair(s) ofdomains. Importantly, understanding interaction atthe domain level is a critical step towards a thoroughunderstanding of the protein–protein interactionnetworks and their evolution.Statistical analysis of domain co-occurrence in

interacting proteins (known as the AssociationMethod) have been used towards predicting pro-tein–protein interactions.26–29 The underlying ideabehind this type of approach is to identify domainpairs that co-occur significantly more often ininteracting proteins than in non-interacting proteins,and use this information to identify new protein–protein interactions. Deng et al.30 applied theMaximum Likelihood Estimation (MLE) method toinfer domain–domain interactions from a givenprotein interaction network. They estimated theprobability of interaction between every domainpair, and used it to predict protein–protein interac-tions. Instead of identifying domain interactionswith the goal of predicting new protein–proteininteractions, Nye et al.31 proposed the lowest p-valuemethod, which focuses on identifying domain–domain interactions that are most likely to mediatea given protein–protein interaction. They performeda comprehensive comparative analysis of theassociation,26 MLE,30 and the lowest p-value31

methods and showed that (i) the overall predictionaccuracy of the MLE and the lowest p-valuemethods is only as good as a random method,which is about 55%, and (ii) the association methodwas the worst among the four with a predictionaccuracy of about 52%. Chen and Liu32 usedRandom Decision Forest Framework (RDFF) topredict domain–domain interactions, which arethen used to predict protein–protein interactions.Recently, Riley et al.33 extended ideas from earliermethods26–28,30 by adding a likelihood ratio test toassess the contribution of each potential domain–domain interaction to the likelihood of a given set ofprotein–protein interactions. They demonstratedthat their method performs considerably betterthan the association and MLE methods.At the structural level, considerable amount of

work has been done to understand protein/domaininteractions. Littler and Hubbard 34 performed acomprehensive analysis of domain–domain interac-tions observed in protein structures in an effort tounderstand comparative orientation and interactingsurfaces of structurally unsolved domain pairs.Gong et al.35 used geometric properties/tools suchas accessible surface area and Voronoi diagram onProtein Data Bank (PDB) crystal structures of proteincomplexes to detect domain interaction interfaces.Shoemaker et al.36 used a conserved binding modeanalysis on domain–domain interactions inferredfrom PDB crystal structures to detect binding

surfaces of biological relevance. Neduva et al.37

studied interactions that involve binding between aglobular domain in one protein and a short linearmotif (pattern of three to eight residues), andproposed a systematic approach to discover thesemotifs. For a comprehensive review of structure-based interaction studies, we refer the reader to Aloyand Russell's recent review on protein interactions.38

Structural information is necessary and critical fora full understanding of molecular/domain levelinteraction. However, as the number of interactionswith known protein structures is far fewer than thenumber of interactions, it makes it difficult tounderstand domain level interactions at the genomicscale. Here, we attempt to understand domain–domain interactions at the sequence level. Specifi-cally, we investigate the relative degree of co-evolution of domains in interacting proteins tounderstand whether or not interacting domainpairs exhibit higher level of co-evolution that thosethat are non-interacting. The concept of co-evolutionhas been widely applied in predicting protein–protein interactions solely based on sequenceinformation14,16,39–44 as well as the gene expressiondata.45,46 The underlying assumption behind thisconcept is that the interacting partners must co-evolve so that changes in a protein's binding surfaceare complemented in the interface of its part-ner.14,47,48 We use co-evolutionary analysis tostudy protein–protein interactions in F1-ATPase,Sec23p/Sec24p, DNA-directed RNA polymerase,and nuclear pore complexes, and show that thedegree of co-evolution of interacting domain pair(s)between two interacting proteins is higher than thatof non-interacting domain pairs. We then develop acomputational method, Relative Co-evolution ofDomain Pairs (RCDP), to test the generality of theobserved behavior, and to predict large-scaledomain–domain interactions using the yeast inter-actome. The proposed method, given a pair ofinteracting proteins, predicts the domain pair(s) thatis most likely to mediate the interaction. Predicteddomain–domain interactions are validated against aset of known domain–domain interactions found inPDB49 crystal structures (as reported in iPfam50). Wefinally compare RCDP's prediction results withthose from two other methods, and show that theRCDP method can predict known domain interac-tions that are missed by the other methods.

Results and Discussion

Co-evolution of domain pairs in interactingproteins

To assess the degree of co-evolution of domainpairs in interacting proteins, we considered protein–protein interactions in the Saccharomyces cerevisiae(yeast) genome. A schematic overview of assessingthe degree of co-evolution between two protein/domain families is shown in Figure 1. In this type of

Page 3: Co-evolutionary Analysis of Domains in Interacting Proteins Reveals Insights Into Domain-domain Interactions Mediating Protein-protein Interactions

Figure 1. A schematic overview of the co-evolutionaryanalysis. Multiple sequence alignments of two yeastproteins for a common set of species are constructed,followed by the construction of their phylogenetic treesand similarity matrices. The extent of agreement betweenthe evolutionary histories of the two yeast proteins isassessed by computing a linear correlation coefficientbetween the two similarity matrices.

863Co-evolution of Interacting Domains

analysis, multiple sequence alignments of twoproteins/domains for a common set of species areused to construct phylogenetic trees and similaritymatrices. The degree of co-evolution of the twodomains is measured by computing a linearcorrelation coefficient of the two similarity matrices,which implicitly compares the evolutionary his-tories of the two domains.To study the relative degree of co-evolution of

domain pairs in interacting proteins, interactingproteins are first assignedwithPfamHiddenMarkovModel (HMM)51 profiles. Then, as shown in Figure2, the correlation (agreement) scores, measuring thedegree of co-evolution of all possible domain pairsbetween the two interacting proteins, are com-puted. Multiple sequence alignment for domain D

in protein P is constructed by extracting thoseregions in P's multiple sequence alignment thatcorrespond to D (see Materials and Methods formore details). Under the co-evolution hypothesis,which assumes that interacting domains undergocorrelated mutations, domain pairs that are med-iating the interaction between two proteins areexpected to have co-evolved, and thus are expectedto have high correlation score. To test this hypoth-esis, we began by examining yeast interactionssupported by at least one PDB crystal structure. Fora given protein–protein interaction, correlationscores of the interacting domain pairs (inferredfrom crystal structures) are compared against thosedomain pairs not known to interact to see whetheror not the interacting domain pairs do reallyexhibit relatively high level of co-evolution. If twointeracting proteins, P and Q, have two domainseach, then there are a total of four domain–domaininteraction possibilities between them.

Interactions in F1-ATPase complex

The F1-ATP synthase is a five-subunit catalyticcore (in a stoichiometry of 3α, 3β, 1γ, 1δ, and 1ε),which uses transmembrane proton motive forcegenerated by photosynthesis or oxidative phosphor-ylation to drive the synthesis of ATP from ADP andinorganic phosphate. The central stalk, comprising3α, 3β, and 1γ subunits, links F1 complex to the nine-subunit transmembrane channel through which theprotons are pumped (F0 complex). The rod-shapedasymmetrical γ-subunit rotates inside a cylindermade of three α and β-subunits, arranged alter-nately52–56, making contacts with α and β-subunits.In this complex, we focused our attention on three

interactions among the α, β, and γ chains (genesATP1, ATP2, ATP3, respectively). The correspondingyeast proteins for the α, β, and γ chains (YBL099w,YJR121w, YBR039w, respectively) were assignedwith Pfam domains. The α-subunit (YBL099w) was

Figure 2. Relative degree of co-evolution of domains in interactingproteins. (a) Domain architecture ofproteins P andQ (shown using grayboxes) that are known to interact(interaction sites are shown as blackboxes). (b) Correlation (agreement)scores, measuring the degree of co-evolution, for all possible domainpairs in P and Q. Domain pairs thatmediate the interaction betweenproteins P and Q are expected tohave co-evolved, and thus areexpected to have a high correlationscore.

Page 4: Co-evolutionary Analysis of Domains in Interacting Proteins Reveals Insights Into Domain-domain Interactions Mediating Protein-protein Interactions

864 Co-evolution of Interacting Domains

inferred to contain three domains: beta-barreldomain (PF02874, 2e-18), nucleotide-bindingdomain (PF00006, 3e-122), and C-terminal domain(PF00306, 2e-37). The β-subunit (YJR121w), a closehomolog of the α-subunit, was inferred to containthe same three domains as well. The γ-subunit(YBR039w) contained just the ATP synthase domain(PF00231, 1e-130). Figure 3(a) depicts the domainarchitecture of the sequences along with the truedomain–domain interactions between the subunits.The results from co-evolutionary analysis (usingorthologs from 22 species) are listed as tables inFigure 3(b), which clearly shows that those domainpairs that interact do, in fact, have relatively highcorrelation scores. A cartoon of one of several bovinemitochondrial F1-ATPase crystal structures (PDB:1h8e), supporting the interactions, is shown inFigure 3(c). Despite the complexity of this system,it is remarkable that the degree of co-evolutionamong the interacting domain pairs is clearly higherthan that among the non-interacting domain pairs.

Figure 3. Interactions among alpha (ATP1), beta (ATP2)sequences are shown using thick colored lines: red for the alphand black for alpha or beta chain. Pfam domain annotations arnames of the protein sequences are to the left of the domain arcare known to be true from PDB crystal structures (as inferred inarchitecture. (b) The correlation scores of all possible domain plisted as tables. Domain pairs that are known to interact, dendegree of co-evolution. (c) A bottom view of the cartoon of bov1h8e), supporting the interactions, is shown with alpha, berespectively.

In particular, between the α and the β chains, sevenout of the nine domain pairs that are known tointeract have higher correlation than the two non-interacting domain pairs. In comparison, anapproach that picks seven domain pairs out of thenine possibilities at random (without replacement)will have a 0.028 probability (p-value) of getting allits seven picks correct (truly interacting domainpairs).

Yeast Sec23p/Sec24p heterodimer

Sec23p (YPR181c) and Sec24p (YIL109c) arecomponents of the Sec23p-Sec24p heterodimericcomplex of the COPII vesicle, which carries proteinsfrom the endoplasmic reticulum (ER) to the Golgicomplex.57 YPR181c and YIL109c are structurallyrelated proteins, consisting of five distinct Pfamdomains: zinc finger (PF04810,<1e-19), alpha/betatrunk domain (PF04811,<1e-124), beta-barreldomain (PF08033,<2e-36), helical domain (PF04815,

, and gamma (ATP3) chains of the ATPase. (a) Proteina chain, green for the beta chain, blue for the gamma chain,e shown using rectangular boxes (not drawn to scale). Thehitecture. Inter-chain domain–domain interactions, whichiPfam), are shown using double-arrow lines in the domainairs between two proteins, sorted in descending order, areoted with Y, have high correlation scores exhibiting highine mitochondrial F1-ATPase PDB crystal structure (PDB:ta, and gamma chains colored in red, green, and blue,

Page 5: Co-evolutionary Analysis of Domains in Interacting Proteins Reveals Insights Into Domain-domain Interactions Mediating Protein-protein Interactions

865Co-evolution of Interacting Domains

<1e-43), and C-terminal gelsolin-like domain(PF00626, 1e-17). The domain architectures forboth the proteins are given in Figure 4(a), showingthe lone inter-chain domain–domain interactionbetween the two alpha/beta trunk domains.There are several intra-chain domain–domain

interactions within each chain, involving all fivedomains (not shown in the Figure). Since, ouranalysis considers only inter-chain domain pairs(one in each chain), we considered only the loneinter-chain interaction between the alpha/betatrunk domains of the Sec23p and Sec24 chains. Thecorrelation scores for all possible inter-chain domainpairs are listed as a table in Figure 4(b), and thecartoon of the PDB crystal structure (PDB: 1m2v),supporting the interaction, is shown in Figure 4(c). Itis evident from our analysis that the interactingdomain pair indeed has relatively high correlation.However, it was not the pair with the highestcorrelation. The domain pair of beta-barrel domain(PF04811) of Sec23p and alpha/beta trunk domain(PF08033) of Sec24p had the highest correlation.Since the trunk and beta-barrel domains interactwithin each chain, and both chains are remarkablyconserved through evolution and are related, it maynot be unreasonable to expect them to have a highcorrelation. Considering the fact that the lone trueinteracting domain pair has the second highestcorrelation score, an approach that picks twodomain pairs out of the 25 possibilities at random

Figure 4. Interaction between Sec23 (YPR181c) and Sec2vesicles. (a) Protein sequences are shown using thick gray linerectangular boxes (not drawn to scale). The names of the proteinter-chain domain–domain interaction, which is known to beshown using a double arrow line. (b) The correlation scores ofin descending order, are listed as a table. The domain pair thatscore, exhibiting high degree of co-evolution. (c) A cartooninteraction, is shown with domain colors consistent with the

(without replacement) will have a 0.04 probability(p-value) of picking the lone interacting domain pair.

Interactions in DNA-directed RNA polymerasecomplex

From the DNA-directed RNA polymerase complexin yeast, we considered two protein–protein interac-tions involving subunit Rpb8. There are several PDBcrystal structures that show Rpb8 subunit interactingwith smaller subunits Rpb3 and Rpb11, and thelargest subunit Rpb1. The results of our domain-levelco-evolutionary analysis for interactions betweensubunits Rpb8 (YOR224c) and Rpb3 (YIL021w), andsubunits Rpb8 and Rpb1 (YDL140c) are shown inFigure 5(a) and (b), respectively. YOR224c wasinferred to contain just one domain: RNA_pol_Rpb8(PF03870, 4e-92). YIL021wwas assignedwith dimeri-sation domain (PF01193, 1e-19) and insert domain(PF01000, 7e-41), and YDL140c was assigned withseven domains: clamp domain (PF04997, 7e-178),active site domain (PF00623, 5e-188), pore domain(PF04983, 2e-69), funnel domain (PF05000, 2e-50),cleft domain (PF04998, 2e-170), and two mobile mo-dule domains (PF04992, 4e-97 and PF04990, 1e-78).The correlation results for the interaction between

Rpb3 and Rpb8 subunits (using orthologs from acommon set of 15 species), listed as a table in Figure5(a), show that the interacting domain pair(PF01193, PF03870), shown in the cartoon of crystal

4 (YIL109c) components of the COPII coat of ER-golgis, and Pfam domain annotations are shown using coloredin sequences are to the left of the domain architecture. Antrue from a PDB crystal structure (as inferred in iPfam), isall possible domain pairs between the two proteins, sortedis known to interact, denoted with Y, has a high correlationof PDB crystal structure (PDB: 1m2v), supporting the

domain architecture.

Page 6: Co-evolutionary Analysis of Domains in Interacting Proteins Reveals Insights Into Domain-domain Interactions Mediating Protein-protein Interactions

Figure 5. Inferred domain–domain interactions in DNA-directed RNA polymerase complex. Protein sequences areshown using thick gray lines, and the domain annotations are shown using colored rectangular boxes (not drawn to scale).The names of the protein sequences are to the left of the domain architecture. The correlation scores of all possible domainpairs between the two proteins, sorted in descending order, are listed as a table. Inter-chain domain–domain interactions,which are known to be true from PDB crystal structures (as inferred in iPfam), are shown using double-arrow lines in thedomain architecture, and Y in the table. Domain pairs that are known to interact have high correlation scores, exhibitinghigh degree of co-evolution. Cartoons of PDB crystal structures, supporting the interactions, are shownwith domain colorsconsistent with the domain architecture. (a) Interaction between subunits 3 and 8 of the DNA-directed RNA polymerase(PDB: 1y1v). (b) Interaction between subunits 1 and 8 of the DNA-directed RNA polymerase (PDB : 1y1v). Since PF04998contains nested domain PF04992, interaction between PF04998 and PF03870 is considered to be true (denoted by ¶).

866 Co-evolution of Interacting Domains

structure (PDB: 1y1v), has a higher level of co-evolution when compared to the non-interactingdomain pair (PF01000, PF03870). The correlationresults are not that clear for the interaction betweenthe Rpb1 and Rpb8 subunits (refer to the table inFigure 5(b)). While two out of the three interactingdomain pairs have high correlation scores, theinteraction between the funnel domain (PF05000)and RNA_pol_Rpb8 (PF03870) has the lowestcorrelation score among all possible domain pairs.An approach that picks four domain pairs out of thesix possibilities at random (without replacement)will have a 0.086 probability (p-value) of gettingthree out its four picks correct (truly interactingdomain pairs).An interacting domain pair with a low correlation

(false negative in some sense) could be explainedusing one or both of the following reasons. Whenassessing the degree of co-evolution between twodomains, we tend to ignore the number of interact-ing partners a domain may have. Even though the

co-evolutionary hypothesis for interacting domainsassumes that the interacting domains undergocorrelated mutations, more specifically, it is actuallythe binding surfaces that undergo correlated muta-tions. A domain having multiple interacting partnersmay use distinct patches on its surface to interactwith each of its partners.34,58 Each binding region ofa domain is highly specific to its interacting partner.Thus, the surface patches used by a domain tointeract with many interacting partners mayundergo independent correlated mutations withtheir corresponding interacting partner. As a result,mutations at different surface patches of a domainneed not be correlated (see Figure 6). Consequently,the degree of co-evolution between two interactingdomains, one or both with multiple interactingpartners, may actually be suppressed, resulting in alow correlation score. Thus, it may not always be thecase that a pair of interacting domains, each of whichhas multiple interacting partners, has a high correla-tion score.

Page 7: Co-evolutionary Analysis of Domains in Interacting Proteins Reveals Insights Into Domain-domain Interactions Mediating Protein-protein Interactions

Figure 6. Uncorrelated set of correlated mutations.Each rectangular box is a cartoon representation of amultiple sequence alignment of a family of orthologousproteins/domains. There are a total of six families, A, B, C,D, E, and F. The binding residues of interaction, referred toas binding surface, between family A and each of the otherfive families are highlighted using distinct colors. Underthe co-evolutionary hypothesis, which states interactingdomains undergo correlated mutations, mutations at eachof A's five surface patches must be correlated with those atthe binding surface in the corresponding interactingpartners. However, mutations at A's five surface patchesneed not be correlated. As a result, for example, it may beunreasonable to expect A and E to have similar evolu-tionary histories even though the corresponding bindingsurfaces in A and E may have high correlation.

867Co-evolution of Interacting Domains

Mutations occurring in interacting domains maynot be correlated due to many other biologicalconstraints imposed on them. There may be cases inwhich the mutations at a binding surface in a domainmay not be followed by compensatory mutations atthe binding surface of its interacting partner. Sinceprotein binding surfaces are relatively more con-served than the rest of the sequence,59,60 domainswith many interacting partners, and thus manysurface patches, are likely to be relatively moreconserved than those with few interactingpartners.61,62 For each interacting domain, from itsmultiple sequence alignment, we computed thesequence identity of each orthologous domain inreference to the yeast domain. The alignments forRpb8 and Rpb1 had orthologs in a common set of 17species, including yeast. The average sequenceidentities for the interacting domains, along with thenumber of known interacting partners are listed inTable 1. Interestingly, at least for this example, thosedomainswithmany interacting partners are relativelymore conserved than those with few partners.

Exportin Cse1p complexed with its cargo

Nuclear pore complexes serve as a medium forexchange of macromolecules between the nucleus

Table 1. The average sequence identities of interacting domainof DNA-directed RNA polymerase, along with the number of

There is a good correlation between the number of interacting partndomains with many partners are relatively more conserved than thcorrelation are highlighted in black.

and the cytoplasm. Carrier proteins that shuttlebetween the nucleus and the cytoplasm enableactive transport of large molecules through thesepore complexes. Importin-alpha Srp1 (YNL189w),which acts as a carrier for many nuclear traffickingprocesses, binds cargo in the cytoplasm, movesthrough the nuclear pore and releases the cargo inthe nucleus. The nuclear envelope protein Cse1p(YGL238w), a yeast homolog of mammalian CAS,recycles importin-alpha from the nucleus back to thecytoplasm, thereby allowing it to participate inmultiple rounds of nuclear import.63,64

To understand the degree of co-evolutionbetween the interacting domains in this complex,we first assigned Pfam domains to the proteins.Cse1 (YGL238w) was assigned with importin-betaN-terminal domain (PF03810, 2e-22), Cse1 domaincontaining HEAT repeats (PF08506, 7e-275), andCas/Cse C terminus domain (PF03378, 5e-77).Importin-alpha Srp1 (YNL189w) was assignedwith importin beta binding domain (PF01749, 1e-45) and eight Armadillo repeats (PF00514, <5e-6).The domain architecture and the domain-levelinteractions in this complex are shown in Figure7. The co-evolution scores for all domain pairsbetween these chains are listed as a table in Figure7(b). Three out of the five interacting domain pairshave high correlation, implying high level of co-evolution. The remaining two interacting domainpairs do not have high correlation scores. Anapproach that picks five domain pairs out of the 29possibilities at random (without replacement) willhave a 0.019 probability (p-value) of getting twoout its five picks correct (truly interacting domainpairs).For each interacting domain, from its multiple

sequence alignment, we computed the sequenceidentity of each orthologous domain in reference tothe yeast domain. The average sequence identitiesfor the interacting domains, along with the numberof known interacting partners are listed in Table 2,which shows that those domains with manyinteracting partners are relatively more conservedthan those with few partners, which is consistentwith similar other findings suggesting that hubs(those proteins/domains with numerous interactingpartners) are relatively more conserved.59–62 If true,this could possibly explain why domains PF03378and PF08506, with average sequence identities∼34% and ∼46%, respectively, have low correlation

s between subunits Rpb1 (YDL140c) and Rpb8 (YOR224c)known interacting partners for each domain

ers and the average sequence identity, which suggests that thoseose with few partners. The interacting domain pairs with low

Page 8: Co-evolutionary Analysis of Domains in Interacting Proteins Reveals Insights Into Domain-domain Interactions Mediating Protein-protein Interactions

Figure 7. Interaction between importin alpha Srp1 (YNL189w) and nuclear export receptor Cse1 (YGL238w). (a)Protein sequences are shown using thick gray lines, and Pfam domain annotations are shown using colored rectangularboxes (not drawn to scale). The names of the protein sequences are to the left of the domain architecture. Inter-chaindomain–domain interactions, which are known to be true from PDB crystal structures (as inferred in iPfam), are shownusing a double arrow line. (b) The correlation scores of all possible domain pairs between two proteins, sorted indescending order, are listed as a table. Two of the five domain pairs, which are known to interact (denoted with Y), havehigh correlation scores, exhibiting high degree of co-evolution. The reason for the other three known interacting domainpairs not having high correlation scores could be attributed to “uncorrelated set of correlated-mutations” illustrated inFigure 4. (c) A cartoon of the PDB crystal structure (PDB: 1wa5), supporting the interaction, is shown with domain colorsconsistent with the domain architecture. A subset of the interaction sites is shown using dotted spheres.

868 Co-evolution of Interacting Domains

with domain PF00514_8, which happens to have arelatively high average sequence identity at ∼75%.

Predicting large-scale domain–domaininteractions from the yeast interactome

Motivated by our results that interacting domainpairs (in interacting proteins) have higher correla-tion compared to non-interacting domain pairs, wedeveloped a method to test the generality of the

Table 2. The average sequence identities of interacting domaexport receptor Cse1 (YGL238w), along with the number of k

There is a clear correlation between the number of interacting partnedomains with many partners are relatively more conserved than thcorrelation are highlighted in black.

observed trend, and to predict large-scale domain–domain (i.e. Pfam–Pfam) interactions using the yeastinteractome. First, we assigned the interacting yeastprotein sequences with Pfam domains (see Materialsand Methods). We then considered only thoseinteractions involving proteins with at least 50% oftheir sequence lengths assigned (SLA) with Pfamdomain(s), which we refer to as “test set SLA≥50%”. A cutoff of 50% was chosen as a compro-mise between being sufficiently small to provide

ins between importin alpha Srp1 (YNL189w) and nuclearnown interacting partners for each domain

rs and the average sequence identity, which suggests that thoseose with few partners. The interacting domain pairs with low

Page 9: Co-evolutionary Analysis of Domains in Interacting Proteins Reveals Insights Into Domain-domain Interactions Mediating Protein-protein Interactions

869Co-evolution of Interacting Domains

enough interactions, and large enough for assigneddomains to contain sufficient binding sites. Inaddition, we imposed a restriction that the interact-ing proteins have orthologs in at least a common setof ten species. This resulted in a set of 1180interactions among 654 proteins.For each protein–protein interaction, we com-

puted the correlation scores of all possible domainpairs between the two proteins, and inferred thedomain pair(s) with the highest correlation score tobe the interacting pair that is most likely to mediatethe protein–protein interaction. For our set of 1180protein–protein interactions, we inferred a total of1222 domain–domain interactions (Supplementarymaterial S3), 960 of which are unique. In order tovalidate our predictions, we used the iPfamdatabase,50 which contains the list of knowndomain–domain interactions inferred from PDBcrystal structures. We found that 206 out of our1222 predictions (109 out of the 960 unique predic-tions ≈11.35%) are in iPfam.If we restricted our set to only those interactions

involving proteins with at least 75% (instead of 50%before) of their sequence lengths assigned with Pfamdomain(s), the percentage of predictions in iPfamjumps by 52%. Our restricted set, referred to as “testset SLA ≥75%”, contained a total of 374 protein–protein interactions among 298 proteins. For this set,we inferred a total of 392 domain–domain interac-tions (Supplementary material S5), 336 of which areunique. 58 out of the 336 unique predictions(≈17.26%) are in iPfam. The increase in the predic-tion accuracy from 11.35% (for the test set SLA≥50%) to 17.26% (for the test set SLA ≥75%)indicates that the higher the SLA, the better theprediction accuracy. This is understandable consid-ering the fact that the binding region in aninteracting protein needs to be contained in one ofthat protein's domains in order to have anypossibility of it being identified.We compared our prediction results with those of

Chen and Liu's RDFF method32 and Riley et al.'sDPEA method.33 The objective of this comparison isto find what percent of RCDP's predictions areconfirmed by the other two methods. Chen and Liuused 9834 yeast protein interactions to infer 4366domain–domain interactions, out of which 2475 arebetween Pfam-A domains while the rest involvePfam-B domains. Riley et al.'s DPEA method is astatistical approach, which uses expectation max-imization (EM) algorithm as a subroutine. Theyused a network of 26,032 protein–protein interac-tions from 69 organisms to infer a total of 3005domain–domain interactions, out of which 1812 ofthem are between Pfam-A domain pairs. Thecomparison summary is shown in Figure 8(a), inwhich we refer to our method as RCDP (relative co-evolution of domain pairs). Although our analysisshows that the predictions by the RCDP method aremore likely to be in iPfam than those by RDFF orDPEA methods, one needs to keep in mind that theprediction accuracies of the three methods are notdirectly comparable as they all use different datasets

of varying sizes. However, the dataset used in ourstudy is a subset of that used by Chen and Liu, andRiley et al. So, one would expect a good fraction ofour predictions to be confirmed by the other twomethods. Interestingly, only about 5% of RCDP'spredictions are confirmed by both the RDFF andDPEA methods (Figure 8(b)). About 14% of RCDP'spredictions are confirmed by DPEA alone, andabout 23% of RCDP's predictions are confirmed byRDFF alone. Overall, 31% of RCDP's predictions areconfirmed by DPEA and/or RDFF, indicating thatRCDP can predict known domain–domain interac-tions missed by the other two. Thus, the RCDPmethod can be used with other methods to detectunrecognized domain–domain interactions on agenome scale with wider coverage.

Validation of predicted domain–domaininteractions and estimating the true predictivepower of the RCDP method

We used domain pairs found to interact in PDB49

crystal structures, as reported in the iPfamdatabase,50 as our gold standard to verify thepredicted domain–domain interactions. The iPfamdatabase defines two domains from two differentchains to be interacting if and only if they are closeenough in at least one PDB complex to form aninteraction. We consider a predicted interactionbetween domain Pi in protein P and domain Qj inprotein Q to be a true interaction (true positive) ifand only if iPfam lists this pair of domain to beinteracting based on one or more PDB crystalstructure evidences.An interacting protein pair P and Q is said to

contain an iPfam domain–domain interaction xy ifdomain x is present in protein P and domain y ispresent in protein Q, or vice versa. A given protein–protein interaction may contain more than oneiPfam domain–domain interaction, i.e. out of allpossible domain pairs between the two interactingproteins, there may be more than one domain pairlisted in iPfam.We consider only those domain pairs found to

interact in PDB crystal structures, as reported iniPfam, as true positives. Absence of a domain pair iniPfam does not necessarily mean that the twodomains do not interact. Thus, it may not be fair toconsider those predictions (without PDB evidence)as false positives. It could very well be that a goodfraction of them could be biologically occurringdomain interactions. A simple case would be a trueprotein–protein interaction, none of whose possibledomain pairs are in iPfam. In order to estimate thetrue predictive power of the RCDP method, wetested it on a validation set comprising only thoseprotein–protein interactions satisfying all of thefollowing conditions: (i) is between proteins with atleast 50% of their sequence lengths assigned withPfam domain(s), (ii) is not between two one-domainproteins, (iii) contains a domain pair that is known tointeract as per iPfam, and (iv) is between proteinshaving orthologs in at least a common set of ten

Page 10: Co-evolutionary Analysis of Domains in Interacting Proteins Reveals Insights Into Domain-domain Interactions Mediating Protein-protein Interactions

Figure 8. (a) An indirect comparison of RCDP's prediction results with those of RDFF32 and DPEA33 methods. Thepredictions were validated against the known domain–domain interactions found in PDB crystal structures (as inferred iniPfam50). The prediction accuracies of the three methods are not directly comparable as the results are from datasets ofvarying sizes. However, the dataset used to test RCDP is a subset of that used by Chen and Liu, and Riley et al. (b) Onlyabout 5% of RCDP's predictions are confirmed by both DPEA and RDFF methods. Overall, about 31% of RCDP'spredictions are confirmed by either DPEA or RDFF, with about 14% and 23% of RCDP's predictions confirmed by DPEAand RDFF, respectively. This indicates that each of these three methods can detect known domain–domain interactionsmissed by the other two.

870 Co-evolution of Interacting Domains

species. The validation set contained a total of 109protein–protein interactions (Supplementary mate-rial S6), comprising a total of 109 unique domaininteractions that are in iPfam.Ideally, a good domain–domain interaction pre-

diction method should be able to recover all 109unique known domain–domain interactions presentin the validation set. To measure the percentage ofrecovery, we use the sensitivity measure, which isthe ratio of the number of unique true positives tothe number of unique positives (which is 109). Onthe validation set of 109 protein–protein interac-tions, RCDP predicted 109 unique domain–domaininteractions, out of which 63 are in iPfam. Thisresulted in a sensitivity of 63/109=57.8%. This is anunderestimation because there may more than onedomain pair mediating a given protein–proteininteraction, and since RCDP is designed to findonly the pair(s) with the highest correlation, it maynot be able to recover all 109 unique domain–domain interactions present in the set.While it is important for any good method to be

able to recover the 109 known domain–domaininteractions from the validation set, it is equallyimportant that every predicted domain–domain

interaction is correct. To measure the accuracy ofour predictions, we used the positive predictivevalue (PPV) defined as:

PPV ¼ TPTPþ FP

where TP is the number of predicted domain pairsthat are known to be true (in iPfam), and FP is thenumber of predicted domain pairs that are not in theiPfam. RCDP predicted a total of 147 domain–domain interactions (109 of them are unique) forthe validation set, out of which 94 are in iPfamwith aPPVof 63.95%. To ensure that the 63.95% predictionaccuracy of RCDP is not by chance, we compared itagainst a randommethod. For this, we used the exactsame random strategy used by Nye et al.31, which,for a given protein–protein interaction, picks adomain pair at random out of all possible domainpairs. Since there could be more than one interactingdomain pair within each interacting protein pair,there is a certain probability that the domain pairpicked at random is a true interaction.We performed100,000 runs of this random method on our valida-tion set. The p-value of obtaining a prediction

Page 11: Co-evolutionary Analysis of Domains in Interacting Proteins Reveals Insights Into Domain-domain Interactions Mediating Protein-protein Interactions

Figure 9. Domain–domain interaction predictions results for 109 yeast protein–protein interactions, each of which (i)is between proteins with at least 50% of their sequence lengths assigned with Pfam domain(s), (ii) is not an interactionbetween two one-domain proteins, (iii) contains a domain pair that is known to interact (as reported iPfam), and (iv) isbetween proteins having orthologs in at least a common set of ten species. The performance of RCDP versus a method thatpicks a domain pair at random among all possible domain pairs is plotted. The results are broken down according to thenumber of potential domain–domain contacts between an interacting protein pair. RCDP clearly outperforms randompicks by about 9%, which is significant (p-value 1.05×10−2) considering the fact that it has been shown before (Figure 4 inNye et al.31), on a different dataset, that random performs as good as three other popular methods26,30,31 for inferringdomain–domain interactions.

871Co-evolution of Interacting Domains

accuracy of ≥63.95% by chance is 1.05×10−2 (z-score: 2.39). The performance of RCDP versus therandom method is shown in Figure 9. On average,the random method is expected to have 55.19(±0.01)% of its predictions to be in iPfam (Figure 9).RCDP outperforms random by about 9%, which issignificant, considering the fact that Nye et al.showed, on a different dataset, that the randommethod performs as good as three other popularmethods for predicting domain–domain interac-tions. In particular, Nye et al. showed that one canexpect their “lowest p-value” method, Deng et al.'sMLE method,30 and the random method to haveabout 55% of their predictions to be true, andSprinzak et al.'s association method26 to have about52% of its predictions to be true. Since the interactiondataset and domain annotations (SCOP domains)used in Nye et al.'s study are different from thoseused in this study, the results are not directlycomparable.

Conclusion

Here, we performed co-evolutionary analysis ofdomains in interacting proteins to assess whether ornot interacting domain pairs exhibit higher level ofco-evolution than non-interacting domain pairs of agiven protein–protein interaction. We used yeast

protein–protein interactions from DNA-directedRNA polymerase complex and F1-ATPase complexamong others in our investigation. Our resultsindicated that interacting domain pairs exhibithigher level of co-evolution than the non-interactingdomain pairs. Motivated by the results, we designeda method, called RCDP, to confirm the observedtrend, and to predict large-scale domain–domain (i.e.Pfam-Pfam) interactions using the yeast interactome.A total of 1222 domain–domain interactions from1180 protein–protein interactions were predicted,out of which 109 are found in PDB (as reported iniPfam). Through comparison of our predictions withthose from two other methods, we showed that theRCDP method can predict known domain–domaininteractions missed by the other two methods.The proposed RCDP method may not be suitable

for predicting domain–domain interactions betweenhomodimers (interaction between two copies of thesame protein). The reason for this is that the domainpairs with the highest correlation will be inter-chainhomodomains, which will have the maximumcorrelation score of one. Although this makesRCDP's results predictable for homodimers, inreality, it is mostly the case that homodimers aremediated by inter-chain homodomain interactions.In our set of 1180 protein–protein interactions (SLA≥50%), we had 71 cases of a protein interacting withitself. For this set of 71 interactions, we predicted 112

Page 12: Co-evolutionary Analysis of Domains in Interacting Proteins Reveals Insights Into Domain-domain Interactions Mediating Protein-protein Interactions

†http://www.ncbi.nlm.nih.gov/CBBresearch/Przytycka/RCDP/

872 Co-evolution of Interacting Domains

domain–domain interactions, out of which 84 (75%)are found in PDB (as reported in iPfam).Although relative co-evolution of interacting

domains can be used to predict domain–domaininteractions between two interacting proteins, thereare some limitations that apply to any method basedon co-evolution, which could cause false positivesand false negatives that one should be aware of. Firstof all, this type of analysis assumes that interactingdomains/proteins co-evolve, i.e. undergo correlatedmutations, which may not be always true due tonumerous other biological constraints on the inter-acting domains/proteins. If domain Awith multipleinteracting partners undergoes correlated mutationswith its interacting partners, then there is a dangerof it not having a high correlation with its partnersdue to “uncorrelated set of correlated mutations”(see Figure 4).Predicted domain–domain interactions are only as

good as the accuracy of the protein–protein interac-tions used. Domain–domain interactions are pre-dicted under the premise that the given protein–protein interaction is accurate. If a protein–proteininteraction is a false positive, then one should notexpect the predicted domain–domain interaction tobe true. Various studies65–68 have reported thatanywhere between 40–60% of the reported protein–protein interactions could be false-positives, whichcould potentially explain the false-positives (≈34%)in predictions by the RCDP method.Not all interactions are mediated by pairs of

globular domains. There are many that involvebinding of a domain in one protein to a short region(approximately three to eight residues) in ano-ther.69,70 Detecting these short length “linear motifs”using sequence comparison is difficult due to theirtendency to reside in disordered regions in proteins,and limited conservation outside of closely relatedspecies.37 Thus, there is a possibility that the set ofinteracting residues may not be part of the domainsassigned to a protein. This could lead to incorrectprediction of domain–domain interactions in suchcases. And, there may be more than one domain pairmediating a protein–protein interaction, and sinceRCDP is designed to find only the pair(s) with thehighest correlation, it may not be able to recover allinteracting domain pairs.The orthology detection procedure used in this

study may not be sufficiently rigorous for detectingorthologs. We did attempt to use a very stringentreciprocal BLAST best-hits approach. But, becauseof its stringent nature, and our requirement thatinteracting proteins have at least ten orthologs froma common set of species, we were unable to obtain alarge enough dataset to make any statistical conclu-sion. Another issue is that of closely relatedparalogs. Since many genes in Eukaryotes areknown to have numerous in-paralogs (due to recentduplications), it makes it difficult to establish one-to-one orthology relationships. Our tests on a few casesto see the effect of including one in-paralog overanother had little or no effect on the co-evolutionaryanalysis.

Despite these limitations, the RCDP methodproves to be extremely useful for inferringdomain–domain interactions. Unlike sophisticatedstatistical methods, which require a training set, theRCDP method can directly be used on a givenprotein–protein interaction to predict the domainpair that is most likely to mediate the interaction.Since RCDP, DPEA, and RDFF methods share asmall fraction of their predictions, indicating thatthey can detect known domain–domain interactionsmissed by the other, together they can be used todetect unrecognized domain–domain interactionson a genome scale with wider coverage.The RCDP method is simple and easy to imple-

ment (an implementation of the RCDP algorithm isavailable†), and can be used as a tool to guideexperimentalists in discovering previously unrecog-nized domain–domain interactions. In the future,we would like to investigate whether there is apossibility of transitivity in co-evolution. That is, ifAinteracts with B, and B interacts with C, will A and Cexhibit a high degree of co-evolution (assuming thatA and C do not interact) because of their associationwith a common interacting partner B? If they do, itwould be interesting to know the biologicalreasons/constraints that require them to co-evolve.

Materials and Methods

Construction of multiple sequence alignments,phylogenetic trees, and similarity matrices

For each protein–protein interaction, multiple sequencealignments for the two proteins were constructed usingMUSCLE71 by searching for their respective orthologs in 93eukaryotic genomes (Supplementary material S1). Ortho-logs were obtained by performing a stringent BLASTsearch.72 For a given query protein, the best hit in a genomewith e-value <1e-5, sequence identity of at least 35% and analignment length of at least 75% of the length of both thequery and the hit sequence was considered to be anortholog. Sequence identity and alignment length con-straints were enforced to eliminate partial hits fromconsideration. Multiple sequence alignment for domainD in protein P is constructed by extracting those regions inP's multiple sequence alignment that corresponds to D.In order to be able to compare the evolutionary histories

of two domains, we require that both domains haveorthologs in at least a common set of ten species. Multiplesequence alignments of both domains for a common set ofspecies were constructed, followed by the construction ofphylogenetic trees and similarity matrices using thealgorithms provided in the ClustalW suite.73

Assessment of the agreement between theevolutionary histories of two domains

The extent of agreement between the evolutionaryhistories of two domains is assessed by comparing theirphylogenetic trees. For comparison of phylogenetic trees,

Page 13: Co-evolutionary Analysis of Domains in Interacting Proteins Reveals Insights Into Domain-domain Interactions Mediating Protein-protein Interactions

873Co-evolution of Interacting Domains

we follow the standard practice of comparing thecorresponding similarity matrices.14,16,39–44 The extent ofagreement between two similarity matrices, A and B, isevaluated using Pearson's correlation coefficient, given by:

rAB ¼

Xn�1

i¼1

Xnj¼iþ1

ðAij � AÞðBij � BÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPn�1

j¼iþ1

Pnj¼iþ1

ðAij � AÞ2Pn�1

i¼1

Pnj¼iþ1

ðBij � BÞ2vuut

;

where n is the number of species (rows/columns)represented in the matrices,Aij and Bij are the evolutionarydistances between species i and j in the tree of domains Aand B, respectively, and Ā and B- are the mean values of allAij and Bij

, respectively. The value of r ranges from −1.0 to+1.0, with higher r indicating greater agreement betweenthe two matrices, and thus higher level of co-evolutionbetween the corresponding families.

Inferring domain–domain interactions

For every interacting protein pair, P and Q, all possibledomain–domain interactions between them are consid-ered. Let protein P contain domains {P1, P2, …, Pm} andprotein Q contain domains {Q1,Q2, …,Qn}. The correlationof evolutionary histories of all possible domain pairsbetween P and Q is computed, and the domain pair PiQjwith the highest level of co-evolution (whose evolution-ary histories correlate the most) is inferred to be the one(or one of many domain–domain contacts) that is mostlikely to mediate the interaction between P and Q. Incases of more than one domain pair having the highestcorrelation score, all domain pairs with the highest scoreare inferred to be interacting. Interestingly, and moreoften, more than one domain pair mediate a givenprotein–protein interaction.

Protein–protein interaction test set

Protein–protein interaction data for Saccharomyces cere-visiae (yeast) from the DIP database74 (February 2005release) were used. This set contained a total of 17,471interactions underlying 4931 yeast proteins. For domaindefinition, we used the Pfam database of Hidden MarkovModel (HMM) profiles.51 Only Pfam-A profiles were usedto assign domain definitions to the 4931 interactingproteins, using e-value cutoff of 1e-3.Only interacting proteins with at least 50% of their

sequence lengths assigned onto Pfam domain(s), andinteractions involving them, were considered. Thisreduced the number of interactions to 3266 among 1397proteins. Because of the limitation that interacting proteinpairs have orthologs in at least a common set of tenspecies, our final test set contained 1180 interactions,underlying 654 proteins (Supplementary material S2). Wealso considered a restricted set of interactions, with eachinteracting protein having at least 75% of its sequencelength assigned onto Pfam domain(s). This restricted setcontained a total of 374 interactions among 298 proteins(Supplementary material S4).

Supplementary information

Supplementary data associated with this article, and animplementation of the RCDPalgorithm is available atURL:

http://www.ncbi.nlm.nih.gov/CBBresearch/Przytycka/RCDP/. The supplementary data comprises: S1, organismsused in ortholog search; S2, test set 1, containing 1180 yeastprotein interactions with SLA ≥50%; S3, RCDP predictionresults for the test set 1, containing 1222 domain–domaininteractions; S4; test set 2, containing 374 yeast proteininteractions with SLA ≥75%; S5, RCDP prediction resultsfor the test set 2, containing 394 domain–domain interac-tions; S6, validation set, containing 109 yeast proteininteractions with SLA ≥50%.

Acknowledgements

We thank S. Balaji for useful comments andsuggestions. This work was supported by theintramural research program of the National Libraryof Medicine, National Institutes of Health.

References

1. Uetz, P., Giot, L., Cagney, G., Mansfield, T. A., Judson,R. S., Knight, J. R. et al. (2000). A comprehensiveanalysis of protein-protein interactions in Saccharo-myces cerevisiae. Nature, 403, 623–627.

2. Ito, T., Chiba, T., Ozawa, R., Yoshida, M., Hattori, M. &Sakaki, Y. (2001). A comprehensive two-hybrid ana-lysis to explore the yeast protein interactome. Proc.Natl Acad. Sci. USA, 98, 4569–4574.

3. Gavin, A. C., Bosche, M., Krause, R., Grandi, P.,Marzioch, M., Bauer, A. et al. (2002). Functionalorganization of the yeast proteome by systematicanalysis of protein complexes. Nature, 415, 141–147.

4. Ho, Y., Gruhler, A., Heilbut, A., Bader, G. D., Moore,L., Adams, S. L. et al. (2002). Systematic identificationof protein complexes in Saccharomyces cerevisiae bymass spectrometry. Nature, 415, 180–183.

5. Giot, L., Bader, J. S., Brouwer, C., Chaudhuri, A.,Kuang, B., Li, Y. et al. (2003). A protein interaction mapof Drosophila melanogaster. Science, 302, 1727–1736.

6. Li, S., Armstrong, C. M., Bertin, N., Ge, H., Milstein, S.,Boxem, M. et al. (2004). A map of the interactomenetwork of the metazoan C. elegans. Science, 303,540–543.

7. Butland, G., Peregrin-Alvarez, J. M., Li, J., Yang, W.,Yang, X., Canadien, V. et al. (2005). Interaction networkcontaining conserved and essential protein complexesin Escherichia coli. Nature, 433, 531–537.

8. Krogan, N. J., Cagney, G., Yu, H., Zhong, G., Guo, X.,Ignatchenko, A. et al. (2006). Global landscape ofprotein complexes in the yeast Saccharomyces cerevisiae.Nature, 440, 637–643.

9. Dandekar, T., Snel, B., Huynen, M. & Bork, P. (1998).Conservation of gene order: a fingerprint of proteinsthat physically interact.Trends Biochem. Sci. 23, 324–328.

10. Enright, A. J., Iliopoulos, I., Kyrpides, N. C. &Ouzounis, C. A. (1999). Protein interaction maps forcomplete genomes based on gene fusion events.Nature, 402, 86–90.

11. Marcotte, E. M., Pellegrini, M., Ng, H. L., Rice, D. W.,Yeates, T. O. & Eisenberg, D. (1999). Detecting proteinfunction and protein-protein interactions from gen-ome sequences. Science, 285, 751–753.

12. Overbeek, R., Fonstein, M., D'Souza, M., Pusch, G. D.& Maltsev, N. (1999). Use of contiguity on the

Page 14: Co-evolutionary Analysis of Domains in Interacting Proteins Reveals Insights Into Domain-domain Interactions Mediating Protein-protein Interactions

874 Co-evolution of Interacting Domains

chromosome to predict functional coupling. In SilicoBiol. 1, 93–108.

13. Pellegrini, M., Marcotte, E. M., Thompson, M. J.,Eisenberg, D. & Yeates, T. O. (1999). Assigning proteinfunctions by comparative genome analysis: proteinphylogenetic profiles. Proc. Natl Acad. Sci. USA, 96,4285–4288.

14. Goh, C. S., Bogan, A. A., Joachimiak, M., Walther, D. &Cohen, F. E. (2000). Co-evolution of proteins with theirinteraction partners. J. Mol. Biol. 299, 283–293.

15. Wojcik, J. & Schachter, V. (2001). Protein-proteininteraction map inference using interacting domainprofile pairs. Bioinformatics, 17(Suppl. 1), S296–S305.

16. Pazos, F. & Valencia, A. (2001). Similarity of phyloge-netic trees as indicator of protein-protein interaction.Protein Eng. 14, 609–614.

17. Pazos, F. & Valencia, A. (2002). In silico two-hybridsystem for the selection of physically interactingprotein pairs. Proteins: Struct. Funct. Genet. 47,219–227.

18. Date, S. V. & Marcotte, E. M. (2003). Discovery ofuncharacterized cellular systems by genome-wideanalysis of functional linkages. Nature Biotechnol. 21,1055–1062.

19. Jansen, R., Yu, H., Greenbaum, D., Kluger, Y., Krogan,N. J., Chung, S., Emili, A. et al. (2003). A Bayesiannetworks approach for predicting protein-proteininteractions from genomic data. Science, 302, 449–453.

20. Lappe, M. & Holm, L. (2004). Unraveling proteininteraction networks with near-optimal efficiency.Nature Biotechnol. 22, 98–103.

21. Pagel, P., Wong, P. & Frishman, D. (2004). A domaininteraction map based on phylogenetic profiling. J.Mol. Biol. 344, 1331–1346.

22. Kim, Y. & Subramaniam, S. (2006). Locally definedprotein phylogenetic profiles reveal previouslymissed protein interactions and functional relation-ships. Proteins: Struct. Funct. Genet. 62, 1115–1124.

23. Sharan, R., Suthram, S., Kelley, R. M., Kuhn, T.,McCuine, S., Uetz, P. et al. (2005). Conserved patternsof protein interaction in multiple species. Proc. NatlAcad. Sci. USA, 102, 1974–1979.

24. Apic, G., Gough, J. & Teichmann, S. A. (2001). Domaincombinations in archaeal, eubacterial and eukaryoticproteomes. J. Mol. Biol. 310, 311–325.

25. Chothia, C., Gough, J., Vogel, C. & Teichmann, S. A.(2003). Evolution of the protein repertoire. Science,300, 1701–1703.

26. Sprinzak, E. & Margalit, H. (2001). Correlatedsequence-signatures as markers of protein-proteininteraction. J. Mol. Biol. 311, 681–692.

27. Kim, W. K., Park, J. & Suh, J. K. (2002). Large scalestatistical prediction of protein-protein interaction bypotentially interacting domain (PID) pair. GenomeInform. Ser. Workshop Genome Inform. 13, 42–50.

28. Ng, S. K., Zhang, Z. & Tan, S. H. (2003). Integrativeapproach for computationally inferring proteindomain interactions. Bioinformatics, 19, 923–929.

29. Albrecht, M., Huthmacher, C., Tosatto, S. C. &Lengauer, T. (2005). Decomposing protein networksinto domain-domain interactions. Bioinformatics, 21(Suppl. 2), ii220–ii221.

30. Deng, M., Mehta, S., Sun, F. & Chen, T. (2002).Inferring domain-domain interactions from protein-protein interactions. Genome Res. 12, 1540–1548.

31. Nye, T. M., Berzuini, C., Gilks, W. R., Babu, M. M. &Teichmann, S. A. (2005). Statistical analysis ofdomains in interacting protein pairs. Bioinformatics,21, 993–1001.

32. Chen, X. W. & Liu, M. (2005). Prediction of protein-protein interactions using random decision forestframework. Bioinformatics, 21, 4394–4400.

33. Riley, R., Lee, C., Sabatti, C. & Eisenberg, D. (2005).Inferring protein domain interactions from databasesof interacting proteins. Genome Biol. 6, R89.

34. Littler, S. J. & Hubbard, S. J. (2005). Conservation oforientation and sequence in protein domain-domaininteractions. J. Mol. Biol. 345, 1265–1279.

35. Gong, S., Park, C., Choi, H., Ko, J., Jang, I., Lee, J. et al.(2005). A protein domain interaction interface data-base: InterPare. BMC Bioinformatics, 6, 207.

36. Shoemaker, B. A., Panchenko, A. R. & Bryant, S. H.(2006). Finding biologically relevant protein domaininteractions: conserved binding mode analysis. ProteinSci. 15, 352–361.

37. Neduva, V., Linding, R., Su-Angrand, I., Stark, A., deMasi, F., Gibson, T. J. et al. (2005). Systematic discoveryof new recognition peptides mediating protein inter-action networks. PLoS Biol. 3, e405.

38. Aloy, P. & Russell, R. B. (2006). Structural systemsbiology: modelling protein interactions. Nature Rev.Mol. Cell Biol. 7, 188–197.

39. Goh, C. S. & Cohen, F. E. (2002). Co-evolutionaryanalysis reveals insights into protein-protein interac-tions. J. Mol. Biol. 324, 177–192.

40. Ramani, A. K. & Marcotte, E. M. (2003). Exploiting theco-evolution of interacting proteins to discover inter-action specificity. J. Mol. Biol. 327, 273–284.

41. Gertz, J., Elfond, G., Shustrova, A., Weisinger, M.,Pellegrini, M., Cokus, S. & Rothschild, B. (2003).Inferring protein interactions from phylogenetic dis-tance matrices. Bioinformatics, 19, 2039–2045.

42. Jothi, R., Kann, M. G. & Przytycka, T. M. (2005).Predicting protein-protein interaction by searchingevolutionary tree automorphism space. Bioinformatics,21(Suppl. 1), i241–i250.

43. Pazos, F., Ranea, J. A., Juan, D. & Sternberg, M. J.(2005). Assessing protein co-evolution in the contextof the tree of life assists in the prediction of theinteractome. J. Mol. Biol. 352, 1002–1015.

44. Sato, T., Yamanishi, Y., Kanehisa, M. & Toh, H. (2005).The inference of protein-protein interactions by co-evolutionary analysis is improved by excluding theinformation about the phylogenetic relationships.Bioinformatics, 21, 3482–3489.

45. Fraser, H. B., Hirsh, A. E., Wall, D. P. & Eisen, M. B.(2004). Coevolution of gene expression among interact-ing proteins. Proc. Natl Acad. Sci. USA, 101, 9033–9038.

46. Tirosh, I. & Barkai, N. (2005). Computational verifica-tion of protein-protein interactions by orthologous co-expression. BMC Bioinformatics, 6, 40.

47. Moyle, W. R., Campbell, R. K., Myers, R. V., Bernard,M. P., Han, Y. & Wang, X. (1994). Co-evolution ofligand-receptor pairs. Nature, 368, 251–255.

48. Pazos, F., Helmer-Citterich, M., Ausiello, G. &Valencia, A. (1997). Correlated mutations containinformation about protein-protein interaction. J. Mol.Biol. 271, 511–523.

49. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G.,Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P.E. (2000). The Protein Data Bank. Nucl. Acids Res. 28,235–242.

50. Finn, R. D., Marshall, M. & Bateman, A. (2005). iPfam:visualization of protein-protein interactions in PDB atdomain and amino acid resolutions. Bioinformatics, 21,410–412.

51. Bateman, A., Coin, L., Durbin, R., Finn, R. D.,Hollich, V., Griffiths-Jones, S. et al. (2004). The Pfam

Page 15: Co-evolutionary Analysis of Domains in Interacting Proteins Reveals Insights Into Domain-domain Interactions Mediating Protein-protein Interactions

875Co-evolution of Interacting Domains

protein families database. Nucl. Acids Res. 32,D138–D141.

52. Boyer, P. D. & Kohlbrenner, W. E. (1981). In EnergyCoupling in Photosynthesis (Selman, B. & Selman-Reiner, S., eds), pp. 231–240, Elsevier Science Publish-ing Co, New York.

53. Cox, G. B., Jans, D. A., Fimmel, A. L., Gibson, F. &Hatch, L. (1984). Hypothesis. The mechanism of ATPsynthase. Conformational change by rotation of thebeta-subunit. Biochim. Biophys. Acta, 768, 201–208.

54. Mitchell, P. (1985). Molecular mechanics of proto-nmotive F0F1 ATPases. Rolling well and turnstilehypothesis. FEBS Letters, 182, 1–7.

55. Oosawa, F. & Hayashi, S. (1986). The loose couplingmechanism in molecular machines of living cells.Advan. Biophys. 22, 151–183.

56. Abrahams, J. P., Leslie, A. G., Lutter, R. & Walker, J. E.(1994). Structure at 2.8 A resolution of F1-ATPase frombovine heart mitochondria. Nature, 370, 621–628.

57. Lederkremer, G. Z., Cheng, Y., Petre, B. M., Vogan, E.,Springer, S., Schekman, R., Walz, T. & Kirchhausen, T.(2001). Structure of the Sec23p/24p and Sec13p/31pcomplexes of COPII. Proc. Natl Acad. Sci. USA, 98,10704–10709.

58. Kim,W. K. & Ison, J. C. (2005). Survey of the geometricassociation of domain-domain interfaces. Proteins:Struct. Funct. Genet. 61, 1075–1088.

59. Caffrey, D. R., Somaroo, S., Hughes, J. D., Mintseris, J.& Huang, E. S. (2004). Are protein-protein interfacesmore conserved in sequence than the rest of theprotein surface? Protein Sci. 13, 190–202.

60. Wuchty, S. (2004). Evolution and topology in the yeastprotein interactionnetwork.GenomeRes.14, 1310–1314.

61. Wuchty, S., Oltvai, Z. N. & Barabasi, A. L. (2003).Evolutionary conservation of motif constituents in theyeast protein interaction network. Nature Genet. 35,176–179.

62. Fraser, H. B., Wall, D. P. &Hirsh, A. E. (2003). A simpledependence between protein evolution rate and thenumber of protein-protein interactions. BMC Evol.Biol. 3, 11.

63. Hood, J. K. & Silver, P. A. (1998). Cse1p is required forexport of Srp1p/importin-alpha from the nucleus inSaccharomyces cerevisiae. J. Biol. Chem. 273, 35142–35146.

64. Schroeder, A. J., Chen, X. H., Xiao, Z. & Fitzgerald-Hayes, M. (1999). Genetic evidence for interactionsbetween yeast importin alpha (Srp1p) and itsnuclear export receptor. Cse1p. Mol. Gen. Genet.261, 788–795.

65. Mrowka, R., Patzak, A. & Herzel, H. (2001). Is there abias in proteome research? Genome Res. 11, 1971–1973.

66. Deane, C. M., Salwinski, L., Xenarios, I. & Eisenberg,D. (2002). Protein interactions: two methods forassessment of the reliability of high throughputobservations. Mol. Cell. Proteomics, 1, 349–356.

67. von Mering, C., Krause, R., Snel, B., Cornell, M.,Oliver, S. G., Fields, S. & Bork, P. (2002). Comparativeassessment of large-scale data sets of protein-proteininteractions. Nature, 417, 399–403.

68. Sprinzak, E., Sattath, S. & Margalit, H. (2003). Howreliable are experimental protein-protein interactiondata? J. Mol. Biol. 327, 919–923.

69. Pawson, T. & Scott, J. D. (1997). Signaling throughscaffold, anchoring, and adaptor proteins. Science, 278,2075–2080.

70. Sudol, M. (1998). From Src homology domains toother signaling modules: proposal of the ‘proteinrecognition code’. Oncogene, 17, 1469–1474.

71. Edgar, R. C. (2004). MUSCLE: a multiple sequencealignment method with reduced time and spacecomplexity. BMC Bioinformatics, 5, 113.

72. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. &Lipman, D. J. (1990). Basic local alignment search tool.J. Mol. Biol. 215, 403–410.

73. Chenna, R., Sugawara, H., Koike, T., Lopez, R.,Gibson, T. J., Higgins, D. G. & Thompson, J. D.(2003). Multiple sequence alignment with the Clustalseries of programs. Nucl. Acids Res. 31, 3497–3500.

74. Salwinski, L., Miller, C. S., Smith, A. J., Pettit, F. K.,Bowie, J. U. & Eisenberg, D. (2004). The database ofinteracting proteins: 2004 update. Nucl. Acids Res. 32,D449–D451.

Edited by F. E. Cohen

(Received 25 April 2006; received in revised form 19 June 2006; accepted 14 July 2006)Available online 1 August 2006