The Role of RNA Sequence and Structure in RNA–Protein Interactions

14
The Role of RNA Sequence and Structure in RNAProtein Interactions Aditi Gupta and Michael GribskovDepartment of Biological Sciences, Purdue University, Hockmeyer Hall of Structural Biology, 240 South Martin Jischke Drive, West Lafayette, IN 47907, USA Received 29 October 2010; received in revised form 7 February 2011; accepted 4 April 2011 Available online 15 April 2011 Edited by J. Doudna Keywords: RNAprotein binding; statistical analysis; protein recognition We investigate the sequence and structural properties of RNAprotein interaction sites in 211 RNAprotein chain pairs, the largest set of RNAprotein complexes analyzed to date. Statistical analysis confirms and extends earlier analyses made on smaller data sets. There are 24.6% of hydrogen bonds between RNA and protein that are nucleobase specific, indicating the importance of both nucleobase-specific and -nonspecific interactions. While there is no significant difference between RNA base frequencies in protein-binding and non-binding regions, distinct prefer- ences for RNA bases, RNA structural states, protein residues, and protein secondary structure emerge when nucleobase-specific and -nonspecific interactions are considered separately. Guanine nucleobase and unpaired RNA structural states are significantly preferred in nucleobase-specific interactions; however, nonspecific interactions disfavor guanine, while still favoring unpaired RNA structural states. The opposite preferences of nucleobase-specific and -nonspecific interactions for guanine may explain discrepancies between earlier studies with regard to base preferences in RNAprotein interaction regions. Preferences for amino acid residues differ significantly between nucleobase-specific and -nonspecific interactions, with nonspecific interactions showing the expected bias towards positively charged residues. Irregular protein structures are strongly favored in interactions with the protein backbone, whereas there is little preference for specific protein secondary structure in either nucleobase-specific interaction or -nonspecific interaction. Overall, this study shows strong preferences for both RNA bases and RNA structural states in proteinRNA interactions, indicating their mutual importance in protein recognition. © 2011 Elsevier Ltd. All rights reserved. Introduction RNAprotein interactions are involved in many cellular and viral processes and can involve either transient complexes, such as the exon junction complex, or stable complexes, such as the ribosome. The structures of representative RNAprotein complexes used in this work are shown in Fig. 1. Disruption of these interactions often leads to disease (see Lukong et al. 1 or Cooper et al. 2 for reviews), due to the absence of RNA-binding proteins (e.g., fragile X syndrome, 3 spinal muscular atrophy, 4 and paraneoplastic neurologic syn- dromes 5 ) or sequestration of RNA-binding proteins by expanded triplet repeat regions (e.g., myotonic dystrophy 6 and oculopharyngeal muscular dystrophy 7 ). Altered expression of RNA-binding proteins such as SF2/ASF, 8 eIF4E, 9 Sam68, 10 and QK1 11 has been implicated in tumorigenesis, and alterations in ribonucleoprotein complexes in- volved in RNA splicing and stability have been implicated in numerous diseases including Prader*Corresponding author. E-mail address: [email protected]. Abbreviations used: PDB, Protein Data Bank; WC, WatsonCrick. doi:10.1016/j.jmb.2011.04.007 J. Mol. Biol. (2011) 409, 574587 Contents lists available at www.sciencedirect.com Journal of Molecular Biology journal homepage: http://ees.elsevier.com.jmb 0022-2836/$ - see front matter © 2011 Elsevier Ltd. All rights reserved.

Transcript of The Role of RNA Sequence and Structure in RNA–Protein Interactions

Page 1: The Role of RNA Sequence and Structure in RNA–Protein Interactions

doi:10.1016/j.jmb.2011.04.007 J. Mol. Biol. (2011) 409, 574–587

Contents lists available at www.sciencedirect.com

Journal of Molecular Biologyj ourna l homepage: ht tp : / /ees .e lsev ie r.com. jmb

The Role of RNA Sequence and Structure inRNA–Protein Interactions

Aditi Gupta and Michael Gribskov⁎Department of Biological Sciences, Purdue University, Hockmeyer Hall of Structural Biology,240 South Martin Jischke Drive, West Lafayette, IN 47907, USA

Received 29 October 2010;received in revised form7 February 2011;accepted 4 April 2011Available online15 April 2011

Edited by J. Doudna

Keywords:RNA–protein binding;statistical analysis;protein recognition

*Corresponding author. E-mail [email protected] used: PDB, Protein

Watson–Crick.

0022-2836/$ - see front matter © 2011 E

We investigate the sequence and structural properties of RNA–proteininteraction sites in 211 RNA–protein chain pairs, the largest set of RNA–protein complexes analyzed to date. Statistical analysis confirms andextends earlier analyses made on smaller data sets. There are 24.6% ofhydrogen bonds between RNA and protein that are nucleobase specific,indicating the importance of both nucleobase-specific and -nonspecificinteractions. While there is no significant difference between RNA basefrequencies in protein-binding and non-binding regions, distinct prefer-ences for RNA bases, RNA structural states, protein residues, and proteinsecondary structure emerge when nucleobase-specific and -nonspecificinteractions are considered separately. Guanine nucleobase and unpairedRNA structural states are significantly preferred in nucleobase-specificinteractions; however, nonspecific interactions disfavor guanine, while stillfavoring unpaired RNA structural states. The opposite preferences ofnucleobase-specific and -nonspecific interactions for guanine may explaindiscrepancies between earlier studies with regard to base preferences inRNA–protein interaction regions. Preferences for amino acid residues differsignificantly between nucleobase-specific and -nonspecific interactions,with nonspecific interactions showing the expected bias towards positivelycharged residues. Irregular protein structures are strongly favored ininteractions with the protein backbone, whereas there is little preference forspecific protein secondary structure in either nucleobase-specific interactionor -nonspecific interaction. Overall, this study shows strong preferences forboth RNA bases and RNA structural states in protein–RNA interactions,indicating their mutual importance in protein recognition.

© 2011 Elsevier Ltd. All rights reserved.

Introduction

RNA–protein interactions are involved in manycellular and viral processes and can involve eithertransient complexes, such as the exon junctioncomplex, or stable complexes, such as the ribosome.The structures of representative RNA–proteincomplexes used in this work are shown in Fig. 1.

ress:

Data Bank; WC,

lsevier Ltd. All rights reserve

Disruption of these interactions often leads todisease (see Lukong et al.1 or Cooper et al.2 forreviews), due to the absence of RNA-bindingproteins (e.g., fragile X syndrome,3 spinal muscularatrophy,4 and paraneoplastic neurologic syn-dromes5) or sequestration of RNA-binding proteinsby expanded triplet repeat regions (e.g., myotonicdystrophy6 and oculopharyngeal musculardystrophy7). Altered expression of RNA-bindingproteins such as SF2/ASF,8 eIF4E,9 Sam68,10 andQK111 has been implicated in tumorigenesis, andalterations in ribonucleoprotein complexes in-volved in RNA splicing and stability have beenimplicated in numerous diseases including Prader–

d.

Page 2: The Role of RNA Sequence and Structure in RNA–Protein Interactions

Fig. 1. Representative RNA–protein complexes. Selected RNA–protein complexes representative of the structuralclasses in our data set are shown. The panels represent RNA–protein complexes between 5S rRNA and ribosomal proteinsL5, L18, L21E, and L30 [PDB ID: 1JJ2 (a)], GLN tRNA and its synthetase [PDB ID: 1QTQ, (b)], domain IV of 4.5S RNA andSRP protein [PDB ID: 1HQ1, (c)], ribosomal protein L1 and its mRNA fragment [PDB ID: 2VPL, (d)], hairpin loop IV of U2snRNA and U2B″/U2A′ protein complex [PDB ID: 1A9N, (e)], box C/D snoRNA and L7Ae protein [PDB ID: 1RLG, (f)],fragment of pre-miRNA and RISC-loading complex subunit [PDB ID: 3ADL, (g)], P3 domain of RNase P/MRP RNA andPOP6 and POP7 protein subunits [PDB ID: 3IAB, (h)], and C-rich fragment of telomeric RNA and polyC binding protein[PDB ID: 2PY9, (i)]. This figure was generated using PyMOL Molecular Graphics System, Version 1.3, Schrödinger, LLC.

575RNA Sequence–Structure in RNA–Protein Interactions

Willi syndrome12 and retinitis pigmentosa.13 Bothtransient and stable ribonucleoprotein complexesinvolve base-specific, protein-side-chain-specific,and nonspecific interactions between the proteinand RNA.Protein–RNA interactions have been studied from

three major perspectives: comparison of DNA–

protein interactions with RNA–protein interactions,analysis of the RNA–protein interface, and investi-gation of how proteins select parts of RNA forbinding, that is, identification of RNA-bindingmotifs within proteins. Comparison of RNA andDNA protein binding suggests that RNA–proteininterfaces are less well packed than DNA–protein

Page 3: The Role of RNA Sequence and Structure in RNA–Protein Interactions

Table 1. Structural states defined for nucleobases in RNAand amino acid residues in protein

Description

RNA codeW Watson–Crick or Wobble base paired (WC)WS WC and Sugar/Hoogsteen base paired (SH)Ws WC and stackedWSs WC and SH base paired, as well as stackedS SH base pairedSs SH and stackeds Stackedu Unpaired

Protein codeB β-sheeth1 Right-handed α-helixh5 Right-handed 3/10 helixo Other (irregular regions)

Protein structural states not found in this data set are omitted.

576 RNA Sequence–Structure in RNA–Protein Interactions

interfaces and that RNA bases make more directcontacts with proteins than do DNA bases.14–16

These differences have been attributed, in part, tothe ability of RNA to form secondary structures.15

Atomic level studies have pointed out differences inthe hydrogen-bonding patterns of nucleotides andprotein side chains,17–19 suggesting that both geom-etry and steric constraints play roles in RNA–proteinrecognition.Analysis of the RNA-binding regions of proteins

has identified several RNA recognition units inproteins. The best understood are RNA recognitionmotifs (RRMs), KH domains, double-strandedRNA-binding domains, and Zinc fingers, all ofwhich have conserved sequence signatures andstructural architectures.20–23 RS domains (clustersof Arg–Ser dipeptides) and RGG/GAR domains(clusters of Gly–Gly mixed with aromatic and Argresidues) are more loosely defined, and littleinformation is available about their structure andbinding modes.23

In contrast, protein recognition features in RNAare poorly understood. There is a broad consensusthat the RNA backbone interacts with protein morefrequently than do the nucleotide bases, indicatingthat sequence-independent features play an impor-tant role in RNA–protein interactions.14,16,17,24However, sequence-specific interactions can anddo play important roles in some interactions, forexample, the binding of FOX1 protein to theUGCAUGmotif in pre-mRNA.25 Statistical analysesof RNA–protein interfaces in crystallographic struc-tures disagree on which bases and amino acids arefavored in interaction regions, andwhich amino acid–base interaction pairs are most common.14–17,24,26

However, the number of RNA–protein complexesin the Protein Data Bank (PDB) has significantlyincreased and analysis of the comprehensive datacurrently available may resolve the differences inprevious analyses.In this study, we investigate whether RNA

sequence–structure properties in protein-bindingregions are distinguishable from non-bindingregions by examining intermolecular hydrogenbonding in ribonucleoprotein complexes. We focusprimarily on the RNA component of the interactionin order to better understand the role of RNAsequence and structure in RNA–protein interac-tions. Our results suggest that different bases arepreferred in nucleobase-specific and -nonspecificinteractions, possibly explaining the discrepanciesbetween conclusions of previous analyses. We findthat the RNA structure in protein-binding regionsis significantly distinguishable from non-bindingregions. The identification of statistically distin-guishable RNA sequence and structure propertiesin protein-binding regions could be used as thebasis for computational methods for predictingprotein-binding sites in novel RNA. We demon-

strate that the nucleobases in certain structuralstates are strongly favored for protein binding,indicating that RNA sequence and structure collec-tively influence protein recognition and binding.

Results

In order to distinguish between sequence-specificand -nonspecific interactions, we use nucleobase torefer to the four nitrogenous bases (adenine, uridine,cytosine, and guanine) in a base-specific sense.We use the term RNA backbone to refer to theribose and phosphate moieties exclusive of thenucleobase (i.e., the sequence nonspecific part ofthe RNA molecule). Each position in the RNAchain, comprising a nucleobase, sugar, and phos-phate, is called a base when no implication aboutsequence specificity is intended. Similarly, backbonerefers to the protein main-chain atoms (nitrogen,carbonyl carbon, and carbonyl oxygen), the non-specific part of the protein, and side chain refers toamino acid R groups, the sequence-specific part ofthe protein.Base stacking is defined as an interaction between

the planar faces of nucleobases and can occur in allbases whether or not they are base paired. Watson–Crick (WC) describes conventional A–U, G–C, andG–U base pairs. Bases may also be described aspaired in the Sugar/Hoogsteen sense.Each base can take part in multiple hydrogen

bond and stacking interactions. We define eightexclusive structural states for the bases (Table 1)based on base pairing and base stacking. Note thatbase stacking is implicit in WC base-paired struc-tures, and the structural state stacked indicatesstacking interactions other than the ones accompa-nying base pairing in helices.

Page 4: The Role of RNA Sequence and Structure in RNA–Protein Interactions

577RNA Sequence–Structure in RNA–Protein Interactions

Protein residues are assigned to 1 of 12 possiblestructural categories: β-sheet, 10 variations ofhelices labeled h1 to h10, and disulfide bonded,as described in the HELIX, SHEET, and SSBOND(disulfide) records of the PDB entries. Amino acidsthat are not defined as helix, sheet, or disulfidebonded are assigned as other; this includes turn,coil, irregularly structured, and unordered regions(Table 1).RNA bases and protein residues involved in

interactions are referred to as RNP, while non-interacting bases and residues are termed nonRNP.The term overall is used to refer to the entire RNA–protein data set and includes both RNP andnonRNP regions.

Preferred RNA bases and structures inprotein-binding regions

Our data set (see Materials and Methods) consistsof 211 RNA–protein chain pairs and includes 9977bases from 77 RNA–protein complexes. Of these,2207 bases (22%) form hydrogen bonds with proteinand 44% of these interacting bases make multiplehydrogen bonds with protein. The base frequencies

Fig. 2. The four categories of RNA–protein interactions. Thribosomal protein L11 (PDB ID: 1MMS). The nucleobases colorand are magnified in the right panel. NN, NS, SN, and SSRNANprot, and SRNASprot, respectively. Here, the SN interactiothe SS interaction is between O4 of 1060U and side chain of 1between 1060U OP1 and main-chain N of 75ALA, and the NSThis figure was generated using PyMOL Molecular Graphics

in RNP regions do not significantly differ from theoverall base frequencies (αN0.1). However, thefrequency of RNA structural states in RNP regionsis significantly different from overall structuralfrequencies, being enriched in unpaired structuralstates (αb0.001). The preference for unpaired basesin RNP regions makes the nucleobases especiallyaccessible for interaction.RNA base frequencies for protein neighboring

bases (van der Waals contacts) do not differ fromoverall RNA base frequencies; however, RNAstructural frequencies are significantly distinguish-able, with higher preference for unpaired nucleo-bases (χ2 test for base frequencies: αN0.1; structuralfrequencies: αb0.001).

Modes of interaction between RNA and protein

Interactions are specific for RNA when nucleo-bases form hydrogen bonds with the protein andnonspecific when hydrogen bonds involve the RNAbackbone. Similarly, interactions are specific forprotein when side chains form hydrogen bonds withRNA and nonspecific when hydrogen bonds aremade by the protein backbone. Thus, based on

e left panel shows the interaction between 23S rRNA anded blue and enclosed in an ellipse interact with the proteinS indicate the interaction modes NRNANprot, NRNASprot,n is between N2 of 1059G and backbone oxygen of 127ILE,31THR, the NN interaction represents hydrogen bondinginteraction is between OP2 of 1061U and 10LYS side chain.System, Version 1.3, Schrödinger, LLC.

Page 5: The Role of RNA Sequence and Structure in RNA–Protein Interactions

578 RNA Sequence–Structure in RNA–Protein Interactions

which parts of the RNA and protein associate inintermolecular interactions, we define the followingfour modes of interaction:

• SRNASprot or SS: Interaction is specific for bothRNA and protein.

• SRNANprot or SN: Interaction is specific forRNA, but nonspecific for protein.

• NRNASprot or NS: Interaction is nonspecific forRNA, but specific for protein.

• NRNANprot or NN: Interaction is nonspecific forboth RNA and protein.

Figure 2 illustrates the above interaction catego-ries in 23S ribosomal RNA and L11 protein complex(PDB ID: 1MMS).The 2207 interacting bases (out of a total of 9977)

make 3833 hydrogen bonds with protein atoms. Themost frequent interaction mode is NRNASprot, inter-action of specific protein side chains with the RNAbackbone (Fig. 3b). About a quarter (24.6%) of all

Fig. 3. RNA base frequencies and hydrogen bonds. (a) RNbinding (RNP), and non-binding (nonRNP) regions; the four inand base-nonspecific (NS NN) interactions; and neighbor Rhydrogen bonds in different categories, it is counted oncefrequency distributions differ from the overall base frequenciinteractions (SS, SN, and base specific) are significantly distinguBase-nonspecific interactions disfavor guanine, while neighbdistribution. (b) The relative frequency of hydrogen bonds in thRNA backbone and protein side chains constituting 61% of al

hydrogen bonds between protein and RNA involvenucleobases and are therefore potentially sequencespecific.

Nucleobase-specific interactions

In specific interactions between nucleobases andamino acid side chains (SRNASprot), nucleobasefrequencies in RNP regions are significantly differ-ent from overall and nonRNP frequencies (αb0.001,Fig. 3a), with uracil and guanine being significantlyoverrepresented (αb0.01) and cytosine being un-derrepresented (αb0.005).Nearly all (97.7%) specific side chain–nucleobase

interactions involve charged and polar amino acidresidues (C, D, E, H, K, N, Q, R, S, T, and Y),which preferentially interact with guanine andavoid cytosine (Table 2): Asp–G, Asn–U, Glu–G,Ser–A, Thr–A, and Arg–C residue–base inter-actions are significantly overrepresented, whileAsn–G and Thr–G are underrepresented (Table 2).The Glu–G, Thr–A, and Arg–C pairs have each

A base frequencies in overall (entire data set), protein-teraction modes: NN, NS, SN, and SS; base-specific (SS SN)NA–protein interactions. When a base makes multiplein each category. Neither RNP nor nonRNP nucleobasees (χ2 test, αN0.1), but nucleobase frequencies in specificishable, showing strong preference for guanine (αb0.001).or interactions show base frequencies similar to overalle four modes of interaction, with NS interactions betweenl RNA–protein hydrogen bonds.

Page 6: The Role of RNA Sequence and Structure in RNA–Protein Interactions

Table 2. Base-specific H bondsSRNASprot interactions

Residue A U C G Fraction

Arg⁎ 27 47 47 65 0.285Asn⁎ 12 31 11 14 0.104Asp⁎ 5 8 1 28 0.064Cys⁎ 0 0 0 1 0.002Gln⁎ 17 19 11 20 0.103Glu⁎ 5 2 1 47 0.084His⁎ 6 3 4 2 0.023Lys⁎ 12 25 18 44 0.152Ser⁎ 26 9 11 20 0.101Thr⁎ 15 4 9 5 0.051Tyr⁎ 2 3 1 8 0.021Trp 0 0 1 0 0.002Met 1 2 0 2 0.008Fraction 0.196 0.235 0.176 0.393

SRNANprot interactions

Protein state A U C G Fraction

B 22 10 5 6 0.147h1 4 6 10 35 0.188h5 0 1 4 10 0.052o 35 40 33 71 0.613Fraction 0.209 0.195 0.178 0.418

Polar and charged residues [marked with an asterisk (⁎)] make up97.7% of hydrogen-bond interactions with RNA. Interactions notlisted were not found in our data set. Asp–G (αb0.005), Asn–U(αb0.001), Glu–G (αb0.001), Ser–A (αb0.001), Thr–A (αb0.001),and Arg–C (αb0.015) residue–base interactions are significantlyoverrepresented, while Asn–G (αb0.015) and Thr–G (αb0.025)are underrepresented.

579RNA Sequence–Structure in RNA–Protein Interactions

been identified individually in previous studies, butnever simultaneously.14,15,27RNA shows significant structural preferences in

SRNASprot interactions, favoring the unpaired state(αb0.001) over WC base-paired states in RNPregions compared to overall (Fig. 4a). The frequen-cies of RNA structural states in RNP regions aresignificantly different from both overall andnonRNP structural preferences (αb0.001, Fig. 4a).However, protein structural states in SRNASprotinteractions are not distinguishable from overallprotein structural frequencies (αN0.1, Fig. 4b).For SRNANprot interactions between nucleobases

and the protein backbone (Table 2), nucleobasefrequencies as well as RNA structure are signifi-cantly different from overall (αb0.001) and nonRNP(αb0.001) nucleobase and RNA structural distribu-tions. Guanine nucleobases (αb0.001) and theunpaired structural state (αb0.001) are preferredwhile cytosine is disfavored (αb0.005), as in SRNASprotinteractions. Protein residues in SRNANprot interac-tions are more frequently found in irregularstructural states (structural state “other”) comparedto overall protein structural frequencies (αb0.001).The SRNANprot together with the SRNASprot interac-tion mode show the strongest preference for theunpaired RNA structural state.

Looking together at nucleobase-specific interac-tions (SRNASprot and SRNANprot), we find thatnucleobase and RNA structural frequencies aresignificantly different from overall and nonRNPdistributions (Table 3). Guanine is preferred andcytosine is disfavored, while unpaired nucleobasesare the preferred structural state (Figs. 3a and 4a).These sequence–structure preferences are highlysignificant, suggesting that nucleobase-specific in-teractions are sensitive to both the type of base andthe RNA structure in the RNP region.

Nucleobase-nonspecific interactions

NRNASprot, RNA backbone interacting with pro-tein side chains, is the most frequent mode ofinteraction between RNA and protein (Fig. 3b).Nucleobase frequencies are statistically differentfrom both overall (αb0.03, Fig. 3a) and nonRNP(αb0.025) frequencies. In strong contrast to nucleo-base-specific interactions, guanine is disfavored inNRNASprot interactions (αb0.005). RNA structuralstates also differ significantly from overall (αb0.015)and nonRNP (αb0.001, Fig. 4a) structural frequen-cies, but the difference is much less dramatic thanthat in the case of nucleobase-specific interactions.The amino acid frequency distributions in SRNASprot(Table 2) and NRNASprot (Table 4) interactions aresignificantly different, with an increased number ofpositively charged and aromatic residues and areduced frequency of negatively charged residuesin NRNASprot (χ2 test, αb0.001). This reflects thefavorable electrostatic interactions between thepositively charged side chains and the negativelycharged RNA backbone.For NRNANprot interactions between RNA and

protein backbones, unlike other interaction catego-ries, nucleobase frequencies are indistinguishablefrom overall (αb0.055, Fig. 3a) and nonRNP(αb0.04), but RNA structural states show a signif-icant difference from nonRNP (αb0.01, Fig. 4a).Again, this difference is much less dramatic thanthat of nucleobase-specific interactions.Protein secondary structure in NRNANprot interac-

tions shows the most pronounced preference forirregular structure of any interaction mode, com-pared to overall protein structural frequencies(αb0.001, Fig. 4b). In SRNANprot interactions (seeabove), irregularly structured regions are alsosignificantly more frequent, and regular secondary-structure elements such as α-helices and β-sheets areunderrepresented (αb0.001, Fig. 4b). This reflectsthe difficulty of accommodating interactions be-tween the protein backbone and RNA within regularα-helical and β-sheet structures in both of theseinteraction modes.When we combine nucleobase-nonspecific inter-

actions (NRNASprot and NRNANprot), we see thatnucleobase frequencies are not different from either

Page 7: The Role of RNA Sequence and Structure in RNA–Protein Interactions

Fig. 4. RNA and protein structure at the RNA–protein interface. (a) RNA base frequencies in overall (entire data set),protein-binding (RNP), and non-binding (nonRNP) regions; the four interaction modes: NN, NS, SN, and SS; base-specific(SS SN) and base-nonspecific (NS NN) interactions; and neighbor RNA–protein interactions. When a base makes multiplehydrogen bonds in different categories, it is counted once in each category. Structural frequencies in RNP regions aresignificantly different from overall and nonRNP RNA structure (χ2 test, αb0.001). SS, SN, and base-specific interactionsshow a strong preference for unpaired structural state, while NS, NN, and base-nonspecific interactions do notsignificantly prefer any particular RNA structural state. Neighbor interactions show indistinguishable RNA structuralfrequencies from overall distribution. (b) Protein structural frequencies in the four modes of interaction. Irregular proteinstructure is preferred and α-helix disfavored in interactions between the RNA and protein backbone.

580 RNA Sequence–Structure in RNA–Protein Interactions

overall or nonRNP nucleobase distributions (Fig. 3a)but that RNA structural states are significantlydifferent from nonRNP (Fig. 4a).

Table 3. Significance of RNA–protein interactions

Interaction type Bases

RNA base frequency

Compared to overall Compare

Base-specific interactionsSRNASprot 493 αb0.001 αSRNANprot 225 αb0.001 αSRNASprot+SRNANprot 653 αb0.001 α

Base-nonspecific interactionsNRNASprot 1570 αb0.03 αNRNANprot 478 αb0.055 αNRNASprot+NRNANprot 1838 αb0.055 α

Amino-acid-side-chain-specific interactionsSRNASprot+NRNASprot 1871 αN0.1 α

Protein-backbone-specific interactionsSRNANprot+NRNANprot 679 αN0.1 α

χ2 tests: interaction region base and structural state frequencies, compNull hypothesis: No significant difference exists between RNP and ovestructural frequencies for specific RNA interactions are significantlhighlighted in boldface. Note that when a base is involved in multipinteraction categories in which it participates.

For protein-side-chain-specific (SRNASprot andNRNASprot) and -nonspecific interactions (SRNANprotand NRNANprot), RNA structural frequencies are

RNA structural state frequency

d to nonRNP Compared to overall Compared to nonRNP

b0.001 αb0.001 αb0.001b0.001 αb0.001 αb0.001b0.001 αb0.001 αb0.001

b0.025 αb0.015 αb0.001b0.04 αb0.085 αb0.01b0.04 αb0.015 αb0.001

N0.1 αb0.001 αb0.001

N0.1 αb0.001 αb0.001

ared to overall and nonRNP base and structural state frequencies.rall/nonRNP frequency distributions. Interaction region base andy different from overall frequencies. Significant differences arele interactions (H bonds) with the protein, it is counted in all the

Page 8: The Role of RNA Sequence and Structure in RNA–Protein Interactions

Table 4. Base-nonspecific H bonds

Residue W WS Ws WSs S Ss s u Fraction

NRNASprot interactionsArg 444 221 60 18 147 14 11 141 0.455Asn 52 16 4 0 22 1 1 8 0.045Gln 59 10 7 1 10 1 0 10 0.042His 38 15 8 0 11 0 1 8 0.035Lys 237 77 46 17 61 11 9 76 0.230Ser 86 30 14 0 22 2 3 28 0.080Thr 73 23 16 2 20 1 3 28 0.072Trp 6 3 2 0 0 0 0 3 0.006Tyr 27 9 7 2 12 2 1 21 0.035Fraction 0.441 0.174 0.071 0.017 0.132 0.014 0.012 0.139

b 172 54 30 10 48 4 1 38 0.154h1 400 97 48 9 101 7 11 148 0.354h5 69 41 5 0 18 2 0 17 0.066o 381 212 81 21 138 19 17 120 0.426Fraction 0.441 0.174 0.071 0.017 0.132 0.014 0.012 0.139

NRNANprot interactionsb 21 1 5 0 7 0 0 7 0.076h1 42 4 3 1 5 1 1 33 0.167h5 16 6 2 0 9 0 0 5 0.071o 177 67 26 4 53 3 4 36 0.686Fraction 0.475 0.145 0.067 0.009 0.137 0.008 0.009 0.150

Positively charged residues (Arg and Lys) are preferred for interactions with negatively charged RNA backbone. For interactionsinvolving protein main chain, residues in unstructured regions are overrepresented.

581RNA Sequence–Structure in RNA–Protein Interactions

significantly different from overall and nonRNP(Fig. 4a and Supplementary Table S2) and aresignificantly different from each other (αb0.001).

Table 5. Summary of significant differences between interactiSS SN

Hydrogen bonds: 17% Hydrogen bonBase composition: ↓AC ↑UG Base compositioRNA structure: ↓W ↓WS ↓Ws ↑↑u, RNA structure: ↓W ↓WProtein composition: (Arg, Asn, Gln,

Glu, Lys, Ser)↑Protein composition: (

Protein structure: no preferences Protein structure: ↑irr

NS NN

Hydrogen bonds: 61.3% Hydrogen bondBase composition: ↓G Base compositioRNA structure: ↑u RNA structure: noProtein composition: (Arg, Lys,

His, Ser, Thr)↑Protein composition: (

(Glu, Ile, LProtein structure: ↓β-sheet ↑3/10 helix Protein structure: ↓β-s

↓α-helix ↑irr

Protein specific (SS NS)

Hydrogen bonds: 78.3%Base composition: no preferencesRNA structure: ↓Ws ↑uProtein composition: (Arg, Lys, Ser)↑ PrProtein structure: ↓β-sheet ↑3/10 helix Protein struc

Arrows indicate significantly favored (↑) or disfavored (↓) features in epreferred RNA bases and structure in the four interaction modesnucleobase-specific interactions, while unpaired is the preferred RNAfour interaction modes, while irregular protein structures are preferr(Table S1), RNA structural frequencies (Table S2), protein compositionin Supplementary Data.

While both protein-specific and -nonspecific inter-actions prefer unpaired nucleotides compared tooverall, when compared with each other, protein-

on modesNucleobase specific (SS SN)

ds: 7.6% Hydrogen bonds: 24.6%n: ↓C ↑G Base composition: ↓C ↑GS ↓Ws ↑S ↑↑u RNA structure: ↓W ↓WS ↓Ws ↑S ↑↑u

Gly, Lys, Thr)↑ Protein composition: (Arg, Lys, Ser, Thr,Gln, Asn)↑

egular ↓α-helix Protein structure: ↑irregular ↓α-helix

Nucleobase nonspecific (NS NN)

s: 14.1% Hydrogen bonds: 75.4%n: ↓G ↑U Base composition: ↓Gpreferences RNA structure: no preferencesArg, Gly, Lys)↑eu)↓

Protein composition: (Arg, Lys)↑

heet ↑3/10 helixegular

Protein structure: ↓β-sheet ↑3/10 helix↓α-helix ↑irregular

Protein nonspecific (SN NN)

Hydrogen bonds: 21.7%Base composition: no preferencesRNA structure: ↓WS ↓Ws ↑S ↑↑uotein composition: (Arg, Gly, Lys)↑ture: ↓β-sheet ↑3/10 helix ↓α-helix ↑irregular

ach interaction mode. Doubled arrows indicate a larger effect. Thesuggest that guanine is favored and cytosine is disfavored instructural state. Arg and Lys are almost universally present in alled in nucleobase-nonspecific interactions. RNA base frequencies(Fig. S1), and protein structure frequencies (Table S3) can be found

Page 9: The Role of RNA Sequence and Structure in RNA–Protein Interactions

582 RNA Sequence–Structure in RNA–Protein Interactions

specific interactions significantly favor WC pairedbases. This suggests that protein side chains aremore likely to bind to helical regions in RNA thanare protein backbone atoms. This is not toosurprising given the constraints on the conformationof the protein backbone. Table 5 summarizes thedifferences in the preferred RNA and proteinsequence and structural properties in the fourmodes of RNA–protein interaction.

Associating RNA nucleobases with structure

To investigate whether RNA structure influencesthe tendency of a nucleobase to interact with protein,we look at the relative frequency of nucleobase–structural state pairs in RNP and nonRNP regions(Table 6).The log-odds scores suggest that stacked cytosines

are preferred in RNP regions; this is supported bystatistical significance (χ2 test, null hypothesis: thereis no significant difference in frequency of state inRNP versus nonRNP regions, αb0.001 for cytosine-s).All bases in unpaired states are preferred in RNPregions (αb0.08 for unpaired guanine and αb0.001for unpaired adenine, uracil, and cytosine), althoughthe preference for guanine is not significant at theα=0.05 level. Guanine in the Sugar/Hoogsteen state(S) is also significantly preferred in RNP (αb0.02),while uracil and cytosine in the Ws state (WC andstacked) and adenine in the Sugar/Hoogsteen stateare preferred in nonRNP regions (αb0.005 for uracil-Ws and cytosine-Ws, αb0.05 for adenine-S). EleventRNAs in our data set have CCA tails that areinvolved in binding to aminoacyl tRNA synthetases.In order to remove any bias due to this overrepre-sentation, we removed these CCA tails and repeatedthe analysis. The overrepresentation of unpaired

Table 6. Log-odds preference of sequence–structure statesfor RNP versus nonRNP

A U C G

W 0.0459 0.1253 −0.1197 −0.0797WS −0.2292 −0.1405 0.1939 −0.1151Ws 0.0488 −0.7484 −0.5814 −0.4378WSs −1.7229 −3.7635 −0.3608 −0.2259S −0.2366 0.2269 0.3214 0.3457Ss −0.3712 −1.4010 −0.7635 0.3434s 0.1585 0.5990 1.5990 −1.2375u 0.5824 1.0314 0.5870 0.3324

Log2-odds of sequence–structure states of nucleotides in protein-interacting regions versus non-interacting regions. Positive valuesindicate preference of sequence–structure state pairs for protein-interacting regions, and negative values indicate that for non-interacting regions. Sequence–structure states that are significantlypreferred in RNP or nonRNP are in boldface (αb0.05). Log-oddsscores for sequence–structure states with fewer than five occur-rences are shown in italics and should be considered asinconclusive.

cytosine in RNP regions and preference for adenine-Sin nonRNP regions are diminished, but the signifi-cance of other interactions is unchanged (αb0.01 forunpaired cytosine, αb0.065 for adenine-S).

Discussion

The current data set of 77 RNA–protein complexesis the largest nonredundant set of RNA–proteincomplexes analyzed to date. In order to avoid biasesdue to overrepresented protein or RNA sequences,and in contrast to previous studies,14–17,24,26 werestricted both RNA and protein sequence identityto less than 80% and considered only a single copyof each RNA–protein complex. However, ourconclusions are as representative as our data set,which is biased towards transfer RNA and ribo-somal RNA classes (tRNA: 14% of all nucleotides;rRNA: 75% of all nucleotides in data set). Our dataset, and crystallographic structures in general, arestrongly biased towards stable ribonucleoproteincomplexes and may not be representative oftransient interactions.This study, for the first time, has investigated

preferred RNA structural states in protein-bindingregions. We find that RNA structural frequencies atRNP regions are significantly different from overallstructural frequencies, with unpaired bases beingoverrepresented at the binding regions. This acces-sibility of bases for interaction with proteins is asalient difference between protein–RNA and protein–DNA interactions.Previous studies agree that protein contacts with

the RNA ribose-phosphate backbone are morecommon than interactions with the nucleobasesand that interactions with protein side chainspredominate over interactions with the proteinbackbone.14–17,26 However, there is considerabledisagreement regarding whether specific bases oramino acid residues are preferred in RNP interac-tions and, in the studies that have seen a preference,over which bases or residues are preferred (sum-marized in Table IX of Ellis et al.24). We attributethese disparities partially to relatively small samplesizes but, more importantly, to differences betweennonspecific (ribose-phosphate backbone in RNA ormain-chain atoms in protein) and specific interac-tions (nucleobase and amino acid side chain) on theparts of both RNA and protein.In this study, we classify interactions as specific or

nonspecific with respect to both the RNA andprotein components of the interaction. We concen-trate on hydrogen bonding between the RNA andprotein because it is likely to confer specificity to theinteraction.18 We did not consider water-mediatedhydrogen bonding because, as pointed out byLejeune et al.,14 a crystal structure resolution of 2 Åor better is necessary to accurately determine these

Page 10: The Role of RNA Sequence and Structure in RNA–Protein Interactions

583RNA Sequence–Structure in RNA–Protein Interactions

interactions, and few available RNA–protein struc-tures have been determined at sufficiently highresolution.When the nucleobase specificity of the RNA–

protein interaction is not considered, base frequen-cies at binding regions are not significantly differentfrom overall base frequencies. This is in agreementwith Treger and Westhof17 but not with otherstudies that find significant differences betweenoverall and RNP base frequencies.15,16,26 However,when interactions are categorized as nucleobasespecific, involving hydrogen bonds to nucleobases,or nucleobase nonspecific, involving contacts to theRNA backbone, clear base preferences, particularlyfor guanine, are seen. We believe that this differenceaccounts for much of the disparity in previouswork, although differences in data sets and thedistribution to which the comparison is mademay also be partially responsible. Jones et al.15

determined expected base distributions from theaverage solvent accessibility of bases, and Elliset al.24 compared RNP base frequencies to a randomdistribution.Nucleobase-specific and -nonspecific interactions

make up 24.6% and 75.4%, respectively, of RNA–protein hydrogen bonds, similar to the majority14,17,28

but not to all15,26 of the previous analyses. Weobserve striking differences in RNA sequence–structure preferences in nucleobase-specific and-nonspecific interactions. While guanine is signifi-cantly preferred and cytosine is disfavored innucleobase-specific interactions, guanine is under-represented in nucleobase-nonspecific interactions.The origin of the preference for G and avoidance ofC in nucleobase-specific interactions is unclear, andunderrepresentation of guanine in nucleobase-non-specific interactions is surprising because theseinteractions do not involve the nucleobase directly.It has been suggested that base preferences in base-nonspecific interaction regions may be affected bynon-hydrogen bond interactions such as stackinginteractions between nucleobases and planar resi-dues such as tryptophan and tyrosine,18,24 but it isnot clear that this entirely explains the observedpreference for guanine.Nucleobase-specific interactions show a stronger

preference for unpaired structural states thannucleobase-nonspecific interactions. Because thebases are largely inaccessible in helical structures,it is not surprising that the WC base-paired state isdisfavored in base-specific interactions and thatunpaired structures are favored. This suggests thatbases in unpaired regions, such as hairpin loops andbulges, are more likely to bind to protein in asequence-specific manner. The availability of hydro-gen bond donors and acceptors in the RNAbackbone is not adversely affected when basepairing occurs, possibly explaining the similaritiesin RNA structure in the overall RNA and in

nucleobase-nonspecific interactions (Fig. 4a). How-ever, the unpaired structural state is preferred innucleobase-nonspecific interactions as well, suggest-ing that the conformational flexibility of unpairedbases facilitates both nucleobase-specific and -non-specific interactions.Looking at the protein component of RNA–

protein interactions, 78% of hydrogen bonds involveamino acid side chains, and the remaining 22%involve the protein backbone. In terms of specificity,this is the reverse of what we observe for RNA,indicating that sequence specificity plays a moreimportant role for the protein component. Asexpected, we find that both positively and nega-tively charged residues (Arg, Lys, Asp, and Glu)are preferred in interactions with nucleobases,but when hydrogen bonds are made to thenegatively charged RNA backbone, the frequenciesof positively charged Arg and Lys are higher. Thispreference for Arg and Lys in protein–RNAbackbone interactions has been almost universallyacknowledged in previous studies.14,15,17,24,26,28

Irregularly structured regions are the predomi-nant protein structural state in RNP regions. This isin significant contrast to DNA–protein interactionsin which α-helices play a prominent role.29 RNArecognition by several RNA-binding proteins in-volves β-sheets, which led to the hypothesis thatthese structural elements are preferred due to thelarge surface area they provide for interaction,30,31

but we find β-sheets to actually be significantlyless common in interactions with the RNA back-bone (which constitute 75.4% of all RNA–proteininteractions) than in the overall data set.When we look at the RNA in protein-specific

and -nonspecific interactions, we find that, in bothcases, RNA base frequencies do not differ fromoverall but that RNA structural frequencies differsignificantly, with more bases in unpaired confor-mations in RNP regions. As mentioned before, thisprobably reflects the relative inaccessibility ofnucleobases in helical structures.Accessible surface area analyses indicate that

RNA–protein interfaces are not as well packed asprotein–DNA and protein–protein interfaces.15,16

The prevalence of unpaired bases in interfaces hasbeen suggested as a possible explanation for loosepacking of these interfaces.15,31 We find significantoverrepresentation of both unpaired bases andirregularly structured amino acids in RNP regions,supporting the idea of loosely packed RNA–protein interfaces. The high level of irregularstructures in both RNA and protein undoubtedlycontributes to the difficulty in identifying RNAstructural motifs involved in protein binding: todate, only a few (kink turns,32 tetraloops,33 andRNA tripods28) have been proposed.To investigate the combined influence of RNA

sequence and structure on protein recognition, we

Page 11: The Role of RNA Sequence and Structure in RNA–Protein Interactions

584 RNA Sequence–Structure in RNA–Protein Interactions

calculate log-odds scores for all RNA sequence–structure pairs, indicating the preference of eachsequence–structure pair for RNP versus nonRNPregions. Some states are preferred over others atRNP regions (Table 6), indicating that certainstructural states favor protein binding more thanothers. The strong preference for stacked cytosinesin protein binding is interesting because neithercytosine nor stacked structural state shows thispreference when nucleobases and RNA structuralstates are considered separately. This suggests thatRNA sequence and structure may have cooperativeor nonlinear effects on protein binding. All bases inunpaired states are preferred in protein-bindingregions. The unpaired RNA structural state issignificantly preferred in protein-binding regions;this preference could be the reason that we see anoverrepresentation of all the bases in unpaired statesin RNP regions when we look at sequence andstructure together.The importance of binding to single-stranded

regions is exemplified by proteins such as nucleolin,a protein involved in folding and packaging of pre-mRNA in nucleolus, which binds to a conservedsequence motif (U/G)CCCG(A/G) found in ahairpin loop and to another single-stranded 11-nucleotide sequence motif GAUCGAUGUGG.34

While nucleobase-nonspecific interactions withRNA play a dominant role in protein–RNA in-teractions, sequence-specific interactions are none-theless important. Hydrogen bonding to nucleobasesconfers specificity to interactions since hydrogen-bonded bases have sequence as well as structuralproperties that can be distinguished from the overallRNA sequence and structure. We find that specificRNA structural states are preferred in RNP regions,intimating the important role that RNA structureplays in positioning the interacting moieties anddetermining their accessibility for interaction.In this article, we have examined RNA base and

structural preferences in RNP regions and comparedthose to both overall and nonRNP region frequen-cies. As further structural data accumulate, it willalso be interesting to study higher-order interac-tions, such as preferences for dinucleotides andtrinucleotides in RNP regions. A closer look atpreferred RNA structural states may lead toidentification of additional RNA secondary struc-tural motifs, such as kink turns, preferred forinteractions with protein.32 RNA structure in RNPregions is clearly distinguishable from the RNAstructure in overall and nonRNP regions, suggestingthat consensus protein-binding structural motifs inRNA can be found by further analysis.Development of models that can predict RNP

regions in uncharacterized RNAs based on sequenceand structural properties in these binding regions isan interesting future research area. Knowledge ofpreferred protein sequence and structure proper-

ties has led to development of computationalmodels that can predict RNA-binding regions in aprotein.35–40 A similar attempt using 3D physio-chemical patterns of binding pockets tries to predictunpaired nucleotides and dinucleotides in localregions where the RNA structure could be in aprotein-binding conformation.41 The distinct se-quence and structural properties observed fornucleobase-specific and -nonspecific interactionssuggest that these interaction modes should bemodeled separately to achieve higher accuracy inpredicting protein-binding regions in RNA.

Materials and Methods

Data set collection

We selected 77 RNA–protein complexes (Table 7) fromPDB for this analysis according to the following criteria:

(1) Crystallographic resolution better than 3.0 Å, toassure good quality of crystal structures allowingunambiguous identification of intermolecular in-teractions between RNAs and proteins.

(2) Sequence similarity b80% for RNA.(3) Sequence similarity b80% for protein.

To avoid any bias in downstream statistical analysis dueto the presence of redundant sequences, we used CDHIT42

to remove RNAs and proteins with highly similarsequences from the data set. For each RNA–proteincomplex, we used a single copy of the known biologicallysignificant oligomerization state of each molecule, asdetermined from BIOMT records (REMARK 350) in thePDB files. However, BIOMT records of 1MJI, 1WSU,2FMT, 1ZBH, and 3EPH reflected multiple copies of thebiological assembly, and we picked representative RNAand protein chains that form a single assembly. 1GTN has11 identical protein chains interacting with a (GAGCC)11repeat RNA, and we consider binding of only one suchprotein to a single repeat sequence to avoid overcountingredundant interactions. We also applied symmetry oper-ations to generate the RNA chains (marked by ⁎ in Table 7)that complete the RNA–protein binding interface in3ADL, 1K8W and 1SDS.Our resulting data set has 211 RNA–protein chain pairs,

comprising 9977 nucleotides, out of which 2207 nucleo-tides make 3833 RNA–protein hydrogen bonds. This is thelargest RNA–protein interaction data set to date andfurther differs from previous data sets in reducing theredundancy of both RNA and protein sequences.

Structural states of RNA bases and protein residues

We used RNAView43 to determine RNA structuralstates in each crystallographic structure. We define eightexclusive structural states for the bases (Table 1) based onbase pairing and base stacking. Structural states of a smallfraction of bases (0.65%) could not be determined by theprogram and are not considered further.

Page 12: The Role of RNA Sequence and Structure in RNA–Protein Interactions

Table 7. Data set of RNA–protein complexes

PDB IDRNAtypea

RNAchainIDb

RNAbasesc

Resolutiond

(Å)

2J02 Ribosomal A 1483 2.81JJ2 Ribosomal 0,9 2876 2.42ZJR Ribosomal X,Y 2808 2.911MMS Ribosomal C 58 2.572I91 Ribosomal C,D 22 2.651DFU Ribosomal M,N 38 1.81MJI Ribosomal D 34 2.51MZP Ribosomal B 55 2.651I6U Ribosomal C 37 2.61JBS Ribosomal C 17 1.972ATW Ribosomal B 12 2.253FTE Ribosomal C,D 44 3.02PJP mRNA B 23 2.31GTN mRNA W 5 2.51WSU mRNA E 22 2.32GJW mRNA E,F,H 35 2.851CVJ mRNA M 9 2.63FHT mRNA C 6 2.22VPL mRNA B 48 2.32F8K mRNA B 16 2.03BSX mRNA C 10 2.323K49 mRNA B 10 2.51S03 mRNA A 47 2.71ZBH mRNA E 16 3.02J0S mRNA E 6 2.211G2E mRNA B 10 2.31ZHO mRNA B 38 2.63F73 mRNA H 12 3.01B7F mRNA P 12 2.61XOK Viral RNA A,B 36 3.02QUX Viral RNA C 25 2.441HYS Viral RNA E 23 3.03KS8 Viral RNA E,F 36 2.42EC0 Viral RNA B,C 15 2.752R7R Viral RNA X 7 2.63H5X Viral RNA T,P 16 1.773ADL Pre-microRNA B,C,B⁎,C⁎ 40 2.23A6P Pre-microRNA D,E 46 2.921SER tRNA T 50 2.91VFG tRNA C 31 2.81H4S tRNA T 64 2.851K8W tRNA B,B⁎ 42 1.851TTT tRNA D 62 2.71QTQ tRNA B 74 2.251GAX tRNA C 75 2.92ZM5 tRNA C 74 2.551QF6 tRNA B 69 2.91F7U tRNA B 63 2.21QU2 tRNA T 74 2.22ZUF tRNA B 74 2.32CSX tRNA C 74 2.72DVI tRNA B 34 2.613EPH tRNA E 69 2.952B3J tRNA E,F 28 2.02FMT tRNA C 71 2.81C0A tRNA B 68 2.42DXI tRNA C 74 2.22ZZM tRNA B 72 2.652BTE tRNA B 71 2.91ZE2 tRNA C 21 3.01Q2R tRNA E 19 2.92IPY IRE C 30 2.81A9N snRNA Q 24 2.382OZB snRNA C 33 2.61URN snRNA P 20 1.92

Table 7 (continued)

PDB IDRNAtypea

RNAchainIDb

RNAbasesc

Resolutiond

(Å)

2V3C SRP M 96 2.51HQ1 SRP B 48 1.521JID SRP B 26 1.82HVY snoRNA E 61 2.31SDS snoRNA F,F⁎ 30 1.81RLG snoRNA C 22 2.71YVP Y RNA C,D,G 21 2.22F8S siRNA C,D 44 3.01RPU siRNA C,D 42 2.52BGG Guide RNA P,Q 16 2.23IAB RNase RNA R 46 2.72PY9 Telomeric RNA E 12 2.56

RNA chains generated by symmetry operations to obtain thecomplete RNA–protein complex are marked with an asterisk (⁎).

a IRE is iron-responsive element RNA, snRNA is small nuclearRNA, SRP indicates the signal recognition particle RNA, snoRNAis small nucleolar RNA, Y RNAs are part of Roribonucleoparticles, and siRNA is silencing RNA.

b PDB file RNA chain ID selected for analysis based on BIOMTrecords.

c Number of bases with coordinates in the PDB entry (whichmay not correspond to chain length in SEQRES record due to theabsence of some bases in the crystal structure).

d Crystallographic resolution as recorded in the PDB entry.

585RNA Sequence–Structure in RNA–Protein Interactions

Hydrogen bonds and van der Waals interactionsbetween RNAs and proteins

We used HBPLUS44 to identify intermolecular hydro-gen bonds between the RNA and protein molecules. Themaximum donor–acceptor distance was adjusted to 3.35 Åand maximum hydrogen–acceptor distance was adjustedto 2.7 Å, as in previous studies.15,24

Bases that are within 4-Å distance of protein atoms butdo not form hydrogen bonds are identified as “neigh-bors”. These neighbor bases can be thought of as formingvan der Waals contacts with the protein.15

Log-odds score for sequence–structure states

If Nrnp(i) is the frequency of a sequence–structure state iin RNP regions and NnonRNP(i) is its frequency in nonRNPregions, then the log-odds of state i is

log−odds ið Þ = log2Nrnp ið Þ

NnonRNP ið Þ� �

Thus, log-oddsN0 indicates preference of nucleobase–structural state i for RNP regions, log-oddsb0 indicatespreference for nonRNP regions, and log-odds=0 indicatesno preference.Supplementary materials related to this article can be

found online at doi:10.1016/j.jmb.2011.04.007

Acknowledgements

We would like to thank Prof. Barbara Golden forher valuable suggestions. We would also like to

Page 13: The Role of RNA Sequence and Structure in RNA–Protein Interactions

586 RNA Sequence–Structure in RNA–Protein Interactions

thank Kejie Li, Greg Zeigler, and Reazur Rahmanfor their comments on the manuscript. This workwas supported by the National Science Foundation(DBI-0515986 and DBI-0850148) to M.G.

References

1. Lukong, K. E., Chang, K. W., Khandjian, E. W. &Richard, S. (2008). RNA-binding proteins in humangenetic disease. Trends Genet. 24, 416–425.

2. Cooper, T. A., Wan, L. L. & Dreyfuss, G. (2009). RNAand disease. Cell, 136, 777–793.

3. Darnell, J. C., Fraser, C. E., Mostovetsky, O.,Stefani, G., Jones, T. A., Eddy, S. R. & Darnell, R. B.(2005). Kissing complex RNAs mediate interactionbetween the Fragile-X mental retardation proteinKH2 domain and brain polyribosomes. Genes Dev.19, 903–918.

4. Kolb, S. J., Battle, D. J. & Dreyfuss, G. (2007).Molecular functions of the SMN complex. J. ChildNeurol. 22, 990–994.

5. Bolognani, F., Qiu, S., Tanner, D. C., Paik, J., Perrone-Bizzozero, N. I. &Weeber, E. J. (2007). Associative andspatial learning and memory deficits in transgenicmice overexpressing the RNA-binding protein HuD.Neurobiol. Learn. Mem. 87, 635–643.

6. Kanadia, R. N., Shin, J., Yuan, Y., Beattie, S. G.,Wheeler, T. M., Thornton, C. A. & Swanson, M. S.(2006). Reversal of RNA missplicing and myotoniaafter muscleblind overexpression in a mouse poly(CUG) model for myotonic dystrophy. Proc. Natl Acad.Sci. USA, 103, 11748–11753.

7. Brais, B., Bouchard, J. P., Xie, Y. G., Rochefort, D. L.,Chretien, N., Tome, F. M. et al. (1998). Short GCGexpansions in the PABP2 gene cause oculopharyngealmuscular dystrophy. Nat. Genet. 18, 164–167.

8. Karni, R., de Stanchina, E., Lowe, S. W., Sinha, R., Mu,D. & Krainer, A. R. (2007). The gene encoding thesplicing factor SF2/ASF is a proto-oncogene. Nat.Struct. Mol. Biol. 14, 185–193.

9. Wendel, H. G., Silva, R. L., Malina, A., Mills, J. R., Zhu,H., Ueda, T. et al. (2007). Dissecting eIF4E action intumorigenesis. Genes Dev. 21, 3232–3237.

10. Busa, R., Paronetto, M. P., Farini, D., Pierantozzi, E.,Botti, F., Angelini, D. F. et al. (2007). The RNA-bindingprotein Sam68 contributes to proliferation and sur-vival of human prostate cancer cells. Oncogene, 26,4372–4382.

11. Chenard, C. A. & Richard, S. (2008). New implicationsfor the QUAKING RNA binding protein in humandisease. J. Neurosci. Res. 86, 233–242.

12. Kishore, S. & Stamm, S. (2006). The snoRNA HBII-52regulates alternative splicing of the serotonin receptor2C. Science, 311, 230–232.

13. McKie, A. B., McHale, J. C., Keen, T. J., Tarttelin, E. E.,Goliath, R., van Lith-Verhoeven, J. J. et al. (2001).Mutations in the pre-mRNA splicing factor genePRPC8 in autosomal dominant retinitis pigmentosa(RP13). Hum. Mol. Genet. 10, 1555–1562.

14. Lejeune, D., Delsaux, N., Charloteaux, B., Thomas, A. &Brasseur, R. (2005). Protein–nucleic acid recognition:statistical analysis of atomic interactions and influence

of DNA structure. Proteins: Struct., Funct., Bioinform. 61,258–271.

15. Jones, S., Daley, D. T. A., Luscombe, N. M., Berman,H. M. & Thornton, J. M. (2001). Protein–RNAinteractions: a structural analysis. Nucleic Acids Res.29, 943–954.

16. Bahadur, R. P., Zacharias, M. & Janin, J. (2008).Dissecting protein–RNA recognition sites. NucleicAcids Res. 36, 2705–2716.

17. Treger, M. & Westhof, E. (2001). Statistical analysis ofatomic contacts at RNA–protein interfaces. J. Mol.Recognit. 14, 199–214.

18. Morozova, N., Allers, J., Myers, J. & Shamoo, Y.(2006). Protein–RNA interactions: exploring bindingpatterns with a three-dimensional superpositionanalysis of high resolution structures. Bioinformatics,22, 2746–2752.

19. Cheng, A. C., Chen, W. W., Fuhrmann, C. N. &Frankel, A. D. (2003). Recognition of nucleic acid basesand base-pairs by hydrogen bonding to amino acidside-chains. J. Mol. Biol. 327, 781–796.

20. Hall, T. M. (2005). Multiple modes of RNA recognitionby zinc finger proteins. Curr. Opin. Struct. Biol. 15,367–373.

21. Perez-Canadillas, J. M. & Varani, G. (2001). Recentadvances in RNA–protein recognition. Curr. Opin.Struct. Biol. 11, 53–58.

22. Kielkopf, C. L., Lucke, S. & Green, A. R. (2004). U2AFhomology motifs: protein recognition in the RRMworld. Genes Dev. 18, 1513–1526.

23. Birney, E., Kumar, S. & Krainer, A. R. (1993).Analysis of the RNA-recognition motif and RSandRGG domains: conservation in metazoan pre-mRNAsplicing factors. Nucleic Acids Res. 21, 5803–5816.

24. Ellis, J. J., Broom, M. & Jones, S. (2007). Protein–RNAinteractions: structural analysis and functional classes.Proteins, 66, 903–911.

25. Jin, Y., Suzuki, H., Maegawa, S., Endo, H., Sugano, S.,Hashimoto, K. et al. (2003). A vertebrate RNA-bindingprotein Fox-1 regulates tissue-specific splicing via thepentanucleotide GCAUG. EMBO J. 22, 905–912.

26. Kim, H., Jeong, E., Lee, S. & Han, K. (2003).Computational analysis of hydrogen bonds in pro-tein–RNA complexes for interaction patterns. FEBSLett. 552, 231–239.

27. Jeong, E., Kim, H., Lee, S. W. & Han, K. (2003).Discovering the interaction propensities of aminoacids and nucleotides from protein–RNA complexes.Mol. Cell, 16, 161–167.

28. Ciriello, G., Gallina, C. & Guerra, C. (2010). Analysisof interactions between ribosomal proteins and RNAstructural motifs. BMC Bioinformatics, 11, S41.

29. Luscombe, N. M., Austin, S. E., Berman, H. M. &Thornton, J. M. (2000). An overview of the structures ofprotein–DNA complexes.Genome Biol. 1, REVIEWS001.

30. Draper, D. E. (1999). Themes in RNA–protein recog-nition. J. Mol. Biol. 293, 255–270.

31. Nagai, K. (1996). RNA–protein complexes. Curr. Opin.Struct. Biol. 6, 53–61.

32. Klein, D., Schmeing, T., Moore, P. & Steitz, T. (2001).The kinkturn: a new RNA secondary structure motif.EMBO J. 20, 4214–4221.

33. Yang, X. J., Gerczei, T., Glover, L. & Correll, C. C.(2001). Crystal structures of restrictocin-inhibitor

Page 14: The Role of RNA Sequence and Structure in RNA–Protein Interactions

587RNA Sequence–Structure in RNA–Protein Interactions

complexes with implications for RNA recognition andbase flipping. Nat. Struct. Biol. 8, 968–973.

34. Allain, F. H. T., Bouvet, P., Dieckmann, T. & Feigon, J.(2000). Molecular basis of sequence-specific recogni-tion of pre-ribosomal RNA by nucleolin. EMBO J. 19,6870–6881.

35. Kumar, M., Gromiha, M. M. & Raghava, G. P. (2008).Prediction of RNA binding sites in a protein usingSVM and PSSM profile. Proteins, 71, 189–194.

36. Terribilini, M., Lee, J. H., Yan, C., Jernigan, R. L.,Honavar, V. & Dobbs, D. (2006). Prediction of RNAbinding sites in proteins from amino acid sequence.RNA, 12, 1450–1462.

37. Terribilini, M., Sander, J. D., Lee, J. H., Zaback, P.,Jernigan, R. L., Honavar, V. & Dobbs, D. (2007).RNABindR: a server for analyzing and predictingRNA-binding sites in proteins. Nucleic Acids Res. 35,W578–W584.

38. Tong, J., Jiang, P. & Lu, Z. H. (2008). RISP: a web-based server for prediction of RNA-binding sites inproteins. Comput. Methods Programs Biomed. 90,148–153.

39. Wang, Y., Xue, Z., Shen, G. & Xu, J. (2008). PRINTR:prediction of RNA binding sites in proteins usingSVM and profiles. Amino Acids, 35, 295–302.

40. Wang, L. J. & Brown, S. J. (2006). BindN: a web-basedtool for efficient prediction of DNA and RNA bindingsites in amino acid sequences. Nucleic Acids Res. 34,W243–W248.

41. Shulman-Peleg, A., Shatsky, M., Nussinov, R. &Wolfson, H. J. (2008). Prediction of interactingsingle-stranded RNA bases by protein-binding pat-terns. J. Mol. Biol. 379, 299–316.

42. Li, W. Z. & Godzik, A. (2006). Cd-hit: a fast programfor clustering and comparing large sets of protein ornucleotide sequences. Bioinformatics, 22, 1658–1659.

43. Yang, H. W., Jossinet, F., Leontis, N., Chen, L.,Westbrook, J., Berman, H. & Westhof, E. (2003).Tools for the automatic identification and classifica-tion of RNA base pairs. Nucleic Acids Res. 31,3450–3460.

44. McDonald, I. K. & Thornton, J. M. (1994). Satisfyinghydrogen bonding potential in proteins. J. Mol. Biol.238, 777–793.