Specificity landscapes unmask submaximal binding site ...Specificity landscapes unmask submaximal...

10
Specificity landscapes unmask submaximal binding site preferences of transcription factors Devesh Bhimsaria a,b,1 , José A. Rodríguez-Martínez a,2 , Junkun Pan c , Daniel Roston c , Elif Nihal Korkmaz c , Qiang Cui c,3 , Parameswaran Ramanathan b , and Aseem Z. Ansari a,d,4 a Department of Biochemistry, University of WisconsinMadison, Madison, WI 53706; b Department of Electrical and Computer Engineering, University of WisconsinMadison, Madison, WI 53706; c Department of Chemistry, University of WisconsinMadison, Madison, WI 53706; and d The Genome Center of Wisconsin, University of WisconsinMadison, Madison, WI 53706 Edited by Michael Levine, Princeton University, Princeton, NJ, and approved September 24, 2018 (received for review July 13, 2018) We have developed Differential Specificity and Energy Landscape (DiSEL) analysis to comprehensively compare DNAprotein inter- actomes (DPIs) obtained by high-throughput experimental plat- forms and cutting edge computational methods. While high-affinity DNA binding sites are identified by most methods, DiSEL uncovered nuanced sequence preferences displayed by homologous transcription factors. Pairwise analysis of 726 DPIs uncovered homolog-specific dif- ferences at moderate- to low-affinity binding sites (submaximal sites). DiSEL analysis of variants of 41 transcription factors revealed that many disease-causing mutations result in allele-specific changes in binding site preferences. We focused on a set of highly homologous factors that have different biological roles but readDNA using iden- tical amino acid side chains. Rather than direct readout, our results indicate that DNA noncontacting side chains allosterically contribute to sculpt distinct sequence preferences among closely related mem- bers of transcription factor families. Differential Specificity and Energy Landscapes | cognate site identification | DNAprotein interactome | DNA sequence recognition | allostery G enome-wide binding profiles of hundreds of transcription factors (TFs) have made it abundantly clear that these proteins bind to a large spectrum of sequences to manifest their biological functions (13). The affinity for different biologically relevant binding sites can vary dramatically. Surprisingly, only a fraction of the genomic sites occupied in living cells can be an- notated using high-affinity motifs assigned to a given TF (1). To further confound annotation, high-affinity sites can be bound interchangeably by TFs that bear a common DNA binding fold (46). This is especially true for highly homologous TFs that often bind indistinguishably to consensus high-affinity sites (6, 7). Increasingly, moderate- to low-affinity (submaximal or sub- optimal affinity) binding sites have been shown to guide selective binding of individual TFs to distinct genomic loci (811). In other words, energetically subtle preferences for different mod- erate- to low-affinity sites govern selective binding and distinct biological roles of closely related homologous TFs (811). The quest to identify consensus binding sites of all DNA (and RNA) binding proteins encoded within the human genome is being driven by high-throughput experimental platforms and new computational approaches (12, 13). Each experimental and computational approach has inbuilt advantages and limitations (1416). While high-affinity sites are readily identified, binding to submaximal affinity sites is nontrivial and is often overlooked. However, an unexpected result from recent analyses is that high- affinity consensusbinding sites often do not predict in vivo genome-wide binding profiles (chromatin immunoprecipitation followed by sequencing or ChIP-seq) as effectively as models that include sequences of submaximal affinities (17). Pairwise com- parisons of DNAprotein interactomes (DPIs) suggest that most experimental platforms capture high-affinity sites with remarkable fidelity (1823). However, the extent to which platform-dependent idiosyncrasies thwart the identification of submaximal binding sites is underscrutinized and poorly understood. Here, we report the development of Differential Specificity and Energy Landscapes (DiSEL) to compare experimental platforms, computational methods, and interactomes of TFs, especially those factors that bind identical consensus motifs. Our results reveal that (i ) most high-throughput experimental plat- forms reliably identify high-affinity motifs but yield less reliable information on submaximal sites; (ii ) with few exceptions, com- putational methods model DPIs with a focus on high-affinity sites; (iii ) submaximal sites improve the annotation of biologi- cally relevant binding sites across genomes; (iv) among members of TF families, homolog-specific preferences are most evident at submaximal affinity sites rather than high-affinity motifs; (v) among closely related homologs that use identical side chains to interact with DNA, the residues that face away from the DNA can allosterically confer homolog-specific preferences for sub- maximal sites; and (vi ) among naturally occurring alleles of specific factors, several disease-causing alleles impact binding to submaximal affinity sites (24). Taken together, DiSEL analysis readily unmasks the differences between experimental platforms and computational models and identifies submaximal sites that Significance Several experimental platforms and computational methods have been developed to identify DNA binding sites of over 1,000 transcription factors. Often, high-affinity (maximal) binding sites are reported as consensus motifs. Differences between experimental platforms contribute to uncertainty in ascribing binding to sub- maximal sites. However, biological studies emphasize the impor- tance of submaximal binding sites in shaping regulatory functions of transcription factors. To bridge this gap, we developed Differential Specificity and Energy Landscapes to unmask differences between experimental and computational methods as well as capture distinct submaximal binding site preferences of transcription factors. Our results suggest that subtle variation in protein structure can allo- sterically confer homolog-specific differences in binding to sub- maximal affinity sites. Author contributions: D.B., J.A.R.-M., Q.C., P.R., and A.Z.A. designed research; D.B., J.A.R.-M., J.P., D.R., E.N.K., Q.C., P.R., and A.Z.A. performed research; D.B., P.R., and A.Z.A. contributed new reagents/analytic tools; D.B., J.A.R.-M., J.P., D.R., and E.N.K. ana- lyzed data; and D.B., J.A.R.-M., J.P., D.R., E.N.K., Q.C., P.R., and A.Z.A. wrote the paper. Conflict of interest statement: A.Z.A. is the sole member of VistaMotif, LLC and founder of the educational nonprofit WINStep Forward. This article is a PNAS Direct Submission. Published under the PNAS license. 1 Present address: Bio Informaticals, Jaipur, Rajasthan 302016, India. 2 Present address: Department of Biology, University of Puerto RicoRio Piedras, San Juan, Puerto Rico 00925. 3 Present address: Departments of Chemistry and Physics & Biomedical Engineering, Boston University, Boston, MA 02215. 4 To whom correspondence should be addressed. Email: [email protected]. This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1811431115/-/DCSupplemental. Published online October 19, 2018. E10586E10595 | PNAS | vol. 115 | no. 45 www.pnas.org/cgi/doi/10.1073/pnas.1811431115 Downloaded by guest on October 5, 2020

Transcript of Specificity landscapes unmask submaximal binding site ...Specificity landscapes unmask submaximal...

Page 1: Specificity landscapes unmask submaximal binding site ...Specificity landscapes unmask submaximal binding site preferences of transcription factors Devesh Bhimsariaa,b,1, José A.

Specificity landscapes unmask submaximal binding sitepreferences of transcription factorsDevesh Bhimsariaa,b,1, José A. Rodríguez-Martíneza,2, Junkun Panc, Daniel Rostonc, Elif Nihal Korkmazc, Qiang Cuic,3,Parameswaran Ramanathanb, and Aseem Z. Ansaria,d,4

aDepartment of Biochemistry, University of Wisconsin–Madison, Madison, WI 53706; bDepartment of Electrical and Computer Engineering, University ofWisconsin–Madison, Madison, WI 53706; cDepartment of Chemistry, University of Wisconsin–Madison, Madison, WI 53706; and dThe Genome Center ofWisconsin, University of Wisconsin–Madison, Madison, WI 53706

Edited by Michael Levine, Princeton University, Princeton, NJ, and approved September 24, 2018 (received for review July 13, 2018)

We have developed Differential Specificity and Energy Landscape(DiSEL) analysis to comprehensively compare DNA–protein inter-actomes (DPIs) obtained by high-throughput experimental plat-forms and cutting edge computational methods. While high-affinityDNA binding sites are identified by most methods, DiSEL uncoverednuanced sequence preferences displayed by homologous transcriptionfactors. Pairwise analysis of 726 DPIs uncovered homolog-specific dif-ferences at moderate- to low-affinity binding sites (submaximal sites).DiSEL analysis of variants of 41 transcription factors revealed thatmany disease-causing mutations result in allele-specific changes inbinding site preferences. We focused on a set of highly homologousfactors that have different biological roles but “read” DNA using iden-tical amino acid side chains. Rather than direct readout, our resultsindicate that DNA noncontacting side chains allosterically contributeto sculpt distinct sequence preferences among closely related mem-bers of transcription factor families.

Differential Specificity and Energy Landscapes | cognate site identification |DNA–protein interactome | DNA sequence recognition | allostery

Genome-wide binding profiles of hundreds of transcriptionfactors (TFs) have made it abundantly clear that these

proteins bind to a large spectrum of sequences to manifest theirbiological functions (1–3). The affinity for different biologicallyrelevant binding sites can vary dramatically. Surprisingly, only afraction of the genomic sites occupied in living cells can be an-notated using high-affinity motifs assigned to a given TF (1). Tofurther confound annotation, high-affinity sites can be boundinterchangeably by TFs that bear a common DNA binding fold(4–6). This is especially true for highly homologous TFs thatoften bind indistinguishably to consensus high-affinity sites (6, 7).Increasingly, moderate- to low-affinity (submaximal or sub-optimal affinity) binding sites have been shown to guide selectivebinding of individual TFs to distinct genomic loci (8–11). Inother words, energetically subtle preferences for different mod-erate- to low-affinity sites govern selective binding and distinctbiological roles of closely related homologous TFs (8–11).The quest to identify consensus binding sites of all DNA (and

RNA) binding proteins encoded within the human genome isbeing driven by high-throughput experimental platforms and newcomputational approaches (12, 13). Each experimental andcomputational approach has inbuilt advantages and limitations(14–16). While high-affinity sites are readily identified, bindingto submaximal affinity sites is nontrivial and is often overlooked.However, an unexpected result from recent analyses is that high-affinity “consensus” binding sites often do not predict in vivogenome-wide binding profiles (chromatin immunoprecipitationfollowed by sequencing or ChIP-seq) as effectively as models thatinclude sequences of submaximal affinities (17). Pairwise com-parisons of DNA–protein interactomes (DPIs) suggest that mostexperimental platforms capture high-affinity sites with remarkablefidelity (18–23). However, the extent to which platform-dependentidiosyncrasies thwart the identification of submaximal binding sitesis underscrutinized and poorly understood.

Here, we report the development of Differential Specificityand Energy Landscapes (DiSEL) to compare experimentalplatforms, computational methods, and interactomes of TFs,especially those factors that bind identical consensus motifs. Ourresults reveal that (i) most high-throughput experimental plat-forms reliably identify high-affinity motifs but yield less reliableinformation on submaximal sites; (ii) with few exceptions, com-putational methods model DPIs with a focus on high-affinitysites; (iii) submaximal sites improve the annotation of biologi-cally relevant binding sites across genomes; (iv) among membersof TF families, homolog-specific preferences are most evident atsubmaximal affinity sites rather than high-affinity motifs; (v)among closely related homologs that use identical side chains tointeract with DNA, the residues that face away from the DNAcan allosterically confer homolog-specific preferences for sub-maximal sites; and (vi) among naturally occurring alleles ofspecific factors, several disease-causing alleles impact binding tosubmaximal affinity sites (24). Taken together, DiSEL analysisreadily unmasks the differences between experimental platformsand computational models and identifies submaximal sites that

Significance

Several experimental platforms and computational methodshave been developed to identify DNA binding sites of over 1,000transcription factors. Often, high-affinity (maximal) binding sites arereported as consensus motifs. Differences between experimentalplatforms contribute to uncertainty in ascribing binding to sub-maximal sites. However, biological studies emphasize the impor-tance of submaximal binding sites in shaping regulatory functions oftranscription factors. To bridge this gap, we developed DifferentialSpecificity and Energy Landscapes to unmask differences betweenexperimental and computational methods aswell as capture distinctsubmaximal binding site preferences of transcription factors. Ourresults suggest that subtle variation in protein structure can allo-sterically confer homolog-specific differences in binding to sub-maximal affinity sites.

Author contributions: D.B., J.A.R.-M., Q.C., P.R., and A.Z.A. designed research; D.B.,J.A.R.-M., J.P., D.R., E.N.K., Q.C., P.R., and A.Z.A. performed research; D.B., P.R., andA.Z.A. contributed new reagents/analytic tools; D.B., J.A.R.-M., J.P., D.R., and E.N.K. ana-lyzed data; and D.B., J.A.R.-M., J.P., D.R., E.N.K., Q.C., P.R., and A.Z.A. wrote the paper.

Conflict of interest statement: A.Z.A. is the sole member of VistaMotif, LLC and founderof the educational nonprofit WINStep Forward.

This article is a PNAS Direct Submission.

Published under the PNAS license.1Present address: Bio Informaticals, Jaipur, Rajasthan 302016, India.2Present address: Department of Biology, University of Puerto Rico–Rio Piedras, San Juan,Puerto Rico 00925.

3Present address: Departments of Chemistry and Physics & Biomedical Engineering, BostonUniversity, Boston, MA 02215.

4To whom correspondence should be addressed. Email: [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1811431115/-/DCSupplemental.

Published online October 19, 2018.

E10586–E10595 | PNAS | vol. 115 | no. 45 www.pnas.org/cgi/doi/10.1073/pnas.1811431115

Dow

nloa

ded

by g

uest

on

Oct

ober

5, 2

020

Page 2: Specificity landscapes unmask submaximal binding site ...Specificity landscapes unmask submaximal binding site preferences of transcription factors Devesh Bhimsariaa,b,1, José A.

are preferred by homologous proteins with indistinguishable high-affinity target sites. Our results highlight the importance of non-obvious allosteric contributors in conferring differential sequencespecificity. While widely ignored, such allosteric effects likely con-tribute to sequence specificity beyond current models of direct andindirect readout of DNA sequence and shape. Increased evaluationof differential binding to submaximal affinity sites will undoubtedlyimprove the ability to decipher how genomic information is utilizedby TFs to manifest their regulatory functions in vivo.

ResultsSpecificity and Energy Landscapes Display Binding Affinities for anEntire Sequence Space. DPIs from high-throughput experimentalmethods are typically distilled down to a position weight matrix(PWM)-based “consensus motif” or a limited set of motifs (12,25, 26) (Fig. 1A). While PWM-based motifs efficiently summa-rize sequence preferences of a DNA binding protein, they compressrelated sequences into a consensus, overlook the impact of flankingsequences, and underestimate the full spectrum of cognate sitescontained within a given interactome. We utilize sequence specificitylandscapes (SSLs) to visualize individual interactomes (19, 27) (Fig.1B). When binding affinities are measured and correlated with cog-nate sites within an interactome, the resulting plots display bindingenergy landscapes [Specificity and Energy Landscapes (SELs)] ofindividual TFs (27, 28). In SSL/SEL plots, the binding affinities for ak-mer sequence space are represented in a series of concentric ringsorganized by a “seed motif.” All sequences in a DPI are then placedat different positions along the concentric circles based on sequencesimilarity to the seed motif (Fig. 1B and SI Appendix, Fig. S1). SELsof different classes of proteins reveal the range of binding modesdisplayed by a given TF and impact of flanking sequences and mis-matches on binding (6, 19).To elucidate similarities and differences in DPIs obtained by

various experimental methods and sequence preferences ofhighly homologous TFs, we now report the development ofDiSELs (Fig. 1B) (29). To perform DiSEL, we normalize pairedDPI datasets and scale the dynamic range of one DPI against theother (Methods has details). An automated peak finding algorithmto systematically identify and rank order the binding site preferencesof TFs was also developed (Fig. 1C and Methods). We then utilizedDiSEL to compare 568 pairs of DPIs of different DNA bindingdomains, 129 pairs of DPIs of different alleles of a given proteinand, 29 DPIs obtained by different experimental platforms.

DPIs Captured by Different Experimental Platforms. Over the pastdecade, several high-throughput platforms have been developedto chart the sequence specificity of DNA binding proteins.Among those, the most widely used methods can be grouped asDNA microarray-based platforms [Cognate Site Identifier (CSI)(18), protein binding microarray (PBM) (20), high-throughputsequencing–fluorescent ligand interaction profiling (HiTS-FLIP)(23)], massively parallel sequencing-enabled methods [for RNA: invitro selection, high-throughput sequencing of RNA, and sequencespecificity landscapes (SEQRS) (30); for DNA: high-throughputsystematic evolution of ligands by exponential enrichment (HT-SELEX) (22, 31), systematic evolution of ligands by exponentialenrichment with massively parallel sequencing (SELEX-seq) (7),Bind-n-Seq (32)], microfluidics-based protein arrays [mechanicallyinduced trapping of molecular interactions (MITOMI), selectivemicrofluidics-based ligand enrichment followed by sequencing(SMiLE-seq)] (13, 33), and cell-based bacterial one-hybridmethods (34). Each method has its advantages and limitations(12); however, submaximal affinity binding sites are typically notexamined due to uncertainty about whether such binding is in-dicative of biologically relevant affinities or simply arises as ar-tifacts of the experimental platform.We use PBM data as a benchmark to compare DPIs obtained

via different platforms. A common protein that was examined by

both CSI array and PBM is Gzf3, a C4-class zinc finger (35). Theconsensus motifs for Gzf3 from both sets of DPI data are nearlyidentical, and the scatterplot of binding intensities shows remarkable

A

B

C

Fig. 1. Comprehensive display of DPI data using SELs and DiSELs. (A) High-throughput methods to study DNA–protein interactions can be classified asarray based (CSI and PBM), microfluidics based (MITOMI), or based on nextgeneration DNA sequencing (high-throughput sequencing) (12). Computa-tional methods can summarize DPIs into PWMs. Sequence affinities de-termined by high-throughput methods correlate with equilibrium bindingconstants measured by standard methods. (B) The entire DPI is displayed as SEL.Shown is an SEL representation (19, 27) of the DPI of Gata4 by CSI array (19)using 5′WGATAA3′ as seed motif, where W = A or G. Sequences placed withinthe zero-mismatch ring have an exact match to the seed motif. The one-mismatch ring contains all sequences that differ from the seed motif at anyone position or a Hamming distance of one. The sequences are placed in aclockwise manner starting with mismatches at the first position of the motif andending with mismatches at the last position of the motif. Within each sector, themismatches at a given position “x” are organized in an alphabetical order (A-C-G-T). The two-mismatch ring contains all permutations with two positional dif-ferences with the seed. The height of each color-coded peak corresponds to theDNA binding intensity scale of plots: red to blue to gray corresponds to highestto median to lowest binding, respectively. Differences between any two DPIs canbe readily visualized through a DiSEL. DiSEL comparison between DPIs ofLhx4 and Lhx2 proteins with seed 5′TAATTA3′ is shown; Lhx4-preferred DNAbinding compared with Lhx2 is pointed out as peaks. Here, x means a mismatchat that the position of seed. (C) Domainwise distribution of DPIs of 568 homol-ogous TFs and 129 alleles of 41 TFs that are compared via DiSEL in the text.

Bhimsaria et al. PNAS | vol. 115 | no. 45 | E10587

BIOPH

YSICSAND

COMPU

TATIONALBIOLO

GY

Dow

nloa

ded

by g

uest

on

Oct

ober

5, 2

020

Page 3: Specificity landscapes unmask submaximal binding site ...Specificity landscapes unmask submaximal binding site preferences of transcription factors Devesh Bhimsariaa,b,1, José A.

correlation (r = 0.88) between the two platforms. SELs visuallydisplay the extent of similarity between the two DPIs, whereasDiSELs quantitatively highlight the differences (Fig. 2A).Whether these differences arise due to platform-specific differ-ences or if these are functional sites that are identified by oneplatform but not the other can now be clarified using a focusedset of sequences.Despite significantly deeper representation of sequence space

in HiTS-FLIP, the Gcn4 motif and interactome obtained byHiTS-FLIP are remarkably similar to those obtained by PBM(Fig. 2B) (correlation r = 0.65). However, impact of flankingsequences and other subtle contributions are better resolved inHiTS-FLIP due to the depth afforded by the sheer number ofDNA sequences available on the Illumina platform (23).MITOMI utilizes microfluidic approaches to examine 1,440

different DNA sequences with a given protein by capturingDNA–protein complexes using surface-tethered antibodies. Thelevels of trapped complex, as detected by fluorescence, reflectequilibrium binding affinities of the examined protein for a givenDNA sequence (33). Focused study of bHLH half sites yielded aset of sequences that correlated well with PBM data. However,encoding the entire 8-mer space within 1,440 oligonucleotides

yielded scatterplots and DiSELs that highlight the limitation ofusing the De Bruijn approach to represent the entire sequencespace of a given binding site. Despite the poor correlation (r =0.30) across the DPI, the motif and several high-affinity sites forCbf1, a bHLH protein, are congruent between the two platformsas shown by coinciding peaks in both SELs. However, DiSELshighlight the differences and provide clusters of sites detected inone platform vs. the other (Fig. 2C).Comparison of HT-SELEX interactome for FOXJ3 protein

(human) with the interactome of Foxj3 (mouse) obtained viaPBM shows that these two different experimental platforms yieldnearly identical motifs and highly comparable SELs. However,DiSEL analysis reveals the underrepresentation of a cognate site(5′GGTAAACA3′) that was previously identified as a part of theprimary Foxj3 binding motif (Fig. 2D and SI Appendix, Fig. S3)(14, 21). In other words, in HT-SELEX, if a sequence is un-derrepresented in early rounds of enrichment or not amplified orsequenced efficiently, it might be lost from the repertoire ofbona fide cognate sites of a given protein. To avoid potential lossof relevant sites, several groups have sequenced the initialmembers of the library bound by the protein without additionalrounds of enrichment (HT-SELEX) (31) or enriched complexes

A

B

C

D

Fig. 2. SEL/DiSEL to compare different high-throughput experimental platforms. SELs (Left), scatterplots of quantile-normalized DNA binding intensities for all 8-mers (Center), and DiSELs (Right) comparing DPIs obtained through high-throughput platforms with PBM. (A) CSI vs. PBM data for Saccharomyces cerevisiae Gzf3(seed motif: 5′GATAAG3′). (B) HiTS-FLIP vs. PBM data for S. cerevisiae Gcn4 (seed motif: 5′TGACTCA3′). (C) MITOMI vs. PBM data for S. cerevisiae Cbf1 (seed motif:5′CACGTG3′). (D) HT-SELEX for Homo sapiens FOXJ3 vs. PBM for Mus musculus Foxj3 (seed motif: 5′AAACA3′). DNA logos derived from PBM were downloadedfrom UniPROBE. SEL peaks represent DNA binding preferences of the protein as measured by the experimental platform, whereas DiSEL peaks correspond tobinding preference identified by one platform but not the other. Few differences are pointed out on DiSELs (arrows). All data are displayed as z scores.

E10588 | www.pnas.org/cgi/doi/10.1073/pnas.1811431115 Bhimsaria et al.

Dow

nloa

ded

by g

uest

on

Oct

ober

5, 2

020

Page 4: Specificity landscapes unmask submaximal binding site ...Specificity landscapes unmask submaximal binding site preferences of transcription factors Devesh Bhimsariaa,b,1, José A.

by EMSA and sequenced bound and unbound DNA from everyround of selection [SELEX-seq (7) or Spec-seq (36)].A major constraint in comparing a comprehensive set of TF

DPIs across all experimental platforms is the paucity of DNAinteractome data beyond a handful of TFs that have been sys-tematically tested on different experimental platforms. Evenwhen the same TF was examined across different high-throughputplatforms, on closer inspection, a number of confounding differ-ences emerged [for example, differences in species (mouse vs.human TFs), protein size (DNA binding domains vs. full-lengthTFs), sample purity (highly purified preparations vs. overexpressedTFs in crude whole-cell lysates or “in vitro transcription–translation”extracts), library design, depth of sequence coverage, bindingreaction conditions under which the experiments were conducted].Despite these constraints, we were able to find 29 TF inter-actomes that permitted meaningful comparisons by DiSEL(Fig. 2 and Dataset S1).

SELEX Enrichment Depletes Submaximal Sites.The ability to probe largerbinding sites, ease of use, widespread access to high-throughputsequencing, and cost-effectiveness through multiplexing are some ofthe reasons that have made systematic evolution of ligands byexponential enrichment (SELEX)-based methods widely used incampaigns to obtain DPIs for hundreds of DNA binding proteins.Different variations on the theme have been developed recently;some rely on limited rounds of enrichment, whereas others do asingle round of enrichment and sequence both bound and unboundDNA (7, 22, 31, 32, 36). At present, the decision on howmany roundsof enrichment are required to define the sequence specificity ofproteins is arbitrary. Moreover, the nature of cognate sites that aregained or lost per round is rarely examined. We, therefore, used SELand DiSEL to evaluate the DPIs obtained from each round ofFOXJ3 enrichment and sequencing.The consensus motifs derived from the top 500 sequences of

successive rounds of enrichment and amplification in HT-SELEX

A

B

Fig. 3. Evaluation of sequences enriched in different rounds of HT-SELEX. (A) Submaximal affinity sequences are lost during consecutive rounds of selectionin SELEX experiments for FOXJ3. (Top) PWMs by MEME (multiple EM for motif elicitation) algorithm derived from the top 500 8-mer binding sequences forrounds 1–5 (counts). (Middle) SELs for FOXJ3 with 5′AAACA3′ as seed motif. (Bottom) DiSELs comparing different selection rounds. Sequences with highintensity are pointed to in SELs; sequences lost or gained in different rounds are pointed out on DiSELs. (B) ROC curves for sensitivity–specificity analysis ofmapping of the interactome data from different rounds of HT-SELEX enrichment to FOXJ3 ChIP-seq peaks from U2OS cell line. Round 1 best fits the ChIPpeaks (AUC = 0.674). The AUC reduces significantly when excluding peaks with 5′AAAAA3′ (AUC = 0.547) or 5′AAATA3′ (AUC = 0.596) sites. However, ex-cluding peaks with a random 5-mer such as 5′ACGAC3′, which shows no binding by FOXJ3, does not alter the ROC values (AUC = 0.672).

Bhimsaria et al. PNAS | vol. 115 | no. 45 | E10589

BIOPH

YSICSAND

COMPU

TATIONALBIOLO

GY

Dow

nloa

ded

by g

uest

on

Oct

ober

5, 2

020

Page 5: Specificity landscapes unmask submaximal binding site ...Specificity landscapes unmask submaximal binding site preferences of transcription factors Devesh Bhimsariaa,b,1, José A.

are nominally different (Fig. 3A) (37). In contrast from SEL plots, itis evident that, with each round of enrichment, submaximal sites areweeded out until only high-affinity sites dominate the landscape (Fig.3A and SI Appendix, Fig. S4). DiSEL analysis reveals the nature ofsites that are successively eliminated in each round of enrichment.For example, 5′AAACATT3′ occurs prominently in the first round,but it is lost by round 3 and certainly missing in round 5; however, arelated sequence, 5′AAACATAA3′, is successively enriched, and byround 5, two very sharp high-affinity peaks with this motif are evidentin the final DPI. DiSEL thus captures the high-affinity motifs as wellas provides a view of the evolving specificity landscape and the pro-gressive loss of submaximal affinity sites that may well be biologicallyrelevant in vivo. In doing so, SEL and DiSEL assist in defining theoptimal rounds of enrichment that might best capture the range ofDNA cognate sites bound by a given protein or small molecule.To determine whether the submaximal sites identified by DiSEL

analysis are bound by TFs in cells, we examined genome-widebinding profiles of FOXJ3 in U2OS, a human osteosarcoma cell line(38). The receiver operator characteristic (ROC) to retrieve FOXJ3ChIP-seq peaks with all sites identified in the first round of HT-SELEX enrichment was far greater than subsequent rounds ofenrichment (Fig. 3B). Removing peaks bearing two submaximal5′AAAAA3′ and 5′AAATA3′ sites eliminates the advantageoffered by round 1-enriched sequences to annotate ChIP-seqpeaks. In contrast, removing regions bearing unrelated sites 5′ACGAC3′ (Fig. 3B), 5′TAACA3′, or 5′GTATG3′ (SI Appendix,Fig. S4C) does not impact area under the curve (AUC) values ofROC. These results highlight the biological relevance of thesubmaximal sites identified by the DiSEL approach. Thus, witheach round of SELEX, the best binding sites are enriched at theexpense of biologically relevant submaximal binding sites.

Comparison of Computational Models. Two different algorithmsreported related but not identical motifs for the same Foxj3DNA interactome data (14–16). These motifs display differinglevels of success in capturing cognate sites within the inter-actome. Protein binding microarray–binding energy and expec-tation maximization likelihood (PBM-BEEML) was designed forPBM data and considers thermodynamic binding in generating aPWM for a motif (14). Seed-n-Wobble also works on PBM data,but it identifies a seed and builds a motif by considering substi-tutions to the seed sequence and by adding nucleotides on the 5′and 3′ ends to determine if any given nucleotide leads to anincreased overall binding intensity (20). The consequence is thatSeed-n-Wobble builds high information content motifs, and re-lated motifs are designated secondary motifs: for example, 5′GTAAACA3′ vs. 5′CAAAACA3′. In direct comparison, PBM-BEEML successfully identifies a larger fraction of Foxj3 bindingsequences, because it defines the core 5′AAACA3′ as the motif.Displaying the Foxj3 interactome in an SEL and using either 5′AAACA3′ or 5′RTAAACA3′ (R = G/A) as seed motifs showthat the PBM-BEEML model is too permissive, whereas Seed-n-Wobble may be too restrictive (Fig. 4A).In recent reports, several models are simultaneously applied to

discover consensus motifs (5). The assumptions inherent to eachcomputational method may impact the inclusion or exclusion ofsubmaximal cognate sites in unanticipated ways as is evidentfrom the SEL in Fig. 4A. SELs and DiSELs could serve as anunbiased tool to evaluate how accurately motifs derived fromdifferent computational models capture the full affinity andspecificity profiles of DNA binding proteins. From the crenella-tions in the zero-mismatch ring, one can immediately identify theimpact of different flanking sequences on the ability of a protein tobind a perfectly matched core cognate site. Recent studies haveshown that such context effects are important in modulatingbinding of TFs to different genomic loci in cells (19, 39–41).Another consistent pattern that emerges from SELs is that

many TFs bind more than one motif: for example, PAX6 binds 5′

TAATTA3′ and 5′TGCACA3′. Specificity profiles of such TFscan be analyzed in two different ways: first by plotting differentSELs using each motif as a seed and second by combining bothmotifs as seeds in a single SEL (Fig. 4B). The advantages in thefirst strategy are that one can focus on one motif at a time andthat binding sites encapsulating the other motif would manifestthemselves as peaks in appropriate mismatch rings. The secondstrategy offers the advantage of comparative analysis of morethan one motif on a single SEL plot. In either case, the bindingaffinity associated with any given sequence is not altered in anyway, only the position with respect to seed changes.

Differential Preferences of Highly Homologous TFs. To determine ifDiSELs can tease out specificity differences between homolo-gous TFs, we examined the entire interactomes of 568 differentpairs of DNA binding proteins representing different classes ofDNA binding domains (Figs. 1C and 5A, Datasets S2–S5, and SIAppendix, Figs. S5 and S6). In particular, we carefully scrutinizedthree pairs of homologous TFs: Lhx2 and Lhx4 of the Homeo-domain family, Hnf4a and Rxra of the Nuclear Receptor family,and Irf4 and Irf5 of the tryptophan pentad repeat members of thewinged helix-turn-helix family of DNA binding proteins. Theadded benefit of examining these three pairs of TFs is that se-quences preferred by one member over the other have beenmapped previously (21, 42). Thus, this prior set of homolog-specific sites serves as a benchmark for DiSEL-based identifi-cation of sequences preferred by closely related homologs.As expected, the derived PWM from each interactome shows

that related TFs yield nearly identical consensus motifs (Fig. 5Aand SI Appendix, Figs. S5A and S6A). SEL representation of theentire DPI for all three pairs further emphasizes the extent of theoverlap in sequence preferences of matched pairs (Fig. 5B and SIAppendix, Figs. S5B and S6B). This is not surprising, especially in thecase of Lhx2 and Lhx4, because both proteins utilize identical aminoacid side chains to read the high-affinity 5′TAATTA3′ cognate site.In DiSELs, by virtue of the sequence placement within the

landscapes, related sequences preferred by one homolog over theother cluster together and are automatically identified using apurpose-built software package (provided here). Such sites can alsobe visually identified, and underlying sequences can be queried

A

B

Fig. 4. SELs to compare specificity models derived from PBM-BEEML andSeed-n-Wobble computational methods. (A) SELs for the DNA interactomeof Foxj3 with PWMs representing specificity models of two computationalmethods derived from PBM data (z scores). Seed motifs derived by PBM-BEEML (5′AAACA3′; Left) and Seed-n-Wobble (5′RTAAACA3′; Right), whereR = A/G (14). Positions of the two sequences from Seed-n-Wobble seeds arepointed out in both SELs 5′GTAAACAA3′ (1) and 5′CAAAACAA3′ (2). (B) Topviews of SELs of PAX6 protein using seed 5′TAATTA3′ in Left, seed 5′TGCACA3′in Center, and both 5′TAATTA3′ and 5′TGCACA3′ as seed in Right. The dashedblack lines demarcate the landscape using one motif or the other as a seed.

E10590 | www.pnas.org/cgi/doi/10.1073/pnas.1811431115 Bhimsaria et al.

Dow

nloa

ded

by g

uest

on

Oct

ober

5, 2

020

Page 6: Specificity landscapes unmask submaximal binding site ...Specificity landscapes unmask submaximal binding site preferences of transcription factors Devesh Bhimsariaa,b,1, José A.

using an interactive graphical user interface (Fig. 5C, Dataset S6,and SI Appendix, Figs. S5C and S6C).Previous efforts to identify homolog-preferred sites relied on

either a subjective manual curation or a Bayesian ANOVAmodel (21, 42, 43). Our analysis of DiSEL plots readily capturedthe site preferences identified by previous methods [such asmanually curated (5′GGTCCA3′ preferred by Hnf4a comparedwith Rxra, 5′TGAAAG3′ preferred by Irf4 compared with 5′CGA-GAC3′ preferred by Irf5) and ANOVA based (5′TAACGA3′, 5′TAATGG3′, and 5′TAATGA3′ preferred by Lhx2 vs. 5′TAATCA3′,5′TGATTG3′, and 5′TAATCT3′ by Lhx4)]. More important, auto-mated DiSEL analysis revealed homolog-preferred submaximal sitesthat were missed by previous studies (Datasets S2–S5). Displaying theidentified sites on a scatterplot makes plain the challenges of identi-fying these homolog preferences through current approaches. DiSELplots, however, cluster and highlight submaximal binding sites (Fig. 5Dand SI Appendix, Figs. S5D and S6D). Another striking observation thatemerges from this analysis is that homolog-specific sequences primarilyappear in the mismatch rings and comprise submaximal sites.In addition to capturing previously mapped differences in

vitro, an additional feature supports the conclusion that thesubmaximal sequences identified by DiSEL analysis are re-flective of preferential binding in vivo. As in the example ofFoxj3 above, genome-wide binding profiles in biologically rele-vant cells, as identified by ChIP-seq, show that submaximal sitesimprove the annotation of ChIP-seq peaks significantly. We ex-amined the ChIP-seq peaks of LHX2 in hair follicle cells (44) todetermine whether Lhx2- or Lhx4-preferred binding sites betterannotate these ChIP peaks. Focusing on the submaximal sites 5′TAATG3′ preferred by Lhx2 and 5′TAATC3′ preferred by Lhx4,we report that ROC plots show that retaining 5′TAATG3′containing ChIP peaks led to better annotation with Lhx2 (AUC0.811) vs. the Lhx4 (AUC 0.754) interactome data (Fig. 5E,Upper Left). Conversely, removing peaks bearing 5′TAATG3′ sitesled to indistinguishable annotation using either Lhx2 (AUC 0.607)or Lhx4 (AUC 0.619) interactome data (Fig. 5E, Lower Left); thisobservation was also true for other replicates (SI Appendix, Fig.S7). Furthermore, removal of peaks bearing Lhx4-preferred 5′TAATC3′ sites improves the annotation of ChIP-seq data with theLhx2 interactome data (Fig. 5E, Right), supporting the fact that,consistent with in vitro sequence preferences, 5′TAATC3′ sitesare less favorably bound by Lhx2 in comparison with Lhx4.

Amino Acids That Differ Between Closely Related Homologs Do Not“Read” DNA. Aligning the amino acid sequence of Lhx2 andLhx4 with the prototypical homeodomain protein engrailed showsthat, in each homolog, side chains that make direct base pair con-tacts are identical and so are over one-half of the residues thatinteract with DNA backbone (Fig. 6A). However, homologs vary inthe residues that face away from the DNA and are packed againstthe hydrophobic core of the homeodomain (Fig. 6 B and C). Inparticular, the substitution of residues L45 and R52 of Lhx2 (blue inFig. 6) with V45 and A52 of Lhx4 (orange in Fig. 6) could poten-tially tilt the DNA recognition helix 3 with respect to the majorgroove. Such a change in the docking angle of the recognition helixcould change the energetic dependence on specific positions withinan otherwise identical DNA consensus motif. Remarkably, in Lhx2,L45 packs against M16, whereas in Lhx4, V45 shows poorer packingagainst L16. Such variation in residues that pack and stabilize theprotein core is not limited to Lhx2 and Lhx4; similar changes atpositions 45 and 52 are observed in 30 of 255 homeodomains (SIAppendix, Fig. S8C). We also note that other residues that packagainst the recognition helix differ between closely related homo-logs and may similarly contribute to homolog-specific preference fordistinct submaximal binding sites (45). Such differences are typicallyoverlooked, because the side chains that do not contact DNA aremostly ignored in defining the sequence specificity (42). A few

A

B

C

D

E

Fig. 5. DiSEL to compare interactomes of highly homologous DNA bindingproteins. (A and B) Lhx2 (Left) and Lhx4 (Right); 8-mer PBM data are used forcomparison. All units in B–D are E scores. (A) PWM motifs determined bySeed-n-Wobble and (B) SELs for Lhx2 and Lhx4 using 5′TAATTA3′ as seedmotif. (C) DiSELs [(Left) Lhx2 over Lhx4; (Right) Lhx4 over Lhx2] with seed 5′TAATTA3′ used to highlight subtle differences in DNA binding preferences inthe form of small blue peaks. (D) Selected differences identified via DiSELsare highlighted in a scatterplot [(Left) Lhx2 preferred sequences; (Right)Lhx4 preferred sequences]. (E) ROC curves plotted to obtain sensitivity–specificity analysis of LHX2 ChIP-seq peaks in hair follicle stem cells. Leftcompares ROCs with ChIP peaks containing the Lhx2 submaximal site 5′TAATG3′ (Upper Left) vs. ChIP data where 5′TAATG3′-containing peaks werecomputationally removed (Lower Left). Right compares ROCs with peakscontaining the Lhx4 submaximal site 5′TAATC3′ (Upper Right) and without it(Lower Right). ChIP peaks containing 5′TAATG3′ are better annotated by theLhx2 rather than the Lhx4 interactome. In contrast, peaks bearing 5′TAATC3′are not preferentially annotated, validating the robustness of our approach inidentifying biologically relevant submaximal sites preferred by each homolog.

Bhimsaria et al. PNAS | vol. 115 | no. 45 | E10591

BIOPH

YSICSAND

COMPU

TATIONALBIOLO

GY

Dow

nloa

ded

by g

uest

on

Oct

ober

5, 2

020

Page 7: Specificity landscapes unmask submaximal binding site ...Specificity landscapes unmask submaximal binding site preferences of transcription factors Devesh Bhimsariaa,b,1, José A.

studies have indicated that non–DNA-contacting positions mightaffect DNA binding (24, 34, 45, 46).We further examined the specificity landscapes of the 30

homeodomains with identical DNA contacting residues R5, V47,Q50, N51, A54, and K55, similar to Lhx2 and Lhx4 (SI Appendix,Fig. S9A). Of the 30 homologs, DPIs of 20 were obtained underthe same experimental conditions (Dataset S5). We examinedspecificity landscapes of all 20 homologs; of these, Lhx3 has thesame V45–R52 pair as Lhx4, whereas Lhx9 has the L45–A52 pairfound in Lhx2. In agreement with our predictions, we find thatLhx3–Lhx4 and Lhx9–Lhx2 show striking overlap in the SEL and

DiSEL profiles (SI Appendix, Fig. S9 B–E). The identification ofsubmaximal sites from Lhx3 and Lhx9 further reveals nearlyidentical homolog-specific preferences as Lhx4 and Lhx2, re-spectively (SI Appendix, Fig. S9). These results strongly validateour approach, where four distinct proteins purified, tested, andanalyzed independently converge on the same specificity profilesand cross-validate each homology pairing.

Molecular Dynamics Simulations Reveal Differences in Helix Orientations.A series of molecular dynamics (MD) simulations tested the pro-posal that mutations that do not directly interface with DNA affectspecificity by altering the orientation of helix 3 with respect to theDNA. Both Lhx2 and Lhx4 sequences were modeled onto homeo-domain template structures (Fig. 6B and SI Appendix, Fig. S10A).The distances between the backbone alpha carbons for both ho-mologs do not exceed 0.3 Å rmsd in the modeled structures. Weconducted equilibrium MD simulations of both structures usingthree different force fields: the AMBER ff14SB force field (47) withgeneralized Born solvation model, the AMBER force field withexplicit solvent, and the CHARMM force field (48, 49) with explicitsolvent (Methods).The simulations reveal important differences in the overall

structures and flexibilities of the two proteins (Fig. 6 and SIAppendix, Fig. S10). Notably, all simulation schemes show per-sistent differences between the proteins in the relative orienta-tions of the three helical axes. Thus, while the residues of helix3 that read DNA are identical between the two proteins, thedifferences in orientation relative to the other helices will alterhow those residues interact with the major groove of DNA, af-fecting the sequence specificity. Fig. 6E shows how structuresfrom the CHARMM simulations might interact with DNA andthat differences in relative orientations of the helices alter theinteractions of helix 3 with the DNA.

Altered Sequence Preferences of Disease-Causing Variants. Thehomolog-specific differences that we observed even among TFsthat read DNA using identical amino acid residues motivated usto explore the impact of TF variants observed in populations. Re-cently, DPIs of 41 TFs and their alleles, including many diseases-causing variants, became publicly available (24). We used DiSEL tocompare these TFs against all of their tested alleles (129 alleles) (Fig.1C and Dataset S7). While variants that showed drastic alterations inDNA binding preferences were readily identified, many allelesexhibited binding profiles that were similar but nonidentical to thereference protein. These nuanced shifts in specificity, as highlighted inthe three examples below, were identified by DiSEL. In the first ex-ample, the DNA interactomes of HOXD13 protein and its alleleQ325R, a variant that is causally linked to syndactyly type V and abrachydactyly-syndactyly syndrome (50), were examined by DiSEL(Fig. 7); 5′GTAAA3′ and 5′GTACA3′, which are two 5-mers thatdisplay greater preference forHOXD13 andQ325R allele, respectively,were identified by DiSEL analysis. ROCs were plotted for ChIP-seqpeaks with and without 5′GTAAA3′ or 5′GTACA3′ sequences (51).The DNA interactome for Q325R allele better annotates ChIP-seqpeaks (AUC for HOXD13 REF (reference) is 0.616 vs. 0.721 forQ325R). Excluding peaks bearing 5′GTAAA3′ has a minor impact,but excluding peaks bearing 5′GTACA3′ dramatically reduced thegains (AUC for HOXD13 REF is 0.656 vs. 0.692 for Q325R) (Fig. 7Dand SI Appendix, Fig. S11). This analysis supports the pathologicalimportance of the 5′GTACA3′ sequence identified by DiSEL analysis.In the case of the R90W allele of CRX (linked to the disease

Leber Congenital Amaurosis 7), DiSEL revealed minimal impacton binding to sites with the 5′GGATTA3′ core motif (the in-nermost zero-mismatch ring of DiSEL is shown in SI Appendix,Fig. S12). In contrast, the R90W allele showed a precipitous lossof binding to submaximal sites with a single mismatch to the 5′GGATTA3′ core. A focus on consensus motifs and high-affinitybinding sites would fail to identify the disease-causing loss of

A

B C

D

E

Fig. 6. Structural models and differential dynamics of Lhx2 and Lhx4. (A)Homeodomain amino acid sequences of engrailed, Lhx2 (blue), and Lhx4 (or-ange) proteins. Residues that make DNA base contacts (▴) or DNA backbonecontacts (▪) or are identified as important for DNA binding by alanine shotgunscanning experiments (▾) are marked. (B and C) Structural models for Lhx2(blue) and Lhx4 (orange) were aligned to engrailed DNA structure (PDB IDcode 1HDD) to highlight amino acid differences. Helices are numbered forreference. (D) Angles between helices during 200-ns MD simulations with theCHARMM force field in explicit solvent. Analogous analyses for other simula-tion models are in SI Appendix, Fig. S10. (E) Representative structures from theCHARMM simulations overlaid on the engrailed structure with DNA using thebackbone atoms of the three helices for alignment.

E10592 | www.pnas.org/cgi/doi/10.1073/pnas.1811431115 Bhimsaria et al.

Dow

nloa

ded

by g

uest

on

Oct

ober

5, 2

020

Page 8: Specificity landscapes unmask submaximal binding site ...Specificity landscapes unmask submaximal binding site preferences of transcription factors Devesh Bhimsariaa,b,1, José A.

binding to submaximal sites by this allele of CRX. The final ex-ample of such subtle but relevant differences between two vari-ants of VSX1 is displayed in SI Appendix, Fig. S13 (VSX1. REFvs. VSX1. Q175H). We highlight this particular pairwise compar-ison, because the two protein variants display a remarkably highcorrelation across the affinity spectrum (r2 = 0.985), and yet,submaximal sites preferred by one allele over the other are readilyidentified by an algorithmic examination of the DiSEL plots. Wemap these differentially preferred sites onto scatterplots that arecommonly used to visualize allele-preferred binding sites. Therobustness of the algorithmic examination is such that even subtlepreferences become readily apparent and can be rank ordered.

DiscussionThe past decade has witnessed rapid expansion in the develop-ment of experimental and computational methods to compre-hensively map DNA and RNA recognition properties of proteinsand small molecules (12, 25, 52). The information that emerges

from various experimental platforms or computational analyses,while remarkably similar at the most robust cognate sites, doesnot always conform at submaximal affinity ranges. It is in-creasingly apparent that submaximal sites have been retainedduring evolution to optimize differential regulation and combi-natorial control over different genes (8–10). We developedDiSEL to identify such submaximal sites and to compare dif-ferent experimental and computational methods. DiSEL permitsan unbiased and unsupervised comparison of the entire DPIobtained by different experimental methods. Moreover, usingmotifs identified by different computational methods as an or-ganizing “seed” permits a comprehensive view of how well agiven method captures the binding profile of the entire inter-actome. Applying DiSEL to five prominent experimental plat-forms shows that array-based methods, such as HiTS-FLIP, CSI,and PBM, provide the most comprehensive view of the speci-ficity and affinity landscape of a given DNA binding protein orsmall molecule. Among sequencing-based approaches, such asHT-SELEX, SELEX-seq, Spec-seq, and Bind-n-Seq, the speci-ficity landscapes immediately make it apparent that early roundsof enrichment capture a wider range of cognate sites and thatincreasing rounds of enrichment yield high-affinity motifs at theexpense of submaximal sites. Circumventing the loss of sub-maximal affinity sites by sequencing the first round, as is done inHT-SELEX (31), is also fraught with the challenge of siftingthrough and identifying bona fide low-affinity cognate sites fromthe much larger pool of noncognate “encounter complexes” thatare also captured under those conditions. This is not surprisinggiven that electrostatic affinity for any DNA fragment of suffi-cient length is typically 10−6 M, and binding to a library bearing1015 different sequence permutations is typically performed atthese concentrations. In this context, DiSEL provides a facileand valuable approach to rapidly identify clusters of submaximalcognate sites from a larger pool of sequences captured in earlyrounds of sequencing-based approaches. Beyond these plat-forms, DiSEL can be applied to compare a range of experi-mental methods that provide data-rich protein–nucleic acidinteractomes, and the approach can identify differences thatemerge from platform-specific biases or from genuine differ-ences in recognition properties of DNA binders.In addition, our approach enables the evaluation of motifs

returned by burgeoning varieties of computational methods.More inclusive or more constrained motifs capture differentfeatures of the entire interactome. Rather than relying on mul-tiple different algorithms and manual “intuition” to identify theappropriate motif, SEL/DiSEL can provide a more systematicpath to comparing and winnowing k-mers to select set of se-quences that capture the full spectrum of binding sites preferredby a given protein or small molecule. While motifs aggregaterelated sequences, obfuscate the role of flanking sequences orsubtle variations within binding sites, and overlook relevant low-affinity sites, using all of the k-mers within an interactome maylead to the inclusion of platform-dependent biases, overfitting ofdata as well as other sources of experimental error. Thus, SEL/DiSEL provides a balance between using consensus motifs andusing the entire set of k-mers of the DPI.In this context, it is important to note that, at the low-affinity

range, distinguishing a bona fide submaximal site from a non-specific site is nearly irresolvable. The challenge of this task isespecially apparent in scatterplots, where arbitrary cutoffs areused to anoint sites as specific or discard them as nonspecificsites. The key advantage offered by our approach is that DiSELclusters related sequences that might individually be indistin-guishable from experimental noise. Clustering reveals those se-quences with near-cognate motifs that recur in the moderate- tolow-affinity range. This greatly enables the identification oflow-affinity binding sites, and SEL/DiSEL plots extend wellbeyond PWM-based motifs in providing a comprehensive but

A

B

C

D

Fig. 7. DiSEL to compare interactomes of HOXD13 protein (Left) and itsallele Q325R (Right); 8-mer E-score PBM data are used for comparison. (A)SELs for HOXD13 REF (reference protein) and Q325R allele using seed motif5′ATAAA3′. (B) DiSELs [(Left) HOXD13 REF over HOXD13 Q325R; (Right)HOXD13Q325R overHOXD13 REF] with seedmotif 5′ATAAA3′. (C) Differencesare identified through automated analysis, and DiSELs are highlighted in thescatterplot. (D) ROC curves mapping DPIs of HOXD13 protein and its alleleQ325R to ChIP-seq peaks ofHOXD13 in the chicken mesenchymal stem cell line(without 5′GTAAA3′ and 5′GTACA3′). ROCs reflect binding preference of 5′GTACA3′ sequence for HOXD13 Q325R allele over HOXD13 REF in vivo.

Bhimsaria et al. PNAS | vol. 115 | no. 45 | E10593

BIOPH

YSICSAND

COMPU

TATIONALBIOLO

GY

Dow

nloa

ded

by g

uest

on

Oct

ober

5, 2

020

Page 9: Specificity landscapes unmask submaximal binding site ...Specificity landscapes unmask submaximal binding site preferences of transcription factors Devesh Bhimsariaa,b,1, José A.

intuitive view of complex binding landscapes of DNA/RNAbinding molecules.In essence, DiSEL analysis reliably identified differential pref-

erences to submaximal sites between different alleles of proteins,several of which are disease-causing point mutations that show noalteration in their affinity for consensus motifs. Additional appli-cation of DiSEL to homologous TFs highlighted the differentialdependence on positions within high- and medium-affinity bindingsites. Importantly, the results highlight the importance of non-obvious allosteric contributors in conferring differential sequencespecificity at submaximal affinity sites. While widely overlooked,such allosteric effects likely contribute to sequence specificitybeyond current models of direct and indirect readout of DNAsequence and shape. Systematic mutational analysis of a set ofnoncontacting amino acids identified additional positions thatwould perturb DNA binding (45). While MD simulations wereused to qualitatively rationalize the importance of protein allosteryto DNA binding affinity, it is worthwhile to pursue free energy sim-ulations in the future to further dissect the energetic contributions tosuch allosteric effects; a deeper mechanistic understanding will inspirestrategies for modulating binding specificities of TFs using, for ex-ample, small molecules.As high-resolution DNA interactomes of thousands of TFs are

being reported, we provide the DiSEL-generating software torapidly evaluate and identify differences as well as commonali-ties in specificity preferences of alleles of same protein andclosely related TFs. We anticipate that such interactome-basedevaluations will provide unprecedented insights into how sub-maximal affinity sites are differential recognized by related orunrelated TFs. This, in turn, will provide a far more effectivemeans to annotate genomes and unmask subtle variations thatplay a quintessential role in selective binding by specific TFs innormal gene regulation as well as in diseased states.While DiSEL analysis provides an important means to un-

derstand complex specificity landscapes, the approach faceschallenges in encapsulating different modes of sequence recog-nition by a given protein: for example, the ability of proteins toform homo- or heterodimers or higher-order oligomers withdifferent spacing configurations and half-site orientations. Thesechallenges guide our future efforts in the development of acomprehensive view of molecular forces that sculpt sequencespecificities of proteins and small molecules (6, 19, 53, 54).

MethodsDiSELs. A DiSEL displays the difference between two DPIs. By keeping thesame arrangement as an SEL, differences between two DPIs are readilyidentified. Before generating a DiSEL, the two DPIs are rescaled by first sub-tracting their respectivemean intensity and thendividingall binding intensities bythe maximum binding intensity. Differences in binding intensities between anytwo DPIs are calculated by subtracting the corresponding intensities.

Data Processing. To enable comparisons using SELs/DiSELs, DPI data for dif-ferent TFs were rescaled to have mean = 0 and maximum binding intensity =1. To get box plots, SDs for DPI data were used as one. Scatterplots for Fig. 2were obtained by quantile normalizing the DPI data obtained from thecompared experimental platform (CSI, MITOMI, HiTS-FLIP, and HT-SELEX)against the DPI data from PBM for the same protein. All SELs, DiSELs, andscatterplots were made using MATLAB.

Automated Peak Finding Algorithm. There are two kinds of peaks picked bythe program. (i) Mismatch peak. These are the peaks that are present inmismatch rings (>0) generally, where median intensity of all sequences withthe same mismatch to the seed sequence is X units higher (or lower) than Y,where X = user-defined mismatch peak cutoff percentage × maximum ab-solute intensity for the whole SEL/DiSEL and Y = 0 for DiSEL and medianintensity of the whole mismatch ring for SEL. (ii) Flanking peak. These arepeaks present among all of the sequences having the same mismatch tothe seed. They are calculated as the sequence with Z-unit higher intensitythan median intensity of sequences having the same mismatch to the seed.

Z = user-defined flanking peak cutoff percentage × maximum absolute in-tensity for the whole SEL/DiSEL.

ROC Curve. ChIP-seq peaks from previously published datasets were used as atrue positive set, whereas two scrambled versions of DNA of each positivepeak were used to make a true negative set. The fractions of regions in thepositive vs. negative sets with scores above a varying DPI intensity cutoff wereplotted to generate ROC curves (true positive rate vs. false positive rate). ROCcurves and heat maps were generated in MATLAB. To plot ROCs with aspecific sequence, peak set is created with ChIP-seq peaks containing thesequence of interest or its reverse complement, and then, ROCs were plottedas described above. Similarly, the set of peaks left behind is used to plot ROCswithout that sequence.

Lhx2 and Lhx4 Homology Models. Homology models for Lhx2 and Lhx4 werebuilt by threading onto the Drosophila melanogaster engrailed proteinbound to DNA [Protein Data Bank (PDB) ID code 3HDD] using Phyre2 (55).Both models were predicted with >99.6% confidence using default settings.Lhx2 and Lhx4 models were superimposed to the structure of engrailedbound to DNA (3HDD) using PyMOL (56).

MD Simulations.Due to the potential sensitivity of protein structural dynamicsto computational model, it is generally important to repeat the simulationwith different force fields and simulation protocols. Therefore, we havecompared the structural dynamics of Lhx2 and Lhx4 using three distinctmodels: CHARMM force field with explicit solvent and AMBER force fieldwith both explicit solvent and implicit solvent (Generalized Born) models;the explicit solvent models are generally more reliable, while the implicitsolvent model can be simulated to substantially longer timescales (1 μs vs.200 ns). The results from the three distinct simulations indeed differ in finedetails; the robust trend, however, is that the orientations of the three helicesare different between Lhx2 and Lhx4. Therefore, by integrating the resultsfrom three different simulations, the conclusion that subtle sequence varia-tions away from the DNA binding site lead to considerable differences in theorientation and flexibility of the helices in Lhx2 and Lhx4 is further supported.

Initial structures were created in Modeler Homology Modeling Package(57) using structures of homologous homeodomains. After sequence align-ment, Lhx2 and Lhx4 structures are built based on the template homeo-domain structures obtained from the PDB (with PDB ID codes 3A02, 3A03,1ENH, and 3K2A) (58). Next, we sought to perform MD simulations longenough to ensure diverse sampling and with a variety of simulationmodels to ensure that any conclusions are not model dependent. Simu-lations used implicit or explicit solvation and either the AMBER ff14SB forcefield (47) or the CHARMM36 force field (48, 49). We used the Amber v14Molecular Dynamics package (59, 60) for simulations with the AMBER forcefield and OpenMM (61) for those with the CHARMM force field. Both pro-grams offer graphics processing unit (GPU)-accelerated simulations, allowingus to reach timescales of hundreds of nanoseconds to microseconds. Details ofthe three simulations schemes are as follows.

The implicit solvent simulations used the AMBER ff14SB force field (47) andgeneralized Born model (62). Production simulations were carried out for aminimum of 1,000 ns using 1-fs time steps. Langevin dynamics were appliedusing a collision frequency of 20 ps−1. The temperature was maintained at300 K. The SHAKE algorithm was applied to bonds involving hydrogens with atolerance of 10−5 Å. The nonbonded cutoff was set to 9,999 Å. The maximumdistance between atom pairs for Born radii calculations was chosen as 12 Å.

The same AMBER force field was also used in simulations with explicitTIP3P (transferable intermolecular potential with 3 points) water (63) in theNPT (constant particle number, constant pressure, and constant tempera-ture) ensemble. The simulations used periodic boundary conditions (PBCs) ina cubic system with the edge of the box at least 10 Å from the protein. Thesystem was neutralized with chloride ions resulting in ca. 16,400 atoms,depending on protein. Simulations were conducted at 300 K with 1-fs timesteps using SHAKE to constrain bonds to hydrogen.

Finally, analogous simulations were conducted using the CHARMM36 forcefield (48, 49) with explicit TIP3P water. These used a similar PBC setup as thosewith AMBER, and production simulations were conducted in both the NVT(constant particle number, constant volume, and constant temperature) en-semble and the NPT ensemble at 300 K with 1-fs time steps using SHAKE toconstrain bonds to hydrogen. Because the two ensembles were nearly in-distinguishable, we show only the NVT results.

Software Description. MATLAB code for SEL/DiSEL software is provided here.Details of implementation are available in the text. The software uses in-formation fromUS patents US 20100159457 A1 andUS 20100160178 A1; thus,

E10594 | www.pnas.org/cgi/doi/10.1073/pnas.1811431115 Bhimsaria et al.

Dow

nloa

ded

by g

uest

on

Oct

ober

5, 2

020

Page 10: Specificity landscapes unmask submaximal binding site ...Specificity landscapes unmask submaximal binding site preferences of transcription factors Devesh Bhimsariaa,b,1, José A.

additional use of this software should be in compliance with these patents.Code can be modified and used as long as change is stated clearly and ref-erenced to this publication.

Code Availability. The computer code and the data used for the paper can bedownloaded from the website https://ansarilab.biochem.wisc.edu/computation.html.

ACKNOWLEDGMENTS. We thank current and former members of the labo-ratory of A.Z.A. for helpful discussions and Laura Vanderploeg for helpwith the artwork. This study was supported by NIH Grant GM120625 and

the W. M. Keck Medical Research Award (to P.R. and A.Z.A.). J.A.R.-M. wassupported by NIH Grant National Human Genome Research Institute Train-ing Grant of the Genome Sciences Training Program T32 HG002760. TheMD simulations work was supported in part by National Science Founda-tion (NSF) Grants CHE-1300209 (to Q.C.) and CHE-1829555 (to Q.C.). Compu-tational resources from the Extreme Science and Engineering DiscoveryEnvironment, which is supported by NSF Grant OCI-1053575, are also greatlyappreciated; computations are also supported in part by NSF Instrumenta-tion Grant CHE-0840494 (to the Department of Chemistry), and theGPU computing facility was supported by Army Research OfficeGrant W911NF-11-1-0327.

1. Dunham I, et al.; ENCODE Project Consortium (2012) An integrated encyclopedia ofDNA elements in the human genome. Nature 489:57–74.

2. Kittler R, et al. (2013) A comprehensive nuclear receptor network for breast cancercells. Cell Rep 3:538–551.

3. Xie D, et al. (2013) Dynamic trans-acting factor colocalization in human cells. Cell 155:713–724.

4. Jolma A, et al. (2013) DNA-binding specificities of human transcription factors. Cell152:327–339.

5. Weirauch MT, et al. (2014) Determination and inference of eukaryotic transcriptionfactor sequence specificity. Cell 158:1431–1443.

6. Rodríguez-Martínez JA, Reinke AW, Bhimsaria D, Keating AE, Ansari AZ (2017)Combinatorial bZIP dimers display complex DNA-binding specificity landscapes. eLife6:e19272.

7. Slattery M, et al. (2011) Cofactor binding evokes latent differences in DNA bindingspecificity between Hox proteins. Cell 147:1270–1282.

8. Ptashne M (2004) A Genetic Switch Phage Lambda Revisited (Cold Spring Harbor LabPress, Cold Spring Harbor, NY).

9. Crocker J, et al. (2015) Low affinity binding site clusters confer hox specificity andregulatory robustness. Cell 160:191–203.

10. Tanay A (2006) Extensive low-affinity transcriptional interactions in the yeast ge-nome. Genome Res 16:962–972.

11. Farley EK, et al. (2015) Suboptimization of developmental enhancers. Science 350:325–328.

12. Stormo GD, Zhao Y (2010) Determining the specificity of protein-DNA interactions.Nat Rev Genet 11:751–760.

13. Isakova A, et al. (2017) SMiLE-seq identifies binding motifs of single and dimerictranscription factors. Nat Methods 14:316–322.

14. Zhao Y, Stormo GD (2011) Quantitative analysis demonstrates most transcriptionfactors require only simple models of specificity. Nat Biotechnol 29:480–483.

15. Morris Q, Bulyk ML, Hughes TR (2011) Jury remains out on simple models of tran-scription factor specificity. Nat Biotechnol 29:483–484.

16. Weirauch MT, et al.; DREAM5 Consortium (2013) Evaluation of methods for modelingtranscription factor sequence specificity. Nat Biotechnol 31:126–134.

17. Ruan S, Swamidass SJ, Stormo GD (2017) BEESEM: Estimation of binding energymodels using HT-SELEX data. Bioinformatics 33:2288–2295.

18. Warren CL, et al. (2006) Defining the sequence-recognition profile of DNA-bindingmolecules. Proc Natl Acad Sci USA 103:867–872.

19. Carlson CD, et al. (2010) Specificity landscapes of DNA binding molecules elucidatebiological function. Proc Natl Acad Sci USA 107:4544–4549.

20. Berger MF, et al. (2006) Compact, universal DNA microarrays to comprehensivelydetermine transcription-factor binding site specificities. Nat Biotechnol 24:1429–1435.

21. Badis G, et al. (2009) Diversity and complexity in DNA recognition by transcriptionfactors. Science 324:1720–1723.

22. Jolma A, et al. (2010) Multiplexed massively parallel SELEX for characterization ofhuman transcription factor binding specificities. Genome Res 20:861–873.

23. Nutiu R, et al. (2011) Direct measurement of DNA affinity landscapes on a high-throughput sequencing instrument. Nat Biotechnol 29:659–664.

24. Barrera LA, et al. (2016) Survey of variation in human transcription factors revealsprevalent DNA binding changes. Science 351:1450–1454.

25. Alipanahi B, Delong A, Weirauch MT, Frey BJ (2015) Predicting the sequence speci-ficities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol 33:831–838.

26. Setty M, Leslie CS (2015) SeqGL identifies context-dependent binding signals ingenome-wide regulatory element maps. PLoS Comput Biol 11:e1004271.

27. Tietjen JR, Donato LJ, Bhimisaria D, Ansari AZ (2011) Sequence-specificity and energylandscapes of DNA-binding molecules. Methods Enzymol 497:3–30.

28. Puckett JW, et al. (2007) Quantitative microarray profiling of DNA-binding molecules.J Am Chem Soc 129:12310–12319.

29. Erwin GS, et al. (2016) Synthetic genome readers target clustered binding sites acrossdiverse chromatin states. Proc Natl Acad Sci USA 113:E7418–E7427.

30. Campbell ZT, et al. (2012) Cooperativity in RNA-protein interactions: Global analysisof RNA binding specificity. Cell Rep 1:570–581.

31. Zhao Y, Granas D, Stormo GD (2009) Inferring binding energies from selected bindingsites. PLoS Comput Biol 5:e1000590.

32. Zykovich A, Korf I, Segal DJ (2009) Bind-n-Seq: High-throughput analysis of in vitroprotein-DNA interactions using massively parallel sequencing. Nucleic Acids Res 37:e151.

33. Fordyce PM, et al. (2010) De novo identification and biophysical characterization oftranscription-factor binding sites with microfluidic affinity analysis. Nat Biotechnol 28:970–975.

34. Noyes MB, et al. (2008) Analysis of homeodomain specificities allows the family-wideprediction of preferred recognition sites. Cell 133:1277–1289.

35. Badis G, et al. (2008) A library of yeast transcription factor motifs reveals a widespreadfunction for Rsc3 in targeting nucleosome exclusion at promoters. Mol Cell 32:878–887.

36. Stormo GD, Zuo Z, Chang YK (2015) Spec-seq: Determining protein-DNA-bindingspecificity by sequencing. Brief Funct Genomics 14:30–38.

37. Bailey TL, Johnson J, Grant CE, Noble WS (2015) The MEME suite. Nucleic Acids Res 43:W39–W49.

38. Sokolova M, et al. (2017) Genome-wide screen of cell-cycle regulators in normal andtumor cells identifies a differential response to nucleosome depletion. Cell Cycle 16:189–199.

39. Gordân R, et al. (2013) Genomic regions flanking E-box binding sites influence DNA bindingspecificity of bHLH transcription factors through DNA shape. Cell Rep 3:1093–1104.

40. Keles S, Warren CL, Carlson CD, Ansari AZ (2008) CSI-Tree: A regression tree approachfor modeling binding properties of DNA-binding molecules based on cognate siteidentification (CSI) data. Nucleic Acids Res 36:3171–3184.

41. Abe N, et al. (2015) Deconvolving the recognition of DNA shape from sequence. Cell161:307–318.

42. Berger MF, et al. (2008) Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences. Cell 133:1266–1276.

43. Jiang B, Liu JS, Bulyk ML (2013) Bayesian hierarchical model of protein-binding mi-croarray k-mer data reduces noise and identifies transcription factor subclasses andpreferred k-mers. Bioinformatics 29:1390–1398.

44. Folgueras AR, et al. (2013) Architectural niche organization by LHX2 is linked to hairfollicle stem cell function. Cell Stem Cell 13:314–327.

45. Sato K, Simon MD, Levin AM, Shokat KM, Weiss GA (2004) Dissecting the engrailedhomeodomain-DNA interaction by phage-displayed shotgun scanning. Chem Biol 11:1017–1023.

46. Nakagawa S, Gisselbrecht SS, Rogers JM, Hartl DL, Bulyk ML (2013) DNA-bindingspecificity changes in the evolution of forkhead transcription factors. Proc NatlAcad Sci USA 110:12349–12354.

47. Maier JA, et al. (2015) ff14SB: Improving the accuracy of protein side chain andbackbone parameters from ff99SB. J Chem Theory Comput 11:3696–3713.

48. MacKerell AD, et al. (1998) All-atom empirical potential for molecular modeling anddynamics studies of proteins. J Phys Chem B 102:3586–3616.

49. Best RB, et al. (2012) Optimization of the additive CHARMM all-atom protein forcefield targeting improved sampling of the backbone φ, ψ and side-chain χ(1) and χ(2)dihedral angles. J Chem Theory Comput 8:3257–3273.

50. Zhao X, et al. (2007) Mutations in HOXD13 underlie syndactyly type V and a novelbrachydactyly-syndactyly syndrome. Am J Hum Genet 80:361–371.

51. Ibrahim DM, et al. (2013) Distinct global shifts in genomic binding profiles of limbmalformation-associated HOXD13 mutations. Genome Res 23:2091–2102.

52. Pelossof R, et al. (2015) Affinity regression predicts the recognition code of nucleicacid-binding proteins. Nat Biotechnol 33:1242–1249.

53. Siggers T, Gordân R (2014) Protein-DNA binding: Complexities and multi-protein co-des. Nucleic Acids Res 42:2099–2111.

54. Jolma A, et al. (2015) DNA-dependent formation of transcription factor pairs alterstheir binding specificity. Nature 527:384–388.

55. Kelley LA, Mezulis S, Yates CM, Wass MN, Sternberg MJE (2015) The Phyre2 webportal for protein modeling, prediction and analysis. Nat Protoc 10:845–858.

56. Schrödinger L (2015) The PyMOL Molecular Graphics System (Schrödinger, New York),Version 2.0.

57. Sali A, Potterton L, Yuan F, van Vlijmen H, Karplus M (1995) Evaluation of compar-ative protein modeling by MODELLER. Proteins 23:318–326.

58. Bernstein FC, et al. (1978) The protein data bank: A computer-based archival file formacromolecular structures. Arch Biochem Biophys 185:584–591.

59. Case DA, et al. (2014) Amber 14 (University of California, San Francisco).60. Salomon-Ferrer R, Case DA, Walker RC (2013) An overview of the Amber biomolecular

simulation package. Wiley Interdiscip Rev Comput Mol Sci 3:198–210.61. Eastman P, et al. (2013) OpenMM 4: A reusable, extensible, hardware independent

library for high performance molecular simulation. J Chem Theory Comput 9:461–469.62. Nguyen H, Roe DR, Simmerling C (2013) Improved generalized born solvent model

parameters for protein simulations. J Chem Theory Comput 9:2020–2034.63. Jorgensen WL, Chandrasekhar J, Madura JD, Impey RW, Klein ML (1983) Comparison

of simple potential functions for simulating liquid water. J Chem Phys 79:926–935.

Bhimsaria et al. PNAS | vol. 115 | no. 45 | E10595

BIOPH

YSICSAND

COMPU

TATIONALBIOLO

GY

Dow

nloa

ded

by g

uest

on

Oct

ober

5, 2

020