the effects of variation on gene expression.

10
Many proteins interact with DNA to accomplish a wide variety of cellular processes. These include the crucial tasks of DNA replication, repair and recombination in all cells. In eukaryotes there are, in addition, histones and other proteins that determine the structure of chromatin within the nucleus. The expression of genes requires transcription by RNA polymerase, and usually there are a variety of associated proteins, referred to gen- erally as transcription factors, that facilitate and regu- late the transcription process. One class of transcription factors interacts directly with the DNA in a sequence- specific manner, binding only to locations within the genome that contain the appropriate DNA sequences. Upon binding, these transcription factors may regulate the expression of associated (usually nearby) genes, either activating or repressing their RNA synthesis. Recently it has also become clear that many of the binding sites within the genome do not affect the expres- sion of nearby genes and it is unknown whether they perform some other function or are just non-functional binding events 1–3 . The analysis of the genomic locations of transcription factor binding sites, and how they differ between distinct cell types, provides a wealth of infor- mation about the dynamics of the regulatory network of cells, but that is not the topic of this Review. Rather, the focus of this Review is the intrinsic specificity of sequence-specific transcription factors (TFs) — their ability to distinguish different DNA sequences in the absence of any other cooperative or competitive inter- actions. In many species, from bacteria to humans, TFs represent 5–10% of the genes encoded in the genome 4,5 and are a major determinant of the regulated and controlled expression of all genes. In the last few years technological advances have greatly increased our knowledge of the locations of the TF binding sites within genomes and the sequence- specific binding preferences for many TFs. These advances include both in vivo and in vitro experimental methods and the development of new computational analyses. The in vivo approaches, such as chromatin immunoprecipitation followed by microarray (ChIP–chip) or by sequencing (ChIP–seq), determine the locations within the genome that the transcription factors bind and provide candidate genes that they are likely to regu- late 6–8 . The locations can be determined for different cell types, at different stages of development and in different environmental conditions to help uncover the regulatory changes that are associated with changes in cell physiol- ogy 1 . This information alone can provide a framework for the regulatory network, indicating which factors are likely to regulate which genes. However, the resolution of the binding locations is not sufficient to identify the binding sites precisely, only to indicate a region of 100 or more base pairs in which it resides. Applying motif discovery algorithms to the sequences of those binding regions can usually identify the main features of the bind- ing patterns and the highest affinity binding sequences for the factors 9,10 , but the accuracy of those specificity models is often limited owing to confounding factors that occur in vivo. Knowing the intrinsic specificity of the factors, together with the binding locations, provides additional valuable information. For example, if strong Department of Genetics, Washington University School of Medicine, 4444 Forest Park Blvd, St. Louis, Missouri 63110‑8510, USA. Correspondence to G.D.S. e‑mail: [email protected] doi:10.1038/nrg2845 Published online 28 September 2010 In vivo In this context we use the term in vivo to refer to any experiments performed on living cells, whether within or outside a whole organism (sometimes referred to as ex vivo). Determining the specificity of protein–DNA interactions Gary D. Stormo and Yue Zhao Abstract | Proteins, such as many transcription factors, that bind to specific DNA sequences are essential for the proper regulation of gene expression. Identifying the specific sequences that each factor binds can help to elucidate regulatory networks within cells and how genetic variation can cause disruption of normal gene expression, which is often associated with disease. Traditional methods for determining the specificity of DNA-binding proteins are slow and laborious, but several new high-throughput methods can provide comprehensive binding information much more rapidly. Combined with in vivo determinations of transcription factor binding locations, this information provides more detailed views of the regulatory circuitry of cells and the effects of variation on gene expression. REVIEWS NATURE REVIEWS | GENETICS VOLUME 11 | NOVEMBER 2010 | 751 © 20 Macmillan Publishers Limited. All rights reserved 10

Transcript of the effects of variation on gene expression.

Page 1: the effects of variation on gene expression.

Many proteins interact with DNA to accomplish a wide variety of cellular processes. These include the crucial tasks of DNA replication, repair and recombination in all cells. In eukaryotes there are, in addition, histones and other proteins that determine the structure of chromatin within the nucleus. The expression of genes requires transcription by RNA polymerase, and usually there are a variety of associated proteins, referred to gen-erally as transcription factors, that facilitate and regu-late the transcription process. One class of transcription factors interacts directly with the DNA in a sequence-specific manner, binding only to locations within the genome that contain the appropriate DNA sequences. Upon binding, these transcription factors may regulate the expression of associated (usually nearby) genes, either activating or repressing their RNA synthesis.

Recently it has also become clear that many of the binding sites within the genome do not affect the expres-sion of nearby genes and it is unknown whether they perform some other function or are just non-functional binding events1–3. The analysis of the genomic locations of transcription factor binding sites, and how they differ between distinct cell types, provides a wealth of infor-mation about the dynamics of the regulatory network of cells, but that is not the topic of this Review. Rather, the focus of this Review is the intrinsic specificity of sequence-specific transcription factors (TFs) — their ability to distinguish different DNA sequences in the absence of any other cooperative or competitive inter-actions. In many species, from bacteria to humans, TFs represent 5–10% of the genes encoded in the genome4,5

and are a major determinant of the regulated and controlled expression of all genes.

In the last few years technological advances have greatly increased our knowledge of the locations of the TF binding sites within genomes and the sequence- specific binding preferences for many TFs. These advances include both in vivo and in vitro experimental methods and the development of new computational analyses. The in vivo approaches, such as chromatin immuno precipitation followed by microarray (ChIP–chip) or by sequencing (ChIP–seq), determine the locations within the genome that the transcription factors bind and provide candidate genes that they are likely to regu-late6–8. The locations can be determined for different cell types, at different stages of development and in different environmental conditions to help uncover the regulatory changes that are associated with changes in cell physiol-ogy1. This information alone can provide a framework for the regulatory network, indicating which factors are likely to regulate which genes. However, the resolution of the binding locations is not sufficient to identify the binding sites precisely, only to indicate a region of 100 or more base pairs in which it resides. Applying motif discovery algorithms to the sequences of those binding regions can usually identify the main features of the bind-ing patterns and the highest affinity binding sequences for the factors9,10, but the accuracy of those specificity models is often limited owing to confounding factors that occur in vivo. Knowing the intrinsic specificity of the factors, together with the binding locations, provides additional valuable information. For example, if strong

Department of Genetics, Washington University School of Medicine, 4444 Forest Park Blvd, St. Louis, Missouri 63110‑8510, USA.Correspondence to G.D.S. e‑mail: [email protected]:10.1038/nrg2845Published online 28 September 2010

In vivo In this context we use the term in vivo to refer to any experiments performed on living cells, whether within or outside a whole organism (sometimes referred to as ex vivo).

Determining the specificity of protein–DNA interactionsGary D. Stormo and Yue Zhao

Abstract | Proteins, such as many transcription factors, that bind to specific DNA sequences are essential for the proper regulation of gene expression. Identifying the specific sequences that each factor binds can help to elucidate regulatory networks within cells and how genetic variation can cause disruption of normal gene expression, which is often associated with disease. Traditional methods for determining the specificity of DNA-binding proteins are slow and laborious, but several new high-throughput methods can provide comprehensive binding information much more rapidly. Combined with in vivo determinations of transcription factor binding locations, this information provides more detailed views of the regulatory circuitry of cells and the effects of variation on gene expression.

REVIEWS

NATURe RevIews | Genetics vOlUMe 11 | NOveMbeR 2010 | 751

© 20 Macmillan Publishers Limited. All rights reserved10

Page 2: the effects of variation on gene expression.

Chromatin immunoprecipitation (ChIP). A technique that is used to identify the location of DNA-binding proteins and epigenetic marks in the genome. Genomic sequences containing the mark of interest are enriched by binding soluble DNA chromatin extracts (complexes of DNA and protein) to an antibody that recognizes the mark. ChIP can be followed by analysis of the precipitated DNA through hybridization to a microarray (ChIP–chip) or by sequencing of the precipitated DNA (ChIP–seq).

Motif In this Review motif refers to a representation of the specificity of a transcription factor. For instance, a position weight matrix is a motif for a transcription factor, but there could be other ways to represent the specificity.

Microfluidic device A device in which fluids are conveyed to samples in channels with diameters in the order of 1 μm; these chambers can be used to precisely and dynamically control the microenvironment to which cells are exposed.

Kd The dissociation constant between two molecules (here, for a transcription factor and a DNA sequence). It is the ratio of the off-to-on rate for the formation and dissolution of the complex.

Consensus sequence A single sequence (possibly degenerate) that represents the specificity of a transcription factor. It is usually the highest affinity sequence and would be the most frequent base at each position in a collection of binding sites. It is also possible to use degeneracies, for example, ‘R = A or G’, if two (or more) bases are equivalent.

binding sites are not bound it indicates that those regions of the genome are inaccessible to the factor, probably owing to the local chromatin organization. Conversely, if the factor binds to regions of the genome that lack strong binding sites, this indicates that the factor is bind-ing indirectly (that is, through interaction with another factor) or is binding to a weak binding site that requires cooperativity with other factors11. In both cases, that information leads to the search for additional interacting factors that are involved in the network.

For many years scientists have been measuring the binding affinity of TFs to specific DNA sequences, but these experiments have generally been undertaken one at a time, each determining the affinity of the TF to a sin-gle DNA sequence (BOX 1). However, in vivo the affinity of the TF is not as crucial as its specificity. Inside a bac-terial cell or a eukaryotic nucleus, the concentration of DNA is so high (typically millimolar for potential bind-ing sites) that TFs will essentially always be bound to DNA, even if there are no high-affinity sites. binding to their regulatory sites — those that are crucial for the proper regulation of gene expression — requires that the TFs can distinguish those sites from the vast major-ity of non-regulatory DNA sequences. The information required for understanding and modelling the regula-tory network is not the affinity to the preferred binding sites but the differences in binding affinity for all of the potential binding sites, which is what we mean by the specificity of the TF (BOX 1). This requires knowing the affinity to the entire set of potential binding sites, or at least enough of them to construct models that allow the complete specificity to be estimated (BOX 2).

several recent technological advances have facilitated high-throughput determination of the intrinsic specifi-city of transcription factors. This information is useful not only for more detailed analysis of the regulatory net-works within genomes and the affects of genetic varia-tions on those networks but also for aiding the design of new proteins with novel specificity. This Review focuses on the recent technological advances that facilitate the rapid and accurate determination of the specificity of TFs, including several new high-throughput experimental methods and how computational modelling can be used to interpret those data.

Direct affinity measurementsMicrofluidics. One recent method that uses a microfluidic device is the mechanically induced trapping of molecular interactions (MITOMI). It provides a means for deter-mining binding specificities accurately through direct measurements in a relatively high-throughput proc-ess, obtaining the binding affinities of a TF to each of a few hundred different DNA sites per device12 (FIG. 1A). synthetic genes for the TFs are flowed into each chamber along with material to support the synthesis of the pro-teins within the chamber, thus avoiding the problem of purifying the TF (a step that is required in many in vitro assays). each chamber of the device has antibodies to attach the TF to its surface and is seeded with a specific DNA binding site at a specific concentration and with a fluorescent tag. Overall, the device contains hundreds

of different DNA binding site sequences, each at mul-tiple different concentrations. The total DNA binding site concentration in each chamber is determined by its fluorescent signal, and any DNA that is not bound to the TF that is attached to the surface is then flushed out. The amount of protein is determined by its fluorescent signal, and the amount of DNA bound to the TF is deter-mined by the remaining DNA signal. by determining the fraction of DNA that is bound at several different DNA concentrations, the relative affinities of each dif-ferent sequence can be determined. This does not, by itself, determine the absolute dissociation constant (Kd) between the TF and the binding site (BOX 1), but compar-ison to a reference with known Kd allows determination of the absolute Kd for each binding site sequence.

Using 17 such devices, Maerkl and Quake12 meas-ured the binding affinities to 464 different DNA sites for four different proteins, all of which were members of the eukaryotic basic helix–loop–helix (bHlH) TF family. Members of that family generally bind to sites with CANNTG consensus sequences (in which N rep-resents any of the four bases). They took advantage of the fact that these bHlH TFs bind as homodimers and have symmetric specificities. by keeping one half-site fixed, their set of 464 binding sites contained all pos-sible four-base-long sequences (256) for the other half-site plus many additional variants of the adjacent bases. They showed that the specificities for two yeast proteins, centromere-binding protein 1 (Cbf1) and phosphate sys-tem positive regulatory protein 4 (Pho4), are similar but have differences — especially in their preferences in the flanking sequences — that could help to distinguish their in vivo binding sites and better define their independent regulatory roles.

Surface plasmon resonance. surface plasmon resonance (sPR) is often used to study protein interactions with ligands (including other proteins) but can also be used to measure protein–DNA interactions. As an approach for determining the affinity of a TF for a DNA sequence, sPR can be made moderately high-throughput by using arrays of DNA sequences13–15. sPR is based on the fact that the angle of light reflection from a thin surface of gold, for example, depends on the mass of molecules attached to the other side of the surface. DNA can be attached to the surface, the TF can be added to the solution and the change in reflection can be measured over time as the TF–DNA complex accumulates, until equilibrium is attained. Then the TF can be allowed to dissociate from the DNA by washing the array and the change in reflection can be monitored until equilibrium is reached again. This ability to measure binding and unbinding over time is one of the main advantages of sPR; the rate constants kon and koff (BOX 1) can be deter-mined and from them the affinity, Kd, can be determined. Multiple DNAs can be arrayed on the surface so that the affinities of many different DNAs to the same protein can be determined in parallel13–15. so far, only medium-sized arrays have been used but it seems feasible to increase the number of simultaneous measurements substantially.

R E V I E W S

752 | NOveMbeR 2010 | vOlUMe 11 www.nature.com/reviews/genetics

© 20 Macmillan Publishers Limited. All rights reserved10

Page 3: the effects of variation on gene expression.

Gibbs standard free energy The difference in free energy between the equilibrium state and the standard state (all reactants and products at 1M concentration).

Gas constant (Represented by R). The thermodynamic constant relating energy per mole to temperature.

Nature Reviews | Genetics

0 1 2 3 0 1,000 500 1,500 2,000

0.0

0.2

0.4

0.6

0.8

[TF]

TF Sequence

TF Sequences

[TF]Relative binding energy

+

+

Bind

ing

prob

abili

tyBi

ndin

g pr

obab

ility

1.0

c

d

a

b

0 500 1,000 1,500 2,000

0.6

0.0

0.2

0.4

0.8

1.0

Box 1 | Binding affinity and specificity

AffinityThe binding of protein to DNA is a bimolecular process governed by two rates: an on-rate for the formation of the complex and an off-rate for its dissociation. This is shown in the figure (see the figure, part a; the transcription factor (TF) is blue and DNA sequence (S) is red) and by the equation:

+

The affinity of the TF for the sequence S is usually defined as the dissociation constant K

d

(or its reciprocal, the association constant):

= = ∆ ° =

The square brackets represent the concentrations of those entities at equilibrium, ΔGo is the Gibbs standard free energy, R is the gas constant and T is the temperature in degrees Kelvin. The K

d is the

concentration of free TF for which the DNA is half bound. Such experiments are usually performed at excess TF so that the free TF is approximately the same as the total TF. The probability of the sequence S being bound is:

= =

+ +

The binding probability to the consensus sequence of the A isoform of the human TF MAX for different concentrations of MAX isoform A is also shown (see the figure, part b).

specificityThe term ‘specificity’ refers to how well a protein can distinguish between different sequences. In vivo the TF has the entire genome of potential binding sites, and in order for the regulatory network to function properly the TF must be able to distinguish its functional binding sites from the vast majority of competing non-functional potential sites (see the figure, part c; each colour represents a different DNA sequence). The complete specificity of a TF can be defined by the list of K

ds to

all possible binding sites, but it is also convenient to have a single number to quantify the specificity. A useful definition is:

≡Σ Σ ⟨ ⟩

This is equal to the ‘information content’ of the binding sites49 under certain simplifying conditions, such as assuming additivity of the positions (BOX 2) and that the binding sites are selected in proportion to their binding affinities50. It is also useful to view the specificity graphically with a three-dimensional plot (see the figure, part d). This is similar to the two-dimensional plot in part b, but is for every different sequence. In the two-dimensional plot, a single sequence is used and the concentration of TF ([TF]) is varied to obtain the K

d for that sequence; this is equivalent to one of

the lines parallel to the [TF] axis in the three-dimensional plot for one particular sequence. However, in many of the high-throughput methods described in this Review, the [TF] is constant and the probabilities are measured over all sequences. This corresponds to one of the lines parallel to the binding energy axis in the three-dimensional plot. By employing a model for binding, such as a position weight matrix (BOX 2), it is possible to estimate all of the parameters of the model, including [TF] and the nonspecific binding energy, from the data37. Data in parts b and d are from ReF. 12.

R E V I E W S

NATURe RevIews | Genetics vOlUMe 11 | NOveMbeR 2010 | 753

© 20 Macmillan Publishers Limited. All rights reserved10

Page 4: the effects of variation on gene expression.

0

0.5

1

1.5

2

1 2 3 4 5 6

Position

Info

rmat

ion

cont

ent

Nature Reviews | Genetics

GCGTGG 0GCGGGG 0.8ACGTGG 1.4GCGTGG 1.7GCGTAG 2.1ACGGGG 2.2GAGTGG 2.4GTGGGG 2.5GCGGAG 2.9… …

1 2 3 4 5 61.4 2.4 5.8 4.8 2.1 6.72.4 0 6.5 4.0 4.3 6.90 3.7 0 0.8 0 03.6 1.7 7.1 0 4.8 6.8

Position A C G T

Microarray A high-density array of DNA molecules, typically with each element of the array containing a different DNA sequence. Single-stranded DNA arrays are used to hybridize to labelled DNA (or RNA) sequences to determine the relative abundances of different sequences. Double-stranded DNA arrays are used in protein-binding microarray and cognate site identifier methods to determine the binding preferences of transcription factors or other DNA-binding molecules.

Microarray-based methodsProtein-binding microarrays. Protein-binding micro-arrays (PbMs) are a technology developed in the last 10 years that has greatly increased the throughput for assessing the binding specificities of TFs16–18 (FIG. 1B). As with microarrays for gene expression, this technology has made possible large-scale, high-throughput analy-ses to collect information that previously was much more laborious to acquire (typically on a gene-by-gene basis)19,20. The current version of PbM uses arrays of over 44,000 spots that contain all possible ten-base-long DNA binding sites once on each array, which means

that every eight-base-long sequence occurs 32 times, taking both orientations into account. A TF, either purified from cells or synthesized in vitro, is added to the array, which is then washed to remove nonspecific binding. The amount of protein binding to any specific DNA spot is determined with a fluorescent antibody to the protein.

The specificity of the TF can be represented as posi-tion weight matrices (PwMs; BOX 2), but it is also com-mon to report enrichment scores for all eight-base-long sequences. These scores are based on the rank median intensities of array sequences that contain a particular

Box 2 | Modelling specificity

The specificity of a DNA binding protein is determined by its relative affinity to all possible binding sites, or at least to a large number of high affinity sites, such that one can estimate the probability of binding to any particular site given the competition from all of the other sites (BOX 1). Using the high-throughput methods described in this Review it is possible to simply determine the binding affinity for all possible binding sites up to the length allowed by the method. However, even when that information is available, determining a model for binding can have advantages, as described in the main text.

The simplest model to represent the specificity of a DNA-binding protein, other than just using the consensus sequence for its highest affinity sites, is a position weight matrix (PWM). In the general form of this model there is a score assigned to each possible base at each position in the binding site (see the figure). The sum of the elements that correspond to a specific sequence gives a total score for that sequence. This allows the model to provide a score to all possible binding sites for the protein. The ‘logo’ provides a convenient graphical representation of the PWM51.

The PWM model assumes that each position independently contributes to binding, but more complex models are easily generated that do not assume additivity of the individual bases. For example, it is possible to make a matrix model that includes scores for adjacent dinucleotides — this requires a PWM-like model with 16 rows for all of the possible dinucleotides at each position — or other higher-order contributions to the binding44. Simple fitting measures can determine which model is best, given the number of parameters45,52,53. One can also employ models composed of two or more PWMs if the TF has multiple modes for binding to DNA24. Different methods have been developed to assign the elements of the PWM. The most useful are those methods that treat the elements of the PWM as binding energy contributions (which makes the PWM an energy matrix), because this enables the probability of binding to be calculated for any protein concentration37,54 (BOX 1):

=+ µ

Ei is the difference in binding energy between the sequence S

i and the reference (usually the preferred) sequence

(which by convention is defined as having energy = 0, see the PWM above). The logarithm of the ratio of the free concentration of the TF to the K

d of the reference sequence is:

In general, even if the total concentration of the TF is known, [TF] is not known but μ can be estimated from the data along with the parameters of the PWM37.

R E V I E W S

754 | NOveMbeR 2010 | vOlUMe 11 www.nature.com/reviews/genetics

© 20 Macmillan Publishers Limited. All rights reserved10

Page 5: the effects of variation on gene expression.

PDMSmembrane

Microfluidicchannel

Substrate

Equilibrium bindingA

C

B

Mechanical trapping Wash off unbound molecules

Penta-histidine antibody

Cy5-labelled target DNA

Transcription factor dimer

Nature Reviews | Genetics

Random pool of oligonucleotides Selection using target protein Amplification of bound sequences

Sequencing

Immobilizedprotein

Weakpromoter

HIS3

ωα α β′

βσ

Transcriptionfactor

RNA polymerase

Weakpromoter

Randomizedregion

HIS3

ωα α β′

βσ

Transcriptionfactor

RNA polymerase

Randomizedregion

Da b

Unbound

Bound

Figure 1 | Different methods for studying DnA–protein interactions. A | An overview of the mechanically induced trapping of molecular interactions (MITOMI) device for binding affinity measurements. The transcription factor (TF) is bound to the surface by antibodies and some fraction of the DNA binds to the TF. The unbound DNA is expelled by washing, so the bound fraction can be measured. B | An overview of universal protein-binding microarray (PBM) design and use. All possible ten-base-long sequences are included on the array (left panel). Primer-directed DNA synthesis creates dsDNA on the array (middle panel). Proteins (shown by purple crescents) are bound, nonspecific binding is minimized by washing of the array and the remaining protein is quantified with a fluorescent antibody (right panel). Fluorescent groups are indicated by stars. c | An overview of the high-throughput (HT)-SELEX procedure. DNA molecules from the random library are exposed to the TF; some sequences are bound, whereas others flow through. The bound fraction is sequenced to determine the probability of being bound. Rounds of amplification of bound sequences may be used (see the main text). D | An overview of the bacterial one-hybrid (B1H) system. The TF is fused to the ω subunit of RNA polymerase and a sequence from a randomized library is inserted upstream of the promoter of the HIS3 gene (which encodes a component of the histidine biosynthesis pathway) (Da). When the TF binds to the randomized sequence, the HIS3 promoter becomes more active (Db) and this increases the growth rate of the cells. PDMS, polydimethylsiloxane. Part A is modified, with permission, from ReF. 12 © (2007) Highwire Press. Part B is modified, with permission, from ReF. 16 © (2006) Macmillan Publishers Ltd. All rights reserved. Part D is modified, with permission, from ReF. 43 © (2008) Oxford University Press.

R E V I E W S

NATURe RevIews | Genetics vOlUMe 11 | NOveMbeR 2010 | 755

© 20 Macmillan Publishers Limited. All rights reserved10

Page 6: the effects of variation on gene expression.

Gel filtration A permeable gel, such a polyacrylamide or agarose, is used to separate molecules and molecular complexes based on their size. DNA will migrate faster through a gel than the same DNA that is bound to a protein, so the protein–DNA complex can be separated from the free DNA.

Barcoding The process of adding the same unique DNA sequence to the ends of DNA molecules from the same experiment, so that the resulting DNA reads can be traced back to the experiment that generated them, allowing several experiments to be sequenced simultaneously (multiplexed).

8-mer compared to a background average. An online database, UniProbe21, contains information from all of the published data sets from Martha bulyk’s group, including the 8-mer enrichment scores and PwMs for each TF that has been tested. The optimal method for modelling the specificity of TFs based on PbM data remains an open question, with different methods some-times leading to different conclusions (see the section on computational modelling). TF specificities determined by PbM data have been compared to in vivo binding location experiments in yeast to assess whether certain TFs appear to bind directly or indirectly through inter-actions with other factors11,22. PbMs have recently been used to determine the variations in specificity among most of the 42 bHlH proteins of Caenorhabditis elegans (including those that form heterodimers), which has helped to distinguish their functional roles23. The bind-ing specificities for many mouse TFs have also recently been analysed24.

Cognate site identifier. A similar method, called the cognate site identifier (CsI), also uses arrayed DNA sequences to measure relative binding by TFs; current arrays include all possible ten-base-long sequences25,26. One difference between PbMs and CsIs is that in CsIs, single-stranded DNAs are synthesized that fold back to form dsDNA binding sites, thereby eliminating the need for primer directed DNA synthesis to generate dsDNA, which is required for PbMs (FIG. 1B). The CsI develop-ers also have an alternative method of visualizing the specificity of the TFs, besides determining PwMs, that is called ‘specificity landscapes’ and that may provide useful additional information in some cases27. CsI fluo-rescence intercalation displacement (CsIFID) is a modi-fication in which the TF and a DNA-binding fluorescent dye compete for binding to the DNA; this eliminates the need for tagged proteins or antibodies to the TF and may simplify the procedure28.

In vitro selectionUsing purified proteins to select high-affinity binding sites from random libraries in vitro is a very powerful technique. Although invented independently multiple times, the term seleX seems to be the most commonly used name29–32 (FIG. 1C). The general strategy is to create a library of potential binding sites, which may be from randomly synthesized DNA or created from genomic sequences. both ends of the library sequences can have primer binding sites so that they can be amplified by PCR. Purified TF is added to the library of DNA sites and the bound and unbound sequences are separated by various means, such as gel filtration, filter binding or binding to immobilized protein. Although higher affinity sites have a higher probability of being bound by the TF, after a single selection most of the bound sequences are still low affinity because they greatly exceed the number of high-affinity sequences. To increase the fraction of high-affinity sites, the bound fraction can be amplified and rebound and those steps repeated as many times as needed. Typically, after several rounds, the selected sites would be cloned and sequenced, often obtaining fewer

than 100 independent sites33. Alternatively, if the library is created from genomic DNA the bound fragments can be detected by hybridization to genomic microarrays34. These methods are capable of determining important aspects of the binding specificity, including the consen-sus sequence and the relative variability in affinity for different bases at different positions within the binding sites, but if multiple rounds of selection are used it is not straightforward to determine binding energy distri-butions directly. As sequencing methods are capable of much longer read lengths than the size of a binding site, ligating several sites together and sequencing them all at once — a technique known as seleX-serial analysis of gene expression (seleX-sAGe)35 — has made the method much more efficient. seleX-sAGe is capa-ble of obtaining thousands of sequenced binding sites and requires fewer rounds of selection, which makes quantitative modelling more accurate35,36.

by utilizing new sequencing technologies it is now possible to derive binding energy profiles from seleX methods quite efficiently using a method called high-throughput seleX (HT-seleX)37 or bind-and-seq38. An advantage of this approach is that the output (the number of counts observed for each sequence) is digital and there is a very large dynamic range. From a total set of hun-dreds of thousands or millions of individual sequences there will be many nonspecific sites, but they usually only occur once, whereas the highest affinity sites may occur thousands of times. From millions of reads one can esti-mate the binding model — such as a PwM, as well as nonspecific binding energies and the free-TF concentra-tion — after a single round of selection and obtain mod-els that fit the data well37. Performing additional rounds of selection may provide more information about specific segments of the energy distribution and may give more accurate models, particularly for low-specificity TFs or those with large non-independent contributions (see the section on computational modelling). For example, the input to the second round of selection is enriched for higher affinity sites and the number of nonspecific sites is reduced. by comparing the distribution of sites before and after selection, one can obtain better resolution of the binding energy differences among the high-affinity sites. Multiple rounds may provide highly accurate specificity models even for low-specificity TFs.

Another advantage is that there is no inherent limit to the length of the binding site that can be selected, although the library size will limit the length that is covered comprehensively; using a nanomole of DNA (about 1015 individual sequences) nearly all possible 25-base-long potential binding sites are included. Of course the number of sites that are sequenced is lim-ited to about 108, but they can be selected from an enor-mously larger set of possible binding sites so that one can study TFs with very long binding sites, such as bac-terial TFs that commonly have sites longer than 16 base pairs. A recent publication39 has pushed this approach much further. Using tagged proteins the authors per-formed HT-seleX from cell extracts (rather than puri-fying the TF) and by barcoding individual experiments they collected binding sites for several TFs in parallel.

R E V I E W S

756 | NOveMbeR 2010 | vOlUMe 11 www.nature.com/reviews/genetics

© 20 Macmillan Publishers Limited. All rights reserved10

Page 7: the effects of variation on gene expression.

Multiplexing The process of mixing several experimental samples together and sequencing them (or some other process) simultaneously. each sample can be barcoded to recover the information about its experimental origin.

In total they obtained binding site data for 19 different TFs, many of which have low specificity and required multiple rounds of selection. This demonstrated that HT-seleX can have very high throughput and gener-ate enormous amounts of specificity data quite rapidly. The optimal method for extracting specificity models from the data, especially after multiple rounds of selec-tion, remains an open question (see the section on computational modelling).

Bacterial one-hybrid selectionsbacterial one-hybrid (b1H) selections are not in vitro assays (unlike the methods described above) and can be used for any TF that can be cloned and expressed in Escherichia coli; this method has the advantage that the TF need not be purified or synthesized in vitro40–42. The approach uses a library of randomized binding sites upstream of a weak promoter that drives the expression of a selectable gene, typically the yeast HIS3 (which encodes a component of the histidine biosynthesis pathway; the E. coli strain lacks the bacterial homologue) (FIG. 1D). when the cells are grown in medium lacking histidine, expression of the HIS3 gene is required for growth and the stringency of the selection can be modulated by the addition of 3-amino-1,2,4-triazole (3AT), an inhibitor of the HIS3 enzyme. The TF is fused to the non-essential ω subunit of RNA polymerase, so that TF binding recruits RNA polymerase and increases promoter activity.

In published work the binding sites from selected colonies were sequenced individually, typically obtain-ing about 50 sequences for each selection42,43. However, the application of new sequencing methods allows one to collect all of the cells on the plate — including the large colonies, the microcolonies and even cells that were plated but did not grow — and sequence them all en masse, obtaining millions of binding sites for each experiment (G.D.s., unpublished observations). The high-affinity sites lead to more HIS3 expression and so allow the cells with those sites to grow the fastest, result-ing in a higher number of sequence reads. Therefore, coupling b1H to high-throughput sequencing provides — similar to HT-seleX — a digital read-out with a very large dynamic range, the highest affinity sites occurring hundreds to thousands of times and the lowest affinity sites typically occurring only once or not at all. It can also be multiplexed so that the data from several experiments, for different TFs or under different selection stringencies can be obtained in parallel.

In b1H there is no real limit to the length of the ran-domized site that can be selected for, although the total library size is limited by the efficiency of transforma-tion into E. coli to about 108 independent sites. That is enough to include all possible 12-base-long sequences, but it is also possible to select from incomplete librar-ies of longer randomized regions42,43. Using the data to model the binding specificity is not as straightforward as with PbM or HT-seleX data because the counts are not directly related to binding affinity but are confounded by the effects of selection for cell growth over many genera-tions. by making some simplifying assumptions about the growth rate being proportional to the TF occupancy

of the binding site, up to some saturation level, one can treat the counts similarly to HT-seleX or PbM data and obtain good quantitative models (see below).

Computational modelling of specificityGiven the data produced by each of the experimental methods one can model the specificity using computa-tional approaches (BOX 2). One can ask why a model is needed if the affinities to all possible binding sites can be determined directly. However, even in cases in which the binding sites are short enough that all possible vari-ations can be measured, there are still some advantages to having a model. The model parameters are averaged over many independent measurements and can reduce the uncertainty for any particular sequence compared to the raw data, which are often fairly noisy. A simple model provides an easy method for scanning a genome and predicting the most likely binding sites as well as the effects of genetic variations. A model has fewer para-meters than the entire data set, and the complexity of the model required for a good fit to the data can provide insight into the recognition mechanism. For example, if a simple PwM (BOX 2) fits the data very well, this indi-cates that the protein binds in an additive manner in which each position contributes independently to the binding energy. If such a model does not fit the data well, it indicates more complex interactions between the pro-tein and the DNA. simple models lend themselves to predicting the effects of variants, both in the binding sites and also in the protein itself, and can facilitate the design of proteins with novel specificity.

Consider the example of studying the bHlH proteins using MITOMI12. A simple additive model assumes that the binding energy changes at each position can be summed to provide the binding energies to any sequence with multiple changes. Under such a model one would only need to determine the Kd of the consensus sequence and each single base variant of that sequence, a total of 3l + 1 measurements for a binding site of length l. Maerkl and Quake concluded that such an additive model does not fit the data very well for the bHlH TFs they studied. However, we showed44 that a simple exten-sion to the simple additive model fits the data well, within the variance of the measurements. The simple extension is that energy changes are needed for each adjacent dinu-cleotide pair but not of more distant pairs; the binding seems to be additive over adjacent dinucleotides. For the four-base-long half sites the extended model requires only 40 measurements to capture all of the important variables (the nonspecific binding energy may need to be determined separately44). In fact a TF that binds to a ten-base-long site could be assayed directly for binding to the preferred site and all possible adjacent dinucleotide variants with only 112 different sequences. This opens up a method like MITOMI to comprehensive studies of TFs with much longer binding sites than have previously been studied, provided that the preferred binding site is known. This is also within the range of existing array sizes for sPR methods14. At this time it is not known what fraction of TFs can be well modelled using adjacent base-pair energy contributions.

R E V I E W S

NATURe RevIews | Genetics vOlUMe 11 | NOveMbeR 2010 | 757

© 20 Macmillan Publishers Limited. All rights reserved10

Page 8: the effects of variation on gene expression.

The other three methods described above do not determine affinities directly but rather measure some-thing related to affinity: the fluorescence intensities for microarray-based methods and the counts of indi-vidual binding sites for in vitro selection and bacterial one-hybrid methods (FIG. 2a–c). The data are usually col-lected at a fixed concentration of the TF, although the experiment can be repeated at different concentrations. even if the total concentration of the TF is known this is not the concentration of free TF, which depends on both the concentration of TF and the concentrations and affinities of all the potential binding sites. It is the free concentration of TF that is needed to relate the binding energy to the binding probability (BOXeS 1,2).

The raw data provide relative binding probabilities for PbM and seleX data but are related to the growth rate for b1H data. If one uses simplifying assumptions that the growth rate is proportional to the occupancy of the binding site, up to a certain saturation level (or the point at which the concentration of HIS3 is no longer limiting for growth) b1H data can be treated very similarly to PbM and seleX data, which are equivalent to the curves on the three-dimensional plot of BOX 1 at a fixed value of the concentration of TF ([TF]). with sufficient data, and by employing a model of how the binding energy depends on the sequence (such as a PwM), it is possible to esti-mate all of the parameters of the model, including the [TF] and the nonspecific binding energy37. Models with various complexities can be compared to see which fits the data the best, taking into account the number of esti-mated parameters; more complex models should always provide a better fit but there may be over-fitting (that is, fitting the experimental noise) if there are excess para-meters. For example, one could test whether using energy terms for adjacent dinucleotides provides a substantially better fit than a simple PwM, as was observed for the bHlH proteins using MITOMI12,37. As noted above, the UniProbe database of the bulyk group21 reports PwMs for each TF that they have analysed using PbMs and also 8-mer enrichment scores, which are another form of model for the specificity of the TF. Using all 8-mer scores produces a binding model with 32,896 parameters (all unique 8-mers accounting for both orientations), but in general only a small number of highly enriched 8-mers are used to define the specificity. we have found that in most cases a PwM, perhaps extended to include adjacent dinucleotides, fits the PbM data as well as 8-mer enrichment scores (G.D.s. and Y.Z., unpublished observa-tions), but there can be exceptions that indicate complex interactions, perhaps multiple modes of binding.

FIGURe 2 shows the data and results for PbM, HT-seleX and b1H using the same protein, the zinc finger protein ZIF268 (also known as early growth response protein 1). As with other microarrays, PbMs tend to have higher backgrounds than the digital counting methods, but there is still a very strong signal with the highest intensities being about 15-fold higher than the mean background (about 65 standard deviations above the background distribution). ZIF268 has previously been shown to be well represented by a simple PwM, with only a small energy contribution from adjacent dinucleotides45. Furthermore, each method

leads to very similar parameters when each data set is analysed using the same method for estimating binding energies37 (FIG. 2d–f), and in each case they fit the experi-mental data well (FIG. 2g–i). This suggests that the choice of method should depend on factors other than the quality of the resulting data, such as the length of the binding site, the ease of purifying or synthesizing the TF and the cost of sequencing compared to microarrays.

ConclusionsThe specificity of TFs — their ability to discriminate between different sequences — is essential to the proper functioning of regulatory networks and gene expres-sion. Recent technological advances in several different approaches have greatly increased the speed at which quantitative affinity measurements can be obtained for many different sequences. Methods such as MITOMI and sPR can determine binding affinities directly but have relatively low throughput compared to the other methods described here and require specialized equip-ment. Using purified or in vitro synthesized proteins, PbM and CsI can be used on microarray equipment that is commonly available and these approaches have many of the attributes of microarrays for gene expres-sion studies. The cost of microarrays is now quite low, which is one advantage of these approaches. HT-seleX also requires purified TFs, although recent results show that it can also be performed on cellular extracts of tagged proteins39. The results are counts of individual binding sites, which give HT-seleX a large dynamic range and allow assay binding sites of any length. single rounds of selection are often sufficient, but low-specificity TFs may require multiple rounds to determine good models. b1H methods do not require purified protein and are per-formed by simple selections in bacterial cells. Unlimited site sizes can be selected, although library sizes are limited by transformation efficiencies. For both HT-seleX and b1H only standard laboratory equipment, and access to high-throughput sequencing facilities, which are now widely available, are needed.

In the few years that these methods have been used, information has rapidly accumulated regarding the specificities of human TFs as well as the TFs of several important model organisms. Many are still unknown, but in the coming years our knowledge of TF specifici-ties should expand enormously. That information will be very useful for the next phase of elucidating regulatory networks: the identification of interacting TFs in vivo and the combinations of TFs that control specific genes and pathways. Comparison of the intrinsic specificities with location experiments, such as ChIP–chip and ChIP–seq, will help in that analysis. In addition, as more specificity information is obtained it will facilitate the prediction of specificities of proteins without experimental analysis and aid the design of new proteins with novel specifici-ties27,42,46,47. There is already considerable effort devoted to the design of zinc finger proteins coupled with nucle-ases to target specific locations in the genome48. As more specificity information emerges for more families of TF proteins, it is likely that many more can be modelled well enough to develop design criteria.

R E V I E W S

758 | NOveMbeR 2010 | vOlUMe 11 www.nature.com/reviews/genetics

© 20 Macmillan Publishers Limited. All rights reserved10

Page 9: the effects of variation on gene expression.

Nature Reviews | Genetics

0

0.5

1

1.5

2

0

0.5

1

1.5

2

0

0.5

1

1.5

2 fd e

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6

Position Position Position

Info

rmat

ion

cont

ent

Info

rmat

ion

cont

ent

Info

rmat

ion

cont

ent

ca b

8-mer median intensity Read count Read count

Freq

uenc

y

Freq

uenc

y

Freq

uenc

y

0 50,000 100,000 150,000 200,000 250,0000

1,000

2,000

3,000

80060040020000e+00

2e+04

4e+04

6e+04

8e+04

1e+05

0 1,000 2,000 3,000 4,0000

500

1,000

1,500

2,000

2,500

3,000

0

50

100

150

200

50

100

150

200

00

20

40

60

80100

120

50,00

0

100,00

0

150,00

0

200,0

00

250,0

00

300,0

001,0

002,0

003,0

004,0

005,0

00200

400

600

800

0

ig h

PWM predicted 8-mer median intensity

R-squared = 0.75 R-squared = 0.92 R-squared = 0.91

PWM predicted counts PWM predicted counts

50,000 100,000 200,000150,000 250,000

50,000

0

100,000

200,000

150,000

250,000

0 100 200 300 400 500 600 0 1,000 2,000 3,000 4,0000

1,000

2,000

3,000

4,000

0

100

200

300

400

500

600

8-m

er m

edia

n in

tens

ity

Obs

erve

d co

unts

Obs

erve

d co

unts

Figure 2 | comparison of data and analysis from three methods. a | Histogram of fluorescent intensities from protein-binding microarray (PBM) data for the zinc finger transcription factor ZIF268 (also known as early growth response protein 1). b | Histogram of counts from high-throughput (HT)-SELEX data for ZIF268 (the total count is about 260,000). c | Histogram of counts from bacterial one-hybrid (B1H) data for ZIF268 in which just the first six positions were randomized (G.D.S., unpublished observations) (experimental conditions were 1 mM 3-amino-1,2,4-triazole, 50 μM isopropyl β-d-1-thiogalactopyranoside (IPTG), the total count is approximately 160,000). d | Logo of position weight matrix (PWM) for data shown in part a. e | Logo of PWM for data shown in part b. f | Logo of PWM for data shown in part c. g | Fit of the predicted to observed 8-mer median intensities using the model of part d on the data from part a. h | Fit of the predicted to observed counts using the model of part e on the data from part b. i | Fit of the predicted to observed counts using the model of part f on the data of part c. Data in part a are from ReF. 16. Data in part b are from ReF. 37.

R E V I E W S

NATURe RevIews | Genetics vOlUMe 11 | NOveMbeR 2010 | 759

© 20 Macmillan Publishers Limited. All rights reserved10

Page 10: the effects of variation on gene expression.

1. Farnham, P. J. Insights from genomic profiling of transcription factors. Nature Rev. Genet. 10, 605–616 (2009).

2. Li, X. Y. et al. Transcription factors bind thousands of active and inactive regions in the Drosophila blastoderm. PLoS Biol. 6, e27 (2008).

3. Zhang, X. et al. Genome-wide analysis of cAMP-response element binding protein occupancy, phosphorylation, and target gene activation in human tissues. Proc. Natl Acad. Sci. USA 102, 4459–4464 (2005).

4. Madan Babu, M., Teichmann, S. A. & Aravind, L. Evolutionary dynamics of prokaryotic transcriptional regulatory networks. J. Mol. Biol. 358, 614–633 (2006).

5. Vaquerizas, J. M., Kummerfeld, S. K., Teichmann, S. A. & Luscombe, N. M. A census of human transcription factors: function, expression and evolution. Nature Rev. Genet. 10, 252–263 (2009).

6. Kharchenko, P. V., Tolstorukov, M. Y. & Park, P. J. Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nature Biotech. 26, 1351–1359 (2008).

7. Park, P. J. ChIP-seq: advantages and challenges of a maturing technology. Nature Rev. Genet. 10, 669–680 (2009).

8. Johnson, D. S., Mortazavi, A., Myers, R. M. & Wold, B. Genome-wide mapping of in vivo protein–DNA interactions. Science 316, 1497–1502 (2007).

9. Pepke, S., Wold, B. & Mortazavi, A. Computation for ChIP-seq and RNA-seq studies. Nature Methods 6, S22–S32 (2009).

10. Ji, H. et al. An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nature Biotech. 26, 1293–1300 (2008).

11. Gordan, R., Hartemink, A. J. & Bulyk, M. L. Distinguishing direct versus indirect transcription factor–DNA interactions. Genome Res. 19, 2090–2100 (2009).

12. Maerkl, S. J. & Quake, S. R. A systems approach to measuring the binding energy landscapes of transcription factors. Science 315, 233–237 (2007).This paper introduced the MITOMI method and demonstrated its application on four bHLH TFs.

13. Paul, S., Vadgama, P. & Ray, A. K. Surface plasmon resonance imaging for biosensing. IET Nanobiotechnol. 3, 71–80 (2009).

14. Shumaker-Parry, J. S., Aebersold, R. & Campbell, C. T. Parallel, quantitative measurement of protein binding to a 120-element double-stranded DNA array in real time using surface plasmon resonance microscopy. Anal. Chem. 76, 2071–2082 (2004).

15. Campbell, C. T. & Kim, G. SPR microscopy and its applications to high-throughput analyses of biomolecular binding events and their kinetics. Biomaterials 28, 2380–2392 (2007).A review of SPR methods and applications, including the study of protein–DNA interactions.

16. Berger, M. F. et al. Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nature Biotech. 24, 1429–1435 (2006).This paper introduced the universal PBM that includes all possible ten-base-long binding sites and its application on several TFs.

17. Mukherjee, S. et al. Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays. Nature Genet. 36, 1331–1339 (2004).

18. Bulyk, M. L., Huang, X., Choo, Y. & Church, G. M. Exploring the DNA-binding specificities of zinc fingers with DNA microarrays. Proc. Natl Acad. Sci. USA 98, 7158–7163 (2001).

19. Berger, M. F. & Bulyk, M. L. Universal protein-binding microarrays for the comprehensive characterization of the DNA-binding specificities of transcription factors. Nature Protoc. 4, 393–411 (2009).

20. Philippakis, A. A., Qureshi, A. M., Berger, M. F. & Bulyk, M. L. Design of compact, universal DNA microarrays for protein binding microarray experiments. J. Comput. Biol. 15, 655–665 (2008).

21. Newburger, D. E. & Bulyk, M. L. UniPROBE: an online database of protein binding microarray data on protein-DNA interactions. Nucleic Acids Res. 37, D77–D82 (2009).

22. Zhu, C. et al. High-resolution DNA-binding specificity analysis of yeast transcription factors. Genome Res. 19, 556–566 (2009).

23. Grove, C. A. et al. A multiparameter network reveals extensive divergence between C. elegans bHLH transcription factors. Cell 138, 314–327 (2009).

24. Badis, G. et al. Diversity and complexity in DNA recognition by transcription factors. Science 324, 1720–1723 (2009).

25. Puckett, J. W. et al. Quantitative microarray profiling of DNA-binding molecules. J. Am. Chem. Soc. 129, 12310–12319 (2007).

26. Warren, C. L. et al. Defining the sequence-recognition profile of DNA-binding molecules. Proc. Natl Acad. Sci. USA 103, 867–872 (2006).This paper introduced the CSI method and its application to TFs as well as small DNA-binding molecules.

27. Carlson, C. D. et al. Specificity landscapes of DNA binding molecules elucidate biological function. Proc. Natl Acad. Sci. USA 107, 4544–4549 (2010).

28. Hauschild, K. E., Stover, J. S., Boger, D. L. & Ansari, A. Z. CSI-FID: high throughput label-free detection of DNA binding molecules. Bioorg Med. Chem. Lett. 19, 3779–3782 (2009).

29. Oliphant, A. R., Brandl, C. J. & Struhl, K. Defining the sequence specificity of DNA-binding proteins by selecting binding sites from random-sequence oligonucleotides: analysis of yeast GCN4 protein. Mol. Cell. Biol. 9, 2944–2949 (1989).

30. Tuerk, C. & Gold, L. Systematic evolution of ligands by exponential enrichment: RNA ligands to bacteriophage T4 DNA polymerase. Science 249, 505–510 (1990).

31. Blackwell, T. K. & Weintraub, H. Differences and similarities in DNA-binding preferences of MyoD and E2A protein complexes revealed by binding site selection. Science 250, 1104–1110 (1990).

32. Wright, W. E., Binder, M. & Funk, W. Cyclic amplification and selection of targets (CASTing) for the myogenin consensus binding site. Mol. Cell. Biol. 11, 4104–4110 (1991).

33. Fields, D. S., He, Y., Al-Uzri, A. Y. & Stormo, G. D. Quantitative specificity of the Mnt repressor. J. Mol. Biol. 271, 178–194 (1997).

34. Liu, X., Noll, D. M., Lieb, J. D. & Clarke, N. D. DIP-chip: rapid and accurate determination of DNA-binding specificity. Genome Res. 15, 421–427 (2005).

35. Roulet, E. et al. High-throughput SELEX SAGE method for quantitative modeling of transcription-factor binding sites. Nature Biotech. 20, 831–835 (2002).

36. Nagaraj, V. H., O’Flanagan, R. A. & Sengupta, A. M. Better estimation of protein–DNA interaction parameters improve prediction of functional sites. BMC Biotechnol. 8, 94 (2008).

37. Zhao, Y., Granas, D. & Stormo, G. D. Inferring binding energies from selected binding sites. PLoS Comput. Biol. 5, e1000590 (2009).Introduction of HT-SELEX and the maximum likelihood method ‘binding energy estimates using maximum likelihood’ (BEEML) for obtaining binding energy models from the data.

38. Zykovich, A., Korf, I. & Segal, D. J. Bind-n-Seq: high-throughput analysis of in vitro protein-DNA interactions using massively parallel sequencing. Nucleic Acids Res. 37, e151 (2009).

39. Jolma, A. et al. Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. Genome Res. 20, 861–873 (2010).This study describes the use of HT-SELEX in parallel to determine the binding specificities of several human TFs.

40. Meng, X., Brodsky, M. H. & Wolfe, S. A. A bacterial one-hybrid system for determining the DNA-binding specificity of transcription factors. Nature Biotech. 23, 988–994 (2005).The introduction of an efficient B1H approach for determining TF binding specificities.

41. Meng, X. & Wolfe, S. A. Identifying DNA sequences recognized by a transcription factor using a bacterial one-hybrid system. Nature Protoc. 1, 30–45 (2006).

42. Noyes, M. B. et al. Analysis of homeodomain specificities allows the family-wide prediction of preferred recognition sites. Cell 133, 1277–1289 (2008).

43. Noyes, M. B. et al. A systematic characterization of factors that regulate Drosophila segmentation via a bacterial one-hybrid system. Nucleic Acids Res. 36, 2547–2560 (2008).

44. Stormo, G. D. & Zhao, Y. Putting numbers on the network connections. Bioessays 29, 717–721 (2007).

45. Benos, P. V., Bulyk, M. L. & Stormo, G. D. Additivity in protein–DNA interactions: how good an approximation is it? Nucleic Acids Res. 30, 4442–4451 (2002).

46. Alleyne, T. M. et al. Predicting the binding preference of transcription factors to individual DNA k-mers. Bioinformatics 25, 1012–1018 (2009).

47. Benos, P. V., Lapedes, A. S. & Stormo, G. D. Probabilistic code for DNA recognition by proteins of the EGR family. J. Mol. Biol. 323, 701–727 (2002).

48. Cathomen, T. & Joung, J. K. Zinc-finger nucleases: the next generation emerges. Mol. Ther. 16, 1200–1207 (2008).

49. Schneider, T. D., Stormo, G. D., Gold, L. & Ehrenfeucht, A. Information content of binding sites on nucleotide sequences. J. Mol. Biol. 188, 415–431 (1986).

50. Stormo, G. D. & Fields, D. S. Specificity, free energy and information content in protein–DNA interactions. Trends Biochem. Sci. 23, 109–113 (1998).

51. Schneider, T. D. & Stephens, R. M. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 18, 6097–6100 (1990).

52. Stormo, G. D., Schneider, T. D. & Gold, L. Quantitative analysis of the relationship between nucleotide sequence and functional activity. Nucleic Acids Res. 14, 6661–6679 (1986).

53. Lee, M. L., Bulyk, M. L., Whitmore, G. A. & Church, G. M. A statistical model for investigating binding probabilities of DNA nucleotide sequences using microarrays. Biometrics 58, 981–988 (2002).

54. Djordjevic, M., Sengupta, A. M. & Shraiman, B. I. A biophysical approach to transcription factor binding site discovery. Genome Res. 13, 2381–2390 (2003).

55. Fordyce, P. M. et al. De novo identification and biophysical characterization of transcription-factor binding sites with microfluidic affinity analysis. Nature Biotech. 28, 970–975 (2010).

AcknowledgementsWe thank S. Wolfe for providing the plasmids and cells for the B1H experiments and useful comments on the manuscript. We thank members of the Stormo Laboratory for comments on the manuscript and especially R. Christensen, D. Granas, Z. Zuo and L. Schriefer for providing the data from the B1H selections with ZIF268.

Competing interests statement The authors declare no competing financial interests.

FURTHER INFORMATIONGary D. Stormo’s homepage: http://ural.wustl.eduBerkeley database of Drosophila TF specificities: http://bdtnp.lbl.gov/Fly-Net/Database of Drosophila TF DNA-binding specificities: http://pgfe.umassmed.edu/TFDBSJASPAR database of TF binding specificities: http://jaspar.genereg.netPAZAR database: http://www.pazar.infoSELEX-SAGE database: http://www.isrec.isb-sib.ch/htpselexUniProbe database: http://the_brain.bwh.harvard.edu/uniprobe All links Are Active in the online pDf

Note added in proofA recent paper extended the capability of the MITOMI approach by increasing the number of chambers in the array and by binding to longer oligonucleotides, such that all eight-base-long binding sites are included in

the analysis55. This increases the throughput consider-ably over the initial method, but the throughput is still substantially less than when all ten-base-long sites are included in PbM and CsI, or than for the even longer selected sites allowed by HT-seleX and b1H.

R E V I E W S

760 | NOveMbeR 2010 | vOlUMe 11 www.nature.com/reviews/genetics

© 20 Macmillan Publishers Limited. All rights reserved10