Post on 08-Aug-2020
1
Microarrays can show us when andwhere genes are expressed. But whatregulates this expression?
regulation in trans:transcription factors
Mechanisms of transcriptional regulation
regulation in cis :promoters & enhancersbinding sites
Identifying transcription factor binding sites
Usually, binding sites are first determined empirically:- DNAseI footprinting- EMSA- SELEX (Systematic Evolution of Ligands
by EXponention enrichment)
-protein-binding microarrays
Identifying transcription factor binding sites
Most transcription factors can bind to a range ofsimilar sequences. We can represent these in eitherof two ways, as a consensus sequence, or as aposition weight matrix (PWM).
Binding site (motif) representationsTCCGGAAGCTCCGGATGCTCCGGATCTCATGGATGCCCAGGAAGTGGTGGATGCACCGGATGC
TCCCTGGATAGCT
A 111007200T 302000502G 110770060C 254000015
7 characterizedbinding sites for a
certain transcriptionfactor:
consensus sequence:
PWM andlogo:
Identifying transcription factor binding sites
Once we know the binding site, we can search thegenome to find all of the (predicted) binding sites.
2
Consensus sequences make searching easy, e.g. by using regularexpressions in Perl:
while(<SEQUENCE>){if ($_ =~ /[T|C]C[T|C]GGA[T|A][G|C][C|T]/)
{do something;}}
All positions in the motif are treated the same.
Finding binding sites in the genomeTCCCTGGATAGCCT
Finding binding sites in the genome
A 111007200T 302000502G 110770060C 254000015
A PWM allows us to assign more importance to more invariant positions. Wecan calculate a score based on the probability of a given nucleotide being in agiven position.
TCCGGAAGC scores higher thanTCCGGATCT as GC is preferredover CT in the last two positions
Some common programs for this are PATSER (Hertz and Stormo [1999].Bioinformatics 15:563) and ScanACE (Hughes et al. [2000]. J. Mol. Biol.296:1205)
Finding binding sites in the genomeMatrices for known TFs have been collected into the TRANSFAC(http://www.gene-regulation.de/) and JASPAR(http://jaspar.cgb.ki.se/cgi-bin/jaspar_db.pl) databases.
TRANSFAC is much bigger, but JASPAR is more carefullycurated and of higher quality.
Issues with finding binding sites in the genome
Many predicted sites end up not being functional in vivo.Wasserman and Sandelin (2004) have termed this the “futilitytheorem.” It might be a little overstated, but still represents asignificant problem.
Futility Theorem:
Essentially all (predicted) TFBSs will have no functional role
What accounts for this, and how do we get around it?
Issues with finding binding sites in the genomeTechnical issues:
• dependencies between nucleotides in motif- in practice, doesn’t seem too important
• limited data/small matrices• better algorithms?
Biological issues:• degeneracy of motifs• cooperative binding• in vivo binding doesn’t always correlate with in vitrobinding (as seen by in vivo footprinting and ChIP)
Transcription factor binding can be affected by localconcentration, by chromatin structure, and by interactionswith other transcription factors
ChIP and ChIP-chip
3
Issues with finding binding sites in the genome
Some ways to help get around the “futility theorem” are to:
• search for paired motifs
• apply constraints as to position, etc.
• use phylogenetic footprinting or phylogeneticshadowing (i.e., comparative genomics)
Finding motifs ab initio
What if you don’t know what TFs are involved?
Binding site motifs can be predicted computationally fromthe regulatory regions of genes with similar expressionpatterns.
For instance, the promoter regions of genes that cluster in amicroarray experiment can be used.
seq1:TTTTTATTTTTCTGAATCACCACTTGATATTGCTTCACAGAACTseq2:CGGGCGGTGAGGCAGAGAAAGAGACCACTTGAAATGTAGTAATAseq3:CACTTGAATTTTTCTGCACGCAGTTTTTATTTTTACTTTTCTTGseq4:CGCGTTCGTTATTTGTTGTTGACCACTTGAATTGATTGCTTTATseq5:ATCCCGGTCGAGGTGCACTTGATGTTTTCAATGGAAATGTTGCCseq6:TCTGCAGATTTATGGCCCAACGCTCATTTAACAATTAAAGTGGGseq7:GCATTAACTCTCACTTCAAAAAATCATATAAACACCTCTAATATseq8:TATATTTTCTCGCCACTTAAATAGTTTTCAATGCCAATGGCAGGseq9:ATCCTTATCGAAGCACTTGGATTTTAAAGCAATCTTTTGAACAC
Finding motifs ab initioseq1:TTTTTATTTTTCTGAATCACCACTTGATATTGCTTCACAGAACTseq2:CGGGCGGTGAGGCAGAGAAAGAGACCACTTGAAATGTAGTAATAseq3:CACTTGAATTTTTCTGCACGCAGTTTTTATTTTTACTTTTCTTGseq4:CGCGTTCGTTATTTGTTGTTGACCACTTGAATTGATTGCTTTATseq5:ATCCCGGTCGAGGTGCACTTGATGTTTTCAATGGAAATGTTGCCseq6:TCTGCAGATTTATGGCCCAACGCTCATTTAACAATTAAAGTGGGseq7:GCATTAACTCTCACTTCAAAAAATCATATAAACACCTCTAATATseq8:TATATTTTCTCGCCACTTAAATAGTTTTCAATGCCAATGGCAGGseq9:ATCCTTATCGAAGCACTTGGATTTTAAAGCAATCTTTTGAACAC
With the higher eukaryotes, there’s a lot more sequence to lookat—you must consider 5’, 3’, and intronic sequence.
Finding motifs ab initioSome common methods of motif discovery:
• Gibbs sampling — Gibbs sampler, AlignACE
• Expectation maximization — MEME
• Simulated annealing — GLAM
• ennumeration (word counting) — YMF
There are many other programs that use similar strategies. All ofthese tend to miss a lot of motifs and also find a lot that don’t seemto be relevant. Your best bet is to use more than one algorithm, touse additional criteria to help evaluate the output, and to doempirical evaluation.
Of course, you must then discover which transcription factor bindsthe motif [how?].
Finding binding sites in the genome
How meaningful are the sites we find?• Only experiments can tell us for sure• However, we can get some hints using statistical analysis
Example 1:We just found the motif CACTTGA upstream of co-expressedgenes. Is it over-represented in this set compared to arandom selection of genes?
Search 100 random sets of genes.Find the mean and standard deviation.
€
z =observed − expectedstandard deviation
Finding binding sites in the genome
Example 2:Many regulatory regions contain multiple binding sites forthe same transcription factor. Is the motif found anunusually large number of times in a short stretch ofsequence?
(very) crudely:Probability of finding a 7 bp motif: 4-7 = 1/16,384i.e., expect only about 1 motif every 16 kb.Thus, finding several close together is very unlikely.
4
find all motifsin genome
identifytranscription factors
identify bindingmotif
identify targetgenes
Transcription factors, binding sites, and target genes
computational searchingChIP-chip
computational searchingmicroarraysgenetic screens
bioinformatics (e.g., Gibbssampling on microarray data)molecular biology using purifiedprotein or protein extracts
genetic screensone-hybrid assayssequence motifs/homology
Typically, CRMs are identified empirically, by deletion analysisand reporter gene assays.
However, we can also search for CRMs computationally.
Finding cis-Regulatory Modules (enhancers)
Genes are often regulated in a modular fashion—discrete cis-regulatory elements (CRMs, “enhancers”) dictate a specific spatio-temporal expression pattern.
Map of 3’ regulatory region of eve (Fujioka et al. 1999)
Finding cis-Regulatory Modules (enhancers)Most methods to identify CRMs search for co-occurrence of oneor more TFBSs.
All of the current methods are based on one or more of thefollowing:
• knowledge of one or more relevant TFBSs
• availability of training set of ≥ 10 known CRMs
• phylogenetic footprinting
• assumption that one or more TFBSs are over-represented
This last assumption appears generally true, but is possibly aresult of selection bias—it has not been seriously tested.
In fact, a major drawback to much of this work is an absenceof thorough empirical validation.
Finding CRMs—examples
simple comparative genomics:
align sequence of two or more species
% id
entit
y(s
eq1
vs s
eq2)
predicted regulatory element sequence
Finding CRMs—examples
motif density/clustering: e.g., Berman et al. (2002) PNAS 99:757
•assemble PWMs
•determine density cutoff based on training set
•search genome (needs PWM cutoff)
•look for ≥ 13 motifs/500 bp
•(apply sequence conservation filters)
requires a certain number of motifs per sequence window
Finding CRMs—examples
logistic regression analysis (LRA): e.g., Krivan and Wasserman(2001) Genome Res. 11:1599
•assemble PWMs
•sum scores for training set and background set
•search genome
•(apply sequence conservation filters)
Logistic Regression: a classification technique used todivide between binary choices
5
Finding CRMs—examples
Stubb: Sinha et al. (2003) Bioinformatics 19 Suppl 1: i292Sinha et al. (2004) BMC Bioinformatics 5:129
•assemble PWMs
•score sequence windows
•can incorporate phylogenetic data
•can incorporate relationships between motifs
•would probably be my method of choice
Uses a form of Hidden Markov Model to score how likely itis that a sequence S was generated by a probabilisticprocess that uses motifs from set W, rather than by arandom background process
Finding CRMs—examples
PFR-sampler/searcher: Grad et al. (2004) Bioinformatics. Inpress: doi:10.1093/bioinformatics/bth320
•does not require PWMs or knowledge of TFs!
•requires set of similarly expressed genes
•incorporates phylogenetic data
•can discover motifs
•untested but very promising
Uses subsequence profiling (k-mer frequency) to findphylogenetically conserved regions that are most similarto one another
Let’s look at the paper:An even skipped Muscle and Heart Enhancer (MHE)
eve+6.0 +6.3 +12.0
MHE
An even skipped muscle and heart enhancer (MHE)
Mad TindTCF Ets (RTK/Ras) Twi
eve+12.0
MHE
+6.0 +6.3
Similarity between D. melanogaster and D. virilis eve MHE
Ets1, 2 Ets3 actcatccccattggcggctgggagagtcgTtgGGcaTCCGGAAgcGgggGCagCcAAaAataTACccac----------ATCcCATGGA | || ||||||| | || | || | ||| ||| |||||| ------------------------------TATGG-TTCCGGAATTGT--GCGCCAAATATAGTACAAGTATCTTTGAGTATCTCATGGA
Tin1 TGC---------------------cATCAATTAGCATACAATTAAAAAATGCTTAAACAgGGAAATtgtCT--------------- ||| |||||||||||||||||||||||||||||||||| |||||| || TGCGCAACGGTTTTGTGACAATGCAATCAATTAGCATACAATTAAAAAATGCTTAAACAAGGAAATCAACTCGAGCTGCAGACGCA
Mad4 Twi1 Mad5---------------tgggaTGcgagTggTtCGGcCGCAgaTGcAgCcGcAgCaGCATtTGTATCT------------------------ || | | ||| |||| || | | | | | |||| |||||||CACACAGCCTTATTTGTATCTGTATCTAATGCGGGCGCAACTGTATCTGTAACTGCATCTGTATCTGTATCTCTGTCTGTGTCTGTATCT
Tin2 dTCF Mad6 Tin3-------------------------ccAAGTGGcggGCaGCAGATCAAAGCGACGAC---AACATAATTGCTgcTTCACTTCACAGT------ |||||| || ||||||||||||||||||---|||||||||||| |||||||||||||CTGTCTGTGTCTGTGCAGATGCCTTGTAAGTGG---GCCGCAGATCAAAGCGACGACGACAACATAATTGCTCGTTCACTTCACAGTTTTTTG
Tin4 Twi2 Ets4------------------tctCAggCACTTAAgataTaCATATgtATgTtgcaTACA-TatctAttgcgAgtCcGgatctgcA D. melan. || ||||||| | ||||| || | |||| | | | | | |ATGACTTCGACATGGCCACAGCAAACACTTAATG-CTCCATAT--ATTTCAGCTACACTGCAAAAAAATAAACGGTCCTAAAA D. virilis
6
α−Hbrα−Eve
α−Hbrα−βgal
α−Eveα−βgal
A hbr dorsal mesodermal enhancer
TP
TP
TP
The Twi and Ets sites in the hbr DMEare required for proper enhancer function
Hbr β-galEts k.o.Twi k.o.WT
tracheal pits(non-mesodermal)
P2, P15(mesodermal)
WT MHE motif A mutα−Eve α−βgal
“Systems Biology”—Gene Regulatory Networks“. . . despite all the examples of how individual genes affect thedevelopmental process, there is yet no case where the lines ofcausality can be mapped from the genomic sequence to a majorprocess of bilaterian development. . . .
. . . biological appoaches have focused on determining thefunctions of one or a few genes at a time, an approach that isnot adequate for analysis of large regulatory control systemsorganized as networks. The heart of such networks consists ofgenes encoding transcription factors and the cis-regulatoryelements that control the expression of those genes. Each ofthese cis-regulatory elements receives multiple inputs fromother genes in the network. . .
By determining the succession of DNA sequence-based cis-regulatory transactions that govern spatial gene expression,closure can be brought to the question of why any particularpiece of development actually happens.”
Davidson et al. (2002) Science 295:1669
Davidson et al. (2002) Science 295:1669
Even knowing just a little of this gets incredibly complicated:Regulatory gene network for sea urchin endomesoderm specification
“Systems Biology”—Gene Regulatory Networks Putting it all together: