Mechanisms of transcriptional regulationhalfonlab.ccr.buffalo.edu/courses/BCH512/BCH512_2004/...We...

6
1 Microarrays can show us when and where genes are expressed. But what regulates this expression? regulation in trans: transcription factors Mechanisms of transcriptional regulation regulation in cis : promoters & enhancers binding sites Identifying transcription factor binding sites Usually, binding sites are first determined empirically: - DNAseI footprinting - EMSA - SELEX (S ystematic E volution of L igands by EX ponention enrichment) -protein-binding microarrays Identifying transcription factor binding sites Most transcription factors can bind to a range of similar sequences. We can represent these in either of two ways, as a consensus sequence, or as a position weight matrix (PWM). Binding site (motif) representations TCCGGAAGC TCCGGATGC TCCGGATCT CATGGATGC CCAGGAAGT GGTGGATGC ACCGGATGC T C C C T GGA T A G C T A 111007200 T 302000502 G 110770060 C 254000015 7 characterized binding sites for a certain transcription factor: consensus sequence: PWM and logo: Identifying transcription factor binding sites Once we know the binding site, we can search the genome to find all of the (predicted) binding sites.

Transcript of Mechanisms of transcriptional regulationhalfonlab.ccr.buffalo.edu/courses/BCH512/BCH512_2004/...We...

Page 1: Mechanisms of transcriptional regulationhalfonlab.ccr.buffalo.edu/courses/BCH512/BCH512_2004/...We just found the motif CACTTGA upstream of co-expressed genes. Is it over-represented

1

Microarrays can show us when andwhere genes are expressed. But whatregulates this expression?

regulation in trans:transcription factors

Mechanisms of transcriptional regulation

regulation in cis :promoters & enhancersbinding sites

Identifying transcription factor binding sites

Usually, binding sites are first determined empirically:- DNAseI footprinting- EMSA- SELEX (Systematic Evolution of Ligands

by EXponention enrichment)

-protein-binding microarrays

Identifying transcription factor binding sites

Most transcription factors can bind to a range ofsimilar sequences. We can represent these in eitherof two ways, as a consensus sequence, or as aposition weight matrix (PWM).

Binding site (motif) representationsTCCGGAAGCTCCGGATGCTCCGGATCTCATGGATGCCCAGGAAGTGGTGGATGCACCGGATGC

TCCCTGGATAGCT

A 111007200T 302000502G 110770060C 254000015

7 characterizedbinding sites for a

certain transcriptionfactor:

consensus sequence:

PWM andlogo:

Identifying transcription factor binding sites

Once we know the binding site, we can search thegenome to find all of the (predicted) binding sites.

Page 2: Mechanisms of transcriptional regulationhalfonlab.ccr.buffalo.edu/courses/BCH512/BCH512_2004/...We just found the motif CACTTGA upstream of co-expressed genes. Is it over-represented

2

Consensus sequences make searching easy, e.g. by using regularexpressions in Perl:

while(<SEQUENCE>){if ($_ =~ /[T|C]C[T|C]GGA[T|A][G|C][C|T]/)

{do something;}}

All positions in the motif are treated the same.

Finding binding sites in the genomeTCCCTGGATAGCCT

Finding binding sites in the genome

A 111007200T 302000502G 110770060C 254000015

A PWM allows us to assign more importance to more invariant positions. Wecan calculate a score based on the probability of a given nucleotide being in agiven position.

TCCGGAAGC scores higher thanTCCGGATCT as GC is preferredover CT in the last two positions

Some common programs for this are PATSER (Hertz and Stormo [1999].Bioinformatics 15:563) and ScanACE (Hughes et al. [2000]. J. Mol. Biol.296:1205)

Finding binding sites in the genomeMatrices for known TFs have been collected into the TRANSFAC(http://www.gene-regulation.de/) and JASPAR(http://jaspar.cgb.ki.se/cgi-bin/jaspar_db.pl) databases.

TRANSFAC is much bigger, but JASPAR is more carefullycurated and of higher quality.

Issues with finding binding sites in the genome

Many predicted sites end up not being functional in vivo.Wasserman and Sandelin (2004) have termed this the “futilitytheorem.” It might be a little overstated, but still represents asignificant problem.

Futility Theorem:

Essentially all (predicted) TFBSs will have no functional role

What accounts for this, and how do we get around it?

Issues with finding binding sites in the genomeTechnical issues:

• dependencies between nucleotides in motif- in practice, doesn’t seem too important

• limited data/small matrices• better algorithms?

Biological issues:• degeneracy of motifs• cooperative binding• in vivo binding doesn’t always correlate with in vitrobinding (as seen by in vivo footprinting and ChIP)

Transcription factor binding can be affected by localconcentration, by chromatin structure, and by interactionswith other transcription factors

ChIP and ChIP-chip

Page 3: Mechanisms of transcriptional regulationhalfonlab.ccr.buffalo.edu/courses/BCH512/BCH512_2004/...We just found the motif CACTTGA upstream of co-expressed genes. Is it over-represented

3

Issues with finding binding sites in the genome

Some ways to help get around the “futility theorem” are to:

• search for paired motifs

• apply constraints as to position, etc.

• use phylogenetic footprinting or phylogeneticshadowing (i.e., comparative genomics)

Finding motifs ab initio

What if you don’t know what TFs are involved?

Binding site motifs can be predicted computationally fromthe regulatory regions of genes with similar expressionpatterns.

For instance, the promoter regions of genes that cluster in amicroarray experiment can be used.

seq1:TTTTTATTTTTCTGAATCACCACTTGATATTGCTTCACAGAACTseq2:CGGGCGGTGAGGCAGAGAAAGAGACCACTTGAAATGTAGTAATAseq3:CACTTGAATTTTTCTGCACGCAGTTTTTATTTTTACTTTTCTTGseq4:CGCGTTCGTTATTTGTTGTTGACCACTTGAATTGATTGCTTTATseq5:ATCCCGGTCGAGGTGCACTTGATGTTTTCAATGGAAATGTTGCCseq6:TCTGCAGATTTATGGCCCAACGCTCATTTAACAATTAAAGTGGGseq7:GCATTAACTCTCACTTCAAAAAATCATATAAACACCTCTAATATseq8:TATATTTTCTCGCCACTTAAATAGTTTTCAATGCCAATGGCAGGseq9:ATCCTTATCGAAGCACTTGGATTTTAAAGCAATCTTTTGAACAC

Finding motifs ab initioseq1:TTTTTATTTTTCTGAATCACCACTTGATATTGCTTCACAGAACTseq2:CGGGCGGTGAGGCAGAGAAAGAGACCACTTGAAATGTAGTAATAseq3:CACTTGAATTTTTCTGCACGCAGTTTTTATTTTTACTTTTCTTGseq4:CGCGTTCGTTATTTGTTGTTGACCACTTGAATTGATTGCTTTATseq5:ATCCCGGTCGAGGTGCACTTGATGTTTTCAATGGAAATGTTGCCseq6:TCTGCAGATTTATGGCCCAACGCTCATTTAACAATTAAAGTGGGseq7:GCATTAACTCTCACTTCAAAAAATCATATAAACACCTCTAATATseq8:TATATTTTCTCGCCACTTAAATAGTTTTCAATGCCAATGGCAGGseq9:ATCCTTATCGAAGCACTTGGATTTTAAAGCAATCTTTTGAACAC

With the higher eukaryotes, there’s a lot more sequence to lookat—you must consider 5’, 3’, and intronic sequence.

Finding motifs ab initioSome common methods of motif discovery:

• Gibbs sampling — Gibbs sampler, AlignACE

• Expectation maximization — MEME

• Simulated annealing — GLAM

• ennumeration (word counting) — YMF

There are many other programs that use similar strategies. All ofthese tend to miss a lot of motifs and also find a lot that don’t seemto be relevant. Your best bet is to use more than one algorithm, touse additional criteria to help evaluate the output, and to doempirical evaluation.

Of course, you must then discover which transcription factor bindsthe motif [how?].

Finding binding sites in the genome

How meaningful are the sites we find?• Only experiments can tell us for sure• However, we can get some hints using statistical analysis

Example 1:We just found the motif CACTTGA upstream of co-expressedgenes. Is it over-represented in this set compared to arandom selection of genes?

Search 100 random sets of genes.Find the mean and standard deviation.

z =observed − expectedstandard deviation

Finding binding sites in the genome

Example 2:Many regulatory regions contain multiple binding sites forthe same transcription factor. Is the motif found anunusually large number of times in a short stretch ofsequence?

(very) crudely:Probability of finding a 7 bp motif: 4-7 = 1/16,384i.e., expect only about 1 motif every 16 kb.Thus, finding several close together is very unlikely.

Page 4: Mechanisms of transcriptional regulationhalfonlab.ccr.buffalo.edu/courses/BCH512/BCH512_2004/...We just found the motif CACTTGA upstream of co-expressed genes. Is it over-represented

4

find all motifsin genome

identifytranscription factors

identify bindingmotif

identify targetgenes

Transcription factors, binding sites, and target genes

computational searchingChIP-chip

computational searchingmicroarraysgenetic screens

bioinformatics (e.g., Gibbssampling on microarray data)molecular biology using purifiedprotein or protein extracts

genetic screensone-hybrid assayssequence motifs/homology

Typically, CRMs are identified empirically, by deletion analysisand reporter gene assays.

However, we can also search for CRMs computationally.

Finding cis-Regulatory Modules (enhancers)

Genes are often regulated in a modular fashion—discrete cis-regulatory elements (CRMs, “enhancers”) dictate a specific spatio-temporal expression pattern.

Map of 3’ regulatory region of eve (Fujioka et al. 1999)

Finding cis-Regulatory Modules (enhancers)Most methods to identify CRMs search for co-occurrence of oneor more TFBSs.

All of the current methods are based on one or more of thefollowing:

• knowledge of one or more relevant TFBSs

• availability of training set of ≥ 10 known CRMs

• phylogenetic footprinting

• assumption that one or more TFBSs are over-represented

This last assumption appears generally true, but is possibly aresult of selection bias—it has not been seriously tested.

In fact, a major drawback to much of this work is an absenceof thorough empirical validation.

Finding CRMs—examples

simple comparative genomics:

align sequence of two or more species

% id

entit

y(s

eq1

vs s

eq2)

predicted regulatory element sequence

Finding CRMs—examples

motif density/clustering: e.g., Berman et al. (2002) PNAS 99:757

•assemble PWMs

•determine density cutoff based on training set

•search genome (needs PWM cutoff)

•look for ≥ 13 motifs/500 bp

•(apply sequence conservation filters)

requires a certain number of motifs per sequence window

Finding CRMs—examples

logistic regression analysis (LRA): e.g., Krivan and Wasserman(2001) Genome Res. 11:1599

•assemble PWMs

•sum scores for training set and background set

•search genome

•(apply sequence conservation filters)

Logistic Regression: a classification technique used todivide between binary choices

Page 5: Mechanisms of transcriptional regulationhalfonlab.ccr.buffalo.edu/courses/BCH512/BCH512_2004/...We just found the motif CACTTGA upstream of co-expressed genes. Is it over-represented

5

Finding CRMs—examples

Stubb: Sinha et al. (2003) Bioinformatics 19 Suppl 1: i292Sinha et al. (2004) BMC Bioinformatics 5:129

•assemble PWMs

•score sequence windows

•can incorporate phylogenetic data

•can incorporate relationships between motifs

•would probably be my method of choice

Uses a form of Hidden Markov Model to score how likely itis that a sequence S was generated by a probabilisticprocess that uses motifs from set W, rather than by arandom background process

Finding CRMs—examples

PFR-sampler/searcher: Grad et al. (2004) Bioinformatics. Inpress: doi:10.1093/bioinformatics/bth320

•does not require PWMs or knowledge of TFs!

•requires set of similarly expressed genes

•incorporates phylogenetic data

•can discover motifs

•untested but very promising

Uses subsequence profiling (k-mer frequency) to findphylogenetically conserved regions that are most similarto one another

Let’s look at the paper:An even skipped Muscle and Heart Enhancer (MHE)

eve+6.0 +6.3 +12.0

MHE

An even skipped muscle and heart enhancer (MHE)

Mad TindTCF Ets (RTK/Ras) Twi

eve+12.0

MHE

+6.0 +6.3

Similarity between D. melanogaster and D. virilis eve MHE

Ets1, 2 Ets3 actcatccccattggcggctgggagagtcgTtgGGcaTCCGGAAgcGgggGCagCcAAaAataTACccac----------ATCcCATGGA | || ||||||| | || | || | ||| ||| |||||| ------------------------------TATGG-TTCCGGAATTGT--GCGCCAAATATAGTACAAGTATCTTTGAGTATCTCATGGA

Tin1 TGC---------------------cATCAATTAGCATACAATTAAAAAATGCTTAAACAgGGAAATtgtCT--------------- ||| |||||||||||||||||||||||||||||||||| |||||| || TGCGCAACGGTTTTGTGACAATGCAATCAATTAGCATACAATTAAAAAATGCTTAAACAAGGAAATCAACTCGAGCTGCAGACGCA

Mad4 Twi1 Mad5---------------tgggaTGcgagTggTtCGGcCGCAgaTGcAgCcGcAgCaGCATtTGTATCT------------------------ || | | ||| |||| || | | | | | |||| |||||||CACACAGCCTTATTTGTATCTGTATCTAATGCGGGCGCAACTGTATCTGTAACTGCATCTGTATCTGTATCTCTGTCTGTGTCTGTATCT

Tin2 dTCF Mad6 Tin3-------------------------ccAAGTGGcggGCaGCAGATCAAAGCGACGAC---AACATAATTGCTgcTTCACTTCACAGT------ |||||| || ||||||||||||||||||---|||||||||||| |||||||||||||CTGTCTGTGTCTGTGCAGATGCCTTGTAAGTGG---GCCGCAGATCAAAGCGACGACGACAACATAATTGCTCGTTCACTTCACAGTTTTTTG

Tin4 Twi2 Ets4------------------tctCAggCACTTAAgataTaCATATgtATgTtgcaTACA-TatctAttgcgAgtCcGgatctgcA D. melan. || ||||||| | ||||| || | |||| | | | | | |ATGACTTCGACATGGCCACAGCAAACACTTAATG-CTCCATAT--ATTTCAGCTACACTGCAAAAAAATAAACGGTCCTAAAA D. virilis

Page 6: Mechanisms of transcriptional regulationhalfonlab.ccr.buffalo.edu/courses/BCH512/BCH512_2004/...We just found the motif CACTTGA upstream of co-expressed genes. Is it over-represented

6

α−Hbrα−Eve

α−Hbrα−βgal

α−Eveα−βgal

A hbr dorsal mesodermal enhancer

TP

TP

TP

The Twi and Ets sites in the hbr DMEare required for proper enhancer function

Hbr β-galEts k.o.Twi k.o.WT

tracheal pits(non-mesodermal)

P2, P15(mesodermal)

WT MHE motif A mutα−Eve α−βgal

“Systems Biology”—Gene Regulatory Networks“. . . despite all the examples of how individual genes affect thedevelopmental process, there is yet no case where the lines ofcausality can be mapped from the genomic sequence to a majorprocess of bilaterian development. . . .

. . . biological appoaches have focused on determining thefunctions of one or a few genes at a time, an approach that isnot adequate for analysis of large regulatory control systemsorganized as networks. The heart of such networks consists ofgenes encoding transcription factors and the cis-regulatoryelements that control the expression of those genes. Each ofthese cis-regulatory elements receives multiple inputs fromother genes in the network. . .

By determining the succession of DNA sequence-based cis-regulatory transactions that govern spatial gene expression,closure can be brought to the question of why any particularpiece of development actually happens.”

Davidson et al. (2002) Science 295:1669

Davidson et al. (2002) Science 295:1669

Even knowing just a little of this gets incredibly complicated:Regulatory gene network for sea urchin endomesoderm specification

“Systems Biology”—Gene Regulatory Networks Putting it all together: