Prader-Willi & Angelman Syndromes
• Both of these genetic disorders are caused by deletion of a region of chromosome 15.
• However, the syndromes differ:– Prader-Willi Syndrome - obesity, mental retardation,
short stature. (abbreviated PWS)
– Angelman Syndrome - uncontrollable laughter, jerky movements, and other motor and mental symptoms. (abbreviated AS)
• Syndrome that develops depends upon the parent that provided the mutant chromosome.
Goal : Identify loci associated with variation in expression levels
Genomic DNA
mRNA
Nucleus
mRNA
regulators
Target
DataCentre d'Etude du Polymorphisme Humain (CEPH) families are Utah
residents with ancestry from northern and western Europe.
• 14 families with genotype and expression data available for all parents and a mean of eight offspring (range 7-9)
A1 A2 A3 A4
A1 A3 A1 A4
Method: Linkage analysis
A1 A2 A3 A4
A1 A3 A2 A4
A1 A2 A3A4
A1 A3 A1 A3
IBD=2 IBD=1 IBD=0
IBD: identical-by-descent
Under criteria 1,• 27/142 (19%) expression phenotype have only a single
cis-regulator.• 110/142 (77.5%) expression phenotype have only a
single trans-regulator.• 2 /142 have a cis and a trans-acting regulator• 3 /142 gene expression have two trans-acting regulator Under criteria 2, 164 / 984 (16%) has multiple regulators
Cis and trans- regulation
GAL Genes: Eukaryotic Transcriptional Regulation
GAL Genes: Eukaryotic Transcriptional Regulation
• Unlike prokaryotes, eukaryotes do not have genes in operons (most mRNAs are not polycistronic).
• The GAL genes of S. cerevisiae are the paradigm for eukaryotic gene regulation
• Galactose is metabolized by GAL gene products:
Galactose Gal-1-PGal1p
Glu-1-P
UDP-Glu
UDP-GalGal7pGal10p
Gal5pGlu-6-P
Glycolysis
EukaryoticTranscription Distal Proximal
• Proteins bind to distal elements called ENHANCERS.
• DNA folding allows these elements to be far from the start site for transcription.
• Proteins bound to the distal sites promote the binding of RNA polymerase to the proximal elements.
GAL Genes: A Transcriptional ProgramGAL Genes: A Transcriptional Program
• The response to galactose is very complex, with a number of genes being turned on or off.
• The central regulator is a protein called Gal4p.– Gal4p binds to enhancer elements in DNA and activates
transcription under some circumstances.
Gal4p: A Transcriptional RegulatorGal4p: A Transcriptional Regulator
• Gal4p binds to enhancer elements near genes that it regulates (e.g., GAL1).
• Gal4p also binds to Gal80p.– Gal80p is necessary for activation of gene expression.
• When galactose binds to Gal80p, the Gal4p-Gal80p complex can activate transcription.– This activation has now been studied at the level of the whole
genome:
• This figure shows data from a microarray experiment (Science 290:2306 [2000]).
Examining Transcriptional RegulationExamining Transcriptional Regulation• MICROARRAYS have become very popular as tools to
study gene regulation.– A microarray is a small glass slide on which cDNAs of many
(or all) genes in an organism have been dotted.– cDNA is made using mRNAs present under certain conditions
(or in a certain tissue) and labeled with fluorescent dyes.– Then, the labeled cDNA are hybridized to the microarray and
the fluorescence determined.
• There is a nice animation describing this at:– http://www.bio.davidson.edu/courses/genomics/chip/chip.html
– Does this examine transcriptional regulation?
Examining Transcriptional RegulationExamining Transcriptional Regulation• This basic method was extended for the Gal4p study
that we have been discussing discussed.– For this study, the researchers tagged the Gal4p protein so the
could purify from the cell.– Then, they chemically cross-linked it to DNA and purified it.– This allowed them to purify the DNA that Gal4p was bound to
in the cell.– The DNA that Gal4p was bound to in the cell was labeled and
used to probe the microarray.
– Does this examine transcriptional regulation?
Examining Transcriptional RegulationExamining Transcriptional Regulation• This study established several interesting facts:
– The Gal4p binding sites in the DNA are sometimes bound by Gal4p in the absence of galactose, others are bound only in the presence of galactose.
– So the trigger is more complex than simply whether or not the Gal4p protein can bind.
– This more complex regulation involves Gal80p, an inhibitor.
Two possible modelsfor regulation of theGal4p-Gal80p complex by galactose.
The models differ onlyin the exact bindingsites for Gal80p.
How do Eukaryotic Transcriptional Regulators Work?
How do Eukaryotic Transcriptional Regulators Work?
• There are a few specific types of proteins that act to increase transcriptional activity:– Many proteins have an acidic domain.
• Surprisingly, these “acid-blob” proteins often require a hydrophobic residue embedded in an acidic region.
• Both Gal4p and the herpes simplex virus VP16 protein (an transcriptional regulator for this virus) have acid blobs.
– Glutamine-rich and Proline-rich transcriptional activation domains have been characterized.
• These protein regions activate transcription when fused to other DNA-binding domains.– Alternatively, they can be recruited by protein-protein
interactions - e.g., a DNA-binding protein binds the enhancer, and it contains a region that recruits and acid-blob protein.
Using Eukaryotic Transcriptional RegulatorsUsing Eukaryotic Transcriptional Regulators• The yeast 2-hybrid system exploits these features of
eukaryotic transcription factors to examine protein-protein interactions.– The DNA-binding and transcription activating regions of Gal4p
can be separated.– Interestingly, if you fuse one protein to the Gal4p DNA-binding
domain (BD) and a second protein that it interacts (physically) with to the Gal4p transcriptional activating domain (AD), one can see transcriptional activation:
How do Eukaryotic Transcriptional Regulators Work?
How do Eukaryotic Transcriptional Regulators Work?
• Another interesting phenomenon that is sometimes seen with transcription factor is SQUELCHING.– Overexpression of transcription activators like Gal4p can
result in a general inhibition of transcriptional activity.– How does this happen?
– Presumably, specific transcription factors like Gal4p act by recruiting “basal” transcription factors.
• In fact, some basal factors that physically interact with these transcription activating domains have been found.
• Basal factors are factors involved in recruiting RNA polymerase II to a large number of promoters.
– So overexpressing proteins with these transcription activating domains can actually turn gene expression off, by competing for these factors.
How do Eukaryotic Transcriptional Regulators Work?
• At least one way is by altering the packing of DNA into chromatin.
• The role of chromatin structure in the regulation of transcription is an area of very active investigation.
• However, two important factors that play clear roles in transcriptional regulation are known:– DNA METHYLATION - A subset of cytosine (C) residues are
modified by methylation.– HISTONE ACETYLATION - Histones can be modified by
acetylation.
Chromatin• Remember, DNA in
eukaryotes packs into CHROMATIN.
• HISTONES form the NUCLEOSOME, which DNA loops around.
• EUCHROMATIN - less compact; actively transcribed
• HETEROCHROMATIN - more compact; transcriptionally inactive.– Heterochromatin can be
either constitutive or facultative.
DNA Methylation• Genes that are transcriptionally inactive are often
METHYLATED.– In eukaryotes, cytosine residues are modified by methylation.
• Typically, the sites of methylation are CG dinucleotides (vertebrates).– This allows maintenance through replication.
NH2
O NHNH
N
NH2
O NHNH
NCH3
CYTOSINE
METHYL-C
Histone Acetylation
• HISTONES in transcriptionally active genes are often ACETYLATED.
• Acetylation is the modification of lysine residues in histones.– Reduces positive charge, weakens the interaction with DNA.– Makes DNA more accessible to RNA polymerase II
• Enzymes that ACETYLATE HISTONES are recruited to actively transcribed genes.
• Enzymes that remove acetyl groups from histones are recruited to methylated DNA.– There are additional types of histone modification as well,
such as methylation of the histones.
Genetic Imprinting
• Remember that DNA methylation can be maintained through replication.
• This allows the packing of chromatin to be passed on - just like a gene sequence.– However, differences in chromatin packing are not as stable
as gene sequences.• Heritable but potentially reversible changes in gene
expression are called EPIGENETIC phenomena– Vertebrates use these differences in chromatin packing to
IMPRINT certain patterns of gene regulation.– Some genes show MATERNAL IMPRINTING while other show
PATERNAL IMPRINTING.• The alleles of some genes that are inherited from the
relevant parent are methylated, and therefore are not expressed.
Prader-Willi & Angelman Syndromes
• Both of these genetic disorders are caused by deletion of a region of chromosome 15.
• However, the syndromes differ:– Prader-Willi Syndrome - obesity, mental retardation,
short stature. (abbreviated PWS)
– Angelman Syndrome - uncontrollable laughter, jerky movements, and other motor and mental symptoms. (abbreviated AS)
• Syndrome that develops depends upon the parent that provided the mutant chromosome.
Prader-Willi & Angelman Syndromes
• Prader-Willi Syndrome - develops when the abnormal copy of chromosome 15 is inherited from the father.
• Angelman Syndrome - develops when the abnormal copy of chromosome 15 is inherited from the mother.
• The differences reflect the fact that some loci are IMPRINTED - so only the allele inherited from one parent is expressed.– The region contains both maternally and paternally
imprinted genes.
Methylation and Gene Regulation
• For imprinted genes, the pattern of gene regulation is dependent upon the parent that donated the chromosome.– The methylation pattern is “reprogrammed”
in the germ line.
• There are other examples of methylation changes the regulate gene expression.– In mammals, one of the two X chromosomes
in females is inactivated.– The inactivated X is methylated.
Genomics, Bioinformatics, and Gene Regulation
Marc S. Halfon, [email protected]
Department of BiochemistryCenter of Excellence in Bioinformatics and the Life Sciences
Based on presentation for UB/CCR Summer Program in Bioinformatics 2004
As of 6/25/04 (As of 7/25/05)
1128 (1496) genome projects: 199 (274) complete (includes 28 (36) eukaryotes) 508 (728) prokaryotic genomes in progress 421 (494) eukaryotic genomes in progress
smallest: archaebacterium Nanoarchaeum equitans 500 kbBacillus anthracis (anthrax) 5228 kbS. cerivisiae (yeast) 12,069 kbArabidopsis thaliana 115,428 kbDrosophila melanogaster (fruit fly) 137,000 kbAnopheles gambiae (malaria mosquito) 278,000 kbOryza sativa (rice) 420,000 kbMus musculus (mouse) 2,493,000 kbHomo sapiens (human) 2,900,000 kb
http://www.genomesonline.org/
Genome Sequencing
Genome sequencing helps in:• identifying new genes (“gene discovery”) • looking at chromosome organization and structure• finding gene regulatory sequences• comparative genomics
These in turn lead to advances in: •medicine•agriculture•biotechnology •understanding evolution and other basic science questions
•high throughput assays•robotics•high speed computing•statistics •bioinformatics
Because of the vast amounts of data that are generated, we need new approaches
Genes (i.e., protein coding)
But. . . only <2% of the human genome encodes proteins
Other than protein coding genes, what is there?• genes for noncoding RNAs (rRNA, tRNA, miRNAs, etc.)• structural sequences (scaffold attachment regions)• regulatory sequences• “junk” (including transposons, retroviral insertions, etc.)
It’s still uncertain/controversial how much of the genome is composed of any of these classes
The answers will come from experimentation and bioinformatics. We will discuss further only gene regulation.
What’s in a genome?
Gene expression must be regulated in:
TIME
Wolpert, L. (2002) Principles of Development New York: Oxford University Press. p. 31
What happens when gene regulation goes awry?
• Disease- chronic myeloid leukemia- rheumatoid arthritis
1
23
4 56
• Developmental abnormalities (birth defects)
photo credits: Wolpert, L. (2002) Principles of Development New York: Oxford University Press. pp. 183, 340
• transcription• post transcription (RNA stability)
• post transcription (translational control)• post translation (not considered gene regulation)
usually, when we speak of gene regulation, we are referring to transcriptional regulation
the “transcriptome”
Genes can be regulated at many levels
RNA PROTEINDNATRANSCRIPTION TRANSLATION
The “Central Dogma”
One way of looking at the transcriptome is with DNA microarrays. With microarrays, the expression of thousands of genes can be assessed in a single experiment.
cDNAs or oligonucleotides representing all genes in the genome are deposited on a glass slide using a robotic arrayer:
Looking at the transcriptome: DNA
microarrays
Benfey, P. and Protopapas, A. Genomics. 2005. New Jersey: Pearson Prentice Hall. pp. 131-2
Exploring the Metabolic and Genetic Control ofGene Expression on a Genomic Scale
Joseph L. DeRisi, Vishwanath R. Iyer, Patrick O. Brown*
MicroArray• Allows measuring the mRNA level of thousands
of genes in one experiment -- system level response
• The data generation can be fully automated by robots
• Common experimental themes:
–Time Course (when)–Tissue Type (where)–Response (under what conditions)–Perturbation: Mutation/Knockout, Knock-in Over-expression
Looking at the transcriptome: DNA
microarrays
extract mRNA
make labeled cDNA
hybridize to microarray
cell type A
cell type B
more in “A”
more in “B”
equal in A & B
Looking at the transcriptome: microarrays
genes
co
nd
itio
ns
condition 1 condition 2
condition 3
statistical processing and analysis
Which Genes to select? • For each gene (row) compute a score defined by
sample mean of X - sample mean of Y
divided by
standard deviation of X + standard deviation of Y
• X=ALL, Y=AML
• Genes (rows) with highest scores are selected.
Seems to work ! Improvement?
•34 new leukemia samples•29 are predicated with 100% accuracy; 5 weak predication cases
That seems to work well.
They have a method
Study of cell-cycle regulated genes
• Rate of cell growth and division varies• Yeast(120 min), insect egg(15-30 min); nerve
cell(no);fibroblast(healing wounds)• Regulation : irregular growth causes cancer• Goal : find what genes are expressed at each state
of cell cycle• Yeast cells; Spellman et al (2000) • Fourier analysis: cyclic pattern
Why clustering make sense biologically?
Profile similarity implies functional association
The rationale is
Genes with high degree of expression similarity are likely to be functionally related and may participate in common pathways.
They may be co-regulated by common upstream regulatory factors.
Simply put,
Rationale behind massive gene expression analysis:
• Pearson's correlation coefficient, a simple way of describing the strength of linear association between a pair of random variables, has become the most popular measure of gene expression similarity.
•1.Cluster analysis: average linkage, self-organizing map, K-mean, ...
2.Classification: nearest neighbor,linear discriminant analysis, support vector machine,…
3.Dimension reduction methods: PCA ( SVD)
Gene profiles and correlation
CC has been used by Gauss, Bravais, Edgeworth … Sweeping impact in data analysis is due to
Galton(1822-1911)
“Typical laws of heridity in man”
Karl Pearson modifies and popularizes the use.
A building block in multivariate analysis, of whichclustering, classification, dim. reduct. are recurrent themes
As a statistician, how can you ignore the time order ?(Isn’t it true that the use of sample correlation relies on the assumption that data are I.I.D. ???)
regulation in trans:transcription factors
Mechanisms of transcriptional regulation
regulation in cis :promoters & enhancersbinding sites
Identifying transcription factor binding sites
Usually, binding sites are first determined empirically.
Most transcription factors can bind to a range of similar sequences. We can represent these in either of two ways, as a consensus sequence, or as a position weight matrix (PWM).
Once we know the binding site, we can search the genome to find all of the (predicted) binding sites.
Binding site (motif) representationsTCCGGAAGCTCCGGATGCTCCGGATCTCATGGATGCCCAGGAAGTGGTGGATGCACCGGATGC
TCCC
TGGATAGC
T
A 111007200T 302000502G 110770060C 254000015
7 characterized binding sites for a
certain transcription factor:
consensus sequence:
PWM and logo:
Consensus sequences make searching easy, e.g. by using regular expressions in Perl:
while(<SEQUENCE>){if ($_ =~ /[T|C]C[T|C]GGA[T|A][G|C][C|T]/)
{do something;}}
All positions in the motif are treated the same.
Finding binding sites in the genome
TCCC
TGGATAGCCT
Finding binding sites in the genome
A 111007200T 302000502G 110770060C 254000015
A PWM allows us to assign more importance to more invariant positions. We can calculate a score based on the probability of a given nucleotide being in a given position.
TCCGGAAGC scores higher thanTCCGGATCT as GC is preferred over CT in the last two positions
Finding binding sites in the genomeBinding site motifs can be predicted computationally from the regulatory regions of genes with similar expression patterns.
For instance, the promoter regions of genes that cluster in a microarray experiment can be used.
(How can the promoter regions be extracted? You should know enough Perl at this point to be able to do this, given a well-annotated sequence database.)
seq1:TTTTTATTTTTCTGAATCACCACTTGATATTGCTTCACAGAACTseq2:CGGGCGGTGAGGCAGAGAAAGAGACCACTTGAAATGTAGTAATAseq3:CACTTGAATTTTTCTGCACGCAGTTTTTATTTTTACTTTTCTTGseq4:CGCGTTCGTTATTTGTTGTTGACCACTTGAATTGATTGCTTTATseq5:ATCCCGGTCGAGGTGCACTTGATGTTTTCAATGGAAATGTTGCCseq6:TCTGCAGATTTATGGCCCAACGCTCATTTAACAATTAAAGTGGG seq7:GCATTAACTCTCACTTCAAAAAATCATATAAACACCTCTAATATseq8:TATATTTTCTCGCCACTTAAATAGTTTTCAATGCCAATGGCAGGseq9:ATCCTTATCGAAGCACTTGGATTTTAAAGCAATCTTTTGAACAC
Finding binding sites in the genomeseq1:TTTTTATTTTTCTGAATCACCACTTGATATTGCTTCACAGAACTseq2:CGGGCGGTGAGGCAGAGAAAGAGACCACTTGAAATGTAGTAATAseq3:CACTTGAATTTTTCTGCACGCAGTTTTTATTTTTACTTTTCTTGseq4:CGCGTTCGTTATTTGTTGTTGACCACTTGAATTGATTGCTTTATseq5:ATCCCGGTCGAGGTGCACTTGATGTTTTCAATGGAAATGTTGCCseq6:TCTGCAGATTTATGGCCCAACGCTCATTTAACAATTAAAGTGGG seq7:GCATTAACTCTCACTTCAAAAAATCATATAAACACCTCTAATATseq8:TATATTTTCTCGCCACTTAAATAGTTTTCAATGCCAATGGCAGGseq9:ATCCTTATCGAAGCACTTGGATTTTAAAGCAATCTTTTGAACAC
A Gibbs sampling algorithm can then find the common sub-sequences:
Of course, we must now discover which transcription factor binds this sequence.
Finding binding sites in the genome
How meaningful are the sites we find?• Only experiments can tell us for sure• However, we can get some hints using statistical analysis
Example 1:We just found the motif CACTTGA upstream of co-expressed genes. Is it over-represented in this set compared to a random selection of genes?
Search 100 random sets of genes.Find the mean and standard deviation. z = observed - expected/standard deviation
Finding binding sites in the genome
Example 2:Many regulatory regions contain multiple binding sites for the same transcription factor. Is the motif found an unusually large number of times in a short stretch of sequence?
Crudely:Probability of finding a 7 bp motif: 4-7 = 1/16,384i.e., expect only about 1 motif every 16 kb.Thus, finding several close together is very unlikely.
find all motifs in genome
identifytranscription
factors
identify binding motif
identify target genes
Transcription factors, binding sites, and target genes
computational searchingChIP-chip
computational searchingmicroarraysgenetic screens
bioinformatics (e.g., Gibbs sampling on microarray data)molecular biology using purified protein or protein extracts
genetic screensone-hybrid assayssequence motifs/homology
How well does it work?
•Although not always that difficult computationally, these approaches are complex biologically
•Predicted and in vitro binding data do not always accurately reflect what takes place in vivo
•Transcription factor binding can be affected by local concentration, by chromatin structure, and by interactions with other transcription factors
•Many predicted sites may therefore have no actual role
•Functional testing of predictions is very important
Gene regulation is combinatorial— several transcription factors bind simultaneously
We can search for co-occurrence of multiple transcription factors to try to identify regulatory modules
Another way to try to find regulatory modules is through comparative genomics
Putting things together: cis-Regulatory Modules (enhancers)
sequence
% i
de
nti
ty(s
eq
1 v
s s
eq
2)
predicted regulatory element
Why bother?
Ultimately, we’d like to be able to describe all of development in terms of gene expression and regulation.
That is, in every cell, at every time, which genes are on or off, and why?
Davidson et al. (2002) Science 295:1669
Even knowing just a little of this gets incredibly complicated:
Regulatory gene network for sea urchin endomesoderm specification
Gene Regulatory Networks
But imagine understanding how we go from
here . . .
. . . to here!
. . . to here . . .
http://nobelprize.org/medicine
http://www.alphascientists.com/embryology_images/cleavage_stage_embryos.html
Further Reading:
Wasserman, W. W. and A. Sandelin (2004). "Applied Bioinformatics For The Identification Of Regulatory Elements." Nature Reviews Genetics 5(4): 276-287.
Halfon, M. S. and A. M. Michelson (2002). "Exploring Genetic Regulatory Networks in Metazoan Development: Methods and Models." Physiol Genomics 10(3): 131-43.
Davidson, E. H. (2001). Genomic Regulatory Systems. San Diego, Academic Press.
Carroll, S. B., J. K. Grenier, et al. (2001). From DNA to Diversity. Molecular Genetics and the Evolution of Animal Design. Massachusetts, Blackwell Science.
Top Related