110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction -...
-
Upload
berenice-watson -
Category
Documents
-
view
213 -
download
0
Transcript of 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction -...
1BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
BCB 444/544
Lecture 28
Gene Prediction - finish it
Promoter Prediction
#28_Oct29
2BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Mon Oct 29 - Lecture 28
Promoter & Regulatory Element Prediction
• Chp 9 - pp 113 - 126
Wed Oct 30 - Lecture 29
Phylogenetics Basics
• Chp 10 - pp 127 - 141
Thurs Oct 31 - Lab 9
Gene & Regulatory Element Prediction
Fri Oct 30 - Lecture 29
Phylogenetic Tree Construction Methods & Programs
• Chp 11 - pp 142 - 169
Required Reading (before lecture)
3BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Assignments & Announcements
Mon Oct 29 - HW#5 - will be posted today
HW#5 = Hands-on exercises with phylogenetics and tree-building software
Due: Mon Nov 5 (not Fri Nov 1 as previously posted)
4BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
BCB 544 "Team" Projects
Last week of classes will be devoted to Projects
• Written reports due: • Mon Dec 3 (no class that day)
• Oral presentations (20-30') will be: • Wed-Fri Dec 5,6,7
• 1 or 2 teams will present during each class period
See Guidelines for Projects posted online
5BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
BCB 544 Only: New Homework Assignment
544 Extra#2
Due: √PART 1 - ASAP
PART 2 - meeting prior to 5 PM Fri Nov 2
Part 1 - Brief outline of Project, email to Drena & Michael
after response/approval, then:
Part 2 - More detailed outline of project
Read a few papers and summarize status of problem
Schedule meeting with Drena & Michael to discuss
ideas
6BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Seminars this Week
BCB List of URLs for Seminars related to Bioinformatics:http://www.bcb.iastate.edu/seminars/index.html
• Nov 1 Thurs - BBMB Seminar 4:10 in 1414 MBB
• Todd Yeates UCLA TBA -something cool about structure and evolution?
• Nov 2 Fri - BCB Faculty Seminar 2:10 in 102 ScI
• Bob Jernigan BBMB, ISU
•Control of Protein Motions by Structure
7BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Chp 8 - Gene Prediction
SECTION III GENE AND PROMOTER PREDICTION
Xiong: Chp 8 Gene Prediction
• Categories of Gene Prediction Programs
• Gene Prediction in Prokaryotes
• Gene Prediction in Eukaryotes
8BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Computational Gene Prediction: Approaches
• Ab initio methods
• Search by signal: find DNA sequences involved in gene
expression
• Search by content: Test statistical properties
distinguishing coding from non-coding DNA
• Similarity-based methods
• Database search: exploit similarity to proteins, ESTs,
cDNAs
• Comparative genomics: exploit aligned genomes
• Do other organisms have similar sequence?
• Hybrid methods - best
9BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Computational Gene Prediction: Algorithms
1. Neural Networks (NNs) (more on these later…)
e.g., GRAIL
2. Linear discriminant analysis (LDA) (see text)
e.g., FGENES, MZEF
3. Markov Models (MMs) & Hidden Markov Models (HMMs)
e.g., GeneSeqer - uses MMs
GENSCAN - uses 5th order HMMs - (see text)
HMMgene - uses conditional maximum likelihood (see text)
This is a new slide
10BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Signals Search
Approach: Build models (PSSMs, profiles, HMMs, …) and search against DNA. Detected instances provide evidence for genes
This is a new slide
11BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Content Search
Observation: Encoding a protein affects statistical properties of DNA sequence:
• Nucleotide.amino acid distribution• GC content (CpG islands, exon/intron)• Uneven usage of synonymous codons (codon bias)• Hexamer frequency - most discriminative of
these for identifying coding potential
Method: Evaluate these differences (coding statistics) to differentiate between coding and non-coding regions
This is a new slide
12BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Human Codon UsageThis is a new slide
13BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Predicting Genes based on Codon Usage Differences
Algorithm:Process sliding window• Use codon frequencies
to compute probability of coding versus non-coding
• Plot log-likelihood ratio:
Coding Profile of ß-globin gene
Exons( )⎟⎟⎠
⎞⎜⎜⎝
⎛− )|(
|log
codingnonSP
codingSP
This is a new slide
14BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
In different genomes: Translate DNA into all 6 reading frames and search against proteins (TBLASTX,BLASTX, etc.)
Within same genome: Search with EST/cDNA database (EST2genome, BLAT, etc.).
Problems: • Will not find “new” or RNA genes (non-coding
genes).• Limits of similarity are hard to define• Small exons might be overlooked
Similarity-Based Methods: Database Search
ATTGCGTAGGGCGCTTAACGCATCCCGCGA
This is a new slide
15BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Similarity-Based Methods:Comparative Genomics
Idea: Functional regions are more conserved than non-functional ones; high similarity in alignment indicates gene
Advantages:• May find uncharacterized or RNA genes
Problems:• Finding suitable evolutionary distance• Finding limits of high similarity (functional
regions)
human
mouse
GGTTTT--ATGAGTAAAGTAGACACTCCAGTAACGCGGTGAGTAC----ATTAA | ||||| ||||| ||| ||||| ||||||||||||| | |C-TCAGGAATGAGCAAAGTCGAC---CCAGTAACGCGGTAAGTACATTAACGA-
This is a new slide
16BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Human Mouse
Human-Mouse Homology
Comparison of 1196 orthologous genes
• Sequence identity between genes in human vs mouseExons: 84.6%Protein: 85.4%Introns: 35%5’ UTRs: 67%3’ UTRs: 69%
This is a new slide
17BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Thanks to Volker Brendel, ISU for the following Figs & Slides
Slightly modified from:
BSSI Genome Informatics Modulehttp://www.bioinformatics.iastate.edu/BBSI/course_desc_2005.html#moduleB
V Brendel [email protected]
Brendel et al (2004) Bioinformatics 20: 1157
18BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
• Perform pairwise alignment with large gaps in one sequence (due to introns)• Align genomic DNA with cDNA, ESTs, protein
sequences
• Score semi-conserved sequences at splice junctions• Using Bayesian probability model & 1st order MM
• Score coding constraints in translated exons• Using Bayesian model
Spliced Alignment Algorithm
GeneSeqer - Brendel et al.- ISUhttp://deepc2.psi.iastate.edu/cgi-bin/gs.cgi
Intron
GT AG
Splice sites
Donor
Acceptor
Brendel et al (2004) Bioinformatics 20: 1157http://bioinformatics.oxfordjournals.org/cgi/content/abstract/20/7/1157
Brendel 2005
19BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
• Information Content Information Content IIii ::
I f fi iBB U C A G
iB= +∈∑2 2, , ,
log ( )
• Extent of Splice Signal Window:
I Ii I≤ +196. σ
i: ith position in sequenceĪ: avg information content over all positions >20 nt from splice siteσĪ: avg sample standard deviation of Ī
Splice Site Detection
Do DNA sequences surrounding splice "consensus" sequences contribute to splicing signal?
YES
Brendel 2005
20BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-50 -40 -30 -20 -10 0 10 20 30 40 50
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-50 -40 -30 -20 -10 0 10 20 30 40 50
HumanT2_GT
HumanT2_AG
Information Content vs Position
Which sequences are exons & which are introns? How can you tell?
Brendel 2005
21BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
en en+1
in in+1
PG
PA(n)PG
(1-PG)PD(n+1)
(1-PG)PD(n+1)
(1-PG)(1-PD(n+1))
1-PA(n)
PG
Markov Model for Spliced Alignment
Brendel 2005
22BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Evaluation of Splice Site Prediction
Fig 5.11Baxevanis & Ouellette 2005
This is a new slide
TP = positive instance correctly predicted as positiveFP = negative instance incorrectly predicted as positiveTN = negative instance correctly predicted as negativeFN = positive instance incorrectly predicted as negative
Right!
23BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Evaluation of Predictions
• Normalized specificity:σα
α β=
−− +1
1
ActualTrue False
PP=TP+FP
PN=FN+TN
AP=TP+FNAN=FP+TN
PredictedTrue
False TNFN
FPTP
• Specificity: rAN
AP=
• Misclassification rates: α =FN
APβ =
FP
AN
Coverage• Sensitivity:
Predicted Positives True
Positives
False Positives
Recall
Do not memorize this!
24BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Evaluation of Predictions - in EnglishActual
True False
PP=TP+FP
PN=FN+TN
AP=TP+FNAN=FP+TN
PredictedTrue
False TNFN
FPTP
• Specificity:
• Sensitivity: = Coverage
In English? Sensitivity is the fraction of all positive instances having a true positive prediction.
= Recall
In English? Specificity is the fraction of all predicted positives that are, in fact, true positives.
IMPORTANT: in medical jargon, Specificity is sometimes defined differently (what we define here as "Specificity" is sometimes referred to as "Positive predictive value")
IMPORTANT: Sensitivity alone does not tell us much about performance because a 100% sensitivity can be achieved trivially by labeling all test cases positive!
25BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Best Measures for Comparison?
• ROC curves (Receiver Operating Characteristic (?!!)
http://en.wikipedia.org/wiki/Roc_curve
• Correlation CoefficientMatthews correlation coefficient (MCC)
MCC = 1 for a perfect prediction 0 for a completely random assignment
-1 for a "perfectly incorrect" prediction
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Do not memorize this!
In signal detection theory, a receiver operating characteristic
(ROC), or ROC curve is a plot of sensitivity vs (1 - specificity) for a binary classifier system as its discrimination threshold is varied. The ROC can also be represented equivalently
by plotting fraction of true positives (TPR = true positive rate) vs fraction of false positives (FPR = false positive rate)
This slide has been changed
26BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07Brendel 2005
GeneSeqer: Input http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi
27BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07Brendel 2005
GeneSeqer: Output
28BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07Brendel 2005
GeneSeqer: Gene Evidence Summary
29BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Gene Prediction - Problems & Status?
Common errors?• False positive intergenic regions:• 2 annotated genes actually correspond to a single gene
• False negative intergenic region:• One annotated gene structure actually contains 2 genes
• False negative gene prediction:• Missing gene (no annotation)
• Other:• Partially incorrect gene annotation• Missing annotation of alternative transcripts
Current status?• For ab initio prediction in eukaryotes: HMMs have better overall
performance for detecting intron/exon boundaries• Limitation? Training data: predictions are organism specific
• Combined ab initio/homology based predictions: Improved accurracy• Limitation? Availability of identifiable sequence homologs in databases
30BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Recommended Gene Prediction Software
• Ab initio• GENSCAN: http://genes.mit.edu/GENSCAN.html
• GeneMark.hmm: http://exon.gatech.edu/GeneMark/• others: GRAIL, FGENES, MZEF, HMMgene
• Similarity-based• BLAST, GenomeScan, EST2Genome, Twinscan
• Combined:• GeneSeqer, http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi • ROSETTA
Consensus: because results depend on organisms & specific task, Always use more than one program!
• Two servers hat report consensus predictions• GeneComber• DIGIT
31BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Other Gene Prediction Resources: at ISU http://www.bioinformatics.iastate.edu/bioinformatics2go/
32BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Other Gene Prediction Resources: GaTech, MIT, Stanford, etc.
Current Protocols in Bioinformatics (BCB/ISU owns a copy - currently in my lab!)
Chapter 4 Finding Genes
4.1 An Overview of Gene Identification: Approaches, Strategies, and Considerations 4.2 Using MZEF To Find Internal Coding Exons 4.3 Using GENEID to Identify Genes 4.4 Using GlimmerM to Find Genes in Eukaryotic Genomes 4.5 Prokaryotic Gene Prediction Using GeneMark and GeneMark.hmm 4.6 Eukaryotic Gene Prediction Using GeneMark.hmm 4.7 Application of FirstEF to Find Promoters and First Exons in the Human Genome 4.8 Using TWINSCAN to Predict Gene Structures in Genomic DNA Sequences 4.9 GrailEXP and Genome Analysis Pipeline for Genome Annotation 4.10 Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences
Lists of Gene Prediction Softwarehttp://www.bioinformaticsonline.org/links/ch_09_t_1.html
http://cmgm.stanford.edu/classes/genefind/
33BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Chp 9 - Promoter & Regulatory Element Prediction
SECTION III GENE AND PROMOTER PREDICTION
Xiong: Chp 9 Promoter & Regulatory Element Prediction
• Promoter & Regulatory Elements in
Prokaryotes
• Promoter & Regulatory Elements in
Eukaryotes
• Prediction Algorithms
34BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Eukaryotic genomes • Are packaged in chromatin & sequestered in a nucleus• Are larger and have multiple linear chromosomes• Contain mostly non-protein coding DNA (98-99%)
Prokarytic genomes • DNA is associated with a nucleoid, but no nucleus• Much larger, usually single, circular chromosome• Contain mostly protein encoding DNA
Eukaryotes vs Prokaryotes: Genomes
35BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Eukaryotes vs Prokryotes: Gene Structure
36BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Eukaryotic genes
• Are larger and more complex than in prokaryotes
• Contain introns that are “spliced” out to generate mature mRNAs*
• Often undergo alternative splicing, giving rise to multiple RNAs*
• Are transcribed by 3 different RNA polymerases
(instead of 1, as in prokaryotes)
* In biology, statements such as this include an implicit “usually” or “often”
Eukaryotes vs Prokaryotes: Genes
37BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Primary level of control?
• Prokaryotes: Transcription initiation
• Eukaryotes: Transcription is also very important, but• Expression is regulated at multiple levels many of which are post-transcriptional:
• RNA processing, transport, stability• Translation initiation• Protein processing, transport, stability• Post-translational modification (PTM) • Subcellular localization
Recent important discoveries: small regulatory RNAs (miRNA, siRNA) are abundant and play very important roles in controlling gene expression in eukaryotes, often at post-transcriptional levels
Eukaryotes vs Prokaryotes: Levels of Gene Regulation
38BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Eukaryotes vs Prokaryotes: Regulatory Elements
• Prokaryotes:• Promoters & operators (for operons) - cis-acting DNA signals• Activators & repressors - trans-acting proteins
(we won't discuss these…)
• Eukaryotes:• Promoters & enhancers (for single genes) - cis-acting •Transcription factors - trans-acting
• Important difference? •What the RNA polymerase actually binds
39BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Prokaryotic Promoters
• RNA polymerase complex recognizes promoter sequences located very close to and on 5’ side (“upstream”) of tansription initiation site
• Prokaryotic RNA polymerase complex binds directly to
promoter, by virtue of its sigma subunitsigma subunit - no requirement for “transcription factors” binding first
• Prokaryotic promoter sequences are highly conserved:
• -10 region • -35 region
40BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Eukaryotic Promoters
• Eukaryotic RNA polymerase complexes do not bind directly to promoter sequences
• Transcription factors must bind first and serve as landmarks recognized by RNA polymerase complexes
• Eukaryotic promoter sequences are less highly conserved, but many promoters (for RNA polymerase II) contain :
• -30 region "TATA" box
• -100 region "CCAAT" box
41BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Eukaryotic Promoters vs Enhancers
Both promoters & enhancers are binding sites for transcription factors (TFs)
• Promoters• essential for initiation of transcription• located “relatively” close to start site (usually <200 bp
upstream, but can be located within gene, rather than upstream!)
• Enhancers• needed for regulated transcription (differential expression
in specific cell types, developmental stages, in response to environment, etc.)
• can be very far from start site (sometimes > 100 kb)
42BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Eukaryotic genes are transcribed by
3 different RNA polymerases (Location of promoter regions, TFBSs & TFs differ, too)
BIOS Scientific Publishers Ltd, 1999
Brown Fig 9.18
mRNA
rRNA
tRNA, 5S RNA
43BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Prokaryotic Genes & Operons
• Genes with related functions are often clustered within operons (e.g., lac operon)
• Operons = genes with related functions that are transcribed and regulated as a single unit; one promoter controls expression of several proteins
• mRNAs produced from operons are “polycistronic” - a single mRNA encodes several proteins; i.e., there are multiple ORFs, each with its own AUG (START) & STOP codons, linked within one mRNA molecule
44BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Promoter of lac operon in E. coli
(Transcribed by prokaryotic RNA polymerase)
BIOS Scientific Publishers Ltd, 1999Brown Fig 9.17
45BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Eukaryotic genes
• Genes with related functions are occasionally, but not usually clustered; instead, they share common regulatory regions (promoters, enhancers, etc.)
• Chromatin structure must also be “active” for transcription to occur
46BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Eukaryotic genes have large & complex regulatory regions
•Cis-acting regulatory elements include:Promoters, enhancers, silencers
•Trans-acting regulatory factors include:Transcription factors (TFs), chromatin remodeling complexes, small RNAs
BIOS Scientific Publishers Ltd, 1999
Brown Fig 9.17
47BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Eukaryotic Promoters: DNA sequences required for initiation, usually <200 bp from start site
Eukaryotic RNA polymerases bind by recognizing a complex of TFs bound at promotor
~250 bp
Pre-mRNA
First, TFs must bind short motifs (TFBSs) within promoters; then RNA polymerase can bind and initiate transcription of RNA
48BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Eukaryotic promoters & enhancer regions often contain many different TFBS motifs
Fig 9.13Mount 2004
49BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Simplified View of Promoters in Eukaryotes
Fig 5.12Baxevanis & Ouellette 2005
50BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Eukaryotic Activators vs Repressors
Regions far from the promoter can act as "enhancers" or "repressors" of transcription by serving as binding sites for activator or repressor proteins (TFs)
Activator proteins (TFs) bind to enhancers & interact with RNAP to stimulate transcriptionRepressors block the action of activators
repressor prevents binding of activator
enhancer Gene
repressor100 - 50,000 bp
promoterRNAP
enhancer proteins interact with RNAP
transcription
51BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Eukaryotic Transcription Factors (TFs)
• Transcription factors = proteins that interact with the RNA polymerase complex to activate or repress transcription
• TFs often contain both:• a trans-activating domain • a DNA binding domain or motif
• TFs recognize and bind specific short DNA sequence motifs called “transcription factor binding sites” (TFBSs)
• Databases for TFs &TFBSs include:• TRANSFAC, http://www.generegulation.com/cgibin/pub/databases/transfac
• JASPAR
Here motif = amino acid sequence in protein
Here motif = nucleotide sequence in DNA
52BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Zinc Finger Proteins - Transcription Factors
• Common in eukaryotic proteins
• ~ 1% of mammalian genes encode zinc-finger proteins
(ZFPs)
• In C. elegans, there are > 500 !
• Can be used as highly specific DNA binding modules
• Potentially valuable tools for directed genome
modification (esp. in plants) & human gene therapy - one clinical trial will begin soon!
BIOS Scientific Publishers Ltd, 1999
Brown Fig 9.12
• Did you go to Dave Segal's seminar?• Your TAs Pete & Jeff work on
designing better ZFPs!
53BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Promoter Prediction Algorithms & Software
Xiong -
54BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Eukaryotes vs Prokaryotes: Promoter Prediction
Promoter prediction is much easier in prokaryotes
Why? Highly conservedSimpler gene structuresMore sequenced genomes!
(for comparative approaches)
Methods? Previously: mostly HMM-basedNow: similarity-based comparative
methodsbecause so many genomes
available
Xiong textbook:1) "Manual method"= rules of Wang et al (see text)2) BPROM - uses linear discriminant function
55BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Eukaryotes vs Prokaryotes: Promoter Prediction
Promoter prediction is much easier in prokaryotes
Why? Highly conservedSimpler gene structuresMore sequenced genomes!
(for comparative approaches)
Methods? Previously: mostly HMM-basedNow: similarity-based comparative
methodsbecause so many genomes
available
Xiong textbook:1) "Manual method"= rules of Wang et al (see text)2) BPROM - uses linear discriminant function
56BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Predicting Promoters in Eukaryotes
Closely related to gene prediction! • Obtain genomic sequence• Use sequence-similarity based comparison
(BLAST, MSA) to find related genesBut: "regulatory" regions are much less well-
conserved than coding regions• Locate ORFs • Identify Transcription Start Site (TSS)
(if possible!)• Use Promoter Prediction Programs • Analyze motifs, etc. in DNA sequence (TRANSFAC, JASPAR)
57BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Predicting promoters: Steps & Strategies
Identify TSS --if possible?
• One of biggest problems is determining exact TSS!Not very many full-length cDNAs!
• Good starting point? (human & vertebrate genes)Use FirstEF
found within UCSC Genome Browseror submit to FirstEF web server
Fig 5.10Baxevanis & Ouellette 2005
58BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Automated Promoter Prediction Strategies
1) Pattern-driven algorithms (ab initio)
2) Sequence-driven algorithms (homology based)
3) Combined "evidence-based"
BEST RESULTS? Combined, sequential
59BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
1) Pattern-driven Algorithms
• Success depends on availability of collections of annotated transcription factor binding sites (TFBSs)
• Tend to produce very large numbers of false positives (FPs)
• Why? • Binding sites for specific TFs are often variable• Binding sites are short (typically 6-10 bp)• Interactions between TFs (& other proteins) influence
both affinity & specificity of TF binding • One binding site often recognized by multiple TFs
• Biology is complex: gene activation is often specific to organism/cell/stage/environmental condition; promoter and enhancer elements must mediate this
60BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
• Take sequence context/biology into account Eukaryotes: clusters of TFBSs are common
Prokaryotes: knowledge of σ (sigma) factors helps• Probability of "real" binding site higher if annotated
transcription start site (TSS) is nearby But: What about enhancers? (no TSS nearby!) & only a small fraction of TSSs have been experimentally
determinined
• Do the wet lab experiments! But: Promoter-bashing can be tedious…
Ways to Reduce FPs in ab initio Prediction
61BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
2) Sequence-driven Algorithms
• Assumption: Common functionality can be deduced from sequence conservation (Homology)
• Alignments of co-regulated genes should highlight elements involved in regulation
Careful: How determine co-regulation?
1. Orthologous genes from difference species2. Genes experimentally shown to be co-
regulated (using microarrays??)
Comparative promoter prediction:• Phylogenetic footprinting• Expression Profiling
62BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Phylogenetic Footprinting
• Based on increasing availability of whole genome DNA sequences from many different species
• Selection of organisms for comparison is important• not too close, not too far: good = human vs mouse
• To reduce FPs, must extract non-coding sequences and then align them; prediction depends on good alignment
• use MSA algorithms (e.g., CLUSTAL)• more sensitive methods
• Gibbs sampling • Expectation Maximization (EM) methods
• Examples of programs:• Consite, rVISTA, PromH(W), Bayes aligner, Footprinter
63BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Expression Profiling
• Based on increasing availability of whole genome mRNA expression data, esp., microarray data
• High-throughput simultaneous monitoring of expression levels of thousands of genes
• Assumptions: (sometimes valid, sometimes NOT)1. Co-expression implies co-regulation2. Co-regulated genes share common regulatory elements
• Drawbacks: 1. Signals are short & weak! Requires Gibbs sampling or EM: e.g., MEME, AlignACE,
Melina• Prediction depends on determining which genes are co-
expressed - usually by clustering - which an be error prone
1. Examples of programs:• INCLUSive - combined microarray analysis & motif
detection• PhyloCon - combined phylo footprinting & expression
profiling)
64BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Need sets of co-regulated genes
• For comparative (phylogenetic) methods• Must choose appropriate species• Different genomes evolve at different rates• Classical alignment methods have trouble with translocations or inversions than
change order of functional elements• If background conservation of entire region is high,
comparison is useless• Not enough data (but Prokaryotes >>> Eukaryotes)
Complexity: many regulatory elements are not conserved across species!
Problems with Sequence-driven Algorithms
65BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
TRANSFAC Matrix Entry: for TATA box
Fields:• Accession & ID • Brief description• TFs associated with this entry• Weight matrix • Number of sites
used to build • Other info
Fig 5.13Baxevanis & Ouellette 2005
66BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Global Alignment of Human & Mouse Obese Gene Promoters (200 bp upstream from TSS)
Fig 5.14Baxevanis & Ouellette 2005
67BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Annotated Lists of Promoter Databases & Promoter Prediction Software
• URLs from Mount textbook:Table 9.12 http://www.bioinformaticsonline.org/links/ch_09_t_2.html
• Table in Wasserman & Sandelin Nat Rev Genet article http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.htm
• URLs from Baxevanis & Ouellette textbook:http://www.wiley.com/legacy/products/subject/life/bioinformatics/ch05.htm#links
More lists:• http://www.softberry.
com/berry.phtml?topic=index&group=programs&subgroup=promoter
• http://bioinformatics.ubc.ca/resources/links_directory/?subcategory_id=104
• http://www3.oup.co.uk/nar/database/subcat/1/4/
68BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07
Check out Optional Review &Try Associated Tutorial:
Wasserman WW & Sandelin A (2004) Applied bioinformatics for identification of regulatory elements. Nat Rev Genet 5:276-287
http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.html
Check this out: http://www.phylofoot.org/NRG_testcases/
Bottom line: this is a very "hot" area - new software for computational prediction of gene regulatory elements published every day!