Gene Prediction: Statistical...
Transcript of Gene Prediction: Statistical...
![Page 1: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/1.jpg)
Gene Prediction:
Statistical Approaches
![Page 2: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/2.jpg)
1. Central Dogma and Codons
2. Discovery of Split Genes
3. Splicing
4. Open Reading Frames
5. Codon Usage
6. Splicing Signals
7. TestCode
Outline
![Page 3: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/3.jpg)
Section 1:
Central Dogma and
Codons
![Page 4: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/4.jpg)
Gene Prediction: Computational Challenge
• Gene: A sequence of nucleotides coding for a protein.
• Gene Prediction Problem: Determine the beginning and end
positions of genes in a genome.
![Page 5: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/5.jpg)
Protein
RNA
DNA
transcription
translation
CCTGAGCCAACTATTGATGAA
PEPTIDE
CCUGAGCCAACUAUUGAUGAA
Central Dogma: DNA -> RNA -> Protein
![Page 6: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/6.jpg)
Central Dogma: Doubts
• Central Dogma was proposed in 1958 by Francis Crick.
• However, he had very little evidence.
• Before Crick‟s seminal paper, all possible information
transfers were considered viable.
• Crick postulated that some of them are not viable.
Pre-Crick Crick’s Proposal
![Page 7: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/7.jpg)
Codons
• 1961: Sydney Brenner and Francis Crick
discover frameshift mutations:
• These systematically delete nucleotides
from DNA.
• Single and double deletions
dramatically alter protein product.
• However, they noted that the effect of
triple deletions was minor.
• Conclusion: Every codon (triplet of
nucleotides) codes for exactly one
amino acid in a protein. Francis Crick
Sydney Brenner
![Page 8: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/8.jpg)
The Sly Fox
• In the following string:
THE SLY FOX AND THE SHY DOG
• Delete 1, 2, and 3 nucleotides after the first „S‟:
• 1 Nucleotide: THE SYF OXA NDT HES HYD OG
• 2 Nucleotides: THE SFO XAN DTH ESH YDO G
• 3 Nucleotides: THE SOX AND THE SHY DOG
• Which of the above makes the most sense?
• This is the idea behind each codon coding one amino acids.
![Page 9: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/9.jpg)
Translating Nucleotides into Amino Acids
• There are 43 = 64 possible codons, since there are four choices
for each of the three nucleotides in a codon.
• Genetic code is degenerative and redundant.
• Includes start and stop codons, whose only purpose is to
represent the beginning or end of an important sequence.
• Despite there being 64 codons, there are only 20 amino
acids.
• Therefore, an amino acid may be coded by more than one
codon.
![Page 10: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/10.jpg)
Great Discovery Provoking Wrong Assumption
• 1964: Charles Yanofsky and Sydney
Brenner prove collinearity in the order of
codons with respect to amino acids in proteins.
• 1967: Yanofsky and colleagues further
prove that the sequence of codons in a
gene determines the sequence of amino
acids in a protein.
• As a result, it was incorrectly assumed that the triplets
encoding for amino acid sequences form contiguous strips of
information.
Charles Yanofsky
![Page 11: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/11.jpg)
Section 2:
Discovery of Split Genes
![Page 12: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/12.jpg)
Discovery of Split Genes
• 1977: Phillip Sharp and Richard
Roberts experiment with mRNA of
hexon, a viral protein.
• They mapped hexon mRNA in viral
genome by hybridization to
adenovirus DNA and electron
microscopy.
• mRNA-DNA hybrids formed three
curious loop structures instead of
contiguous duplex segments.
Phillip Sharp Richard Roberts
![Page 13: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/13.jpg)
Discovery of Split Genes
• 1977: “Adenovirus Amazes at
Cold Spring Harbor” (Nature)
documents "mosaic molecules
consisting of sequences
complementary to several non-
contiguous segments of the viral
genome.”
• In other words, coding for a
protein occurs at disjoint,
nonconnected locations in the
genome.
![Page 14: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/14.jpg)
Exons and Introns
• In eukaryotes, the gene is a combination of coding segments
(exons) that are interrupted by non-coding segments (introns).
• Prokaryotes don‟t have introns—genes in prokaryotes are
continuous.
• Upshot: Introns make computational gene prediction in
eukaryotes even more difficult.
…AGGGTCTCATTGTAGACAGTGGTACTGATCAACGCAGGACTT…
Coding Non-coding Coding Non-coding
![Page 15: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/15.jpg)
aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg
Gene Prediction: Computational Challenge
![Page 16: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/16.jpg)
aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg
Gene Prediction: Computational Challenge
![Page 17: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/17.jpg)
aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg
Gene Prediction: Computational Challenge
Gene!
![Page 18: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/18.jpg)
Section 3:
Splicing
![Page 19: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/19.jpg)
exon1 exon2 exon3intron1 intron2
transcription
translation
splicing
Batzoglou
Central Dogma and Splicing
![Page 20: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/20.jpg)
Central Dogma and Splicing
![Page 21: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/21.jpg)
Splicing Signals
• Exons are interspersed with introns and typically flanked by
splicing signals: GT and AG.
• Splicing signals can be helpful in identifying exons.
• Issue: GT and AG occur so often that it is almost impossible to
determine when they occur as splicing signals and when they
don‟t.
![Page 22: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/22.jpg)
5’ 3’Donor site
Position
% -8 … -2 -1 0 1 2 … 17
A 26 … 60 9 0 1 54 … 21
C 26 … 15 5 0 1 2 … 27
G 25 … 12 78 99 0 41 … 27
T 23 … 13 8 1 98 3 … 25
From lectures by Serafim Batzoglou (Stanford)
Splice Site Deletions
![Page 23: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/23.jpg)
Donor: 7.9 bits
Acceptor: 9.4 bits
Consensus Splice Sites
![Page 24: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/24.jpg)
5’Promoter 3’
Promoters
• Promoters are DNA segments upstream of transcripts that
initiate transcription.
• A promoter attracts RNA Polymerase to the transcription start
site.
![Page 25: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/25.jpg)
(http://genes.mit.edu/chris/)
Splicing Mechanism
![Page 26: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/26.jpg)
From lectures by Chris Burge (MIT)
Splicing Mechanism
1. Adenine recognition site marks intron.
![Page 27: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/27.jpg)
Splicing Mechanism
From lectures by Chris Burge (MIT)
1. Adenine recognition site marks intron.
2. snRNPs bind around adenine recognition site.
![Page 28: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/28.jpg)
Splicing Mechanism
From lectures by Chris Burge (MIT)
1. Adenine recognition site marks intron.
2. snRNPs bind around adenine recognition site.
3. The spliceosome thus forms and excises introns in the
mRNA.
![Page 29: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/29.jpg)
Splicing Mechanism
From lectures by Chris Burge (MIT)
1. Adenine recognition site marks intron.
2. snRNPs bind around adenine recognition site.
3. The spliceosome thus forms and excises introns in the
mRNA.
![Page 30: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/30.jpg)
Two Approaches to Gene Prediction
1. Statistical: Exons have typical sequences on either end and
use different subwords than introns.
• Therefore, we can run statistical analysis on the subwords
of a sequence to locate potential exons.
2. Similarity-based: Many human genes are similar to genes in
mice, chicken, or even bacteria.
• Therefore, already known mouse, chicken, and bacterial
genes may help to find human genes.
![Page 31: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/31.jpg)
Gene Prediction Analogy
• Say we have a newspaper written in unknown language.
• Certain pages contain an encoded message, say 99 letters on
page 7, 30 letters on page 12 and 63 letters on page 15.
• How do you recognize the message?
• You could probably distinguish between the ads and the story
(ads contain the “$” sign often).
• Statistics-based approach to gene prediction tries to make
similar distinctions between exons and introns.
![Page 32: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/32.jpg)
Statistical Approach: Metaphor
• Noting the differing
frequencies of symbols
(e.g. „%‟, „.‟, „-‟) and
numerical symbols
could you distinguish
between a story and
the stock report in a
foreign newspaper?
![Page 33: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/33.jpg)
Similarity-Based Approach: Metaphor
• If you could compare
the day‟s news in
English, side-by-side
to the same news in a
foreign language,
some similarities may
become apparent.
![Page 34: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/34.jpg)
Genetic Code and Stop Codons
• UAA, UAG and UGA
correspond to 3 Stop codons
that (together with Start codon
ATG) delineate Open Reading
Frames.
![Page 35: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/35.jpg)
Section 4:
Open Reading Frames
![Page 36: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/36.jpg)
Stop and Start Codons
• Codons often appear exclusively to start/stop transcription:
• Start Codon: ATG
• Stop Codons: TAA, TAG, TGA
![Page 37: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/37.jpg)
Genomic Sequence
Open reading frame
ATG TGA
Open Reading Frames (ORFs)
• Detect potential coding regions by looking at Open Reading
Frames (ORFs):
• A genome of length n is comprised of (n/3) codons.
• Stop codons break genome into segments between
consecutive stop codons.
• The subsegments of these segments that start from the Start
codon (ATG) are ORFs.
• ORFs in different frames may overlap.
![Page 38: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/38.jpg)
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
6 Possible Frames for ORFs
• There are six total frames in which to find ORFs:
• Three possible ways of splitting the sequence into codons.
• We can “read” a DNA sequence either forward or backward.
• Illustration:
![Page 39: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/39.jpg)
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
6 Possible Frames for ORFs
• There are six total frames in which to find ORFs:
• Three possible ways of splitting the sequence into codons.
• We can “read” a DNA sequence either forward or backward.
• Illustration:
![Page 40: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/40.jpg)
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
6 Possible Frames for ORFs
• There are six total frames in which to find ORFs:
• Three possible ways of splitting the sequence into codons.
• We can “read” a DNA sequence either forward or backward.
• Illustration:
![Page 41: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/41.jpg)
Long vs. Short ORFs
• At random, we should expect one stop codon every (64/3) ~=
21 codons.
• However, genes are usually much longer than this.
• An intuitive approach to gene prediction is to scan for ORFs
whose length exceeds a certain threshold value.
• Issue: This method is naïve because some genes (e.g. some
neural and immune system genes) are not long enough to be
detected.
![Page 42: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/42.jpg)
Section 5:
Codon Usage
![Page 43: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/43.jpg)
Testing ORFs: Codon Usage
• Idea: Amino acids typically are coded by more than one
codon, but in nature certain codons occur more commonly.
• Therefore, uneven codon occurrence may characterize a
real gene.
• Solution: Create a 64-element hash table and count the
frequencies of codons in an ORF.
• This compensates for pitfalls of the ORF length test.
![Page 44: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/44.jpg)
Codon Occurrence in Human Genome
![Page 45: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/45.jpg)
AA codon /1000 frac
Ser TCG 4.31 0.05
Ser TCA 11.44 0.14
Ser TCT 15.70 0.19
Ser TCC 17.92 0.22
Ser AGT 12.25 0.15
Ser AGC 19.54 0.24
Pro CCG 6.33 0.11
Pro CCA 17.10 0.28
Pro CCT 18.31 0.30
Pro CCC 18.42 0.31
AA codon /1000 frac
Leu CTG 39.95 0.40
Leu CTA 7.89 0.08
Leu CTT 12.97 0.13
Leu CTC 20.04 0.20
Ala GCG 6.72 0.10
Ala GCA 15.80 0.23
Ala GCT 20.12 0.29
Ala GCC 26.51 0.38
Gln CAG 34.18 0.75
Gln CAA 11.51 0.25
Codon Occurrence in Mouse Genome
![Page 46: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/46.jpg)
How to Find Best ORFs
• An ORF is more “believable” than another if it has more
“likely” codons.
• Do sliding window calculations to find best ORFs.
• Allows for higher precision in identifying true ORFs; much
better than merely testing for length.
• However, average vertebrate exon length is 130
nucleotides, which is often too small to produce reliable
peaks in the likelihood ratio.
• Further improvement: In-frame hexamer count (examines
frequencies of pairs of consecutive codons).
![Page 47: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/47.jpg)
Section 6:
Splicing Signals
![Page 48: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/48.jpg)
Splicing Signals
• Try to recognize location of splicing signals at exon-intron
junctions, which are simply small subsequence of DNA that
indicate potential transcription..
• This method has yielded a weakly conserved donor splice site
and acceptor splice site.
• Unfortunately, profiles for such sites are still weak, and lends
the problem to the Hidden Markov Model (HMM) approaches,
which capture the statistical dependencies between sites.
![Page 49: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/49.jpg)
exon 1 exon 2
GT AC
AcceptorSite
DonorSite
Donor and Acceptor Sites: GT and AG
• The beginning and end of exons are signaled by donor and
acceptor sites that usually have GT and AG dinucleotides.
• Detecting these sites is difficult, because GT and AG appear
very often without indicating splicing.
![Page 50: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/50.jpg)
(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)
Donor: 7.9 bits
Acceptor: 9.4 bits
(Stephens & Schneider, 1996)
Donor and Acceptor Sites: Motif Logos
![Page 51: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/51.jpg)
-10
STOP
0 10-35
ATG
TATACTPribnow Box
TTCCAA GGAGGRibosomal binding site
Transcription start site
Gene Prediction and Motifs
• Upstream regions of genes often contain motifs.
• These motifs can be used to supplement the method of splicing
signals.
• Illustration:
![Page 52: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/52.jpg)
Section 7:
TestCode
![Page 53: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/53.jpg)
TestCode
• 1982: James Fickett develops TestCode.
• Idea: There is a tendency for nucleotides
in coding regions to be repeated with
periodicity of 3.
• TestCode judges randomness instead of
codon frequency.
• Finds “putative” coding regions, not
introns, exons, or splice sites.
James Fickett
![Page 54: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/54.jpg)
TestCode Statistics
• Define a window size no less than 200 bp, and slide the
window the sequence down 3 bases at a time.
• In each window:
• Calculate the following formula for each base {A, T, G,
C}:
• Use these values to obtain a probability from a lookup
table (which was previously defined and determined
experimentally with known coding and noncoding
sequences).
max n3k,n
3 k 1,n
3 k 2
min n3k,n
3 k 1,n
3 k 2
![Page 55: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/55.jpg)
TestCode Statistics
• Probabilities can be classified as indicative of “coding” or
“noncoding” regions, or “no opinion” when it is unclear what
level of randomization tolerance a sequence carries.
• The resulting sequence of probabilities can be plotted.
![Page 56: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/56.jpg)
Coding
No opinion
Non-coding
TestCode Sample Output
![Page 57: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/57.jpg)
Promoter Structure in Prokaryotes (E.Coli)
• Transcription starts at offset 0.
• Pribnow Box (-10)
• Gilbert Box (-30)
• Ribosomal Binding Site (+10)
![Page 58: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/58.jpg)
Ribosomal Binding Site
![Page 59: Gene Prediction: Statistical Approachesbioinformaticsinstitute.ru/sites/default/files/ch06_genepred.pdf · • 1964: Charles Yanofsky and Sydney Brenner prove collinearity in the](https://reader030.fdocuments.in/reader030/viewer/2022040904/5e77d95710fabd6df4272f22/html5/thumbnails/59.jpg)
Popular Gene Prediction Algorithms
• GENSCAN: Uses Hidden Markov Models (HMMs).
• TWINSCAN: Uses both HMM and similarity (e.g., between
human and mouse genomes).