BIOS816/VBMS818 Lecture 7 – Gene Prediction Guoqing Lu Office: E115 Beadle Center Tel: (402)...
-
date post
20-Dec-2015 -
Category
Documents
-
view
213 -
download
0
Transcript of BIOS816/VBMS818 Lecture 7 – Gene Prediction Guoqing Lu Office: E115 Beadle Center Tel: (402)...
BIOS816/VBMS818
Lecture 7 – Gene Prediction
Guoqing LuOffice: E115 Beadle Center
Tel: (402) 472-4982Email: [email protected]
Website: http://biocore.unl.edu
Genes
• Protein coding genes– ORF– Regulatory signals
• Depend on organism
• RNA genes– rRNA– tRNA– snRNA, others…
Prokaryotic Gene Expression
Promoter Cistron1 Cistron2 CistronN Terminator
Transcription RNA Polymerase
mRNA 5’ 3’
TranslationRibosome, tRNAs,Protein Factors
1 2 N
Polypeptides
NC
NC N
C
1 2 3
Eukaryotic Gene Expression
Promoter Transcribed Region Terminator
Transcription RNA Polymerase II
Primary transcript 5’ 3’
Translation
Polypeptide
NC
Enhancer
Exon1 Exon2Intron1
CapSpliceCleave/Polyadenylate
7mG An
7mG An
Transport
Gene Finding
• Comparative– Compare your sequence to
what is already known– BLASTN, BLASTX
• Predictive: Stitch together a consensus– HMM, GRAIL…– Frames, Testcode– Findpatterns …
• Empirical approach – cDNA OR protein OR genetic
evidence
ORF Characteristics
• Primary characters– Start Codon – (ATG)– Stop Coden - (TAA, TAG, TGA)
• Secondary characters– Codon bias– Biased nucleotide distribution
Vector NTI - ORF
Ecoli Lac operon7477 bp
CDS(lacI) 1 CDS(lacZ) 2 CDS(lacY) 3CDS(lacA) 4
ORFs of the lac operon
GI: 146575
Codon Bias
• Genetic code degenerate
• Codon usage varies– organism to organism– gene to gene
• high bias correlates with high level expression
• bias correlates with tRNA isoacceptors
• Change bias or tRNAs, change expression
Codon Bias
Gly GGG 6 0.21 Gly GGA 5 0.17 Gly GGT 11 0.38 Gly GGC 7 0.24
GAL4 ADH1
Gly GGG 0.21 0
Gly GGA 0.17 0
Gly GGT 0.38 0.93
Gly GGC 0.24 0.07
Gene Differences
GCG: CodonFrequency
Codon Bias Calculation
frequency/synonymous family frequencyPref =
frequency in random/Family frequency in random
• Bias >1 in CORRECT frame
• Bias < 1 in Incorrect frame
Gly GGG 6 0.21 Gly GGA 5 0.17 Gly GGT 11 0.38 Gly GGC 7 0.24
Fickett’s Statistic
rpsB tsf
-analyzes the local nonrandomness at every third base in the sequence in a frame-independent fashion.-does not use codon frequency statistics
ORF Found, Now What?
• Find ORFs is the biggest target, but easiest to find• Find Promoter elements
– Should be upstream of 5’-most ORF• Remember, one promoter can regulate expression of multiple
cistrons
– May have ambiguous sequence
• Find Ribosome Binding Site(s) and Start Codon(s)– 1 WITHIN each ORF (cistron) near 5’ end– RBS is close to (~5-10nt) and upstream of the start codon
P
• More complex signals/regulatory elements
• More genes
• Combinatorial regulation common
• Introns/exons
ORF Found, Now What?
Eukaryotic Gene Complexity
• Yeast– introns rare
–promoters adjacent
–genome dense
Eukaryotes, cont’d
• “higher” Eukaryotes– introns common,
LONGER than exons
– Promoter/enhancer– genome sparse
• Fungi– introns common,
short relative to exons
– promoter/enhancer– genome dense
Fungi and “higher” eukaryotes
Sew together exons–ORF regions
–consensus sequences
–domain/polypeptide matches
How do we know what sequences to look for?
• Promoter sites
• Intron/Exon
• Transcription Termination/PolyA
• Translation initiation
Finding Functional Sequences
• Known Consensus Sequences
• Consensus Sequence Generation– Position Weight Matrices– Sequence Logos– Hidden Markov Models
• Functional Tests
Gene finding Tools-WWW
• GRAIL II: integrated gene parsing
• GenLang
• GENIE
• HMMGene
• GENESCAN
• GENEMARK
YOU are the best universal gene finder…
• You understand the “rules”– ORF, Promoter, RBS– Organism specific
• You understand relationships/sequences– 5’ to 3’
• You are a good sequence finder – search patterns
• You can resolve ambiguities• EXPERIENCE
Exercise
• ORF analysis using Vector NTI: • Open Vector NTI • Retrieve the E. coli lac operon sequence
– Find Tools -> Open Link -> GID in the molecular display window – Type in 146575 in the Genbank ID required window
• Do ORF analysis– Find Analysis->ORF in the molecular display window– Use the Default Start & Stop setting
• Present a figure showing your ORF analysis result and report the start and stop positions and lengths of the ORF's.
Exercise (cont’d)
• ORF analysis using GeneMark• Go to Genmark web site:
http://opal.biology.gatech.edu/GeneMark/genemark24.cgi
• Paste in the lac operon sequence• Choose E. coli as the organism• Report the start and stop positions and
lengths of the predicted ORF's and compare them to those found with the Vector NTI ORF