BIOS816/VBMS818 Lecture 7 – Gene Prediction Guoqing Lu Office: E115 Beadle Center Tel: (402)...

31
BIOS816/VBMS818 Lecture 7 – Gene Prediction Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: [email protected] Website: http://biocore.unl.edu
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of BIOS816/VBMS818 Lecture 7 – Gene Prediction Guoqing Lu Office: E115 Beadle Center Tel: (402)...

BIOS816/VBMS818

Lecture 7 – Gene Prediction

Guoqing LuOffice: E115 Beadle Center

Tel: (402) 472-4982Email: [email protected]

Website: http://biocore.unl.edu

Genes

• Protein coding genes– ORF– Regulatory signals

• Depend on organism

• RNA genes– rRNA– tRNA– snRNA, others…

Prokaryotic Gene Expression

Promoter Cistron1 Cistron2 CistronN Terminator

Transcription RNA Polymerase

mRNA 5’ 3’

TranslationRibosome, tRNAs,Protein Factors

1 2 N

Polypeptides

NC

NC N

C

1 2 3

Eukaryotic Gene Expression

Promoter Transcribed Region Terminator

Transcription RNA Polymerase II

Primary transcript 5’ 3’

Translation

Polypeptide

NC

Enhancer

Exon1 Exon2Intron1

CapSpliceCleave/Polyadenylate

7mG An

7mG An

Transport

Gene Finding

• Comparative– Compare your sequence to

what is already known– BLASTN, BLASTX

• Predictive: Stitch together a consensus– HMM, GRAIL…– Frames, Testcode– Findpatterns …

• Empirical approach – cDNA OR protein OR genetic

evidence

ORF Characteristics

• Primary characters– Start Codon – (ATG)– Stop Coden - (TAA, TAG, TGA)

• Secondary characters– Codon bias– Biased nucleotide distribution

ORF finding tools

• GCG– Frames, Map

• VectorNTI– ORF

• WWW tools– ORF Finder (NCBI)– …

Vector NTI - ORF

Ecoli Lac operon7477 bp

CDS(lacI) 1 CDS(lacZ) 2 CDS(lacY) 3CDS(lacA) 4

ORFs of the lac operon

GI: 146575

Statistical analysis as a means to find genes

• ORF example

• Codon Bias

• Fickett’s Statistic

Codon Bias

• Genetic code degenerate

• Codon usage varies– organism to organism– gene to gene

• high bias correlates with high level expression

• bias correlates with tRNA isoacceptors

• Change bias or tRNAs, change expression

Codon Bias

Gly GGG 6 0.21 Gly GGA 5 0.17 Gly GGT 11 0.38 Gly GGC 7 0.24

GAL4 ADH1

Gly GGG 0.21 0

Gly GGA 0.17 0

Gly GGT 0.38 0.93

Gly GGC 0.24 0.07

Gene Differences

GCG: CodonFrequency

Codon BiasOrganism Differences

CCU 12.8 3.4

CCC 1.7 17.6

CCA 22.4 1.2

CCG 4.9 26.2

Pc Ml

Codon Bias Calculation

frequency/synonymous family frequencyPref =

frequency in random/Family frequency in random

• Bias >1 in CORRECT frame

• Bias < 1 in Incorrect frame

Gly GGG 6 0.21 Gly GGA 5 0.17 Gly GGT 11 0.38 Gly GGC 7 0.24

Codon-Biased GeneRibosomal Protein S2, Ef-Ts

Frame 2

Frame 3

rpsB

tsf

Fickett’s Statistic

rpsB tsf

-analyzes the local nonrandomness at every third base in the sequence in a frame-independent fashion.-does not use codon frequency statistics

Error-rich DNAFickett’s

Normal

Corrupted1% substitution

2 indels

ORF Found, Now What?

• Find ORFs is the biggest target, but easiest to find• Find Promoter elements

– Should be upstream of 5’-most ORF• Remember, one promoter can regulate expression of multiple

cistrons

– May have ambiguous sequence

• Find Ribosome Binding Site(s) and Start Codon(s)– 1 WITHIN each ORF (cistron) near 5’ end– RBS is close to (~5-10nt) and upstream of the start codon

P

• More complex signals/regulatory elements

• More genes

• Combinatorial regulation common

• Introns/exons

ORF Found, Now What?

Eukaryotic Gene Complexity

• Yeast– introns rare

–promoters adjacent

–genome dense

Eukaryotes, cont’d

• “higher” Eukaryotes– introns common,

LONGER than exons

– Promoter/enhancer– genome sparse

• Fungi– introns common,

short relative to exons

– promoter/enhancer– genome dense

Fungi and “higher” eukaryotes

Sew together exons–ORF regions

–consensus sequences

–domain/polypeptide matches

Exon/Intron Structure

CCACATTgtn(30-10,000)an(5-20)agCAGAA

...CCACATTCAGAA...

...ProHisSerGlu...

Alternative Splice

CCACATTgtn(30-10,000)an(5-20)agcagAA

...CCACATTAA...

...ProHisSTOP

How do we know what sequences to look for?

• Promoter sites

• Intron/Exon

• Transcription Termination/PolyA

• Translation initiation

Finding Functional Sequences

• Known Consensus Sequences

• Consensus Sequence Generation– Position Weight Matrices– Sequence Logos– Hidden Markov Models

• Functional Tests

Gene finding Tools-WWW

• GRAIL II: integrated gene parsing

• GenLang

• GENIE

• HMMGene

• GENESCAN

• GENEMARK

GLIMMER for gene-findingin bacteria (www.tigr.org)

YOU are the best universal gene finder…

• You understand the “rules”– ORF, Promoter, RBS– Organism specific

• You understand relationships/sequences– 5’ to 3’

• You are a good sequence finder – search patterns

• You can resolve ambiguities• EXPERIENCE

Exercise

• ORF analysis using Vector NTI: • Open Vector NTI • Retrieve the E. coli lac operon sequence

– Find Tools -> Open Link -> GID in the molecular display window – Type in 146575 in the Genbank ID required window

• Do ORF analysis– Find Analysis->ORF in the molecular display window– Use the Default Start & Stop setting

• Present a figure showing your ORF analysis result and report the start and stop positions and lengths of the ORF's.

Exercise (cont’d)

• ORF analysis using GeneMark• Go to Genmark web site:

http://opal.biology.gatech.edu/GeneMark/genemark24.cgi

• Paste in the lac operon sequence• Choose E. coli as the organism• Report the start and stop positions and

lengths of the predicted ORF's and compare them to those found with the Vector NTI ORF

Assignment #2

• Download from Blackboard– Go to “Assignment” page– Open “Assignment #2”– Download the file “Assignment1”

• Submit to Blackboard– Go to “Assignment” page– Open “Assignment #2”– Submit your answer through Tools->Digital Drop

Box

• Assignment #2 – due March 12