Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya...

63
Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar

Transcript of Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya...

Gene PredictionChengwei Luo, Amanda McCook, Nadeem Bulsara,

Phillip Lee, Neha Gupta, and Divya Anjan Kumar

Gene Prediction

• Introduction

• Protein-coding gene prediction

• RNA gene prediction

• Modification and finishing

• Project schema

Gene Prediction

• IntroductionIntroduction

• Protein-coding gene prediction

• RNA gene prediction

• Modification and finishing

• Project schema

Why gene prediction?experimental way?

Why gene prediction?

Exponential growth of sequences

Metagenomics: ~1% grow in lab

New sequencing technology

How to do it?

How to do it?It is a complicated task, let’s break it into parts

How to do it?It is a complicated task, let’s break it into parts

Genome

How to do it?It is a complicated task, let’s break it into parts

Genome

How to do it?Protein-coding gene prediction

Phillip Lee & Divya Anjan Kumar

Homology Search

ab initio approach

Nadeem Bulsara & Neha Gupta

How to do it?RNA gene prediction

Amanda McCook & Chengwei Luo

tRNA

rRNA

sRNA

Homology Search

Homology Search

Strategy

open reading frame(ORF)

How/Why find ORF?

How/Why find ORF?

How/Why find ORF?

Protein Database Searches

Domain searches

Limits of Extrinsic Prediction

ab initio Prediction

Homology Search is not Enough!

Biased and incomplete Database

sequenced genomes are not evenly distributed on the tree of life, and does not reflect the diversity accordingly either.

ab initio Gene Prediction

Features

ORFs (6 frames)

Codon Statistics

Features (Contd.)

Probabilistic View

Supervised Techniques

Unsupervised Techniques

Usually Used Tools

GeneMark

Glimmer

EasyGene

PRODIGAL

GeneMark

GeneMark.hmm

GeneMark.hmm

GeneMarkS

Glimmer

Glimmer Journey

Glimmer3.02

PRODIGALProkaryotic Dynamic Programming Gene Finding Algorithm

Developed at Oak Ridge National Laboratory and the University of Tennessee

Features

Features

EasyGene

Developed at University of Copenhagen

Statistical significance is the measure for gene prediction.

• High quality data set based onsimilarity in SwissPRot isextracted from genome.

• Data set used to estimate theHMM where based on ORF scoreand length statistical significance iscalculated.

Problem:

• No standalone version available

Comparison of Different Tools

RNA Gene Prediction

Why Predict RNA?

Regulatory sRNA

sRNA Challenges

Fundamental Methodology

RFAM

What Is Covariance?

Fig: Christian Weile et al. BMC Genomics (2007) 8:244

Noncomparative Prediction

Fig: James A. Goodrich & Jennifer F. Kugel, Nature Rev. Mol. Cell Biol. (2006) 7:612

Noncomparative Prediction

*Rolf Backofen & Wolfgang R. Hess, RNA Biol. (2010) 7:1

Comparative+Noncomparative

Effective sRNA prediction in V. cholerae

• Non-enterobacteria

• sRNAPredict2

• 32 novel sRNAs predicted

• 9 tested

• 6 confirmed

Jonathan Livny et al. Nucleic Acids Res. (2005) 33:4096

Software

*Rolf Backofen & Wolfgang R. Hess, RNA Biol. (2010) 7:1

Eva K. Freyhult et al. Genome Res. (2007) 17:117

Modification & finishing

• Consensus strategy to integrate ab initio results

• Broken gene recruiting

• TIS correcting

• IS calling

• operon annotating

• Gene presence/absence analysis

Modification & finishingConsensus strategy

pass

pass

fail

Broken gene recruiting

ab initio results

homology search

candidate fragments

Modification & finishingTIS correcting

Start codon redundancy:ATG, GTG, TTG, CTG

Markov iteration, experimental verified data

Leaderless genes

Modification & finishingIS calling Operon annotating

IS Finder DB

Modification & finishingGene Presence/absence

analysis

Schema (proposed)

Schema (proposed)

assembly group

Schema (proposed)

assembly group