Alignment scoring matrices and evolution

31
Alignment scoring matrices and evolution

description

Alignment scoring matrices and evolution. ORF Graphs: types of edges. ATG  TAG (single exon gene) ATG  GT (initial coding exon) GT  AG (intron) AG  GT (internal exon) AG  TAG (terminal coding exon) TAG  ATG (intergenic region). Conceptual framework for - PowerPoint PPT Presentation

Transcript of Alignment scoring matrices and evolution

Page 1: Alignment scoring matrices and evolution

Alignment scoring matrices and evolution

Page 2: Alignment scoring matrices and evolution
Page 3: Alignment scoring matrices and evolution
Page 4: Alignment scoring matrices and evolution
Page 5: Alignment scoring matrices and evolution
Page 6: Alignment scoring matrices and evolution
Page 7: Alignment scoring matrices and evolution
Page 8: Alignment scoring matrices and evolution
Page 9: Alignment scoring matrices and evolution
Page 10: Alignment scoring matrices and evolution
Page 11: Alignment scoring matrices and evolution

ORF Graphs: types of edges

ATG TAG (single exon gene)ATG GT (initial coding exon)GT AG (intron)AG GT (internal exon)AG TAG (terminal coding exon)TAG ATG (intergenic region)

Page 12: Alignment scoring matrices and evolution
Page 13: Alignment scoring matrices and evolution

Conceptual framework forgene finding with ORF graphs

1. Given an input sequence S, compute the ORF graph G for S.

2. Score the vertices and edges in G using some scoring strategy or function, f.

3. Extract the highest-scoring valid parse from G according to f.

Page 14: Alignment scoring matrices and evolution

Common assumptions

• No overlapping genes– reasonably safe

• No nested genes– reasonably safe

• No partial genes– limiting; needs to be relaxed for real gene

finders

Page 15: Alignment scoring matrices and evolution

Common assumptions (continued)

• No non-canonical splice sites or start codons– GTG, TTG common start sites in bacteria– GC-AG and AT-AC introns present in

plants and animals

• No frame shifts or sequencing errors– OK for finished sequence

Page 16: Alignment scoring matrices and evolution
Page 17: Alignment scoring matrices and evolution

Common assumptions (continued)

• Optimal parse only– useful, but if we allow the gene finder more

guesses it may do better

• Constraints on exon/intron lengths– max sizes needed for practical reasons– but, human introns can be >100,000 bp– for now we just miss those

Page 18: Alignment scoring matrices and evolution

Common assumptions (continued)

• No split start/stop codons– introns can occur right in the middle of an

ATG or TAG/TGA/TAA!– (most of) today’s gene finders will miss

these

• No alternative splicing– some species have an average of 2

alternative splicing products per gene– researchers actively working on this

Page 19: Alignment scoring matrices and evolution

Common assumptions (continued)

• No selenocysteine codons– encoded by TGA

• No ambiguity codes– R = purine = A or G– Y = pyrimidine = C or T– other codes for all possible ambiguous

base calls

Page 20: Alignment scoring matrices and evolution

The (Eukaryotic) Gene Prediction Problem

ATG…..GT…..AG…………GT….AG……..TAG

exons

intronsUTR UTR

Gene prediction is the problem of parsing a sequence into nonoverlapping coding segments (CDSs) consisting of exons separated by introns. Untranslated regions (UTR) are rarely predicted.

This parsing problem can be visualized as one of choosing the best path through the graph of all open reading frames (ORFs):

slide courtesy of Bill Majoros

Page 21: Alignment scoring matrices and evolution

Some Eukaryotic Gene-finding Programs

GenScan SNAP GeneZilla

TwinScan GenomeScan DoubleScan

Genie FGenesH SGP1

SGP2 SLAM GlimmerM

TWAIN GlimmerHMM GeneMark

Augustes EuGène HMMgene

JIGSAW Unveil Grail

VEIL Morgan Phat

GrailEXP Exonomy GAZE

slide courtesy of Bill Majoros

gray boxes = developed in Salzberg lab

Page 22: Alignment scoring matrices and evolution

Similarities Between Gene-finding Programs

GeneZillaGenScanPhat

SNAPGlimmerHMMAugustusGenieExonomy

TWAIN

SLAM

TWINSCAN

GenomeScan

DoubleScan

SGP-1

SGP-2

Unveil

HMMgene

VEIL

GlimmerMGrail

GrailEXPMORGANGeneMarkFGenesH

JigsawEnsemblGAZE

GH

MM

s

Hom

ology-based

HM

Ms

Com

bine

rs

ad h

oc

slide courtesy of Bill Majoros

Page 23: Alignment scoring matrices and evolution

Gene Finding Strategies

Gene finding programs can be classified into several types:

(1) ad hoc. These apply an ad hoc scoring function to the set of all ORFs and then predict only those ORFs scoring above a predefined threshold. Examples are GRAIL and GlimmerM.

(2) probabilistic. These adopt a rigorous probabilistic model of sequence structure and choose the most probable parse according to that probabilistic model. Examples are GenScan and GeneZilla.

(3) homology-based. These utilize evidence in the form of homology. These can be either ad hoc (eg., GrailEXP) or probabilistic (eg., TwinScan, Slam, Twain).

(4) combiners. These combine multiple forms of evidence, such as the predictions of other gene finders, and use ad hoc methods to arrive at a consensus prediction. Examples include Ewan Birney’s Ensembl and Jonathan Allen’s JIGSAW.

slide courtesy of Bill Majoros

Page 24: Alignment scoring matrices and evolution

Probabilistic Gene Finders

Page 25: Alignment scoring matrices and evolution

Review of Probability Theory

P(x) denotes the probability of event x occurring. P(x)=0.25 means, for example, that x occurs 25% of the time. P(x)=1 implies that the event x is certain to occur. P(x)=0 implies that x cannot occur.

The probability of events x and y both occurring is denoted by:

P(x, y) or P(x y)

P(x | y) denotes the probability of event x occurring, given that y has occurred, or given that y is a true statement. If P(y)≠0 then:

P(x | y) = P(x, y) / P(y)

so that:

P(x, y) = P(x | y) × P(y)

If x and y are independent then:

P(x, y) = P(x) × P(y), P(x | y) = P(x).

(conditional probability)

(multiplication rule)

(joint probability)

(independence)

slide courtesy of Bill Majoros

Page 26: Alignment scoring matrices and evolution

Review of Probability Theory

If x and y are mutually exclusive (i.e., can’t both happen), then:P(x y) = P(x) + P(y).

If an experiment is guaranteed to yield one of a set of mutually exclusive events {x1,x2,…,xn} then

Σ P(xi) = 1

If a set of events {x1,x2,…,xn} are all pairwise mutually independent then:

P(x1, x2, …, xn) = Π P(xi)

PM(x)=P(x|M) is an estimate of P(x) according to model M.

i=1

n

i=1

n

(the addition rule)

(partitioning rule)

(independence)

(prob. model)

slide courtesy of Bill Majoros

Page 27: Alignment scoring matrices and evolution

Ab initio Gene Finding = Modeling

Computational gene prediction is generally carried out as follows:

1. We formulate a mathematical model M which describes some method for generating DNA sequences and their gene structures. That is, M generates pairs of the form (S, ) for sequence S and gene structure .

2. Using a set T of known genes, we customize or “train” the model so that the sequences and gene structures which M generates have the same statistical properties as the real genes in T.

3. Given an un-annotated sequence S, we pretend that M generated S, even though we know that it did not. (Evolution and the cellular machinery generated it, not our model!)

4. Since we assume that M generated S, we can determine precisely how likely it is for M to have generated it—i.e., the precise intron/exon boundaries that M would have imposed while it was generating S.

slide courtesy of Bill Majoros

Page 28: Alignment scoring matrices and evolution

exon 1 exon 2 exon 3AGCTAGCAGTCGATCATGGCATTATCGGCCGTAGTACGTAGCAGTAGCTAGTAGCAGTCGATAGTAM

The underlying model of a gene finder generates both sequences and their gene structures. These can be denoted as pairs of the form (S,) for sequence S and gene structure .

A Model Generates Sequences and Gene Structures

gene structure

sequence

slide courtesy of Bill Majoros

Page 29: Alignment scoring matrices and evolution

The parameters to M determine the statistical properties of the sequences and gene models which it generates (both the structural properties of the gene models and frequences of individual bases).

AGCTAGCAGTCGATCATGGCATTATCGGCCGTAGTACGTAGCAGTAGCTAGTAGCAGTCGATAGTAM

TAGGCTCTATTAGCGCTATGCTACGTTATATTCTGATGTGTGATCGTATCTATATATCGATCTAGGM

CCCTATCGCGCGCGCGCTATCACACACACTACGCGTCATTATCTTACTGAGCGCGCGCTATCGTATM

6,23,843,924…

45,2,8214,32…

11,413,7,235,8…

The Model Parameters Affect the Model’s Output

slide courtesy of Bill Majoros

Page 30: Alignment scoring matrices and evolution

The parameters to M can be tuned to make its outputs most similar to the set of known genes:

CGCGCTATCGATCGATCATCTGCGATCGTATATGCTACGGTCGTAGCTAGCTGATCGATCGATCGCM

TGCTGCTATATGCTACGAGCATCTAGCTGACTTATCGCGCGCTAGCTAGCATCGATCGATCTAGCGM

AGCTTTCAGTCGATCCCGGCATTATCGGCCGTAGCCCGTAGGGGTAGCTAGTACGCATCGATAGTAM

6,23,843,924…

45,2,8214,32…

11,413,7,235,8…

Tuning the Model to Obtain Optimal Behavior

AGCTAGCAGTCGATCATGGCATTATCGGCCGTAGTACGTAGCAGTAGCTAGTAGCAGTCGATAGTA

AGCTAGCAGTCGATCATGGCATTATCGGCCGTAGTACGTAGCAGTAGCTAGTAGCAGTCGATAGTA

AGCTAGCAGTCGATCATGGCATTATCGGCCGTAGTACGTAGCAGTAGCTAGTAGCAGTCGATAGTA

known

predicted

modify parms

modify parms

Still dissimilar

similar

slide courtesy of Bill Majoros

Page 31: Alignment scoring matrices and evolution

Given an input sequence, we can ask what gene structure the model is most likely to have generated at the same time that it generated that sequence:

M

Using the Model to Predict Gene Structure

(Si,i)

sequence S

gene structure

Suppose M was invoked 1,000,000,000,000,000,000 times. Collect those pairs (Si,i) where Si=S. Among those pairs, which gene structure is most common? Emit that .

Gen

e F

inde

r X

YZ

parameter file11,413,7,235,8… AGCTAGTACG…

(maximum a posteriori, MAP)

slide courtesy of Bill Majoros