Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins...

26
Gene Finding
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins...

Page 1: Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.

Gene Finding

Page 2: Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.

Finding Genes

• Prokaryotes – Genome under 10Mb– >85% of sequence codes for proteins

• Eukaryotes– Large Genomes (up to 10Gb)– 1-3% coding for vertebrates

Page 3: Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.

Introns

• Humans– 95% of genes have introns– 10% of genes have more than 20 introns– Some have more than 60– Largest Gene (Duchenne muscular dystrophy

locus) spans >2Mb (more than a prokaryote)– Average exon = 150b– Introns can interrupt Open Reading Frame at any

position, even within a codon– ORF finding is not sufficient for Eukaryotic

genomes

Page 4: Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.

Open Reading Frames in Bacteria

• Without introns, look for long open reading frame (start codon ATG, … , stop codon TAA, TAG, TGA)

• Short genes are missed (<300 nucleotides)• Shadow genes (overlapping open reading frames

on opposite DNA strands) are hard to detect• Some genes start with UUG, AUA, UUA and

CUG for start codon• Some genes use TGA to create selenocysteine

and it is not a stop codon

Page 5: Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.

Eukaryotes

• Maps are used as scaffolding during sequencing

• Recombination is used to predict the distance genes are from each other (the further apart two loci are on the chromosome, the more likely they are to be separated by recombination during meiosis)

• Pedigree analysis

Page 6: Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.

Gene Finding in Eukaryotes

• Look for strongly conserved regions• RNA blots - map expressed RNA to DNA• Identification of CPG islands

– Short stretches of CG rich DNA are associated with the promoters of vertebrate genes

• Exon Trapping - put questionable clone between two exons that are expressed. If there is a gene, it will be spliced into the mature transcript

Page 7: Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.

Computational methods

• Signals - TATA box and other sequences– TATA box is found 30bp upstream from about 70%

of the genes

• Content - Coding DNA and non-coding DNA differ in terms of Hexamer frequency (frequency with which specific 6 nucleotide strings are used)– Some organisms prefer different codons for the

same amino acid

• Homology - blast for sequence in other organisms

Page 8: Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.

Genome Browser

• http://genome.ucsc.edu/

• Tables

• Genome browser

Page 9: Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.

Non-coding RNA genes

• Ribosomal rRNA, transfer tRNA can be recognized by stochastic context-free grammars

• Detection is still an open problem

Page 10: Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 11: Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 12: Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 13: Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.

Hidden Markov Models

(HMMs) • Provide a probabilistic view of a process

that we don’t fully understand

• The model can be trained with data we don’t understand to learn patterns

• You get to implement one for the first lab!!

Page 14: Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.

State Transitions

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Markov Model Example. --x = States of the Markov model -- a = Transition probabilities -- b = Output probabilities -- y = Observable outputs

-How does this differ from a Finite State machine?-Why is it a Markov process?

Page 15: Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.

Example

• Distant friend that you talk to daily about his activities (walk, shop, clean)

• You believe that the weather is a discrete Markov chain (no memory) with two states (rainy, sunny), but you cant observe them directly. You know the average weather patterns

Page 16: Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.

Formal Descriptionstates = ('Rainy', 'Sunny')

observations = ('walk', 'shop', 'clean')

start_probability = {'Rainy': 0.6, 'Sunny': 0.4}

transition_probability = { 'Rainy' : {'Rainy': 0.7, 'Sunny': 0.3}, 'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6}, }

emission_probability = { 'Rainy' : {'walk': 0.1, 'shop': 0.4, 'clean': 0.5}, 'Sunny' : {'walk': 0.6, 'shop': 0.3, 'clean': 0.1}, }

Page 17: Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.

Observations

• Given (walk, shop, clean) – What is the probability of this sequence of

observations? (is he really still at home, or did he skip the country)

– What was the most likely sequence of rainy/sunny days?

Page 18: Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.

MatrixRainy Sunny

walk .6*.1 .4*.6

shop .7*.4 .4*.4 .3*.3 .6*.3

clean .7*.5 .4*.5 .3*.1 .6*.1

Sunny, Rainy, Rainy = (.4*.6)(.4*.4)(.7*.5)

Page 19: Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.

The CpG island problem

• Methylation in human genome– “CG” -> “TG” happens in most places

except “start regions” of genes and within genes

– CpG islands = 100-1,000 bases before a gene starts

• Question– Given a long sequence, how would we find

the CpG islands in it?

Page 20: Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.

Hidden Markov Model

CpG Island

X=ATTGATGCAAAAGGGGGATCGGGCGATATAAAATTTG

OtherOther

How can we identify a CpG island in a long sequence?

Idea 1: Test each window of a fixed number of nucleitidesIdea2: Classify the whole sequence Class label S1: OOOO………….……OClass label S2: OOOO…………. OCC…Class label Si: OOOO…OCC..CO…O…Class label SN: CCCC……………….CC

S*=argmaxS P(S|X)= argmaxS P(S,X)

S*=OOOO…OCC..CO…O

CpG

Page 21: Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.

HMM is just one way of modeling p(X,S)…

Page 22: Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.

A simple HMMParameters

Initial state prob: p(B)= 0.5; p(I)=0.5

State transition prob:p(BB)=0.7 p(BI)=0.3p(IB)=0.5 p(II)=0.5

Output prob:P(a|B) = 0.25,…p(c|B)=0.10…P(c|I) = 0.25 …

P(B)=0.5P(I)=0.5

P(x|B)B I

0.5

0.5P(x|I)

0.7

0.30.5

0.5

P(x|HCpG)=p(x|I)

P(a|I)=0.25P(t|I)=0.25P(c|I)=0.25P(g|I)=0.25

P(x|HOther)=p(x|B)

P(a|B)=0.25P(t|B)=0.40P(c|B)=0.10P(g|B)=0.25

Page 23: Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.

( , , , , )HMM S V B A= Π

( ) : " "i k k ib v prob of generating v at s

A General Definition of HMM

11

{ ,..., } 1N

N ii

π π π=

Π = =∑:i iprob of starting at state sπ

1{ ,..., }MV v v=

1{ ,..., }NS s s=N states

M symbols

Initial state probability:

1

{ } 1 , 1N

ij ijj

A a i j N a=

= ≤ ≤ =∑State transition probability:

1

{ ( )} 1 , 1 ( ) 1M

i k i kk

B b v i N k M b v=

= ≤ ≤ ≤ ≤ =∑

Output probability::ij i ja prob of going s s→

Page 24: Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.

How to “Generate” a Sequence?

B I

0.7

0.30.5

0.5

P(x|B) P(x|I)

P(B)=0.5 P(I)=0.5

B I BB BII I

I I IB BBI I… …

Given a model, follow a path to generate the observations.

model

Sequence

states

P(a|B)=0.25P(t|B)=0.40P(c|B)=0.10P(g|B)=0.25

P(a|I)=0.25P(t|I)=0.25P(c|I)=0.25P(g|I)=0.25

a c g t t …

Page 25: Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.

How to “Generate” a Sequence?

B I

0.7

0.30.5

0.5

P(x|B) P(x|I)

P(B)=0.5 P(I)=0.5

model

Sequence

P(a|B)=0.25P(t|B)=0.40P(c|B)=0.10P(g|B)=0.25

P(a|I)=0.25P(t|I)=0.25P(c|I)=0.25P(g|I)=0.25

a c g t t …

a

B I BII

tgc

0.50.3

P(“BIIIB”, “acgtt”)=p(B)p(a|B) p(I|B)p(c|I) p(I|I)p(g|I) p(I|I)p(t|I) p(B|I)p(t|B)

0.50.50.5

0.40.250.250.250.25

t

Page 26: Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.

HMM as a Probabilistic Model

1 2 1 2 1 1 1 2 1 2 2 1( , ,..., , , ,..., ) ( ) ( | ) ( | ) ( | )... ( | ) ( )T T T T T Tp O O O S S S p S p O S p S S p O S p S S p O S−=

1 2 1 2 1 1( , ,..., ) ( ) ( | )... ( | )T T Tp S S S p S p S S p S S −=

Time/Index: t1 t2 t3 t4 …Data: o1 o2 o3 o4 …

Observation variable: O1 O2 O3 O4 …

Hidden state variable: S1 S2 S3 S4 …

Random variables/process

Sequential data

Probability of observations (incomplete likelihood):

1

1 2 1 2 1,...

( , ,..., ) ( , ,..., , ,... )T

T T TS S

p O O O p O O O S S= ∑

1 2 1 2 1 1 2 2( , ,..., | , ,..., ) ( | ) ( | )... ( | )T T T Tp O O O S S S p O S p O S p O S=

Joint probability (complete likelihood):

State transition prob:

Probability of observations with known state transitions:

Init state distr.

State trans. prob.

Output prob.