Frog’s eye view of the jungle (time frozen) Push to restart time.

85

Transcript of Frog’s eye view of the jungle (time frozen) Push to restart time.

Frog’s eye view of the jungle(time frozen)

Push to restart time

Frog’s eye view of the jungle(time moving)

Frog’s eye view of the jungle(time frozen)

Frog’s eye view of the jungle(through movement filter)

Push to restart time

Frog’s eye view of the jungle(through movement filter)

Filters: Information reducersMovement filter

Filters: Information reducersSequence filter

TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA TATGAGGCAA TCACAGCATC AGGTGACCTT AGTATCTATT CTCGGGAGCG CACGGCTCTA AAGAGGCCCA TATCCAGGCA CCTTTAGATG CAAGAAGGAG GAAACAGCTC GAAATCCCTG AGGCCGGAGG GTCAAGAACT CTCCACCGGC GGCAGCGGCC CCCCGGCCTA AGGCTGCCTG TGCTATAAAT ACGCGGCCCA TTCCCTGGGC TCGGCGGGAC AGATAACATG AATGTGCCCT

CTCCGTAAAC CTCTAAC...How organism is made

How organism works

From Sequence to OrganismHow does Nature do it?

ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu...

Genetic code Rules of folding

Active site

From Sequence to OrganismHow does Nature do it?

ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu...

Active site

Cell interaction

Metabolism,Architecture

Genetic code Rules of folding

From Sequence to OrganismHow does Nature do it?

ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu...

Genetic code

Active site

Gives us:

• Custom antibiotics

Genetic code Rules of folding

From Sequence to OrganismHow does Nature do it?

ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu...

Gives us:

• Custom antibiotics • Custom antibodies• Custom enzymes• New materials

Genetic code Rules of folding

Active site

From Sequence to OrganismHow does Nature do it?

ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu...

Genetic code

Rules of transcriptional and post-transcriptional control

• Transcr’l initiation• Transcr’l termination/ polyA tailing• Splicing• Transl’l initiation

?

TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA

ATGACTTATGATCAACGCACAGGGCTA3%

TCTACTTATATTCAATCCACAGGGCTACACCTAGTTCTTGAAGAGTCTGTTGAATGAACACATACATGGTTTATCTGTTTTTCTGTCTGCTCTGACCTCTGGCAGCTT

TAGCCTGCCCCACTCTTAGATAAACGAACCTTAGTGACTTCTGCTATACCAAAGTCTCCACGCCCCTCCGTAAACCTCTAACATGATGTCAGCAAATATTAAAAATGA

97%

From Sequence to OrganismHow does Nature do it?

Natural filters/transformations

• Selective transcription

• Selective processing

• Translation

• Folding

DNA Functional protein

From Sequence to OrganismHow does Nature do it?

Natural filters/transformations

DNA Functional protein

From Sequence to OrganismHow can WE do it?

Simulation of Nature

Utterence of Wm Shakespeare

Utterence of George W Bush

“Whether ‘tis nobler in the mind to suffer the slings and arrows

of outrageous fortune...”

“We must give our military every tool and weapon it needs to prevail...”

???

From Sequence to OrganismHow can WE do it?

Surrogate Processes

Utterence of Wm Shakespeare

Utterence of George W Bush

“Whether ‘tis nobler in the mind to suffer the slings and arrows

of outrageous fortune...”

“We must give our military every tool and weapon it needs to prevail...”

Words/sentence; Choice of words; Sentence structure; …

From Sequence to OrganismHow can WE do it?

Natural filters/transformations

• Selective transcription

• Selective processing

• Translation

• Folding

Surrogate filters

Characteristics of coding sequences/introns

My sequence

• Gene finders

Predicted coding regions

From Sequence to OrganismHow can WE do it?

Natural filters/transformations

• Selective transcription

• Selective processing

• Translation

• Folding

Surrogate filters• Gene finders

• Similarity finders

Sequence/motif Databases My sequence

From Sequence to OrganismHow can WE do it?

Natural filters/transformations

• Selective transcription

• Selective processing

• Translation

• Folding

Surrogate filters• Gene finders

• Similarity finders

• Feature finders

Predicted features

Characteristicsof features

My sequence

From Sequence to OrganismHow can WE do it?

Natural filters/transformations

• Selective transcription

• Selective processing

• Translation

• Folding

Surrogate filters• Gene finders

• Similarity finders

• Feature finders

• Pattern finders

My sequences Statistical engine

Surrogate Filters

• Gene finders

• Similarity finders

• Feature finders

• Pattern finders

How do they work? Case studies• Real problems

• Mixed strategies

You do it

Surrogate FiltersGene finders

Class 1: Start/Stop codon search (Map, Frames, OrfFinder)

CTCCACGCCCCTCCGTACACCTCTAACATGATGTCAGCAAATATTAAAAATGAATAAACTTTGTGACATGTACAAATGGAAATATGCAA

CT CCA CGC CCC TCC GTA CAC CTC TAA CAT GAT CTC AGC AAA TAT TAA AAA TGA ATA AAC TTT GTG ACA TGT ACA AAT GGA AAT ATG CAA

CTC CAC GCC CCT CCG TAC ACC TCT AAC ATG ATC TCA GCA AAT ATT AAA AAT GAA TAA ACT TTG TGA CAT GTA CAA ATG GAA ATA TGC AAC TCC ACG CCC CTC CGT ACA CCT CTA ACA TGA TCT CAG CAA ATA TTA AAA ATG AAT AAA CTT TGT GAC ATG TAC AAA TGG AAA TAT GCA A

Look for start codons (ATG) (GTG,TTG)

Look for stop codons (TAA,TAG,TGA)

CTCCACGCCCCTCCGTACACCTCTAACATGATGTCAGCAAATATTAAAAATGAATAAACTTTGTGACATGTACAAATGGAAATATGCAA

TTGCATATTTCCATTTGTACATGTCACAAAGTTTATTCATTTTTAATATTTGCTGAGATCATGTTAGAGGTGTACGGAGGGGCGTGGAG

Surrogate FiltersGene finders

Class 1: Start/Stop codon search (Map, Frames, OrfFinder)

Look for start codons (ATG) (GTG,TTG)

Look for stop codons (TAA,TAG,TGA)

Pro: Quick, simple

Con: Useless for eukaryotic genomic sequences (introns) Inaccurate (start codon problem)

Inaccurate (doubtful short open reading frames)

Surrogate FiltersGene finders

Class 1: Start/Stop codon search (Map, Frames, OrfFinder)

Surrogate FiltersGene finders

Genetic CodeUUU Phe UCU Ser UAU Tyr UGU CysUUC Phe UCC Ser UAC Tyr UGC CysUUA Leu UCA Ser UAA ochre UGA opalUUG Leu UCG Ser UAG amber UGG TrpCUU Leu CCU Pro CAU His CGU ArgCUC Leu CCC Pro CAC His CGC ArgCUA Leu CCA Pro CAA Gln CGA ArgCUG Leu CCG Pro CAG Gln CGG ArgAUU Ile ACU Thr AAU Asn AGU SerAUC Ile ACC Thr AAC Asn AGC SerAUA Ile ACA Thr AAA Lys AGA ArgAUG Met ACG Thr AAG Lys AGG ArgGUU Val GCU Ala GAU Asp GGU GlyGUC Val GCC Ala GAC Asp GGC GlyGUA Val GCA Ala GAA Glu GGA GlyGUG Val GCG Ala GAG Glu GGG Gly

The code is degenerate

Class 2: Codon bias recognition (TestCode)

Are codons equally used?

Surrogate FiltersGene finders

Genetic Code (human)UUU Phe UCU Ser UAU Tyr UGU CysUUC Phe UCC Ser UAC Tyr UGC CysUUA Leu UCA Ser UAA ochre UGA opalUUG Leu UCG Ser UAG amber UGG TrpCUU Leu CCU Pro CAU His CGU ArgCUC Leu CCC Pro CAC His CGC ArgCUA Leu CCA Pro CAA Gln CGA ArgCUG Leu CCG Pro CAG Gln CGG ArgAUU Ile ACU Thr AAU Asn AGU SerAUC Ile ACC Thr AAC Asn AGC SerAUA Ile ACA Thr AAA Lys AGA ArgAUG Met ACG Thr AAG Lys AGG ArgGUU Val GCU Ala GAU Asp GGU GlyGUC Val GCC Ala GAC Asp GGC GlyGUA Val GCA Ala GAA Glu GGA GlyGUG Val GCG Ala GAG Glu GGG Gly

Codon usage is biased

Most frequently used codons

Class 2: Codon bias recognition (TestCode)

Codon bias universal?

Surrogate FiltersGene finders

Class 2: Codon bias recognition (TestCode)

Pro: Quick, simple, available through GCG Better than Class 1 in excluding false open reading framesCon: Useless for eukaryotic genomic sequences (introns) Gives only general areas of open reading frames

Surrogate FiltersGene finders

Class 3: Hidden Markov Model (HMM)-based recognition

Principle Step 1: Create model through extensive training set * Training set = proven or suspected genes * Organism-specific

Step 2: Assess candidate genes through filter of model

Step 1: Create model through extensive training set

Surrogate FiltersGene finders

Class 3: Hidden Markov Model (HMM)-based recognition

AAAAACAAGAATACA . . .TTGTTT

TrainingSet

AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATCAATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAACCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAATGACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACACTTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTCATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCTATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAATCCATAGTTATTATTACTTATGACTAAAACAAAATTACTATGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATGACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTATATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTCAAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACTGAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCACTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGATCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGATGCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGGTAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT

Step 1: Create model through extensive training set

AAAA: 33%

AAAC: 25%

AAAG: 12%

AAAT: 30%

Surrogate FiltersGene finders

Class 3: Hidden Markov Model (HMM)-based recognition

AAAAACAAGAATACA . . .TTGTTT

TrainingSet

AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATCAATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAACCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAATGACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACACTTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTCATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCTATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAATCCATAGTTATTATTACTTATGACTAAAACAAAATTACTATGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATGACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTATATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTCAAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACTGAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCACTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGATCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGATGCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGGTAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT

Step 1: Create model through extensive training set

AACA: 30%

AACC: 20%

AACG: 15%

AACT: 35%

AAAAACAAGAATACA . . .TTGTTT

Surrogate FiltersGene finders

Class 3: Hidden Markov Model (HMM)-based recognition

TrainingSet

AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATCAATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAACCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAATGACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACACTTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTCATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCTATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAATCCATAGTTATTATTACTTATGACTAAAACAAAATTACTATGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATGACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTATATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTCAAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACTGAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCACTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGATCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGATGCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGGTAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT

Step 2: Assess candidate genes

A C G TAAA 0.33 0.25 0.12 0.30AAC 0.30 0.20 0.15 0.35AAG 0.35 0.15 0.20 0.30 AAT 0.30 0.15 0.20 0.25 ACA 0.25 0.20 0.15 0.35 . . .TTG 0.25 0.30 0.15 0.30TTT 0.30 0.25 0.10 0.35

Candidategene

AAAGCAA…

0.12

3rd order Markov model

Surrogate FiltersGene finders

Class 3: Hidden Markov Model (HMM)-based recognition

Step 2: Assess candidate genes

AAAGCAA…

0.12 x 0.15

3rd order Markov model

Surrogate FiltersGene finders

Class 3: Hidden Markov Model (HMM)-based recognition

A C G TAAA 0.33 0.25 0.12 0.30AAC 0.30 0.20 0.15 0.35AAG 0.35 0.15 0.20 0.30 AAT 0.30 0.15 0.20 0.25 ACA 0.25 0.20 0.15 0.35 . . .TTG 0.25 0.30 0.15 0.30TTT 0.30 0.25 0.10 0.35

Candidategene

Step 2: Assess candidate genes

AAAGCTA…

0.12 x 0.15 . . .

So far, not a good candidate!

3rd order Markov model

Surrogate FiltersGene finders

Class 3: Hidden Markov Model (HMM)-based recognition

A C G TAAA 0.33 0.25 0.12 0.30AAC 0.30 0.20 0.15 0.35AAG 0.35 0.15 0.20 0.30 AAT 0.30 0.15 0.20 0.25 ACA 0.25 0.20 0.15 0.35 . . .TTG 0.25 0.30 0.15 0.30TTT 0.30 0.25 0.10 0.35

Candidategene

Surrogate FiltersGene finders

Class 3: Hidden Markov Model (HMM)-based recognition

Pro: Almost most accurate method known

Con: Needs big training set May miss genes of foreign origin

Will miss very small genes

Surrogate FiltersGene finders

Class 3: Hidden Markov Model (HMM)-based recognition

Pro: Almost most accurate method known

Con: Needs big training set May miss genes of foreign origin

Will miss very small genes

Surrogate FiltersScenario I – Case of the Hidden Heterocyst

Case of the Hidden Heterocyst

heterocysts

Matveyev and Elhai (unpublished)

N2

NH3

NH3

O2

Case of the Hidden HeterocystStrategy to find heterocyst differentiation genes

Nostoc genome Transposon

1. Use transposon mutagenesis

Case of the Hidden HeterocystStrategy to find heterocyst differentiation genes

Nostoc genome Transposon

1. Use transposon mutagenesisto find a mutant defective in heterocyst differentiation

Case of the Hidden HeterocystStrategy to find heterocyst differentiation genes

Nostoc genome

2. Sequence out from transposon

AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATCAATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAACCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAATGACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACACTTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTCATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCTATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAATCCATAGTTATTATTACTTATGACTAAAACAAAATTACTATGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATGACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTATATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTCAAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACTGAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCACTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGATCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGATGCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGGTAAAAGTTCGGCGCACCTGTGGA

1. Use transposon mutagenesisto find a mutant defective in heterocyst differentiation

Case of the Hidden HeterocystStrategy to find heterocyst differentiation genes

Nostoc genome

2. Sequence out from transposon

AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATCAATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAACCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAATGACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACACTTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTCATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCTATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAATCCATAGTTATTATTACTTATGACTAAAACAAAATTACTATGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATGACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTATATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTCAAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACTGAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCACTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGATCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGATGCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGGTAAAAGTTCGGCGCACCTGTGGA

1. Use transposon mutagenesisto find a mutant defective in heterocyst differentiation

3. Find gene boundaries

4. Identify gene

Do it

Case of the Hidden HeterocystStrategy to find heterocyst differentiation genes

1. Go to http://www.vcu.edu/~elhaij/BioInf

2. Open second browser (Ctrl-N in Netscape)

Go to same site (copy and paste URL)

3. In 1st browser, go to Program List Click on Gene Finders Open GeneMark

4. In 2nd browser, open Nostoc sequence

Case of the Hidden HeterocystStrategy to find heterocyst differentiation genes

Mission successful:

>Translation: 397..639 (direct), 81 amino acids VLGSKIEEGPKHIILDLSQIDFIDSSGLGALVQLAKQAQTAEGTLQIVTNARVTQTVKLVRLEKFLSLQKSVEEALENVK*

… or was it?

Check predicted protein against databases

Surrogate FiltersSimilarity finders

Blast• BlastP: Protein sequence to search protein database

• BlastN: Nucleotide sequence to search nucleotide database

• BlastX: Nucleotide sequence (translated) to search protein database

• TBlastN: Protein sequence to search (translated) nucleotide database

• Blast2Seq: Compare two sequences you specify

Do itFastA

• (Various flavors)

Pfam (Protein motif families)Finds conserved motifs similar to protein sequence

Case of the Hidden HeterocystStrategy to find heterocyst differentiation genes

Mission successful:

>Translation: 397..639 (direct), 81 amino acids VLGSKIEEGPKHIILDLSQIDFIDSSGLGALVQLAKQAQTAEGTLQIVTNARVTQTVKLVRLEKFLSLQKSVEEALENVK*

Why?• GeneMark correct: Conservation of noncoding regions

VLGSK

• GeneMark wrong: Fooled by weird aa sequence or start codon

Case of the Hidden HeterocystStrategy to find heterocyst differentiation genes

Moral

Automated gene finders are wonderful, but common sense is better

Don’t trust automated annotation

Surrogate FiltersFeature finders

Hidden Markov model-based methods

• Good for contiguous features (e.g. signal sequences)• Not good with features with gaps (e.g. promoters)

Ad hoc methods

• Feature-specific rules (e.g. tandem repeats, terminators)

Position-dependent frequency tables = Position-specific scoring matrix (PSSM) = Weight table

Surrogate FiltersFeature finders

Position-dependent frequency tables

CCCTATATAAGGC... histone H1tCGCTATAAAAACT... HMG-17GGGTATATAAGCG... b'-tubulin b'2GGCTATATAAAAC... a'-actin skel-m.TTCTATAAAGCGG... a'-cardiac actinCCCTATAAAACCC... b'-actinGAGTATAAAGCAC... keratin I 50KGGTTATAAAAACA... vimentinCAGTATAAAAGGG... a'1(I) collagenCCGTATAAATAGG... a'2(I) collagenTCCCATATAAGCC... fibronectin

Some of 106 aligned human promoter

sequences (near -26)

Consensus TATAAA

Surrogate FiltersFeature finders

Position-dependent frequency tables

A 21 29 -----

0 100 0 100 81 91 57 32 15 26 T 16 22 ---

-- 87 0 100 0 19 0 21 6 10 11

C 28 24 -----

13 0 0 0 0 0 0 15 33 28 G 35 25 ---

-- 0 0 0 0 0 9 22 47 42 34

CCCTATATAAGGC... histone H1tCGCTATAAAAACT... HMG-17GGGTATATAAGCG... b'-tubulin b'2GGCTATATAAAAC... a'-actin skel-m.TTCTATAAAGCGG... a'-cardiac actinCCCTATAAAACCC... b'-actinGAGTATAAAGCAC... keratin I 50KGGTTATAAAAACA... vimentinCAGTATAAAAGGG... a'1(I) collagenCCGTATAAATAGG... a'2(I) collagenTCCCATATAAGCC... fibronectin

Some of 106 aligned human promoter

sequences (near -26)

aceB ACTATGGAGCATCTGCACATGAAAACCatpI ACCTCGAAGGGAGCAGGAGTGAAAAACbioB ACGTTTTGGAGAAGCCCCATGGCTCACglnA ATCCAGGAGAGTTAAAGTATGTCCGCTglnH TAGAAAAAAGGAAATGCTATGAAGTCTlacZ TTCACACAGGAAACAGCTATGACCATGrpsJ AATTGGAGCTCTGGTCTCATGCAGAACserC GCAACGTGGTGAGGGGAAATGGCTCAAsucA GATGCTTAAGGGATCACGATGCAGAACtrpE CAAAATTAGAGAATAACAATGCAAACA

Position-Specific Scoring Matrix in action

Surrogate FiltersFeature finders

Experimentally proven

start sites

unknown

aceB ACTATGGAGCATCTGCACATGAAAACCatpI ACCTCGAAGGGAGCAGGAGTGAAAAACbioB ACGTTTTGGAGAAGCCCCATGGCTCACglnA ATCCAGGAGAGTTAAAGTATGTCCGCTglnH TAGAAAAAAGGAAATGCTATGAAGTCTlacZ TTCACACAGGAAACAGCTATGACCATGrpsJ AATTGGAGCTCTGGTCTCATGCAGAACserC GCAACGTGGTGAGGGGAAATGGCTCAAsucA GATGCTTAAGGGATCACGATGCAGAACtrpE CAAAATTAGAGAATAACAATGCAAACA

Position-Specific Scoring Matrix in action

Surrogate FiltersFeature finders

Experimentally proven

start sites

unknown

aceB ACCACATAACTATGGAGCATCTGCACATGAAAACCatpI ACCTCGAAGGGAGCAG.....GAGTGAAAAACbioB ACGTTTTGGAGAAGC...CCCATGGCTCACglnA ATCCAGGAGAGTTA.AAGTATGTCCGCTglnH TAGAAAAAAGGAAATG.....CTATGAAGTCTlacZ TTCACACAGGAAACAG....CTATGACCATGrpsJ AATTGGAGCTCTGGTCTCATGCAGAACserC GCAACGTGGTGAGGG...GAAATGGCTCAAsucA GATGCTTAAGGGATCA....CGATGCAGAACtrpE CAAAATTAGAGAATA...ACAATGCAAACA

Surrogate FiltersFeature finders

Position-Specific Scoring Matrix in action

ACGT

aceB ACCACATAACTATGGAGCATCT.GCACATGAAAACCatpI ACCTCGAAGGGAGCAG.....GAGTGAAAAACbioB ACGTTTTGGAGAAGC...CCCATGGCTCACglnA ATCCAGGAGAGTTA.AAGTATGTCCGCTglnH TAGAAAAAAGGAAATG.....CTATGAAGTCTlacZ TTCACACAGGAAACAG....CTATGACCATGrpsJ AATTGGAGCTCTGGTCTCATGCAGAACserC GCAACGTGGTGAGGG...GAAATGGCTCAAsucA GATGCTTAAGGGATCA....CGATGCAGAACtrpE CAAAATTAGAGAATA...ACAATGCAAACA

Surrogate FiltersFeature finders

Position-Specific Scoring Matrix in action

ACGT

Surrogate FiltersPattern finders

Specified patterns (FindPatterns, PatScan) e.g. Find instances of restriction sites

New pattern discovery (Meme, Gibbs sampler)

snRNA U1 (pU1-6) AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTChistone H1t GCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTTHMG-14 CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGGTP1 GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTTprotamine P1 CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACTnucleolin GCAGGCTCAGTCTTTCGCCTCAGTCTCGAGCTCTCGCTGGsnRNP E TGCCGCCGCGTGACCTTCACACTTCCGCTTCCGGTTCTTTrp S14 GACACGGAAGTGACCCCCGTCGCTCCGCCCTCTCCCACTCrp S17 TGGCCTAAGCTTTAACAGGCTTCGCCTGTGCTTCCTGTTTribosomal p. S19 ACCCTACGCCCGACTTGTGCGCCCGGGAAACCCCGTCGTTa'-tubulin ba'1 GGTCTGGGCGTCCCGGCTGGGCCCCGTGTCTGTGCGCACGb'-tubulin b'2 GGGAGGGTATATAAGCGTTGGCGGACGGTCGGTTGTAGCAa'-actin skel-m. CCGCGGGCTATATAAAACCTGAGCAGAGGGACAAGCGGCCa'-cardiac actin TCAGCGTTCTATAAAGCGGCCCTCCTGGAGCCAGCCACCCb'-actin CGCGGCGGCGCCCTATAAAACCCAGCGGCGCGACGCGCCA

Human sequences 5’ to transcriptional start

Surrogate FiltersPattern finders

How do pattern finders work?

snRNA U1 (pU1-6) AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTChistone H1t GCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTTHMG-14 CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGGTP1 GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTTprotamine P1 CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT

Step 1. Arbitrarily choose candidate pattern from a sequence

Step 2. Find best matches to pattern in all sequences

Step 3. Construct position-dependent frequency table based on matches

Step 4. Calculate relative probability of matches from frequency table

GACAGGGCAGAAGCCCGGGTGTTTGCCGGGGACGCGGCCCCCGGGCCTGCCGCAGAGCTG

A 0.208 0.292 0.000 0.999 0.000 0.999 0.811 0.905 0.575 0.321 0.151 0.264T 0.160 0.217 0.867 0.000 0.999 0.000 0.189 0.000 0.208 0.057 0.104 0.113C 0.283 0.236 0.132 0.000 0.000 0.000 0.000 0.000 0.000 0.151 0.330 0.283G 0.349 0.255 0.000 0.000 0.000 0.000 0.000 0.95 0.217 0.472 0.415 0.340

Surrogate FiltersPattern finders

How do pattern finders work?

snRNA U1 (pU1-6) AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTChistone H1t GCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTTHMG-14 CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGGTP1 GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTTprotamine P1 CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT

Step 1. Arbitrarily choose candidate pattern from a sequence

Step 2. Find best matches to pattern in all sequences

Step 3. Construct position-dependent frequency table based on matches

Step 4. Calculate relative probability of matches from frequency table

Step 5. If probability score high, remember pattern and score

Surrogate FiltersPattern finders

How do pattern finders work?

snRNA U1 (pU1-6) AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTChistone H1t GCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTTHMG-14 CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGGTP1 GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTTprotamine P1 CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT

Step 1. Arbitrarily choose candidate pattern from a sequence

Step 2. Find best matches to pattern in all sequences

Step 3. Construct position-dependent frequency table based on matches

Step 4. Calculate relative probability of matches from frequency table

Step 5. If probability score high, remember pattern and score

Step 6. Repeat Steps 1 - 5

Surrogate FiltersScenario II – Case of the Masked Motif• You’ve found a gene related to Purple Tongue Syndrome

• BlastP: Encoded protein related to cAMP-binding proteins

• Are the similarities trivial? Related to cAMP binding?

• Does your protein contain cAMP-binding site?

• What IS a cAMP-binding site?

Task

1. Determine what is a cAMP-binding site

2. Determine if your protein has one

Surrogate FiltersScenario II – Case of the Masked Motif

1. Collect sequences of known cAMP-binding proteins

2. Run Meme, a pattern-finding programAsk it to find any significant motifs

3. Rerun Meme. Demand that every protein has identified motifs

4. Run Pfam over known sequence to check

Do it

Strategy

Surrogate FiltersScenario III – Case of the Mortal Mitochondrion

Progressive External Ophthalmoplegia (PEO)• Slow paralysis of voluntary eye muscles• Many other symptoms (e.g., frequent deafness)• Loss of mitochondrial DNA

Surrogate FiltersScenario III – Case of the Mortal Mitochondrion

Progressive External Ophthalmoplegia (PEO)• Slow paralysis of voluntary eye muscles• Many other symptoms (e.g., frequent deafness)• Loss of mitochondrial DNA

Inheritance• Mendelian• Autosomal dominant• Linked to chromosome 4q34

Surrogate FiltersScenario III – Case of the Mortal Mitochondrion

Progressive External Ophthalmoplegia (PEO)• Slow paralysis of voluntary eye muscles• Many other symptoms (e.g., frequent deafness)• Loss of mitochondrial DNA

Inheritance• Mendelian• Autosomal dominant• Linked to chromosome 4q34

Your task• Examine sequence of 4q34 region

• Assess likelihood that a gene in the area could cause disease symptoms

Surrogate FiltersScenario III – Case of the Mortal Mitochondrion

Examining Sequence of 4q34 Regiontctacttatattcaatccacagggctacacctagttcttggtacacagtacatgctcagcaagagtctgttgaatgaacacatacatggtttatctgtttgtctcttccgagttcttgacttctgtctgctctgacctctggcagctttccactagtttctagctttcattctgcttacctggatttcggaactctagcctgccccactcttagataaacgcatgccctctgtggccctggaaccttagtgacttctgctataccaaagtctccacgcccagggtgacacgcagctgcagctccgtaaacctctaacatgatgtcagcaaatattaaaaaaaaaaagtttataaaaacaatgaataaactttgttaaaggtacaaatgaaaattagcaaacatgggaagataattgagtaaagagtttaaagttaaaaacgaattgcagtcattctaggggaaggaacagttgtatttgaaaacctgtatggttacatgaactgcctaaaaaacaagctaaggaaaattaaagctcagatttatatattttaagaaattaattgcaattaatttcctgggattaaatagcatttcctcaaccccagctgtcattaaaaagaggcaaatacagccaaggactggatcttctccggaaggctgacagcactgaccctcaagaaggcaccggctgacagacagaacattctgccctaatatgtgctgaaattccgctgagagcagagtggtacattgaaccctttaggggcttacaaaagaagtgtcctgtgttttagagtcacagagttttgcagaaacaagtatgaattcacctagtggccccctgcaccaggtctttcctgtgggcactgagtgcagacacatcaatatgtaatagcagaatgaatgactgaacgaacgattgaatgaaaagaaatgagaggcagcaggttgtcagattctatgaggcaatcacagcatcaggtgaccttagtatctatttgagaggactgccatttattctcgggagcgcacggctctaaagaggcccatatccaggcagtgagctctggtggggggcgcctttagatgcaagaaggaggaaacagctcgaaatccctgggcctgagcgcggcccgtgcaggccggagggtcaagaactctccaccggcggcagcggcccggtgtctgccccggcttcgccccggcctaaggctgcctgtgctataaatacgcggcccacatgccgcggtgacacggtgttccctgggctcggcgggacagataacatgaatgtgccctttaaacgtcccaagttgcagggacagcccccggcccagcctcgctcccggaagcgccttcgcccccgatgccctctgcagctgggaggagggggcgccccgcacctgcccagccaatgcgcggcgcgagcgccggccgcgacccgcctcctctcgcgagagcccggcggggatataagggggagctgcgggccaggcggcggccccctagcgtcgcgcagggtcggggactgcgcgcggtgccaggccgggcgtgggcgagagcacgaacgggctgcctgcgggctgagagcgtcgagctgtcaccatgggtgatcacgcttggagcttcctaaaggacttcctggccgggggcgtcgccgctgccgtctccaagaccgcggtcgcccccatcgagagggtcaaactgctgctgcaggtgaggaccgcgcggtgcaagaggcgggcgcgggcgcggcgggccgggcggggcgcgcgatgcggcgcgagctgcagggcgcggggcgccgcggaaaatctgcgccaggccacaggcccgggcgcccgcccgcccgcgggggaagaaggtgccctctgcgtagagacaggtccagcgtcagtcgcagattcctggtgtcgggtggcgcccggcgttcgggtgtctatatatggaaacccacccggagccggtttacgtgtgccagatcctgcgcccgtgacagcacgggcgtgcactcaggcccggaggcacctagtgattgccagtatttttggcaccgtcttatgcgcacgcacctttacaataaaaacatcaaaataatcatcacccaagaattcccttatcgtatctcatgcacaatgctgtatgtaggctgacgccttcatctttatgtaacctctgtgagagagttattcttctccattttacagatgaagctgaggttttgaaatattaagaaacaattttcggaataaactcagatcatcctgtctccaaatcttttcctcccctacctggtcgctgaatggtttatcatcctctcgtgttttcctccacctgcccaaaaggtcagggcccctcaatgaggaagagcccaatttgggagtcagaattactaacaacaaaacccccacaaattgctcacaacggcagcaaacccttaataattgattacttggattatctgcttgaaaactttggaggcctaatgtttagtggatttattctccttcctctattagagcatctagtagagatcctcatctccagggtgatcagagtgacactgagaaattgtcattttttggccatcatgtctattaaatccaaagccctttgaagcagggagtgttactcatttctgtcccccagtaagcccctcatacagttctcaaacctagggaaagtgaaataaataaatggctatagctttatataattcaatcaccttttcagtttatttggggcaatacctttccctcaaataccctaataattgaagcaacattggattattttggcttgttatccagtaactaacatggataacagtatccatttacacgtcctcgtatccatttgatttcctcatcctttttttcttcaaaaaaaaaatctaggaagtgcaaaccttttttttttctcctgtcctcttcccttctctctaccctgcctgtcctctgtcacccaccctcccctccaccaggtccagcatgccagcaaacagatcagtgctgagaagcagtacaaagggatcattgattgtgtggtgagaatccctaaggagcagggcttcctctccttctggaggggtaacctggccaacgtgatccgttacttccccacccaagctctcaacttcgccttcaaggacaagtacaagcagctcttcttagggggtgtggatcggcataagcagttctggcgctactttgctggtaacctggcgtccggtggggccgctggggccacctccctttgctttgtctacccgctggactttgctaggaccaggttggctgctgatgtgggcaagggcgccgcccagcgtgagttccatggtctgggcgactgtatcatcaagatcttcaagtctgatggcctgagggggctctaccagggtttcaacgtctctgtccaaggcatcattatctatagagctgcctacttcggagtctatgatactgccaagggtgagagaggggcatcggggagaaggagggtggtgtggaaagaggatcctatgggatctataactcacaaaggacctgatatatattgatcttgttttttctagtctctgggataattgaggcttctgaatgaggaggtgatgtgcataagttaatagctgaagcgttccttgtgtcctctactgaaataaactctggcctttagttattcagagaggaggaggggggagcctgtctccctctagacacagccatagcagttactgagtttaacttgaagccacttccaatgccctgtatacaagctgagcactgcccctccggggtccggagagggcagcagccacctttgctgtctgcctggtcatatgtgaagcacctgcacaggggcaggttccccgcaaggtcagagcatggagctggaggtgcagtggcctctctccctccacctgctttctgctgagaacaggcacttcatagccgttcggcttctgggctctgtccacagggatgctgcctgaccccaagaacgtgcacatttttgtgagctggatgattgcccagagtgtgacggcagtcgcagggctggtgtcctacccctttgacactgttcgtcgtagaatgatgatgcagtccggccggaaagggggtaagcttgtgctctactcatctaaacttgtttggttttgcccgaggagaacattttacagggctcctttcagtcttccttactggaaattaattttcaaaattatttgataaggacttagggaagaaagatggtattaattccccctaacgttctcaactatcctattagggaaaagtattttccattttattagagatgataagaacatgaatagtaagacatttagatgtgaatttaactaggtatccagcattatagagaccctaggccctcttcccttagagcctgggtgcaaaagctagggaaaagaagtagttagctacttcttacaaagaactcttgcttccctcctagttacaggtgttagtgggatggggtgtttagctgggtagagatggcctgaagcaatctgttgtgccagagaaagttttggcttctataggttgaaccatatgaaattgccactttaaaagtcaaaaacagtccaatgttagcagtttcgtatgtttcaacgaatagttacagccttttatttagactgcataacctcgtgcaggatcatctgaggctcagcctcagttcggtcctccataaaaaaaggtaaccgcgtagcataatactcctgctccactgcgcccttcttgtttcgcagttgggcagtccatgaattacttggttaattgccccagttcttcactgaccttgaactaatggagtaggaatgacaggagacccagcctgccagtgaagcaaggaaggagatgtccagtgggatgttgcatggagctgggactccatgcccagatgaccctgattttataaaactggtaacagtgtgtacagatatgtttcaggggaaaagtctctttcctccagcgttacggagccctcaccagcatttgtttccacagccgatattatgtacacggggacagttgactgctggaggaagattgcaaaagacgaaggagccaaggccttcttcaaaggtgcctggtccaatgtgctgagaggcatgggcggtgcttttgtattggtgttgtatgatgagatcaaaaaatatgtctaatgtaattaaaacacaagttcacagatttacatgaacttgatctacaagttcacagatccattgtgtggtttaatagactattcctaggggaagtaaaaagatctgggataaaaccagactgaaggaatacctcagaagagatgcttcattgagtgttcattaaaccacacatgtattttgtatttattttacatttaaattcccacagcaaatagaaaataatttatcatacttgtacaattaactgaagaattgataataactgaatgtgaaacatcaataaagaccacttaatgcacgctttctattttattgaactcttattaactgtaaaatgcatttttaaaagatcaaaaatgcatattttctagcatgattcatgtatcagtcagcagccaagcttctaaatgccagatattatattgagaatgtattatatgagaacgtacaatgcttaaagttccggttttcaaacttaggcaggtcatattctatctatcttatccagcgttactgtaggctagaaagtgataatggctttcataatcctgccttgtcttaggcactttcctgcag

Strategy

• Protein has function associated with mitochondrial location?

• Protein has structure associated with mitochondrial location?

• Assume that encoded protein is in mitochondria

– Use Gene finder to identify protein sequence(s) – Use Similarity finder to identify possible function

– Use Feature finders to identify pertinent regions – (What ARE pertinent regions?)

Surrogate FiltersScenario III – Case of the Mortal Mitochondrion

Name: PEO-related_gene?First three lines of sequence:tctacttatattcaatccacagggctacacctagttcttggtacacagtacatgctcagcaagagtctgttgaatgaacacatacatggtttatctgtttgtctcttccgagttcttgacttctgtctgctctgacctctggcagctttccactagtttctagctttcattctgcttacctggatttcggaactctagcctgccccactcttagataaacgcatg

fgene Wed Feb 27 16:55:29 GMT 2002>PEO-related_gene? length of sequence - 5768 number of predicted exons - 5 positions of predicted exons: 1607 - 1717 w= 17.84 ORF: 1607 - 1717 2985 - 3231 w= 9.13 ORF: 2985 - 3230 3421 - 3471 w= 6.08 ORF: 3423 - 3470 3980 - 4120 w= 12.62 ORF: 3982 - 4119 5035 - 5192 w= 1.93 ORF: 5037 - 5192 Length of Coding region- 708bp Amino acid sequence - 235aaMGDHAWSFLKDFLAGGVAAAVSKTAVAPIERVKLLLQVQHASKQISAEKQYKGIIDCVVRIPKEQGFLSFWRGNLANVIRYFPTQALNFAFKDKYKQLFLGGVDRHKQFWRYFAGNLASGIIIYRAAYFGVYDTAKGMLPDPKNVHIFVSWMIAQSVTAVAGLVSYPFDTVRRRMMMQSGRKGADIMYTGTVDCWRKIAKDEGAKAFFKGAWSNVLRGMGGAFVLVLYDEIKKYV*

Surrogate FiltersScenario III – Case of the Mortal Mitochondrion

Run 4q34 region through FGene

Name: PEO-related_gene?First three lines of sequence:tctacttatattcaatccacagggctacacctagttcttggtacacagtacatgctcagcaagagtctgttgaatgaacacatacatggtttatctgtttgtctcttccgagttcttgacttctgtctgctctgacctctggcagctttccactagtttctagctttcattctgcttacctggatttcggaactctagcctgccccactcttagataaacgcatg

Fgenesh Wed Feb 27 16:59:14 GMT 2002 FGENESH 1.0 Prediction of potential genes in Human genomic DNA Time: Wed Feb 27 16:59:14 2002 Seq name: PEO-related_gene? Length of sequence: 5768 GC content: 48 Zone: 2Positions of predicted genes and exons: G Str Feature Start End Score ORF Len

1 + TSS 1216 -2.70 1 + 1 CDSf 1607 - 1717 18.01 1607 - 1717 111 1 + 2 CDSi 2985 - 3471 52.41 2985 - 3470 486 1 + 3 CDSi 3980 - 4120 20.99 3982 - 4119 138 1 + 4 CDSl 5035 - 5192 2.32 5037 - 5192 156 1 + PolA 5471 0.92

Predicted protein(s):>FGENESH 1 4 exon (s) 1607 - 5192 298 aa, chain +MGDHAWSFLKDFLAGGVAAAVSKTAVAPIERVKLLLQVQHASKQISAEKQYKGIIDCVVRIPKEQGFLSFWRGNLANVIRYFPTQALNFAFKDKYKQLFLGGVDRHKQFWRYFAGNLASGGAAGATSLCFVYPLDFARTRLAADVGKGAAQREFHGLGDCIIKIFKSDGLRGLYQGFNVSVQGIIIYRAAYFGVYDTAKGMLPDPKNVHIFVSWMIAQSVTAVAGLVSYPFDTVRRRMMMQSGRKGADIMYTGTVDCWRKIAKDEGAKAFFKGAWSNVLRGMGGAFVLVLYDEIKKYV

FGENE output 1607 - 1717 w= 17.84 2985 - 3231 w= 9.13 3421 - 3471 w= 6.08 3980 - 4120 w= 12.62 5035 - 5192 w= 1.93

Surrogate FiltersScenario III – Case of the Mortal Mitochondrion

Run 4q34 region through FGeneSH

How to decide where exons are?

AAAAAAAAmRNA

DNA

PExon Intron Exon Intron Exon hnRNA

Strategy• Compare sequence of 4q34 region to sequence of mRNA• Sequence of mRNA may be in cDNA library• Expressed Sequence Tag (EST) library

Problems• Library may not exist• Expression of gene may be low

MORAL: Trust, but verify.

Feature FGene(splice site

recognition)

FGeneSH(FGene +

HMM model)

BlastN ofEST library

(comparewith known)

TranscriptionStart Site

1216 1501

Exon 1 …1607-1717 …1607-1717 …1607-1717

Exon 2 2985-3231 3421-3471

2985-3471 2985-3471

Exon X 3980-4120Exon 3 5035-5192… 5035-5192… 5035-5192…

PolyA site ? ? ? ? ? ?

Final Score Card for Gene Finders

3980-4120 3980-4120

Surrogate FiltersScenario III – Case of the Mortal Mitochondrion

Run 4q34 region through BlastN (x human est’s)

Strategy

• Protein has function associated with mitochondrial location?

• Protein has structure associated with mitochondrial location?

• Assume that encoded protein is in mitochondria

– Use Gene finder to identify protein sequence(s) – Use Similarity finder to identify possible function

– Use Feature finders to identify pertinent structures – (What ARE pertinent structures?)

Surrogate FiltersScenario III – Case of the Mortal Mitochondrion

Name: PEO-related_gene?First three lines of sequence:tctacttatattcaatccacagggctacacctagttcttggtacacagtacatgctcagcaagagtctgttgaatgaacacatacatggtttatctgtttgtctcttccgagttcttgacttctgtctgctctgacctctggcagctttccactagtttctagctttcattctgcttacctggatttcggaactctagcctgccccactcttagataaacgcatg

Fgenesh Wed Feb 27 16:59:14 GMT 2002 FGENESH 1.0 Prediction of potential genes in Human genomic DNA Time: Wed Feb 27 16:59:14 2002 Seq name: PEO-related_gene? Length of sequence: 5768 GC content: 48 Zone: 2Positions of predicted genes and exons: G Str Feature Start End Score ORF Len

1 + TSS 1216 -2.70 1 + 1 CDSf 1607 - 1717 18.01 1607 - 1717 111 1 + 2 CDSi 2985 - 3471 52.41 2985 - 3470 486 1 + 3 CDSi 3980 - 4120 20.99 3982 - 4119 138 1 + 4 CDSl 5035 - 5192 2.32 5037 - 5192 156 1 + PolA 5471 0.92

Predicted protein(s):>FGENESH 1 4 exon (s) 1607 - 5192 298 aa, chain +MGDHAWSFLKDFLAGGVAAAVSKTAVAPIERVKLLLQVQHASKQISAEKQYKGIIDCVVRIPKEQGFLSFWRGNLANVIRYFPTQALNFAFKDKYKQLFLGGVDRHKQFWRYFAGNLASGGAAGATSLCFVYPLDFARTRLAADVGKGAAQREFHGLGDCIIKIFKSDGLRGLYQGFNVSVQGIIIYRAAYFGVYDTAKGMLPDPKNVHIFVSWMIAQSVTAVAGLVSYPFDTVRRRMMMQSGRKGADIMYTGTVDCWRKIAKDEGAKAFFKGAWSNVLRGMGGAFVLVLYDEIKKYV

Surrogate FiltersScenario III – Case of the Mortal Mitochondrion

Run 4q34 region through BlastP

Summary

• One protein in region

• Contains mitochondrial carrier motifs

• Similar to ATP/ADP transporter

• Mitochondrial signal sequence?

Reasonable candidate for PEO-related protein

Surrogate FiltersScenario III – Case of the Mortal Mitochondrion

Run 4q34 region through BlastP

Complex gene discovery

Your turn: Repeat and extend characterization of PEO-related gene

1. Take same sequence (FastA format) e-mailed to you

2. Get better estimate of promoter and polyA site (e.g. by TSSW and PolyASH) (Is there a TATA box upstream from the predicted promoter?)

3. Find encoded protein sequence by suitable method (e.g. FGeneSH(GC) or comparison with cDNA)

4. Continue characterization of protein * Contains signal sequence? * Contains transmembrane domains?

Filter limitation

Inevitable…but whose filter?

Filters controlled by outside programmers

Filters controlled by you