Post on 20-Mar-2016
description
Elements of Bioinformatics (14F001)
TP2: Gene prediction22 October 2012
CORRECTIONS
Notice:
During this practical, you will need to use ‘raw’ and ‘fasta’ sequence formats.
For additional information on the different sequence formats available, please have a look athttp://www.genomatix.de/online_help/help/sequence_formats.html
nc RNA gene prediction
Choose: eukaryotic tRNA; does not give any result with general tRNA model !
CpG island prediction
CpG island in the C. Elegans cosmid
Lenght 219 pb; position 21’954 to 22’172
cgttttctgtggtcaca cacgagtatc cggatcttct ggatcaactt gttctcgtct gcaacgtctt tgcaagaatg gcaccagaac agaaacaact actcgtggaa caccttcaag acgttgggca gacggtcgct atgtgtggcg atggagctaa tgattgtgct gctctgaaag cagctcacgc gggaatctca ctatcggagg ctgaagcatc ga
To confirm that this sequence could be part of a promoter sequence (> 80 % of CpG islands extend in the 5’ flanking region of the associated genes), check - according to its positions - if this CpG island is located in a gene promoter region(see later).
Gene prediction
with HMM on the complete cosmid sequence
Gene 1
Gene 2
Gene 3
Gene 4
Wrong CDS ?
3 HMM models: firstex, exon_n, lastex
1
4
32
tRNA 169 238
Predicted CpG island: 21954 22172 -> in the middle of CDS4: not a ‘classical’ CpG (not in the 5’ of a gene)
Summary:
Gene 1
Gene 1 prediction with HMMgene
One gene found
Gene 1 prediction with HMMgene
With ‘human’: 2 genes found, one on each strand, (strand minus with less good scores)The programs are ‘trained’ with sequence from specific organisms. The ‘codon bias’ for example, is not the same for the different species.
Example of codon usage tables (-> codon bias)http://www.kazusa.or.jp/codon/
Gene 1 prediction with Netgene2
Netgene 2 gives the positions of the first and last nucleotide of the intron (donnor and acceptor splice sites)
GTdonnor
AG
acceptor
intron
Gene 1 prediction with GeneBuilder(organism: no choice….human; option: first and last exon disabled)
Matrix: miscellaneous
One gene found
Gene 1 prediction with GenScan!! No choice except: vertebrate, maize and arabidobsis !
Two genes found
!! No choice except: vertebrate, maize and arabidobsis !
Two genes found
FGENESH
One gene found
Summary (gene prediction)
3 ’5 ’
108310031305
14061452 1661
2000
DO 1084 (1.00)
AC 1304 (0.77)
DO 1407 (0.89)
AC 1451 (0.90)
DO 1662 (1.00)
AC 1913 (1.00)
HMMgene Genebuilder Netgene2 DO:donnor site AC: acceptor site
19141997
and GenScan (organism = human !!)
1557
(organism = human !!)
977
GeneMark: finds a second gene in 3’!!!
163211
FGENESH
+ another potential genefrom positions 2000 to 2900
One gene
ID FGENESH Unreviewed; 159 AA.SQ SEQUENCE 159 AA; 17780 MW; F9A2C7DE9614425C CRC64;
MKVETCVYSG YKIHPGHGKR LVRTDGKVQI FLSGKALKGA KLRRNPRDIR WTVLYRIKNK KGTHGQEQVT RKKTKKSVQV VNRAVAGLSL DAILAKRNQT EDFRRQQREQ AAKIAKDANK
AVRAAKAAAN KEKKASQPKT QQKTAKNVKT AAPRVGGKR//
ID GENESCAN1 Unreviewed; 159 AA.SQ SEQUENCE 159 AA; 17780 MW; F9A2C7DE9614425C CRC64;
MKVETCVYSG YKIHPGHGKR LVRTDGKVQI FLSGKALKGA KLRRNPRDIR WTVLYRIKNK KGTHGQEQVT RKKTKKSVQV VNRAVAGLSL DAILAKRNQT EDFRRQQREQ AAKIAKDANK
AVRAAKAAAN KEKKASQPKT QQKTAKNVKT AAPRVGGKR//
ID GENESCAN2 Unreviewed; 202 AA.SQ SEQUENCE 202 AA; 23684 MW; 98A69FA21823F2F3 CRC64;
MRTLRIAQYS VLTVGFAIYM YRLIEEIPID IRNLNSDSLE GIINSDELCD VTVSNRNRGL LVRNDSLDLD ILKAKFTTFF SKRYLTRFLS EQVPFLHVID EALLVKRFVM CACFMVFCLT VIWFLVIRRM GNLIKRLSVL NQLEDAESVE WARCIREFTQ EKLAVLCFCI VPPFAQTDKL
VSDKIKLFRE HKILRIRSVQ HI//
ID GENEMARK1 Unreviewed; 184 AA.SQ SEQUENCE 184 AA; 20255 MW; 85BB0234E6C14EA0 CRC64;
MGRCGSSGKR DGYGAKDSSS EGLSTMKVET CVYSGYKIHP GHGKRLVRTD GKVQIFLSGK ALKGAKLRRN PRDIRWTVLY RIKNKKGTHG QEQVTRKKTK KSVQVVNRAV AGLSLDAILA KRNQTEDFRR QQREQAAKIA KDANKAVRAA KAAANKEKKA SQPKTQQKTA KNVKTAAPRV
GGKR//
ID GENEMARK2 Unreviewed; 183 AA.SQ SEQUENCE 183 AA; 21336 MW; 64F65D472A58046E CRC64;
MRTLRIAQYS VLTVGFAIYM YRLIEEIPID IRNLNSDSLE GIINSDELCD VTVSNRNRGL LVRNDSLDLD ILKAKFTTFF SKRYLTRFLS EQVPFLHVID EALLVKRFVM CACFMVFCLT VIWFLVIRRM GNLIKRLSVL NQLEDAESVE WARCIREFTQ EKLAVLCFCI VPPFAQTDNV
QHI//
For fun…
Compare the predictions with the same program (GenMark) with different
parameters (HMM trained with eukaroyta or prokaroyta)
Two genes found
Gene 1 prediction with GeneMark (prokaryota specific; E.coli K12)
Protein 1Protein 2
Protein 1
Protein 2
Gene 1 prediction with GeneMark (prokaryota specific)
CDS corresponds ~ to ‘exon’ : there is no intron in prokaryota !
Summary (prokaryota gene prediction)
3 ’5 ’
108310031305
14061452
1661
2000DO
1084 (1.00)
AC 1304 (0.77)
DO 1407 (0.89)
AC 1451 (0.90)
DO 1662 (1.00)
AC 1913 (1.00)
HMMgene Genebuilder Netgene2
DO:donnor site
AC: acceptor site
1914 1997
GenScan
1437 1688
Gene Mark (proka)
1254 1433Protein 1Protein 2
1557
GenMark (euka)
Alignment between the ‘eukaryota and prokaryota’ predicted sequences
Gene prediction: similarity searches with ESTs
ESTs: Expressed sequence tags (cDNAs which are rapidly and badly sequenced)
Blast 2012
Gene A Gene B
Two genes found
Blast 2010
Gene A Gene B
EST1 >gi|47590759|gb|BJ750997.1|BJ750997 BJ750997 unpublished oligo-capped cDNA library Caenorhabditis elegans cDNA clone yk1360e06 5', mRNA sequenceGGTTTAATTACCCAAGTTTGAGATTCGTCAAGCGAGGGCCTATCAGCAATGAAGGTCGAAACCTGCGTTTACTCCGGATACAAGATCCACCCAGGACACGGAAAGAGACTTGTCCGTACTGACGGAAAGGTGAGTTCAGTTTCTCTTTGAAAGGCGTTAGCATGCTGTTAGAGCTCGTAAGGTATATTGTAATTTTACGAGTGTTGAAGTATTGCAAAAGTAAAGCATAATCACCTTATGTATGTGTTGGTGCTATATCTTCTAGTTTTTAGAAGTTATACCATCGTTAAGCATGCCACGTGTTGAGTGCGACAAACTACCGTTTCATGATTTATTTATTCAAATTTCAGGTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGACTGTCCTCTACAGAATCAAGAACAAGAAGGGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTCCGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAA EST2 >gi|47646579|gb|BJ775052.1|BJ775052 BJ775052 unpublished oligo-capped cDNA library Caenorhabditis elegans cDNA clone yk1360e06 3', mRNA sequenceATAACGGGACCGAGAACGTTTATCGCTTTCCTCCGACACGTGGAGCAGCAGTCTTCACATTCTTGGCGGTCTTTTGCTGGGTCTTTGGCTGAGAGGCCTTCTTTTCCTTGTTGGCAGCAGCCTTGGCGGCACGGACAGCCTTGTTGGCATCCTTGGCGATCTTAGCGGCTTGTTCACGCTGTTGGCGACGGAAGTCTTCGGTCTGGTTTCTCTTGGCAAGGATAGCATCAAGGGAAAGTCCAGCGACGGCGCGGTTAACAACCTGGACGGACTTCTTGGTCTTCTTTCTGGTGACTTGCTCTTGTCCGTGGGTTCCCTTCTTGTTCTTGATTCTGTAGAGGACAGTCCATCTGATGTCACGTGGGTTACGGCGAAGCTTGGCTCCCTTGAGTGCCTTTCCACTGAGGAAGATTTGGACCTGAAATTTGAATAAATAAATCATGAAACGGTAGTTTGTCGCACTCAACACGTGGCATGCTTAACGATGGTATAACTTCTAAAAACTAGAAGATATAGCACCAACACATACATAAGGTGATTATGCTTTACTTTTGCAATACTTCAACACTCGTAAAATTACAATATACCTTACGAGCTCTAACAGCATGCTAACGCCTTTCAAAGAGAAACTGAACTCACCTTTCCGTCAGTACGGACAAGTCTCTTTCCGTGTCCTGGGTGGATCTTGTATCCGGAGTAAACGCAGGTTTCGACCTTCATTGCTGATANGCCCTCGCTTGACGAATCTCAAACTTGGGTAATTAAACCCCA
EST3
>gi|47727995|gb|BJ818152.1|BJ818152 BJ818152 unpublished oligo-capped cDNA library, stage L4 Caenorhabditis elegans cDNA clone yk1685h11 3', mRNA sequence TAACGGGACCGAGAACGTTTATCGCTTTCCTCCGACACGTGGAGCAGCAGTCTTCACATTCTTGGCGGTC TTTTGCTGGGTCTTTGGCTGAGAGGCCTTCTTTTCCTTGTTGGCAGCAGCCTTGGCGGCACGGACAGCCT TGTTGGCATCCTTGGCGATCTTAGCGGCTTGTTCACGCTGTTGGCGACGGAAGTCTTCGGTCTGGTTTCT CTTGGCAAGGATAGCATCAAGGGAAAGTCCAGCGACGGCGCGGTTAACAACCTGGACGGACTTCTTGGTC TTCTTTCTGGTGACTTGCTCTTGTCCGTGGGTTCCCTTCTTGTTCTTGATTCTGTAGAGGACAGTCCATC TGATGTCACGTGGGTTACGGCGAAGCTTGGCTCCCTTGAGTGCCTTTCCACTGAGGAAGATTTGGACCTT TCCGTCAGTACGGACAAGTCTCTTTCCGTGTCCTGGGTGGATCTTGTATCCGGAGTAAACGCAGGTTTCG ACCTTCATTGTTGATAGGCCCTCGCTTGACGAATCTCAAACTTGGGTAATTAAACCTACAAATAAAAATG AGATAAAGCATACTGCCATTCTACAACCGGAGAATAAGAAAACCGAAAACGAGAAAATTATTCTATTATG ACAGATAGAATAAGTTAAAATGGGAAGAGTGCATTTGTCACTGATTTACTTGGTGACTTGGTGGAGAGCG TGGGCAAGGTAAGCGACATTGTTCGATGAA
Gene A
975-1407 1450-1615 1692-1865
Blast result with EST1
BUT: Blast does not take care of the intron-exon boundaries when aligning DNA with RNA -> we have to use a specific tool : SIM4
The 3rd part of the EST1 is of very bad quality
SIM4 alignment
Example withEST 1 BJ750997
(partial)
The 3rd part of the EST1 is of very bad quality: not align by SIM4 -> EST1 is considered as partial !
EST 3 BJ818152
SIM4 alignment results
EST 1 BJ750997(partial)
EST 2 BJ775052
summary (ESTs)
3 ’5 ’
108310031305
14061452
1661
1914 1997
1615EST1BJ750997.1
EST2 BJ775052.1
EST3 BJ818152.1
Alternative splicing event (intron retention)-> 2 different mRNAs
(EST BJ750997.1 is partial)
…
Gene A
Translation and BLASTpTranslation
(beware the EST sequence orientation !)
>gi|47590759|gb|BJ750997.1|BJ750997 BJ750997 unpublished oligo-capped cDNA library Caenorhabditis elegans cDNA clone yk1360e06 5', mRNA sequenceGGTTTAATTACCCAAGTTTGAGATTCGTCAAGCGAGGGCCTATCAGCAATGAAGGTCGAAACCTGCGTTTACTCCGGATACAAGATCCACCCAGGACACGGAAAGAGACTTGTCCGTACTGACGGAAAGGTGAGTTCAGTTTCTCTTTGAAAGGCGTTAGCATGCTGTTAGAGCTCGTAAGGTATATTGTAATTTTACGAGTGTTGAAGTATTGCAAAAGTAAAGCATAATCACCTTATGTATGTGTTGGTGCTATATCTTCTAGTTTTTAGAAGTTATACCATCGTTAAGCATGCCACGTGTTGAGTGCGACAAACTACCGTTTCATGATTTATTTATTCAAATTTCAGGTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGACTGTCCTCTACAGAATCAAGAACAAGAAGGGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTCCGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAA
EST1
MIYLFKFQVQIFLSGKALKGAKLRRNPRDIRWTVLYRIKNKKGTHGQEQVTRKKTKKSVQ
VVNRAVAGLSLDAILAKRNQTEDFRRQQREQAAKIA
Blastp results
>gi|47646579|gb|BJ775052.1|BJ775052 BJ775052 unpublished oligo-capped cDNA library Caenorhabditis elegans cDNA clone yk1360e06 3', mRNA sequenceATAACGGGACCGAGAACGTTTATCGCTTTCCTCCGACACGTGGAGCAGCAGTCTTCACATTCTTGGCGGTCTTTTGCTGGGTCTTTGGCTGAGAGGCCTTCTTTTCCTTGTTGGCAGCAGCCTTGGCGGCACGGACAGCCTTGTTGGCATCCTTGGCGATCTTAGCGGCTTGTTCACGCTGTTGGCGACGGAAGTCTTCGGTCTGGTTTCTCTTGGCAAGGATAGCATCAAGGGAAAGTCCAGCGACGGCGCGGTTAACAACCTGGACGGACTTCTTGGTCTTCTTTCTGGTGACTTGCTCTTGTCCGTGGGTTCCCTTCTTGTTCTTGATTCTGTAGAGGACAGTCCATCTGATGTCACGTGGGTTACGGCGAAGCTTGGCTCCCTTGAGTGCCTTTCCACTGAGGAAGATTTGGACCTGAAATTTGAATAAATAAATCATGAAACGGTAGTTTGTCGCACTCAACACGTGGCATGCTTAACGATGGTATAACTTCTAAAAACTAGAAGATATAGCACCAACACATACATAAGGTGATTATGCTTTACTTTTGCAATACTTCAACACTCGTAAAATTACAATATACCTTACGAGCTCTAACAGCATGCTAACGCCTTTCAAAGAGAAACTGAACTCACCTTTCCGTCAGTACGGACAAGTCTCTTTCCGTGTCCTGGGTGGATCTTGTATCCGGAGTAAACGCAGGTTTCGACCTTCATTGCTGATANGCCCTCGCTTGACGAATCTCAAACTTGGGTAATTAAACCCCA
EST2
MIYLFKFQVQIFLSGKALKGAKLRRNPRDIRWTVLYRIKNKKGTHGQEQVTRKKTKKSVQ VVNRAVAGLSLDAILAKRNQTEDFRRQQREQAAKIAKDANKAVRAAKAAANKEKKASQPK
TQQKTAKNVKTAAPRVGGKR
Blastp results
>gi|47727995|gb|BJ818152.1|BJ818152 BJ818152 unpublished oligo-capped cDNA library, stage L4 Caenorhabditis elegans cDNA clone yk1685h11 3', mRNA sequence TAACGGGACCGAGAACGTTTATCGCTTTCCTCCGACACGTGGAGCAGCAGTCTTCACATTCTTGGCGGTC TTTTGCTGGGTCTTTGGCTGAGAGGCCTTCTTTTCCTTGTTGGCAGCAGCCTTGGCGGCACGGACAGCCT TGTTGGCATCCTTGGCGATCTTAGCGGCTTGTTCACGCTGTTGGCGACGGAAGTCTTCGGTCTGGTTTCT CTTGGCAAGGATAGCATCAAGGGAAAGTCCAGCGACGGCGCGGTTAACAACCTGGACGGACTTCTTGGTC TTCTTTCTGGTGACTTGCTCTTGTCCGTGGGTTCCCTTCTTGTTCTTGATTCTGTAGAGGACAGTCCATC TGATGTCACGTGGGTTACGGCGAAGCTTGGCTCCCTTGAGTGCCTTTCCACTGAGGAAGATTTGGACCTT TCCGTCAGTACGGACAAGTCTCTTTCCGTGTCCTGGGTGGATCTTGTATCCGGAGTAAACGCAGGTTTCG ACCTTCATTGTTGATAGGCCCTCGCTTGACGAATCTCAAACTTGGGTAATTAAACCTACAAATAAAAATG AGATAAAGCATACTGCCATTCTACAACCGGAGAATAAGAAAACCGAAAACGAGAAAATTATTCTATTATG ACAGATAGAATAAGTTAAAATGGGAAGAGTGCATTTGTCACTGATTTACTTGGTGACTTGGTGGAGAGCG TGGGCAAGGTAAGCGACATTGTTCGATGAA EST3
EST1 is partial in C-ter
Gene A
EST1 is partial.EST3 corresponds to the UniProtKB/Swiss-Prot RL24_CAEEL sequence
Gene A
Some prediction programs give the correct protein sequenceNone have predicted the alternative splicing event (EST2; intron 1084-1304 retention)
Gene A
summary (ESTs)
3 ’5 ’
108310031305
14061452
1661
1914 1997
EST BJ775052.1
EST BJ818152
Alternative splicing events (intron retention)-> 2 different mRNAs
MKVET…..1010
MIYLF…..1284
Gene A
Gene 1 is on C.elegans chromosome I
BLAT results
Isoform 2EST2
Gene BGene A
>NP_491399 length=159 MKVETCVYSGYKIHPGHGKRLVRTDGKVQIFLSGKALKGAKLRRNPRDIR WTVLYRIKNKKGTHGQEQVTRKKTKKSVQVVNRAVAGLSLDAILAKRNQT EDFRRQQREQAAKIAKDANKAVRAAKAAANKEKKASQPKTQQKTAKNVKT AAPRVGGKR
RefSeq sequence
InterPro scan results: the protein contains a ribosomal L24e domain
Conclusions (1)
There are 2 different protein sequences due to alternative splicing (intron retention; the shortest isoform is due to a intron retention and is rarely expressed – only 2 ESTs)
Gene A
Conclusions (2)
Gene prediction programs can not predict an alternative splicing event(it can only predict the alternative splice junction)
The protein (Gene A) is a ribosomal protein which belongs to the ribosomal protein L24e family (UniProtKB/Swiss-Prot O01868).
The alternatively spliced sequence is not yet in the protein sequence databases, because it is ‘derived’ from ESTs sequenceswhich are submitted to public DNA/RNA databases without annotated CDS
Non coding region analysis
3’end of chromosome Y EMBL #AJ271736
Example of Alu sequence
Gene 2
Schema recapitulatif
5 ’3 ’
11117891410 1636
1688 1845
AC 1112 (0.56)
DO 1409 (0.92)
DO 1556 (0.96)
AC 1637 (0.61)
HMMgene
Netgene2DO:donneur AC: accepteur
5 ’ 3 ’
1557 Exon 1Exon 2Exon 3
1112 1407 1637 1688
GeneBuilder prediction is not confirmed anywhere else
CDS2 (3 exons)
RefSeq NP_491393 (AF272397)UniProtKB/TrEMBL: G5EC89
237 AA; 3 exonsMMMEYGGYFS SSAVAQQSGD VPTTAPSAVT NSFFYTPQSH NIYHQYATPY LQSGRALTTA HNTSSSSAGN STSSSSSSSN YRNTTHDSLQ AFFNTGLQYQ LYQKSQLIGS DTIQRTSSNV LNGLPRSSLV GALCSTGGAP LNPAERRKQR RIRTTFTSGQ LKELERSFCE THYPDIYTRE EIAMRIDLTE ARVQVWFQNR RAKYRKQEKI RRVKDEEEDP LKKEPGQISL EEIIDQI
A probable nuclear protein with a DNA binding domain (homeobox)
Gene 3
Numérotation « direct strand »
CDS3
>tr|O01864|O01864_CAEEL Hypothetical protein - Caenorhabditis elegans. METEVMKSFNNELSSLFDSKNMSKNKIQDITKAAIKAKSQYKHVVFSVEKLINKCKPDQR LNVLYVIDSIVRASKHQLKEKDTFGPRFMKQFDKFLMPLLKCGQKEKMRTVRTLNLWMSN KVFKESEIQPLREMCKASGLTIDFEEVELAVKGKQADMSIYSGVYKKKPKRSSSSSQPKS RTPTNPHPDDGLLGAGPSSALRSVPDIPNFVLSEDYFLGTISEREMLELVQKFGIDRSGV LSKDKNLLQRALQIFAGSLSQKVEEVLAENNRINGSSIQNVLTKDFEYSDDEEEKEKEPQ PEKQKNLPHAQVLLLAQSLLTQPQILAKLAEVLIPQGNPFGLPFPGEHIVPTSSAALTLG APPPNLMALQQSLPPGFPNQQLGLPNLSGLNQAQLMNVQNAQNMLQLQQRAAQLQALQGN PNAQRNLLMLGNPLLNPFALQHGVNPMLNDLQAAAAAQQQAMLNEAAQSPEKKILELSGG NSGINNSGDVERARLREKEKERESKERRRMGLPPVRIGFTIIASRTLWLKKIPTNIVEND LKQAVESCGEASRVKVIGNRACAYITMENRRSANDVVSKMREVSVAKKMVKVYWARSPGM DSDQFSDLWDSNRGVLEIPYEKLPLDLVALCEGAMLDIESLPIEKKLLYKETGETVISIP PPNIQPPVPHPPPMGFPFQHQLTQLPGQPRPAGLPPGVPPMFNLNAPPPPGIPGYPPAPP PPGVGPPPPQGIPPMGFDPNKPPPPMFQQGFNAGAPPPPFGRGAGPMSSFPPPPRGGMHH MPPPPSFRGGRGGHGGPPPPHFDRRGGGGPPFRPENGRGRLLDQSEMWNREQREMRGGGG AGRDGGREHRDYDRDRSQIDRRRQDDMGARRRSRWGDDDRRDDDRRDDRRDDRRESRRRS PRSPRSPDRRTRRSPSYEREEPPVKKTSVEEETVSSTTLDELKPSVEPTPVPAPIPAPAP
ELKAAEEPVKIVAEHHEDQTDEVPMDLE
Gene 4
Removed from gene 4:1412-1691, 1795-5682, 5842-6048, 6865-6907, 7133-7413,7518-7589, 7754-7999, 7912-7958, 8154-8222, 8414-8496,8660-8709, 9043-9114, 9529-9573, 9706-9769, 9943-9996
EST HMMgene WebGene Netgene2
1346 1411 (AG) (GT)1695 1794 1691 1795
5405 54495679 5841 5668 5859 5683 5841 5682 58426049 6080 6049 6864 6049 6864 6048 68656908 6993 6908 7132 6908 7132 6907 7133
7187 7328 7187 7328 7186 73297411 7520 7414 7517 7414 7517 7413 7518
7564 75897959 8153 7958 8154
7589 7753 7589 77547800 7911 7800 7911 7799 79127954 8113 7959 8135
8223 8413 8223 8413 8222 84148497 8659 8497 8659 8496 86608710 9042 8710 9042 8709 90439115 9528 9115 9528 9114 9529
9631 9705 9574 9705 9574 9705 9573 97069770 9943 9770 9946 9770 9942 99439997 10350 9996
Protein Q3N323
>tr|Q9N323|Q9N323_CAEEL Hypothetical protein - Caenorhabditis elegans. MSTNNYQTLSQNKADRMGPGGSRRPRNSQHATASTPSASSCKEQQKDVEHEFDIIAYKTT FWRTFFFYALSFGTCGIFRLFLHWFPKRLIQFRGKRCSVENADLVLVVDNHNRYDICNVY YRNKSGTDHTVVANTDGNLAELDELRWFKYRKLQYTWIDGEWSTPSRAYSHVTPENLASS APTTGLKADDVALRRTYFGPNVMPVKLSPFYELVYKEVLSPFYIFQAISVTVWYIDDYVW YAALIIVMSLYSVIMTLRQTRSQQRRLQSMVVEHDEVQVIRENGRVLTLDSSEIVPGDVL VIPPQGCMMYCDAVLLNGTCIVNESMLTGESIPITKSAISDDGHEKIFSIDKHGKNIIFN GTKVLQTKYYKGQNVKALVIRTAYSTTKGQLIRAIMYPKPADFKFFRELMKFIGVLAIVA FFGFMYTSFILFYRGSSIGKIIIRALDLVTIVVPPALPAVMGIGIFYAQRRLRQKSIYCI SPTTINTCGAIDVVCFDKTGTLTEDGLDFYALRVVNDAKIGDNIVQIAANDSCQNVVRAI ATCHTLSKINNELHGDPLDVIMFEQTGYSLEEDDSESHESIESIQPILIRPPKDSSLPDC QIVKQFTFSSGLQRQSVIVTEEDSMKAYCKGSPEMIMSLCRPETVPENFHDIVEEYSQHG YRLIAVAEKELVVGSEVQKTPRQSIECDLTLIGLVALENRLKPVTTEVIQKLNEANIRSV MVTGDNLLTALSVARECGIIVPNKSAYLIEHENGVVDRRGRTVLTIREKEDHHTERQPKI VDLTKMTNKDCQFAISGSTFSVVTHEYPDLLDQLVLVCNVFARMAPEQKQLLVEHLQDVG QTVAMCGDGANDCAALKAAHAGISLSEAEASIAAPFTSKVADIRCVITLISEGRAALVTS YSAFLCMAGYSLTQFISILLLYWIATSYSQMQFLFIDIAIVTNLAFLSSKTRAHKELAST PPPTSILSTASMVSLFGQLAIGGMAQVAVFCLITMQSWFIPFMPTHHDNDEDRKSLQGTA IFYVSLFHYIVLYFVFAAGPPYRASIASNKAFLISMIGVTVTCIAIVVFYVTPIQYFLGC LQMPQEFRFIILAVATVTAVISIIYDRCVDWISERLREKIRQRRKGA
Prediction of mitochondrial genes (human)
NC_012920.1
Mitochondrial genomeNC_012920.1 annotation
tRNA scan prediction
tRNA scan lists 1- all the tRNAs in the current strand2- all the tRNAs in the complement strandThis tRNA is found at the end of the list
Conclusion
• Good tRNA prediction• If you try: very bad protein-coding gene
prediction….– Mitochondrial genome has not the same sequence
content (codon biais, signals) compare to the nuclear genome.
– You might try with ‘prokaryota’-like gene model, but the results are not perfect… !