COST workshops 2011 Annotation Mitrovic Kube
Transcript of COST workshops 2011 Annotation Mitrovic Kube
������������ ���� �����������������
Jelena Mitrovic & Michael Kube
Contact: <[email protected]> <[email protected]>
What’s an Annotation?
Annotation: describes the process of the assignment of information to a location on a sequence
* location - Positions have to be addressed to location (e.g. 101..1000 at the
sequence) * assignment of information: - Features classified by keys and corresponding additional information
so called qualifiers
* Sequence (5’-3’) ⇒ we need a language with a controlled vocabulary to submit and
to exchange the data data in *.embl and *.gbk are arranged in a table text file
(JMMK0711)
Minimal Requirements- NCBI/EMBL-Ebi
[Requirements for Every Submission ] * Contact information: name, address, phone number, fax number and e-mail address of the submitter or sequencing center code. * Release date information (private until) * Reference information o Sequence authors: list of authors credited with the sequence. o Citation(s) associated with the sequence (publication or preliminary title). * Source description o Scientific name (Genus species) of the source organism or description (e.g.,
uncultured bacterium). For synthetic sequences, please provide a specific name (e.g., cloning vector pRB223)
o Unique source modifiers (e.g., clone, strain, isolate, cultivar, specimen voucher name). These are especially important if the scientific name is not known.
o Identification of the organelle from which any non-nuclear nucleotide sequence originates (e.g., chloroplast, mitochondrion).
* Input DNA sequence o A contiguous nucleotide sequence of at least 50 base pairs, sequenced by the submitter(s). o Type of molecule sequenced (e.g. genomic DNA, genomic RNA, mRNA). o Description of the sequence/ annotation
(JMMK0711)
Common nomenclature for DNA sequences but different dialects
DDBJ DNA Data Bank of Japan, Mishima, Japan [http://www.ddbj.nig.ac.jp/]
EMBL European Molecular Biology Laboratory, Nucleotide Sequence Database, Cambridge, UK/ EMBL-EBI Hinxton [http://www.ebi.ac.uk/]
GenBank , NCBI National Center for Biotechnology Information, part of the
National Institutes of Health U.S.A., Bethesda, MD, USA. [http://www.ncbi.nlm.nih.gov/] (JMMK0711)
Minimal requirements header section (e.g. CU469464)
������������ �
(JMMK0711)
Minimale requirements DNA sequence- Annotation feature header (e.g. CU469464)
FEATURES Location/Qualifiers source 1..601943
/organism="Candidatus Phytoplasma mali"
/mol_type="genomic DNA"
/strain="AT"
/db_xref="taxon:37692"
key�
Total sequence length�
Taxon identifier, obligatory but works more or less. However, taxon ids are always linked to a general overview. �
The taxon ID is assigned by the annotator. If no taxid is available a general taxon ID will be assigned.
(JMMK0711)
Example: Simple entry Feature Location/Qualifiers CDS 23..400 /product="alcohol dehydrogenase" /gene="adhI“
/locus_tag=“ATP_00001” Left hand- key:
CDS, coding sequence examples for other typical keys: tRNA: transfer ribonucleic acid rRNA: ribosomal ribonucleic acid gene: this key is followed by the location
Right hand- location and qualifiers : location: 23..400 on forward strand of the sequence
(reverse: complement (23..400) qualifiers: /product=“�” , the function of the protein in this case
/gene=“�” , abbreviation, if possible /locus_tag= “ATP_00001“, strain key + identifier (street number)
KEEP IN MIND! Qualifiers start with a slash (/).
Feature Table Terminology- Coding sequence
table�
(JMMK0711)
Feature Table Terminology- Features & Qualifiers
~60 features &
~126 qualifiers
http://www.ebi.ac.uk/ena/WebFeat/ (JMMK0711)
CDS 1..891 /gene="repA" /locus_tag="ETA_pET460010" /note="silverDB:46p00001" /codon_start=1 /transl_table=11 /product="Replication protein" /protein_id="YP_001905924.1" /db_xref="GI:188535864" /db_xref="GeneID:6302942"
/translation="MAFIRHHDWCRNPDLIALRRKGYTPYSRTFDRDFRPKPMRITAR�� � �SESREALSALSMVLAANCDYSPDSEYMFETMLPVEEMARRMGVLHV�� � �YESGRKAYDVLLALRVLEQMEYVVVHRDRDSDSGQHKPMRIFLTES�� � �FFTSRGMTVENVRSWLHKYRQWAVASGVAESMREKYERHQIKMARL�� � �GISIERHHSLKNRLKKIKRWVVSPDLRAEKQRVTSDLERALDGHAG�� � �SVRPLRPRAGSGRYRQAWLRWSASAETYPAECWKLEQAVKAEHPQLH�
�VTDPEKYHRLLLDRAGVTPE"�
Example- Feature with CDS (coding sequence) key
location gene name street number
reference to internal database: unique protein identifier & version number
deduced amino acid sequence from the given location
GenInfo Identifier formally linked to external databases (e.g. EMBL). Reference links to internal data (e.g. NCBI processed genome). Two entries present (gene and protein). PLEASE KEEP IN MIND GIs are not stable. Never use them as a reference.
translation start translation table, bacterial code
note, e.g.reference to an external database, protein DB
(JMMK0711)
Translation table code: The Standard Code For overview see http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG1
Alternative Initiation Codons In rare cases, translation in eukaryotes can be initiated from codons other than AUG. A well documented case (including direct protein sequencing) is the GUG start of a ribosomal P protein of the fungus Candida albicans or in NAT1 (Takahashi et al. 2005). Other examples can be found. The standard code currently allows initiation from UUG and CUG in addition to AUG. By default all translation tables in GenBank flatfiles are equal to id 1, and this is not shown. When translation table is not equal to id 1, it is shown as a qualifier on the CDS feature. �
Codons encoding methionine (M) or in some cases leucine (L, but only UUG/TTG and CUG/CTG) can act as initiation codons.�
(JMMK0711)
TTT F Phe TCT S Ser TAT Y Tyr TGT C Cys�TTC F Phe TCC S Ser TAC Y Tyr TGC C Cys�TTA L Leu TCA S Ser TAA * Ter TGA * Ter�TTG L Leu i TCG S Ser TAG * Ter TGG W Trp��CTT L Leu CCT P Pro CAT H His CGT R Arg�CTC L Leu CCC P Pro CAC H His CGC R Arg�CTA L Leu CCA P Pro CAA Q Gln CGA R Arg�CTG L Leu i CCG P Pro CAG Q Gln CGG R Arg��ATT I Ile i ACT T Thr AAT N Asn AGT S Ser�ATC I Ile i ACC T Thr AAC N Asn AGC S Ser�ATA I Ile i ACA T Thr AAA K Lys AGA R Arg�ATG M Met i ACG T Thr AAG K Lys AGG R Arg��GTT V Val GCT A Ala GAT D Asp GGT G Gly�GTC V Val GCC A Ala GAC D Asp GGC G Gly�GTA V Val GCA A Ala GAA E Glu GGA G Gly�GTG V Val i GCG A Ala GAG E Glu GGG G Gly� � � �
Translation table code: Bacteria For overview see http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi
Codons encoding methionine (M) or in some cases, isoleucine (I), leucine (L, but only UUG/TTG and CUG/CTG) and valine (GUG/GTG) can act as initiation codons. �
(JMMK0711)
Looking for a Coding Sequence- ORF/ open reading frame (initiation- termination signals)
Just an example:�
In Artemis termination signals are indicated in frame as: -> + TAG (UAG) -> # TAA (UAA) -> * TGA (UGA)�
(JMMK0711)
Initiation and termination signals occure frequently
Example: Open the result from the last session in artemis (File -> Open, files -> all files, select your example file) and switch off the feature entries. Click on the right mouse button within a reading frame and choose “Start Codons”. Go back to the artemis entry page (Options) and change the genetic code to see the differences. Enable features to see the optional start codon within the CDS features.�
(JMMK0711)
ORF prediction by definition will result in conflicts and overlaps
Example does not take into consideration the different initiaton sites within the orfs.�
(JMMK0711)
ORF Determination
Extrinsic or evidence-based systems for gene identification (experimental driven approach) - identification of the mRNA - N-terminal protein sequencing (Edman sequencing) - peptide mass fingerprinting (PMF) Limits: - mRNA, difficult to catch the 5’-end - Edman sequencing, eloborate and not always possible (protein isolation, N-terminal modification ...) - PMF, needs a partial separation at least (2D gel electrophoreses)
(JMMK0711)
Bacteria - identification of promotor regions (initiation of transcription), e.g.
* AT-rich UP-elements (upstream the −35-region), * −35 region target consensus sequence 5'-TTGACA-3', * −10 region or Pribnow-Box consensus sequence 5'-TATAAT-3‘ (similar
to TATA-Box of eukaryotes). - identification of RBS - identification of long ORFs
* The length of a ORF can be used as an informative signal. 3/64 possible tripletts encode for termination signals. The occurrence of a termination signal every 20-25 tripletts or 60-75 bp would be normal.
- complex probabilistic models, e.g. Hidden Markov Models (HMMs), taking in account signals (e.g. promotor structures) and sequence motifs (e.g. protein models, amino acid distributions, GC content�) ⇒ e.g. GLIMMER common for gene prediction in Archaea and Bacteria
� Innovative approaches improve the initial prediction by taking in account known conserved syntenies (SEED).
ORF Prediction- ab initio approaches (from the beginning; predictions)
(JMMK0711)
Glimmer
http://www.cbcb.umd.edu/software/glimmer/
http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi
ONLINE VERSION
(JMMK0711)
Example based on orf prediction
(JMMK0711)
http://blast.ncbi.nlm.nih.gov/Blast.cgi
(JMMK0711)
Artemis- Annotation platform
��������������������������������������
⇒ ���������������������������������������������� �����!���
"#$$%�#$�
(JMMK0711)
Other all in one solutions- examples
academical
commercial
����&���������� ��
����
����� ���� �� �� ������������� ������ ����!���������������� ����������"""��
⇒ ��!����������������������⇒ ��#��������� ��������$�
���������
(JMMK0711)
������������ �������
-> fully-automated service for annotating bacterial and archaeal genomes -> provides high quality genome annotations -> SEED-quality automated annotation service -> automated quality gene calling -> functional annotation -> free service -> turn-around 12-24 hours -> integration in SEED possible, genome annotation provided does include a
mapping of genes to subsystems and a metabolic reconstruction => Open results in Artemis in review the automatic annotation! Looks like a commercial but works fine! However, results need manual inspection! Do not underestimate the work!
(JMMK0711)
Genome comparison via ACT (artemis comparison tool)
http://www.sanger.ac.uk/Software/ACT/ http://www.webact.org/WebACT/generate
- in silico fragmented genome sequence is aligned via BLAST to a reference - genome sequences and BLAST results are opened in ACT - ACT supports a graphical overview and an editor (similar to artemis) - assigned regions are illustrated by connecting lines - problems result from the BLAST approach (unspecific hits)
Full functionality by big_blast Online, limited options
(JMMK0711)
Data submission
NCBI e.g. SequIN or BankIT http://www.nlm.nih.gov/pubs/factsheets/sdgenbk.html EMBL-EBI WebIn but also SequIN (in parts) is supported http://www.ebi.ac.uk/embl/Submission/index.html However, direct submission upload (bulk submission) is possible for sequencing centers.
low amounts of DNA sequences, HTS-
sequences and ESTs
Genome submission (european genome
centers)
Depending on the kind of data & submitter
Major problem! Upload of feature tables is not supported (except direct bulk submission)! However, feature tables can be uploaded as update!
(JMMK0711)
%����������� �& ��'(�
(JMMK0711)
)����& ���* � � ����+�#,&'��-�&.��**&/�
#,&'� 0��11���"����"���"�� "���1�
� ������
���
(JMMK0711)
� �2��3�� ��4��������� ���3�� ������ ����3��� �������
����3�������������3��������� ����5��� �����
����# ����%�� ���
���# ����5�� ������������ ����3�� �����������
���6�������5�� �����������
���(��1� !�
����+� ���
��7�
� �( ������5��� ��������������������������3��� ������
���5���� ����!������� ���3����������
(JMMK0711)
8��������� ����5��� ����� �� �������������������� �����5� �������'�5��� ���� ����� �� � ����
8���������9�:�
#������5���������3�� �������;���
� �( ��� ���� ���� ���������5��
���� � ����5���� ��������� ���
����5���� ��� ����� �������<������
(JMMK0711)
� �'����� ������� � �����9 =��� � �����:�
������ ����� ����!��� �� �������������� 3���
���� �� ���4�������5�� ����������
�����! ������
��, ��������� ���� 3���� �� ���� ���,#+>?>�>@%�������� ��A#6�������� �� ��������B�>@%+CD%�������� ��A#6�������������� ���� ����A#6+'���������������������������B� ���CD%�������� ��A#6�������� �� ���������
���� 0� � ��� � �� � � � � � � � ��� � � �� �
���� � � � � � � � �� �� � � � � � �7��
� ����������������� ���������0��� ��� ��� �������� ���� ��� � �������� ��� ����� �� �� ��������������!���
(JMMK0711)
6=��� ������������9�:�������������� ���5��� ����� ����������9�:�
� �8����� �� ������������� ����
������5�3����� ��� ������
(JMMK0711)
%�����E'*�����%�� �� �# �� �E ��� �. �E ��� �,����3 �#����6����� �������68+/ �F��������� � ��������� ��� ��� �>@%�'+&�6����� �� �������68+G �6���� ��������� �F��������� �H%6 �>@%�'+&�*����� !��� �������*'I �*����� !����������� � �%� �� �>@%�'+&�)� ������ �������).6G, �)� ����� � �F��������� �#�� ��� ��� �>@%�'+&�.��������� �������.-2 �.��������� ������� � �'� �3 �>@%�'+,�F���������� �������#6 �F��������� � �F��������� �'� �3 �>@%�'+&�F� � � �������F8A �F� � � �F��������� �'� �3 �>@%�'+&�2������� �������AI �G��������������� ��� �F��������� ��� ��� �>@%�'+&�
� -! ������5� �� ��������������!��5����������
��������������
(JMMK0711)
� ������������������ ����
(JMMK0711)
� �-! ����� ������� ��������������
� ����� ������� � �����5� �����
����� ���� �������� �����
� �G ���4��� ������ �� ���
����5� ������������������
(JMMK0711)
� �. ��������+� =���� �������
��������3� ������������
�����'#'%J�%H&�'%%'2#�
� �'������3�����������������
������������� ���������������
�����3�������������� ���
� �6=����������������'#'%J�%H&�"�
����� ������������������ �������
��������������������
(JMMK0711)
� �6��� ���������� ������ �� ��������5��
����� ������0���������9�:���������
�������������
� �( ��������9�:����������5��3�
��������0����
(JMMK0711)
K�& ��'�����%�'����������������5���� �������������5�� �������������K�. �������������9�>LL���:����������0����3��!�������� ������������������
�+���"�"�-�&.�9 0��11���"���" �"�1����1%��������1����!" ���:�� � � �+��M--F�'#��'#*$�� � � � �K� � ������������ � � � �K���3��95� ����:� ��� ��4����� � � � �K� ���������� ����������������������� � � � ���� ��-�&.�����������5��� ��
Training part 1
Introduction- Artemis an editor for annotation
Artemis- Annotation platform
��������������������������������������
⇒ ������������������������������������������ �����!���
"#$$%�#$�
(JMMK0711)
Artemis- Getting started
'����������������� �
�������������������� ��
Open your anno_train folder!
(JMMK0711)
Artemis- Download an entry from EBI
e.g. add the acc. no. DQ119295 and start download
(JMMK0711)
Open Tutorial 1 in Artemis
select
Select all files!
(JMMK0711)
Artemis- Entry page downloaded entry
position
scaling of the selection
Table view
forward frames
reverse frames
Overview + features
Black bars indicate stop codons
(JMMK0711)
Short training- Most essential topics of this tutorial
Click to select�
(JMMK0711)
Editing 1
2
3
4 5
add qualifiers to the text box, /locus_tag=“Mel_00001” or e.g. /colour=2
(JMMK0711)
Create New feature (1)
Choose key-> repeat region Set position-> complement(1..1000)
Add qualifier-> note “getting started”
(JMMK0711)
OK
Create New feature (2)
(JMMK0711)
First assignment using Blast => Blast-Family is included in Artemis (-> RUN), direct access to NCBI Examples for applications during annotation:
BlastP-protein blast Search protein database using a protein query -> CDS features
BlastN-nucleotide blast Search nucleotide database using a nucleotide query -> rRNA operons -> overall comparison (high homologies) -> intergenic regions BlastX-translated blast Search protein database using a translated nucleotide query -> searching for unpredicted CDS regions in intergenic regions -> searching for disrupted CDS regions
TblastN-translated blast Search translated nucleotide database using a protein query -> searching (NGS draft) sequences for known candidate genes
TblastX-translated blast Search translated nucleotide database using a translated nucleotide query -> mRNA assignment
==> Please keep in mind! BLAST hits have to be reviewed -> e-value!
-> identity/similarity! -> alignment length! Kube, APVW 2010 (JMMK0711)
Training part 2
Practice- annotation (E. coli sequence, select code 11) PART 1-Predictions 1. Download the sequence tutorial_2.fa: �������������������� �����())(*+�,������-.�-��/����Copy the sequence to the Windows Editor (!) and paste the sequence. Save the file in the artemis folder. 2. Run predictions 2.1 Start the ORF (CDS) prediction at the Glimmer site: http://www.ncbi.nlm.nih.gov/genomes/MICROBES/
glimmer_3.cgi Choose bacterial! Your template is linear!
2.2 Start the tRNA prediction at the tRNA-SE site: http://lowelab.ucsc.edu/tRNAscan-SE/ Choose bacterial!
2.3 Start the rRNA predicion at the RNAmmer site: http://www.cbs.dtu.dk/services/RNAmmer/ Choose Bacteria!
3. While running these analyses continue to open tutorial_2.fa in Artemis. Select the bacterial code (11) -
> Artemis main window Options 4. Transfer the prediction results to Artemis 4.1 Generate the features - use applicable keys (2.1, 2.2 & 2.3) and enter positions 4.2 CDS - add the qualifier locus_tag to give street numbers to the CDS features 4.3 tRNAs - add the qualifier “product” to describe the predicted tRNA (/product="tRNA-???“) - add the qualifier “note” to add information on the prediction, e.g. /note="tRNAScan-SE score 92.73" 4.4 rRNA - add the qualifier “product” to describe the predicted rRNA
If you have finished this part, please ensure that your colleagues do not need help. If they don’t, proceed with PART 2. (JMMK0711)
Practice- annotation (E. coli sequence) PART 2- Assign CDS functions 5. CDS- functional assignment Start with the CDS Feature at the 3’-end of the sequence! Select the CDS feature and select from the headline of Artemis ‘Run’ and select BlastP. Examine the BlastP results and add to the features the qualifiers product and gene (for gene name). If you reach the last CDS (at the 5’-end at the sequence) , please look more carefully to the BlastP derived
alignment. 6. Save results. Ensure that your WordPad is closed. Select from the headline of Artemis ‘File’ and select save entries as EMBL. 7. View results in WordPad Format changed to EMBL. Again, if you have finished this part, please ensure that your colleagues do not need help. Results will be compared together.
(JMMK0711)
Results
ORF Prediction (1)
or
http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi (JMMK0711)
ORF/CDS (2)
(A) New feature�
(B) CDS key�
(C) add location and complement (-2!)�
Kube, APVW 2010 (JMMK0711)
tRNAs (1)
(JMMK0711)
tRNAs (2)
start/stop position -> use tRNA key -> add note for software and score -> use product qualifier to describe, e.g. /product="tRNA-Arg"
������������� �������0���������������������������1������������������ �2���������3��
(JMMK0711)
rRNA
Results: -> use rRNA key -> add note for software -> use product qualifier to describe, e.g. /product=“16S rRNA" -> add position (reverse strand!)
Summary
(JMMK0711)
Functional Annotation
Assign products and gene names
Please start from this point!
(JMMK0711)
How to run BlastP on Artemis Please keep in mind, the first hit (and also the following ones) may not be the correct assignment. Additional analysis is needed for most of the entries.
select CDS�
NCBI window will appear in your webbrowser�
Results from the BLASTP
Query length 260 aa, a hit over the whole sequence with 100% identity
����������4�
(JMMK0711)
Example how to annotate the first CDS
Use the shortcut StrG & E or go to Edit and choose “Selected Features in Editor“
Select qualifiers or add the text in field
(JMMK0711)
Save results- use EMBL Format
File name should end with “.embl” After this, please close Artemis.
Results can be downloaded here: http://ws.molgen.mpg.de/ws/693356/toanalyse_faster_anno.embl�
Results & Questions
This CDS feature was predicted with a very low score. BlastP shows a strong (different in start & stop) hit to conserved hypothetical proteins. However, no functional assignment and no hit in Pfam. This region was not masked for the CDS prediction (common problem, double coding). Select and erase!
(JMMK0711)
Pfam- Collection of protein families
(JMMK0711)
Functional Annotation- Additional information
The Flagellar system in KEGG ����������������5��������
(JMMK0711)
The Flagellar system in KEGG
next window
(JMMK0711)
The Flagellar system in KEGG
Kube, APVW 2010
“Genome” comparison in ACT
Starting ACT
Start -> sact_v9.jar
(JMMK0711)
Adjust results in ACT
red lines indicate alignments in the same orientation, blue ones are inverted