Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the...
Transcript of Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the...
Genome Annotation, Gene Ontology, Sequence Ontology
Arthur Gruber
Instituto de Ciências Biomédicas Universidade de
São Paulo
AG-ICB-USP
Sequence annotation
• Annotation is the process of adding information to a DNA sequence.
• The information usually has DNA coordinate.
• Features could be repeats, genes, promoters, protein domains…
• Features can be linked to other databases e.g. Pfam/Pubmed
AG-ICB-USP
Public databases
• GenBank, EMBL and DDBJ.• All databases update each other
automatically
AG-ICB-USP
The Feature Definition
AG-ICB-USP
• Format definition
• Covers DDBJ/EMBL/GenBank
• Defines all accepted annotation terms and hierarchy
Annotation file
Contains:• A header with:
• Information about the sequence• Organism• Authors• References• Comments
• A feature table containing• Sequence features and coordinates
AG-ICB-USP
NCBI Header
AG-ICB-USP
LOCUS NC_008685 1347714 bp DNA linear INV 26-FEB-2009DEFINITION Eimeria tenella str. Houghton chromosome 1, ordered contigs.ACCESSION NC_008685VERSION NC_008685.1 GI:153816670DBLINK Project:18295KEYWORDS .SOURCE Eimeria tenella str. Houghton ORGANISM Eimeria tenella str. Houghton Eukaryota; Alveolata; Apicomplexa; Coccidia; Eucoccidiorida; Eimeriorina; Eimeriidae; Eimeria.REFERENCE 1 AUTHORS Ling,K.H., Rajandream,M.A., Rivailler,P., Ivens,A., Yap,S.J., Madeira,A.M., Mungall,K., Billington,K., Yee,W.Y., Bankier,A.T., Carroll,F., Durham,A.M., Peters,N., Loo,S.S., Mat Isa,M.N., Novaes,J., Quail,M., Rosli,R., Nor Shamsudin,M., Sobreira,T.J., Tivey,A.R., Wai,S.F., White,S., Wu,X., Kerhornou,A., Blake,D., Mohamed,R., Shirley,M., Gruber,A., Berriman,M., Tomley,F., Dear,P.H. and Wan,K.L. TITLE Sequencing and analysis of chromosome 1 of Eimeria tenella reveals a unique segmental organization JOURNAL Genome Res. 17 (3), 311-319 (2007) PUBMED 17284678REFERENCE 2 (bases 1 to 1347714) CONSRTM NCBI Genome Project TITLE Direct Submission JOURNAL Submitted (20-FEB-2008) National Center for Biotechnology Information, NIH, Bethesda, MD 20894, USAREFERENCE 3 (bases 1 to 1347714) AUTHORS Rajandream,M.A. TITLE Direct Submission JOURNAL Submitted (26-MAY-2006) Rajandream M.A., The Pathogen Sequencing Unit, The Wellcome Trust Sanger Institute, Wellcome Trust genome Campus, Hinxton, Cambridge CB10 1SA, UNITED KINGDOMCOMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final NCBI review. The reference sequence was derived from AM269894.
EMBL Header
AG-ICB-USP
ID AM269894; SV 1; linear; genomic DNA; STD; INV; 1347714 BP.XXAC AM269894;XXDT 16JUN2006 (Rel. 88, Created)DT 23OCT2008 (Rel. 97, Last updated, Version 5)XXDE Eimeria tenella chromosome 1, ordered contigsXXKW .XXOS Eimeria tenellaOC Eukaryota; Alveolata; Apicomplexa; Coccidia; Eucoccidiorida; Eimeriorina;OC Eimeriidae; Eimeria.XXRN [1]RP 11347714RA Rajandream M.A.;RT ;RL Submitted (26MAY2006) to the EMBL/GenBank/DDBJ databases.RL Rajandream M.A., The Pathogen Sequencing Unit, The Wellcome Trust SangerRL Institute, Wellcome Trust genome Campus, Hinxton, Cambridge CB10 1SA,RL UNITED KINGDOM.XXRN [2]RX DOI; 10.1101/gr.5823007.RX PUBMED; 17284678.RA Ling K.H., Rajandream M.A., Rivailler P., Ivens A., Yap S.J.,RA Madeira A.M.B.N., Mungall K., Billington K., Yee W.Y., Bankier A.T.,RA Carroll F., Durham A.M., Peters N., Loo S.S., MatIsa M.N., Novaes J.,RA Quail M., Rosli R., Shamsudin M.N., Sobreira T.J.P., Tivey A.R., Wai S.F.,RA White S., Wu X., Kerhornou A.X., Blake D., Mohamed R., Shirley M.,RA Gruber A., Berriman M., Tomley F., Dear P.H., Wan K.L.;RT "Sequencing and analysis of chromosome 1 of Eimeria tenella reveals aRT unique segmental organization";RL Genome Res. 17(3):311319(2007).
Feature
• Region of DNA that was annotated with a key/qualifier• Keys: CDS, intron, miscellaneous, etc.• Qualifier: notes or extrainformation about a
featurei.e. exon (key) /gene=“adh” (qualifier)
AG-ICB-USP
Feature keysmisc_differencemisc_featuremisc_recombmisc_RNAmisc_signalmisc_structuremodified_basemRNAN_regionold_sequencepolyA_signalpolyA_siteprecursor_RNAprim_transcript
primer_bindpromoterprotein_bindRBSrepeat_regionrepeat_unitrep_originrRNAS_regionsatellitescRNAsig_peptidesnRNAsnoRNAsourcestem_loopSTSTATA_signalterminator
transit_peptidetRNAunsureV_regionV_segmentvariation3'clip3'UTR5'clip5'UTR-10_signal-35_signal
attenuatorC_regionCAAT_signalCDSconflictD-loopD_segmentenhancerexonGC_signalgeneiDNAintronJ_segmentLTRmat_peptidemisc_binding
AG-ICB-USP
Feature qualifier
Additional information about a feature
/allele="text"/citation=[number]/codon=(seq:"text",aa:<amino_acid>)/codon_start=<1/db_xref="<database>:<identifier>"/EC_number="text"/evidence=<evidence_value>/exception="text"/function="text"/gene="text"/label=feature_label/map="text"
/note="text"/number=unquoted/product="text"/protein_id="<identifier>"/pseudo/standard_name="text"/translation="text"/transl_except=(pos:<base_range>,aa:<amino_acid>)/transl_table/usedin=accnum:feature_label
AG-ICB-USP
Features (NCBI)
AG-ICB-USP
FH Key Location/QualifiersFHFT source 1..1347714FT /organism="Eimeria tenella"FT /chromosome="1"FT /strain="Houghton"FT /mol_type="genomic DNA"FT /db_xref="taxon:5802"FT misc_feature 1..168039FT /note="Contig01.7197"FT repeat_region 1..79FT /rpt_type=TANDEMFT /rpt_unit_seq="AACCCTA(11.3)"FT /note="TRF parameters 2 1000 1000 80 10 25 1000"FT /note="TRF score is 158"FT /inference="ab initio prediction:Tandem Repeats Finder:4.0FT 0"FT repeat_region 78..138FT /rpt_type=TANDEMFT /rpt_unit_seq="AAACCCT(8.7)"FT /note="TRF parameters 2 1000 1000 80 10 25 1000"FT /note="TRF score is 122"FT /inference="ab initio prediction:Tandem Repeats Finder:4.0FT 0"FT CDS 1424..1717FT /locus_tag="eimer1623h09.tmp0001"FT /product="hypothetical protein"FT /note="no EST match to serve as supporting evidence forFT this feature"FT /db_xref="UniProtKB/TrEMBL:C8TDJ1"FT /inference="protein motif:Seg:1999"FT /protein_id="CAK51327.1"FT /translation="MSSKLIVLTGTHCRGTDSTRSLKSNVCSAGQQAASTSSSTTQAYFFT VQSAHVEIERHMCLAAFEAPFSTSPHGQASSLRLPQQRLAAYSKRRPWGNKN"
CDS features
• CDS stands for coding sequence and is used to denote genes and pseudogenes.
• These features are automatically translated on submission and the protein added to the protein databases.
AG-ICB-USP
/note
• Note field contains all the evidence for a gene call……..plus anything else.• Similarity (fasta or blast)• Domain/motif information (Pfam, TMHMM,
etc.)• Unusual features (repeats, aa richness)
AG-ICB-USP
/product
• The name of the gene product eg. Alcohol dehydrogenase
• Unless there is proof we must qualify...• Putative• Possible
• Always be conservative!… eg. Putative dehydrogenase
dehyrogenase like protein
• Only piece of annotation added to the protein databases.
AG-ICB-USP
Naming protocols• Hypothetical protein unknown function and no homology
• Conserved hypothetical protein unknown function WITH homology
• Alcohol dehydrogenase like looks a bit like it, but may not be.
• Putative alcohol dehydrogenase probably a alcohol dehydrogenase
• Alcohol dehydrogenase this has previously been characterised and shown to be alcohol dehydrogenase in this
organism.
AG-ICB-USP
/gene
• The gene name• eg ADH1
• Only transfer a gene name if it is meaningful
• Never transfer a gene name like Et0034a.• Is it a gene family? make sure two genes
have the same name.
AG-ICB-USP
Transitive Annotation
• AKA annotation catastrophe• Junk in = Junk out
• Misannotations spread through incorrect database submissions.
AG-ICB-USP
How can we standardize the annotation terms?
AG-ICB-USP
Through a dynamic controlled vocabulary
AG-ICB-USP
AG-ICB-USP
So what does that mean?From a practical view, ontology is the representation of something we know about. “Ontologies" consist of a representation of things, that are detectable or directly observable, and the relationships between those
things.
Ontology Structure
cell
membrane chloroplast
mitochondrial chloroplastmembrane membrane
Directed Acyclic Graph (DAG) multiple
parentage allowed
GO topology
• The ontologies are structured as directed acyclic graphs• Similar to hierarchies but differ in that a more
specialized term (child) can be related to more than one less specialized term (parent).
• For example, hexose biosynthetic process has two parents, hexose metabolic process and monosaccharide biosynthetic process.
AG-ICB-USP
True Path Violations Create Incorrect Definitions
..”the pathway from a child term all the way up to its top-level parent(s) must always be true".
chromosome
Part_of relationship
nucleus
True Path Violations..”the pathway from a child term all the way up to its top-level parent(s) must always be true".
chromosome
Mitochondrial chromosome
Is_a relationship
True Path Violations..”the pathway from a child term all the way up to its top-level parent(s) must always be true".
chromosome
Mitochondrial chromosome
Is_a relationship
Part_of relationship
nucleusA mitochondrial chromosome is not part of a nucleus!
True Path Violations..”the pathway from a child term all the way up to its top-level parent(s) must always be true".
nucleus chromosome
Nuclear chromosome
Mitochondrial chromosome
Is_a relationshipPart_of
relationship
mitochondrion
Part_of relationship
GO Definitions: Each GO term has 2 Definitions
A definition written by a biologist:
necessary & sufficientconditions
written definition(not computable)
Graph structure: necessary conditions
formal(computable)
Termterm relationship
• is_a• The is_a relationship is a simple class
subclass relationship, where A is_a B means that A is a subclass of B
• For example, nuclear chromosome is_a chromosome.
AG-ICB-USP
GO:0043232 : intracellular nonmembranebound organelle GO:0005694 : chromosome GO:0000228 : nuclear chromosome
Termterm relationship
• part_of• C part_of D means that whenever C is present, it is
always a part of D, but C does not always have to be present
• For example, periplasmic flagellum part_of periplasmic space
AG-ICB-USP
GO:0044464 : cell part GO:0042995 : cell projection GO:0019861 : flagellum GO:0009288 : flagellinbased flagellum GO:0055040 : periplasmic flagellum GO:0042597 : periplasmic space GO:0055040 : periplasmic flagellum
Current Ontologies
• Molecular function: tasks performed by gene product
• Biological process: broad biological goals accomplished by ordered assemblies of molecular functions
• Cellular component: subcellular structures, locations and macromolecular complexes
AG-ICB-USP
AG-ICB-USP
Search result for toxin
AG-ICB-USP
Relationships in GO
•“is-a”
•“part of”
AG-ICB-USP
GO paths to terms
AG-ICB-USP
GO definitions
AG-ICB-USP
Pyruvate dehydrogenase
AG-ICB-USP
Why the interest in GO?● Universal ontology● Functional classification scheme with many
different levels in a DAG● Widespread interest from scientific community● Already mappings to SP keywords and gene
productsannotation on some organisms
AG-ICB-USP
GO Evidence codes
AG-ICB-USPAG-ICB-USP
• Experimental Evidence Codes •EXP: Inferred from Experiment •IDA: Inferred from Direct Assay •IPI: Inferred from Physical Interaction •IMP: Inferred from Mutant Phenotype •IGI: Inferred from Genetic Interaction •IEP: Inferred from Expression Pattern
• Computational Analysis Evidence Codes •ISS: Inferred from Sequence or Structural Similarity •ISO: Inferred from Sequence Orthology •ISA: Inferred from Sequence Alignment •ISM: Inferred from Sequence Model •IGC: Inferred from Genomic Context •RCA: inferred from Reviewed Computational Analysis
• Author Statement Evidence Codes •TAS: Traceable Author Statement •NAS: Nontraceable Author Statement •Curator Statement Evidence Codes •IC: Inferred by Curator
• ND: No biological Data available • Automaticallyassigned Evidence Codes
•IEA: Inferred from Electronic Annotation • Obsolete Evidence Codes • NR: Not Recorded
Current Mappings to GO
• Consortium mappings -MGD, SGD, FlyBase
• Swiss-Prot keywords
• EC numbers
• InterPro entries
• Medline ID
• Commercial companies -CompuGen, Proteome
AG-ICB-USP
AG-ICB-USP
AG-ICB-USP
AG-ICB-USP
InterProtoGO
EC numbertoGO
AG-ICB-USP
SP keywordtoGO
AG-ICB-USP
GO Slims
AG-ICB-USP
• GO slims are cutdown versions of the GO ontologies containing a subset of the terms in the whole GO. • They give a broad overview of the ontology content without the detail of the specific fine grained terms.
• GO slims are particularly useful for giving a summary of the results of GO annotation of a genome, microarray, or cDNA collection when broad classification of gene product function is required
• GO slims are created by users according to their needs, and may be specific to species or to particular areas of the ontologies.
Sequence Ontology
AG-ICB-USPAG-ICB-USP
• How to edit an ontology file?• OBO-Edit – an ontology editor for biologists
• OBO-Edit compliant format
GO doesn’t cover…• Gene products: e.g. cytochrome c is not in the ontologies, but
attributes of cytochrome c, such as oxidoreductase activity, are. • Processes, functions or components that are unique to mutants or
diseases: e.g. oncogenesis is not a valid GO term because causing cancer is not the normal function of any gene.
• Attributes of sequence such as intron/exon parameters: these are not attributes of gene products and will be described in a separate sequence ontology (see Sequence Ontology).
• Protein domains or structural features. • Proteinprotein interactions. • Environment, evolution and expression. • Anatomical or histological features above the level of cellular
components, including cell types.
AG-ICB-USP
Sequence Ontology• The four major aspects of the complete
Sequence Ontology are:• located sequence features for objects that
can be located on sequence in coordinates,• sequence attributes for describing the
properties of features,• consequences of mutation for the
annotation of the effects of a mutation• chromosome variation to describe large
scale variations
AG-ICB-USP
Generic feature format 3
AG-ICB-USPAG-ICB-USP
• Generic format for sequence annotation interchange
• Tab-delimited text file• Represents features in hierarchical view
• Uses a controlled vocabulary – is compliant to Sequence Ontology
AG-ICB-USPAG-ICB-USP
• The tab-delimited file presents 9 columns:• Column 1: "seqid"• Column 2: "source"• Column 3: "type"• Columns 4 & 5: "start" and "end"• Column 6: "score"• Column 7: "strand"• The strand of the feature. + for positive strand
(relative to the landmark), - for minus strand• Column 8: "phase"• Column 9: "attributes"
Generic feature format 3
Generic feature format 3
• Column 1: "seqid"• Column 2: "source"• Column 3: "type"• Columns 4 & 5: "start" and "end"• Column 6: "score"• Column 7: "strand"• Column 8: "phase"• Column 9: "attributes"
How to annotate these splicing variants using Sequence Ontology terms and the GFF3?
• The annotated genome region is named “ctg123” • A gene named EDEN extends from coordinates 1 to 9000• The gene encodes three alternativelyspliced variants: EDEN.1, EDEN.2 and EDEN.3• Transcript EDEN.3 presents two alternative translation start points• There is a transcriptional factor binding site (a promoter) located 50 bp upstream of the translational start site of EDEN.1
##gffversion 3##sequenceregion ctg123 1 1497228 ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN ctg123 . TF_binding_site 1000 1012 . + . ID=tfbs00001;Parent=gene00001 ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1 ctg123 . mRNA 1050 9000 . + . ID=mRNA00002;Parent=gene00001;Name=EDEN.2 ctg123 . mRNA 1300 9000 . + . ID=mRNA00003;Parent=gene00001;Name=EDEN.3
ctg123 . exon 1300 1500 . + . ID=exon00001;Parent=mRNA00003ctg123 . exon 1050 1500 . + . ID=exon00002;Parent=mRNA00001,mRNA00002ctg123 . exon 3000 3902 . + . ID=exon00003;Parent=mRNA00001,mRNA00003ctg123 . exon 5000 5500 . + . ID=exon00004;Parent=mRNA00001,mRNA00002,mRNA00003ctg123 . exon 7000 9000 . + . ID=exon00005;Parent=mRNA00001,mRNA00002,mRNA00003
ctg123 . CDS 1201 1500 . + 0 ID=cds00001;Parent=mRNA00001;Name=edenprotein.1ctg123 . CDS 3000 3902 . + 0 ID=cds00001;Parent=mRNA00001;Name=edenprotein.1ctg123 . CDS 5000 5500 . + 0 ID=cds00001;Parent=mRNA00001;Name=edenprotein.1ctg123 . CDS 7000 7600 . + 0 ID=cds00001;Parent=mRNA00001;Name=edenprotein.1
ctg123 . CDS 1201 1500 . + 0 ID=cds00002;Parent=mRNA00002;Name=edenprotein.2ctg123 . CDS 5000 5500 . + 0 ID=cds00002;Parent=mRNA00002;Name=edenprotein.2ctg123 . CDS 7000 7600 . + 0 ID=cds00002;Parent=mRNA00002;Name=edenprotein.2
ctg123 . CDS 3301 3902 . + 0 ID=cds00003;Parent=mRNA00003;Name=edenprotein.3ctg123 . CDS 5000 5500 . + 2 ID=cds00003;Parent=mRNA00003;Name=edenprotein.3ctg123 . CDS 7000 7600 . + 2 ID=cds00003;Parent=mRNA00003;Name=edenprotein.3
ctg123 . CDS 3391 3902 . + 0 ID=cds00004;Parent=mRNA00003;Name=edenprotein.4ctg123 . CDS 5000 5500 . + 2 ID=cds00004;Parent=mRNA00003;Name=edenprotein.4ctg123 . CDS 7000 7600 . + 2 ID=cds00004;Parent=mRNA00003;Name=edenprotein.4
AG-ICB-USPAG-ICB-USP
• If you writes a GFF file, you can test it! There is an online validator:
http://dev.wormbase.org/db/validate_gff3/validate_gff3_online
Generic feature format 3
Testing the GFF3 Validator
AG-ICB-USPAG-ICB-USP
Testing the GFF3 Validator
Let’s change the feature names
Annotation viewing and editingArtemis• Artemis is a free genome viewer and annotation
tool developed by Kim Rutherford (Sanger Institute, UK).
• It allows for visualization of sequence features and results of analyses, in the context of the sequence and its sixframe translation.
AG-ICB-USP
Annotation viewing and editingArtemis• Artemis is written in Java, and is available for
UNIX, GNU/Linux, BSD, Macintosh and MSWindows systems.
• It can read complete EMBL and GENBANKdatabase entries or sequence in FASTA or raw format. Extra sequence features can be in EMBL, GENBANK or GFF format.
AG-ICB-USP
AG-FMVZ-USPAG-FMVZ-USP
AG-FMVZ-USPAG-FMVZ-USP
AG-FMVZ-USPAG-FMVZ-USP
AG-FMVZ-USPAG-FMVZ-USP
AG-FMVZ-USPAG-FMVZ-USP