Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the...

76
Genome Annotation, Gene Ontology, Sequence Ontology Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP

Transcript of Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the...

Page 1: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

Genome Annotation, Gene Ontology, Sequence Ontology

Arthur Gruber

Instituto de Ciências Biomédicas Universidade de

São Paulo

AG-ICB-USP

Page 2: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

Sequence annotation

• Annotation  is  the  process  of  adding information to a DNA sequence.

• The information usually has  DNA coordinate.

• Features could be repeats, genes, promoters, protein domains…

• Features can be linked to other databases e.g. Pfam/Pubmed 

AG-ICB-USP

Page 3: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

Public databases

• GenBank, EMBL and DDBJ.• All databases update each other 

automatically

AG-ICB-USP

Page 4: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

The Feature Definition

AG-ICB-USP

• Format definition

• Covers DDBJ/EMBL/GenBank 

• Defines all accepted annotation terms and hierarchy 

Page 5: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

Annotation file

Contains:• A header with:

• Information about the sequence• Organism• Authors• References• Comments

• A feature table containing• Sequence features and co­ordinates

AG-ICB-USP

Page 6: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

NCBI Header

AG-ICB-USP

LOCUS NC_008685 1347714 bp DNA linear INV 26-FEB-2009DEFINITION Eimeria tenella str. Houghton chromosome 1, ordered contigs.ACCESSION NC_008685VERSION NC_008685.1 GI:153816670DBLINK Project:18295KEYWORDS .SOURCE Eimeria tenella str. Houghton ORGANISM Eimeria tenella str. Houghton Eukaryota; Alveolata; Apicomplexa; Coccidia; Eucoccidiorida; Eimeriorina; Eimeriidae; Eimeria.REFERENCE 1 AUTHORS Ling,K.H., Rajandream,M.A., Rivailler,P., Ivens,A., Yap,S.J., Madeira,A.M., Mungall,K., Billington,K., Yee,W.Y., Bankier,A.T., Carroll,F., Durham,A.M., Peters,N., Loo,S.S., Mat Isa,M.N., Novaes,J., Quail,M., Rosli,R., Nor Shamsudin,M., Sobreira,T.J., Tivey,A.R., Wai,S.F., White,S., Wu,X., Kerhornou,A., Blake,D., Mohamed,R., Shirley,M., Gruber,A., Berriman,M., Tomley,F., Dear,P.H. and Wan,K.L. TITLE Sequencing and analysis of chromosome 1 of Eimeria tenella reveals a unique segmental organization JOURNAL Genome Res. 17 (3), 311-319 (2007) PUBMED 17284678REFERENCE 2 (bases 1 to 1347714) CONSRTM NCBI Genome Project TITLE Direct Submission JOURNAL Submitted (20-FEB-2008) National Center for Biotechnology Information, NIH, Bethesda, MD 20894, USAREFERENCE 3 (bases 1 to 1347714) AUTHORS Rajandream,M.A. TITLE Direct Submission JOURNAL Submitted (26-MAY-2006) Rajandream M.A., The Pathogen Sequencing Unit, The Wellcome Trust Sanger Institute, Wellcome Trust genome Campus, Hinxton, Cambridge CB10 1SA, UNITED KINGDOMCOMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final NCBI review. The reference sequence was derived from AM269894.

Page 7: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

EMBL Header

AG-ICB-USP

ID   AM269894; SV 1; linear; genomic DNA; STD; INV; 1347714 BP.XXAC   AM269894;XXDT   16­JUN­2006 (Rel. 88, Created)DT   23­OCT­2008 (Rel. 97, Last updated, Version 5)XXDE   Eimeria tenella chromosome 1, ordered contigsXXKW   .XXOS   Eimeria tenellaOC   Eukaryota; Alveolata; Apicomplexa; Coccidia; Eucoccidiorida; Eimeriorina;OC   Eimeriidae; Eimeria.XXRN   [1]RP   1­1347714RA   Rajandream M.A.;RT   ;RL   Submitted (26­MAY­2006) to the EMBL/GenBank/DDBJ databases.RL   Rajandream M.A., The Pathogen Sequencing Unit, The Wellcome Trust SangerRL   Institute, Wellcome Trust genome Campus, Hinxton, Cambridge CB10 1SA,RL   UNITED KINGDOM.XXRN   [2]RX   DOI; 10.1101/gr.5823007.RX   PUBMED; 17284678.RA   Ling K.H., Rajandream M.A., Rivailler P., Ivens A., Yap S.J.,RA   Madeira A.M.B.N., Mungall K., Billington K., Yee W.Y., Bankier A.T.,RA   Carroll F., Durham A.M., Peters N., Loo S.S., Mat­Isa M.N., Novaes J.,RA   Quail M., Rosli R., Shamsudin M.N., Sobreira T.J.P., Tivey A.R., Wai S.F.,RA   White S., Wu X., Kerhornou A.X., Blake D., Mohamed R., Shirley M.,RA   Gruber A., Berriman M., Tomley F., Dear P.H., Wan K.L.;RT   "Sequencing and analysis of chromosome 1 of Eimeria tenella reveals aRT   unique segmental organization";RL   Genome Res. 17(3):311­319(2007).

Page 8: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

Feature

• Region of DNA that was annotated with a key/qualifier• Keys: CDS, intron, miscellaneous, etc.• Qualifier: notes or extra­information about a 

featurei.e. exon (key) /gene=“adh” (qualifier)

AG-ICB-USP

Page 9: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

Feature keysmisc_differencemisc_featuremisc_recombmisc_RNAmisc_signalmisc_structuremodified_basemRNAN_regionold_sequencepolyA_signalpolyA_siteprecursor_RNAprim_transcript

primer_bindpromoterprotein_bindRBSrepeat_regionrepeat_unitrep_originrRNAS_regionsatellitescRNAsig_peptidesnRNAsnoRNAsourcestem_loopSTSTATA_signalterminator

transit_peptidetRNAunsureV_regionV_segmentvariation3'clip3'UTR5'clip5'UTR-10_signal-35_signal

attenuatorC_regionCAAT_signalCDSconflictD-loopD_segmentenhancerexonGC_signalgeneiDNAintronJ_segmentLTRmat_peptidemisc_binding

AG-ICB-USP

Page 10: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

Feature qualifier

Additional information about a feature

/allele="text"/citation=[number]/codon=(seq:"text",aa:<amino_acid>)/codon_start=<1/db_xref="<database>:<identifier>"/EC_number="text"/evidence=<evidence_value>/exception="text"/function="text"/gene="text"/label=feature_label/map="text"

/note="text"/number=unquoted/product="text"/protein_id="<identifier>"/pseudo/standard_name="text"/translation="text"/transl_except=(pos:<base_range>,aa:<amino_acid>)/transl_table/usedin=accnum:feature_label

AG-ICB-USP

Page 11: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

Features (NCBI)

AG-ICB-USP

FH   Key             Location/QualifiersFHFT   source          1..1347714FT                   /organism="Eimeria tenella"FT                   /chromosome="1"FT                   /strain="Houghton"FT                   /mol_type="genomic DNA"FT                   /db_xref="taxon:5802"FT   misc_feature    1..168039FT                   /note="Contig01.7197"FT   repeat_region   1..79FT                   /rpt_type=TANDEMFT                   /rpt_unit_seq="AACCCTA(11.3)"FT                   /note="TRF parameters 2 1000 1000 80 10 25 1000"FT                   /note="TRF score is 158"FT                   /inference="ab initio prediction:Tandem Repeats Finder:4.0FT                   0"FT   repeat_region   78..138FT                   /rpt_type=TANDEMFT                   /rpt_unit_seq="AAACCCT(8.7)"FT                   /note="TRF parameters 2 1000 1000 80 10 25 1000"FT                   /note="TRF score is 122"FT                   /inference="ab initio prediction:Tandem Repeats Finder:4.0FT                   0"FT   CDS             1424..1717FT                   /locus_tag="eimer1623h09.tmp0001"FT                   /product="hypothetical protein"FT                   /note="no EST match to serve as supporting evidence forFT                   this feature"FT                   /db_xref="UniProtKB/TrEMBL:C8TDJ1"FT                   /inference="protein motif:Seg:1999"FT                   /protein_id="CAK51327.1"FT                   /translation="MSSKLIVLTGTHCRGTDSTRSLKSNVCSAGQQAASTSSSTTQAYFFT                   VQSAHVEIERHMCLAAFEAPFSTSPHGQASSLRLPQQRLAAYSKRRPWGNKN"

Page 12: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

CDS features

• CDS stands for coding sequence and is used to denote genes and pseudogenes.

• These features are automatically translated on submission and the protein added to the protein databases.

AG-ICB-USP

Page 13: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

/note

• Note field contains all the evidence for a gene call……..plus anything else.• Similarity (fasta or blast)• Domain/motif information (Pfam, TMHMM, 

etc.)• Unusual features (repeats, aa richness)

AG-ICB-USP

Page 14: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

/product

• The name of the gene product eg. Alcohol dehydrogenase

• Unless there is proof we must qualify...• Putative• Possible

• Always be conservative!…  eg. Putative dehydrogenase

    dehyrogenase like protein 

• Only  piece  of  annotation  added  to  the protein databases.

AG-ICB-USP

Page 15: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

Naming protocols• Hypothetical protein unknown function and no homology

 

• Conserved hypothetical protein unknown function WITH homology

 

• Alcohol dehydrogenase like looks a bit like it, but may not be.

• Putative alcohol dehydrogenase probably a alcohol dehydrogenase

• Alcohol dehydrogenase this has previously been              characterised and shown to be alcohol dehydrogenase in this 

organism.

AG-ICB-USP

Page 16: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

/gene 

• The gene name• eg ADH1

• Only transfer a gene name if it is meaningful

• Never transfer a gene name like Et0034a.• Is it a gene family? make sure two genes 

have the same name.

AG-ICB-USP

Page 17: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

Transitive Annotation

• AKA annotation catastrophe• Junk in = Junk out

• Mis­annotations  spread  through incorrect database submissions.

AG-ICB-USP

Page 18: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

How can we standardize the annotation terms?

AG-ICB-USP

Page 19: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

Through a dynamic controlled vocabulary

AG-ICB-USP

Page 20: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

AG-ICB-USP

Page 21: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

So what does that mean?From a practical view, ontology is the representation of something we know about.  “Ontologies" consist of a representation of things, that are detectable or directly observable, and the relationships between those 

things.

Page 22: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

Ontology Structure

cell

membrane chloroplast

mitochondrial chloroplastmembrane membrane

Directed Acyclic Graph (DAG) ­ multiple 

parentage allowed

Page 23: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

GO topology

• The ontologies are structured as directed acyclic graphs• Similar  to hierarchies but differ  in  that a more 

specialized term (child) can be related to more than one less specialized term (parent). 

• For example, hexose biosynthetic process has two  parents,  hexose  metabolic  process  and monosaccharide biosynthetic process. 

AG-ICB-USP

Page 24: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

True Path Violations Create Incorrect Definitions

..”the pathway from a child term all the way up to its top-level parent(s) must always be true".

chromosome

Part_of relationship

nucleus

Page 25: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

True Path Violations..”the pathway from a child term all the way up to its top-level parent(s) must always be true".

chromosome

Mitochondrial chromosome

Is_a relationship

Page 26: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

True Path Violations..”the pathway from a child term all the way up to its top-level parent(s) must always be true".

chromosome

Mitochondrial chromosome

Is_a relationship

Part_of relationship

nucleusA mitochondrial chromosome is not part of a nucleus!

Page 27: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

True Path Violations..”the pathway from a child term all the way up to its top-level parent(s) must always be true".

nucleus chromosome

Nuclear chromosome

Mitochondrial chromosome

Is_a relationshipPart_of

 relationship

mitochondrion

Part_of relationship

Page 28: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

GO Definitions: Each GO term has 2 Definitions

A definition written by a biologist:

necessary & sufficientconditions 

written definition(not computable)

 Graph structure: necessary conditions

formal(computable)

Page 29: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

Term­term relationship

• is_a• The  is_a  relationship  is  a  simple  class­

subclass relationship, where A is_a B means that A is a subclass of B

• For  example,  nuclear  chromosome  is_a chromosome. 

AG-ICB-USP

GO:0043232 : intracellular non­membrane­bound organelle     GO:0005694 : chromosome         GO:0000228 : nuclear chromosome 

Page 30: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

Term­term relationship

• part_of• C part_of D means that whenever C  is present,  it  is 

always a part of D, but C does not always have to be present

• For  example,  periplasmic  flagellum  part_of periplasmic space

AG-ICB-USP

GO:0044464 : cell part    GO:0042995 : cell projection        GO:0019861 : flagellum           GO:0009288 : flagellin­based flagellum             GO:0055040 : periplasmic flagellum    GO:0042597 : periplasmic space        GO:0055040 : periplasmic flagellum 

Page 31: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

Current Ontologies

• Molecular function: tasks performed by gene product

• Biological process: broad biological goals accomplished by ordered assemblies of molecular functions

• Cellular component: subcellular structures, locations and macromolecular complexes

 AG-ICB-USP

Page 32: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

AG-ICB-USP

Page 33: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

Search result for toxin

AG-ICB-USP

Page 34: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

Relationships in GO

•“is-a”

•“part of”

AG-ICB-USP

Page 35: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

GO paths to terms

AG-ICB-USP

Page 36: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

GO definitions

AG-ICB-USP

Page 37: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

Pyruvate dehydrogenase

AG-ICB-USP

Page 38: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

Why the interest in GO?● Universal ontology● Functional  classification  scheme  with  many 

different levels in a DAG● Widespread interest from scientific community● Already  mappings  to  SP  keywords  and  gene 

products­annotation on some organisms

AG-ICB-USP

Page 39: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

GO Evidence codes

AG-ICB-USPAG-ICB-USP

• Experimental Evidence Codes •EXP: Inferred from Experiment •IDA: Inferred from Direct Assay •IPI: Inferred from Physical Interaction •IMP: Inferred from Mutant Phenotype •IGI: Inferred from Genetic Interaction •IEP: Inferred from Expression Pattern 

• Computational Analysis Evidence Codes •ISS: Inferred from Sequence or Structural Similarity •ISO: Inferred from Sequence Orthology •ISA: Inferred from Sequence Alignment •ISM: Inferred from Sequence Model •IGC: Inferred from Genomic Context •RCA: inferred from Reviewed Computational Analysis 

• Author Statement Evidence Codes •TAS: Traceable Author Statement •NAS: Non­traceable Author Statement •Curator Statement Evidence Codes •IC: Inferred by Curator 

• ND: No biological Data available • Automatically­assigned Evidence Codes 

•IEA: Inferred from Electronic Annotation • Obsolete Evidence Codes • NR: Not Recorded 

Page 40: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

Current Mappings to GO

• Consortium mappings -MGD, SGD, FlyBase

• Swiss-Prot keywords

• EC numbers

• InterPro entries

• Medline ID

• Commercial companies -CompuGen, Proteome

AG-ICB-USP

Page 41: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

AG-ICB-USP

Page 42: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

AG-ICB-USP

Page 43: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

AG-ICB-USP

Page 44: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

InterPro­to­GO

Page 45: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

EC number­to­GO

AG-ICB-USP

Page 46: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

SP keyword­to­GO

AG-ICB-USP

Page 47: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

GO Slims

AG-ICB-USP

• GO slims are cut­down versions of the GO ontologies containing a subset of the terms in the whole GO. • They give a broad overview of the ontology content without the detail of the specific fine grained terms.

• GO slims are particularly useful for giving a summary of the results of GO annotation of a genome, microarray, or cDNA collection when broad classification of gene product function is required

• GO slims are created by users according to their needs, and may be specific to species or to particular areas of the ontologies.

Page 48: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

Sequence Ontology

AG-ICB-USPAG-ICB-USP

• How to edit an ontology file?• OBO-Edit – an ontology editor for biologists

• OBO-Edit compliant format

Page 49: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA
Page 50: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

GO doesn’t cover…• Gene products: e.g. cytochrome c is not in the ontologies, but 

attributes of cytochrome c, such as oxidoreductase activity, are. • Processes, functions or components that are unique to mutants or 

diseases: e.g. oncogenesis is not a valid GO term because causing cancer is not the normal function of any gene. 

• Attributes of sequence such as intron/exon parameters: these are not attributes of gene products and will be described in a separate sequence ontology (see Sequence Ontology). 

• Protein domains or structural features. • Protein­protein interactions. • Environment, evolution and expression. • Anatomical or histological features above the level of cellular 

components, including cell types. 

AG-ICB-USP

Page 51: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA
Page 52: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

Sequence Ontology• The  four major aspects of  the complete 

Sequence Ontology are:• located  sequence  features  for  objects  that 

can be located on sequence in coordinates,• sequence  attributes  for  describing  the 

properties of features,• consequences  of  mutation  for  the 

annotation of the effects of a mutation• chromosome  variation  to  describe  large 

scale variations

AG-ICB-USP

Page 53: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

Generic feature format 3

AG-ICB-USPAG-ICB-USP

• Generic format for sequence annotation interchange

• Tab-delimited text file• Represents features in hierarchical view

• Uses a controlled vocabulary – is compliant to Sequence Ontology

Page 54: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

AG-ICB-USPAG-ICB-USP

• The tab-delimited file presents 9 columns:• Column 1: "seqid"• Column 2: "source"• Column 3: "type"• Columns 4 & 5: "start" and "end"• Column 6: "score"• Column 7: "strand"• The strand of the feature. + for positive strand

(relative to the landmark), - for minus strand• Column 8: "phase"• Column 9: "attributes"

Generic feature format 3

Page 55: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

Generic feature format 3

• Column 1: "seqid"• Column 2: "source"• Column 3: "type"• Columns 4 & 5: "start" and "end"• Column 6: "score"• Column 7: "strand"• Column 8: "phase"• Column 9: "attributes"

Page 56: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

How to annotate these splicing variants using Sequence Ontology terms and the GFF3?

Page 57: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

• The annotated genome region is named “ctg123” • A gene named EDEN extends from coordinates 1 to 9000• The gene encodes three alternatively­spliced variants: EDEN.1, EDEN.2 and EDEN.3• Transcript EDEN.3 presents two alternative translation start points• There is a transcriptional factor binding site (a promoter) located 50 bp upstream of the translational start site of EDEN.1

Page 58: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

##gff­version 3##sequence­region ctg123 1 1497228 ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN ctg123 . TF_binding_site 1000 1012 . + . ID=tfbs00001;Parent=gene00001 ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1 ctg123 . mRNA 1050 9000 . + . ID=mRNA00002;Parent=gene00001;Name=EDEN.2 ctg123 . mRNA 1300 9000 . + . ID=mRNA00003;Parent=gene00001;Name=EDEN.3 

Page 59: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

ctg123 . exon            1300  1500  .  +  .  ID=exon00001;Parent=mRNA00003ctg123 . exon            1050  1500  .  +  .  ID=exon00002;Parent=mRNA00001,mRNA00002ctg123 . exon            3000  3902  .  +  .  ID=exon00003;Parent=mRNA00001,mRNA00003ctg123 . exon            5000  5500  .  +  .  ID=exon00004;Parent=mRNA00001,mRNA00002,mRNA00003ctg123 . exon            7000  9000  .  +  .  ID=exon00005;Parent=mRNA00001,mRNA00002,mRNA00003

ctg123 . CDS             1201  1500  .  +  0  ID=cds00001;Parent=mRNA00001;Name=edenprotein.1ctg123 . CDS             3000  3902  .  +  0  ID=cds00001;Parent=mRNA00001;Name=edenprotein.1ctg123 . CDS             5000  5500  .  +  0  ID=cds00001;Parent=mRNA00001;Name=edenprotein.1ctg123 . CDS             7000  7600  .  +  0  ID=cds00001;Parent=mRNA00001;Name=edenprotein.1

Page 60: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

ctg123 . CDS             1201  1500  .  +  0  ID=cds00002;Parent=mRNA00002;Name=edenprotein.2ctg123 . CDS             5000  5500  .  +  0  ID=cds00002;Parent=mRNA00002;Name=edenprotein.2ctg123 . CDS             7000  7600  .  +  0  ID=cds00002;Parent=mRNA00002;Name=edenprotein.2

ctg123 . CDS             3301  3902 .  +  0  ID=cds00003;Parent=mRNA00003;Name=edenprotein.3ctg123 . CDS             5000  5500 .  +  2  ID=cds00003;Parent=mRNA00003;Name=edenprotein.3ctg123 . CDS             7000  7600 .  +  2  ID=cds00003;Parent=mRNA00003;Name=edenprotein.3

ctg123 . CDS             3391  3902  .  +  0  ID=cds00004;Parent=mRNA00003;Name=edenprotein.4ctg123 . CDS             5000  5500 .  +  2  ID=cds00004;Parent=mRNA00003;Name=edenprotein.4ctg123 . CDS             7000  7600 .  +  2  ID=cds00004;Parent=mRNA00003;Name=edenprotein.4

Page 61: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

AG-ICB-USPAG-ICB-USP

• If you writes a GFF file, you can test it! There is an online validator:

http://dev.wormbase.org/db/validate_gff3/validate_gff3_online

Generic feature format 3

Page 62: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

Testing the GFF3 Validator

Page 63: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

AG-ICB-USPAG-ICB-USP

Page 64: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA
Page 65: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA
Page 66: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

Testing the GFF3 Validator

Let’s change the feature names

Page 67: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA
Page 68: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA
Page 69: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA
Page 70: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

Annotation viewing and editingArtemis• Artemis is a free genome viewer and annotation 

tool  developed  by  Kim  Rutherford  (Sanger Institute, UK).

• It  allows  for  visualization of  sequence  features and    results  of  analyses,  in  the  context  of  the sequence  and its six­frame translation. 

AG-ICB-USP

Page 71: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

Annotation viewing and editingArtemis• Artemis  is  written  in  Java,  and  is  available  for 

UNIX,  GNU/Linux,  BSD,  Macintosh  and  MS­Windows systems. 

• It  can  read  complete  EMBL  and  GENBANKdatabase entries or sequence in FASTA or raw format.  Extra  sequence  features  can  be  in EMBL, GENBANK or GFF format. 

AG-ICB-USP

Page 72: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

AG-FMVZ-USPAG-FMVZ-USP

Page 73: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

AG-FMVZ-USPAG-FMVZ-USP

Page 74: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

AG-FMVZ-USPAG-FMVZ-USP

Page 75: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

AG-FMVZ-USPAG-FMVZ-USP

Page 76: Genome Annotation, Gene Ontology, Sequence Ontology · Sequence annotation • Annotation is the process of adding information to a DNA sequence. • The information usually has DNA

AG-FMVZ-USPAG-FMVZ-USP