Protein Annotation Ontology The BioSapiens Virtual Institute for Genome Annotations Janet Thornton &...

27
Protein Annotation Protein Annotation Ontology Ontology The BioSapiens Virtual Institute for Genome The BioSapiens Virtual Institute for Genome Annotations Annotations Janet Thornton & Gabby Reeves AFP/BioSapiens Vienna: July 07

Transcript of Protein Annotation Ontology The BioSapiens Virtual Institute for Genome Annotations Janet Thornton &...

Page 1: Protein Annotation Ontology The BioSapiens Virtual Institute for Genome Annotations Janet Thornton & Gabby Reeves AFP/BioSapiens Vienna: July 07.

Protein Annotation OntologyProtein Annotation OntologyThe BioSapiens Virtual Institute for Genome AnnotationsThe BioSapiens Virtual Institute for Genome Annotations

Janet Thornton & Gabby Reeves

AFP/BioSapiens

Vienna: July 07

Page 2: Protein Annotation Ontology The BioSapiens Virtual Institute for Genome Annotations Janet Thornton & Gabby Reeves AFP/BioSapiens Vienna: July 07.

OutlineOutline

• Integrating annotations -- why it is so important to think about it.

• Progress made by the BioSapiens towards the virtual institute for genome annotations.

• Creating the ontology

• ontology rules

• software (OBO)

• The Ontology – a brief outline

Page 3: Protein Annotation Ontology The BioSapiens Virtual Institute for Genome Annotations Janet Thornton & Gabby Reeves AFP/BioSapiens Vienna: July 07.

The European Virtual Institute for Genome Annotation

Funded by the European Commission

Page 4: Protein Annotation Ontology The BioSapiens Virtual Institute for Genome Annotations Janet Thornton & Gabby Reeves AFP/BioSapiens Vienna: July 07.

BioSapiens

Network of ExcellenceNetwork of Excellence

26 partners in 14 different countries

The objective of the BIOSAPIENS Network of Excellence is to provide a large-scale, concerted effort to annotate genome data by laboratories distributed around Europe, using both informatics tools and input from experimentalists.

Page 5: Protein Annotation Ontology The BioSapiens Virtual Institute for Genome Annotations Janet Thornton & Gabby Reeves AFP/BioSapiens Vienna: July 07.

• Many tools have been developed for the annotation of proteins – many make similar predictions.

• These tools come from a number of different labs in different locations

BIOSAPIENSBIOSAPIENS

Page 6: Protein Annotation Ontology The BioSapiens Virtual Institute for Genome Annotations Janet Thornton & Gabby Reeves AFP/BioSapiens Vienna: July 07.
Page 7: Protein Annotation Ontology The BioSapiens Virtual Institute for Genome Annotations Janet Thornton & Gabby Reeves AFP/BioSapiens Vienna: July 07.

BioSapiens Genome Annotation

DNA

Annotation

Proteome Annotation

Functional Annotation

•Gene definition/ alternative splicing•Regulators and promoters•Expression•Variation (haplotypes and SNPs)

•Protein families, orthologues•Membrane proteins and ligands•3D protein structure•Post translational modification and localisation

•Sequence and structure to function•Protein-protein complexes•Pathways and networks

How can we provide an integrated view of this information for the biologist?

Page 8: Protein Annotation Ontology The BioSapiens Virtual Institute for Genome Annotations Janet Thornton & Gabby Reeves AFP/BioSapiens Vienna: July 07.

• 69 sources from 19 partner sites, providing approximately 330 annotations.

• Information provided but not functionally ordered.

• Without a defined ontology, accurate interpretation of these annotations is impossible .

• The servers providing annotations also need sensible IDs to allow adequate identification and administration

Functional Grouping of AnnotationsFunctional Grouping of Annotations

Page 9: Protein Annotation Ontology The BioSapiens Virtual Institute for Genome Annotations Janet Thornton & Gabby Reeves AFP/BioSapiens Vienna: July 07.

Integrating AnnotationsIntegrating Annotations

Sequencing projects, structural genomics initiatives,Sequencing projects, structural genomics initiatives, ever ever

increasing increasing experimental based knowledgeexperimental based knowledge of biological systems. of biological systems.

1.1. Additional information needs to be added to already Additional information needs to be added to already existing entries.existing entries.

e.g. EMBL/Genbank/DDBJe.g. EMBL/Genbank/DDBJ

• Third Party Annotation pilot studyThird Party Annotation pilot study

• Entries via the website, marked as TPA entriesEntries via the website, marked as TPA entries

• Checked carefully by curators before published.Checked carefully by curators before published.

Page 10: Protein Annotation Ontology The BioSapiens Virtual Institute for Genome Annotations Janet Thornton & Gabby Reeves AFP/BioSapiens Vienna: July 07.

UniProt - proposals

• The “adopt a protein” scheme, - a research community in a particular area The “adopt a protein” scheme, - a research community in a particular area would be responsible for the update of information would be responsible for the update of information

• Making use of “grey matter” – using the growing population of retired scientists Making use of “grey matter” – using the growing population of retired scientists at home – with broadband accounts and nothing to do.at home – with broadband accounts and nothing to do.

• Quality and uniformity of curation is an issue – input fields free text/drop down Quality and uniformity of curation is an issue – input fields free text/drop down menusmenus

Distributed Annotation SystemDistributed Annotation System

• allows a system of decentralised annotation

Integrating AnnotationsIntegrating Annotations

2.2. Manually curated databases are struggling with the influx of Manually curated databases are struggling with the influx of information.information.

Page 11: Protein Annotation Ontology The BioSapiens Virtual Institute for Genome Annotations Janet Thornton & Gabby Reeves AFP/BioSapiens Vienna: July 07.

• What it is

– The distributed annotation system (DAS) is a specification of a client-server system for sharing various types of sequence annotations.

– An “annotation” is an entity which is anchored to a reference subsequence with a start and a stop position, together with some information about the type and method of annotation, and possibly some other textual information.

– Today, DAS is used for serving positional annotations on genomes and on proteins, and for serving “global annotations” on genes.

DAS, the distributed annotation system

Page 12: Protein Annotation Ontology The BioSapiens Virtual Institute for Genome Annotations Janet Thornton & Gabby Reeves AFP/BioSapiens Vienna: July 07.

Distributed Annotation System

Viewer

DAS Protocol

Page 13: Protein Annotation Ontology The BioSapiens Virtual Institute for Genome Annotations Janet Thornton & Gabby Reeves AFP/BioSapiens Vienna: July 07.

Dasty2Dasty2

Rafael Jimenez

Page 14: Protein Annotation Ontology The BioSapiens Virtual Institute for Genome Annotations Janet Thornton & Gabby Reeves AFP/BioSapiens Vienna: July 07.

SpiceSpice

Andreas Prlic

Page 15: Protein Annotation Ontology The BioSapiens Virtual Institute for Genome Annotations Janet Thornton & Gabby Reeves AFP/BioSapiens Vienna: July 07.

1.Cluster like annotations together to aid comparison between sources.

What will the ontology do?What will the ontology do?

Page 16: Protein Annotation Ontology The BioSapiens Virtual Institute for Genome Annotations Janet Thornton & Gabby Reeves AFP/BioSapiens Vienna: July 07.

Information on metal binding sites from two

sources

Page 17: Protein Annotation Ontology The BioSapiens Virtual Institute for Genome Annotations Janet Thornton & Gabby Reeves AFP/BioSapiens Vienna: July 07.

1.Cluster like annotations together to aid comparison between sources.

2.Facilitate the identification of exact duplications in the data (e.g. Pfam domains are provided by Interpro and UniProt).

What will the ontology do?What will the ontology do?

Page 18: Protein Annotation Ontology The BioSapiens Virtual Institute for Genome Annotations Janet Thornton & Gabby Reeves AFP/BioSapiens Vienna: July 07.

Duplications in the data.

Page 19: Protein Annotation Ontology The BioSapiens Virtual Institute for Genome Annotations Janet Thornton & Gabby Reeves AFP/BioSapiens Vienna: July 07.

1.Cluster like annotations together to aid comparison between sources.

2. Facilitate the identification of exact duplications in the data (e.g. Pfam domains are provided by Interpro and UniProt).

3.Standardise the vocabulary used by each partner. This will allow us to manipulate the data in a more powerful way.

What will the ontology do?What will the ontology do?

Page 20: Protein Annotation Ontology The BioSapiens Virtual Institute for Genome Annotations Janet Thornton & Gabby Reeves AFP/BioSapiens Vienna: July 07.

Standardisation of information provided by all DAS servers. Standardisation of information provided by all DAS servers.

Sometimes annotation types on some servers are exactly the same as names on other servers

Server

Annotation

Page 21: Protein Annotation Ontology The BioSapiens Virtual Institute for Genome Annotations Janet Thornton & Gabby Reeves AFP/BioSapiens Vienna: July 07.

1.Cluster like annotations together to aid comparison between sources.

2. Facilitate the identification of exact duplications in the data (e.g. Pfam domains are provided by Interpro and UniProt).

3.Standardise the vocabulary used by each partner site. This will allow us to manipulate the data in a more powerful way.

4.Provide evidence for each annotation to give an indication on how the information can be used.

What will the ontology do?What will the ontology do?

Page 22: Protein Annotation Ontology The BioSapiens Virtual Institute for Genome Annotations Janet Thornton & Gabby Reeves AFP/BioSapiens Vienna: July 07.

Evidence codes.Evidence codes.

• Each annotation must have at least one evidence code associated with it.

• Evidence codes can be selected from the EEvidence CCode OOntology

It is up to each partner to decide the evidence codes for their own annotations as each case is very

individual.

http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=ECO

Page 23: Protein Annotation Ontology The BioSapiens Virtual Institute for Genome Annotations Janet Thornton & Gabby Reeves AFP/BioSapiens Vienna: July 07.

Designing an OntologyDesigning an Ontology

• The provision of a controlled vocabulary which can be shared between data sources.

• Needs Approval of the community.

• The creation of terms and clustering can only be done properly by an expert in the field rather than an expert in ontologies.

• Clear Goals essential: What relationship are necessary; What should they show.

• Increased complexity becomes laborious and time-intensive

• Continuous evolution.

Once agreed, the ontology will be deposited with the SO for maintenance.

Page 24: Protein Annotation Ontology The BioSapiens Virtual Institute for Genome Annotations Janet Thornton & Gabby Reeves AFP/BioSapiens Vienna: July 07.

Ontology RulesOntology Rules

Terms: computer friendly Phrase spacing: terms do not include white space. e.g. binding_site. Case: terms are always in lowercase except where demanded by context e.g. mRNA Abbreviations: If there is a common abbreviation, it is used for the name of the term, eg UTR. Symbols: Symbols and greek letters are generally spelled out in full. Full stops, slashes, and hyphens are not allowed, underscores used instead. Brackets (){}[] are not allowed.

Synonyms: They facilitate searching the ontology. Types of synonym: The long version of the words in the abbreviated phrase spelled out, different words that mean the same thing. Synonym rules: There is no limit on synonym number, one synonym can be used more than once, Synonyms do not have to be computer friendly. They can begin with numbers and include punctuation such as hyphens.

Definitions: Each term should have a definition. A definition must have a reference to it’s origin. (PubMed, database, website, the person that created it). The format of a definition:

a bicycle -- has two wheels a tandem -- is a bicycle with two saddles and two sets of

handle bars. (inherits all the features of bicycle – therefore the definition for bicycle definition cannot state “a saddle and a set of handlebars”)

Understanding relationships: Currently there are 3 types of relationship in SO; is_a, part_of and derived_from

Page 25: Protein Annotation Ontology The BioSapiens Virtual Institute for Genome Annotations Janet Thornton & Gabby Reeves AFP/BioSapiens Vienna: July 07.

The OBO EditorThe OBO Editor

Page 26: Protein Annotation Ontology The BioSapiens Virtual Institute for Genome Annotations Janet Thornton & Gabby Reeves AFP/BioSapiens Vienna: July 07.

The OntologyThe Ontology

Still in draft form.

Page 27: Protein Annotation Ontology The BioSapiens Virtual Institute for Genome Annotations Janet Thornton & Gabby Reeves AFP/BioSapiens Vienna: July 07.

AcknowledgementsAcknowledgements

Gaby ReevesGaby Reeves

Midori Harris (GO), Karen Eilbeck (SO)Midori Harris (GO), Karen Eilbeck (SO)

Luisa Montecchi, Henning Hermjakob, Eugene Luisa Montecchi, Henning Hermjakob, Eugene Kulesha, Andreas Prlic Kulesha, Andreas Prlic

Members of UniProt (EBI and SIB):Members of UniProt (EBI and SIB):

Alan Bridge, Alan Bridge, Michele MagraneMichele Magrane, Clare O’Donovan, , Clare O’Donovan, and Anne-Lise Veutheyand Anne-Lise Veuthey

BioSapiens Workshop held in February:BioSapiens Workshop held in February:

University of Bologna, CNIO, University of Dundee, University of Bologna, CNIO, University of Dundee, EBI, EBI, ENZIM Hungary, ENZIM Hungary, Hebrew University MPI, Hebrew University MPI, Sanger and UCL Sanger and UCL