Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI...

55
Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October 2012

Transcript of Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI...

Page 1: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

Introduction to the Gene Ontologyand GO annotation resources

Rachael Huntley

UniProt-GOA

EBI

Programmatic Access to Biological Databases

3rd October 2012

Page 2: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

What is an Ontology?

What is the difference between an ontology and a controlled vocabulary?

Page 3: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

An ontology formally represents knowledge as a set of concepts within a domain, and the relationships among those concepts. It can be used to reason about the entities within that domain and may be used to describe the domain.

A Controlled vocabularies provide a way to organize knowledge for subsequent retrieval but does not allow reasoning about the entities.

Page 4: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

Ontologies and CVs in Biology

Ontologies/CVs are heavily used in biological databases

•Allow organisation of data within a database

•Enable linking between databases

•Enable searches across databases

Page 5: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.
Page 6: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

www.ebi.ac.uk/ols

Page 7: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

What is GO?

Page 8: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

• A way to capture biological knowledge for individual gene productsin a written and computable form

The Gene Ontology

• A set of concepts and their relationships to each other arrangedas a hierarchy

www.ebi.ac.uk/QuickGO

Less specific concepts

More specific concepts

Page 9: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

The Concepts in GO

1. Molecular Function

2. Biological Process

3. Cellular Component

An elemental activity or task or job

• protein kinase activity• insulin receptor activity

A commonly recognised series of events

• cell division

Where a gene product is located

• mitochondrion

• mitochondrial matrix

• mitochondrial inner membrane

Page 10: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

Anatomy of a GO term

Unique identifierUnique identifier

Term nameTerm name

DefinitionDefinitionSynonymsSynonyms

Cross-referencesCross-references

Page 11: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

Ontology structure

• Directed acyclic graphTerms can have more than one parent

• Terms are linked by relationships

is_a

part_of

regulates (and +/- regulates)

www.ebi.ac.uk/QuickGOoccurs_in

has_part

These relationships allow for complex analysis of large datasets

Page 12: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

is_aIf A is a B, then A is a subtype of B

part_ofWherever B exists, it is as part of A. But not all A is part of B.

Relations Between GO Terms

http://www.geneontology.org/GO.ontology-ext.relations.shtml

BAAll replication forks are part of a chromosomeNot all chromosomes have replication forks

Page 13: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

Relations Between GO TermsRegulates

• One process directly affects another process or quality

• Necessarily regulates: if both A and B are present, B always regulates A, but A may not always be regulated by B

A B

All cell cycle checkpoints regulate the cell cycle.The cell cycle is not solely regulated by cell cycle checkpoints

http://www.geneontology.org/GO.ontology-ext.relations.shtml

Page 14: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

• GO was originally three completely independent hierarchies, with no relationships between them

• Biological processes are ordered assemblies of molecular functions

• As of 2009 we have started making relationships between biological process and molecular function in the live ontology

Process-Function Links in GO

functions that are part_of processes e.g. transporter part_of transport

functions that regulate processes e.g. transcription regulator regulates transcription

process

function

process

function

Page 15: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

Searching for GO terms

http://www.ebi.ac.uk/QuickGO

Search GO terms or proteins

Page 16: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

Exercise

Search for a GO term Exercise 1 (pg.16)

Page 17: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

17

Why do we need GO?

Page 18: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

Reasons for the Gene Ontology

www.geneontology.org

• Inconsistency in English language

Page 19: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

Inconsistency in English languauge

• Same name for different concepts

or

??

Cell

Page 20: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

Comparison is difficult – in particular across species or across databases

Just one reason why the Gene Ontology (GO) is is needed…

• Different names for the same concept

Eggplant

Aubergine

Brinjal

Melongene

Same for biological concepts

Page 21: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

Reasons for the Gene Ontology

www.geneontology.org

• Inconsistency in English language

• Increasing amounts of biological data available

• Increasing amounts of biological data to come

Page 22: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

Search on ‘DNA repair’...get over 68,000 results

Increasing amounts of biological data available

Expansion of sequence information

Page 23: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

Reasons for the Gene Ontology

www.geneontology.org

• Inconsistency in English language

• Large datasets need to be interpreted quickly

• Increasing amounts of biological data available

• Increasing amounts of biological data to come

Page 24: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

• Compile the ontologies

- currently over 38,000 terms - constantly increasing and improving

• Annotate gene products using ontology terms

- around 30 groups provide annotations

• Provide a public resource of data and tools

- regular releases of annotations - tools for browsing/querying annotations and editing the ontology

Aims of the GO project

Page 25: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

Reactome

http://www.geneontology.org

Page 26: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

GO Annotation

Page 27: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

UniProt-Gene Ontology Annotation (UniProt-GOA) project at the EBI

• Largest open-source contributor of annotations to GO

• Provides annotation for more than 350,000 species

• Our priority is to annotate the human proteome

Page 28: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

A GO annotation is … …a statement that a gene product;

1. has a particular molecular function or is involved in a particular biological process

or is located within a certain cellular component

2. as determined by a particular method

3. as described in a particular reference

P00505

Accession Name GO ID GO term name Reference Evidence code

IDAPMID:2731362aspartate transaminase activityGO:0004069GOT2

Page 29: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

Electronic Annotation

Manual Annotation

UniProt-GOA incorporates annotations made using two methods

• Quick way of producing large numbers of annotations• Annotations use less-specific GO terms

• Time-consuming process producing lower numbers of annotations• Annotations tend to use very specific GO terms

• Only source of annotation for many non-model organism species

Page 30: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

Electronic annotation methods

GO:0004707: MAP kinase activity

GO:0005634: Nucleus

GO:0009734: Auxin mediated signaling pathway

1. Mapping of external concepts to GO terms

Page 31: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

Annotations are high-quality and have an explanation of the method (GO_REF)

Macaque

Mouse

DogCow

Guinea PigChimpanzee Rat

Chicken

Ensembl compara

2. Automatic transfer of manual annotations to orthologs

...and more

e.g. Human

Arabidopsis

Rice

Brachypodium

Maize

Poplar

Grape

…and moreEnsembl compara

Electronic annotation methods

http://www.geneontology.org/cgi-bin/references.cgi

Page 32: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

Manual annotation by GOA

High–quality, specific annotations made using:

• Full text peer-reviewed papers

• A range of evidence codes to categorise the types of evidence found in a

papere.g. IDA, IMP, IPI

http://www.ebi.ac.uk/GOA

Page 33: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

* Includes manual annotations integrated from external model organism and specialist groups

1,149,802Manual annotations*

102,205,043Electronic annotations

Sep 2012 Statistics

Number of annotations in UniProt-GOA database

Page 34: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

How to access and use

GO annotation data

Page 35: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

Where can you find annotations?

UniProtKB

Ensembl

Entrez gene

Page 36: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

UniProt vs. QuickGO annotation display

QuickGO

UniProt

Page 37: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

UniProt vs. QuickGO annotation display

• Exclude root terms (e.g. Molecular Function)

• Exclude annotations with qualifier (e.g. NOT, contributes_to)

• Exclude annotations to less granular terms

• Exclude annotations to GO:0005515 protein binding

• Exclude lower quality assignments for same data, e.g. UniProt taken in preference to MGI

• Add electronic annotations that cover ground not covered by manual annotation

Filtering mechanism

Page 38: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

GO Consortium website

Gene Association Files17 column files containing all information for each annotation

http://www.ebi.ac.uk/GOA/downloads.html

UniProt-GOA websiteNumerous species-specific files

Page 39: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

GO browsers

Page 40: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

http://www.ebi.ac.uk/QuickGO

The EBI's QuickGO browser

Search GO terms or proteins

Find sets of GO annotations

Page 41: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

Exercise

Find annotations to a protein Exercise 2 (pg.16)

Find annotations to a list of proteins Exercise 1 and 2 (pg.22)

Page 42: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

• Access gene product functional information

• Analyse high-throughput genomic or proteomic datasets

• Validation of experimental techniques

• Get a broad overview of a proteome

• Obtain functional information for novel gene products 

How scientists use the GO

Some examples…

Page 43: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

Term enrichment

• Most popular type of GO analysis

• Determines which GO terms are more often associated with a specified list of genes/proteins compared with a control list or rest of genome

• Many tools available to do this analysis

• User must decide which is best for their analysis

Page 44: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

Selected Gene Tree: pearson lw n3d ...Branch color classification:Set_LW_n3d_5p_...

Colored by: Copy of Copy of C5_RMA (Defa...Gene List: all genes (14010)

attacked

time

control

Puparial adhesionMolting cycleHemocyanin

Defense responseImmune responseResponse to stimulusToll regulated genesJAK-STAT regulated genes

Immune responseToll regulated genes

Amino acid catabolismLipid metobolism

Peptidase activityProtein catabolismImmune response

Selected Gene Tree: pearson lw n3d ...Branch color classification:Set_LW_n3d_5p_...

Colored by: Copy of Copy of C5_RMA (Defa...Gene List: all genes (14010)

Bregje Wertheim at the Centre for Evolutionary Genomics, Department of Biology, UCL and Eugene Schuster Group, EBI.

MicroArray data analysis

Analysis of high-throughput genomic datasets

Page 45: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

Annotating novel sequences

• Can use BLAST queries to find similar sequences with GO annotation which can be transferred to the new sequence

• Two tools currently available;

AmiGO BLAST – searches the GO Consortium database

BLAST2GO – searches the NCBI database

Page 46: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

Using the GO to provide a functional overview for a large dataset

• Many GO analysis tools use GO slims to give a broad overview of the dataset

• GO slims are cut-down versions of the GO andcontain a subset of the terms in the whole GO

• GO slims usually contain less-specialised GO terms

Page 47: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

Slimming the GO using the ‘true path rule’ Many gene products are associated with a large number of descriptive, leaf GO nodes:

Page 48: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

Slimming the GO using the ‘true path rule’ …however annotations can be mapped up to a smaller set of parent GO terms:

Page 49: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

GO slims

or you can make your own using;

Custom slims are available for download;

http://www.geneontology.org/GO.slims.shtml

• AmiGO's GO slimmer

• QuickGOhttp://www.ebi.ac.uk/QuickGO

http://amigo.geneontology.org/cgi-bin/amigo/slimmer

Page 50: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

www.ebi.ac.uk/QuickGO

Map-up annotations with GO slims

The EBI's QuickGO browser

Search GO terms or proteins

Find sets of GO annotations

Page 51: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

Precautions when using GO annotations for analysis

• Recommended that ‘NOT’ annotations are removed before analysis - only ~3000 out of 57 million annotations are ‘NOT’- can confuse the analysis

• The Gene Ontology is always changing and GO annotations are continually being created

- always use a current version of both

- if publishing your analyses please report the versions/dates you used

http://www.geneontology.org/GO.cite.shtml

Page 52: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

Precautions when using GO annotations for analysis

• Unannotated is not unknown

- where there is no evidence in the literature for a process, function orlocation the gene product is annotated to the appropriate ontology’sroot node with an ‘ND’ evidence code (no biological data), thereby distinguishing between unannotated and unknown

• Pay attention to under-represented GO terms- a strong under-representation of a pathway may mean that normalfunctioning of that pathway is necessary for the given condition

Page 53: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

Exercise

Use the QuickGO web services to retrieve annotations

Exercise 1 Pg.46

Page 54: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

The UniProt-GOA group

Curators:

Software developer:

Team leaders:

Rachael Huntley

Rolf Apweiler

Email: [email protected]

http://www.ebi.ac.uk/GOA

Claire O’Donovan

Tony Sawford

Yasmin Alam-FaruquePrudence Mutowo

Project leader:

Page 55: Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3 rd October.

Acknowledgements

Ensembl

Ensembl Genomes

GO Consortium

Members of;

UniProtKB

InterPro

IntAct

HAMAP

Funding

National Human Genome Research Institute (NHGRI)

British Heart Foundation

EMBL