Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI...
-
Upload
dominick-skinner -
Category
Documents
-
view
227 -
download
3
Transcript of Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProt-GOA EBI...
Introduction to the Gene Ontologyand GO annotation resources
Rachael Huntley
UniProt-GOA
EBI
Programmatic Access to Biological Databases
3rd October 2012
What is an Ontology?
What is the difference between an ontology and a controlled vocabulary?
An ontology formally represents knowledge as a set of concepts within a domain, and the relationships among those concepts. It can be used to reason about the entities within that domain and may be used to describe the domain.
A Controlled vocabularies provide a way to organize knowledge for subsequent retrieval but does not allow reasoning about the entities.
Ontologies and CVs in Biology
Ontologies/CVs are heavily used in biological databases
•Allow organisation of data within a database
•Enable linking between databases
•Enable searches across databases
www.ebi.ac.uk/ols
What is GO?
• A way to capture biological knowledge for individual gene productsin a written and computable form
The Gene Ontology
• A set of concepts and their relationships to each other arrangedas a hierarchy
www.ebi.ac.uk/QuickGO
Less specific concepts
More specific concepts
The Concepts in GO
1. Molecular Function
2. Biological Process
3. Cellular Component
An elemental activity or task or job
• protein kinase activity• insulin receptor activity
A commonly recognised series of events
• cell division
Where a gene product is located
• mitochondrion
• mitochondrial matrix
• mitochondrial inner membrane
Anatomy of a GO term
Unique identifierUnique identifier
Term nameTerm name
DefinitionDefinitionSynonymsSynonyms
Cross-referencesCross-references
Ontology structure
• Directed acyclic graphTerms can have more than one parent
• Terms are linked by relationships
is_a
part_of
regulates (and +/- regulates)
www.ebi.ac.uk/QuickGOoccurs_in
has_part
These relationships allow for complex analysis of large datasets
is_aIf A is a B, then A is a subtype of B
part_ofWherever B exists, it is as part of A. But not all A is part of B.
Relations Between GO Terms
http://www.geneontology.org/GO.ontology-ext.relations.shtml
BAAll replication forks are part of a chromosomeNot all chromosomes have replication forks
Relations Between GO TermsRegulates
• One process directly affects another process or quality
• Necessarily regulates: if both A and B are present, B always regulates A, but A may not always be regulated by B
A B
All cell cycle checkpoints regulate the cell cycle.The cell cycle is not solely regulated by cell cycle checkpoints
http://www.geneontology.org/GO.ontology-ext.relations.shtml
• GO was originally three completely independent hierarchies, with no relationships between them
• Biological processes are ordered assemblies of molecular functions
• As of 2009 we have started making relationships between biological process and molecular function in the live ontology
Process-Function Links in GO
functions that are part_of processes e.g. transporter part_of transport
functions that regulate processes e.g. transcription regulator regulates transcription
process
function
process
function
Searching for GO terms
http://www.ebi.ac.uk/QuickGO
Search GO terms or proteins
Exercise
Search for a GO term Exercise 1 (pg.16)
17
Why do we need GO?
Reasons for the Gene Ontology
www.geneontology.org
• Inconsistency in English language
Inconsistency in English languauge
• Same name for different concepts
or
??
Cell
Comparison is difficult – in particular across species or across databases
Just one reason why the Gene Ontology (GO) is is needed…
• Different names for the same concept
Eggplant
Aubergine
Brinjal
Melongene
Same for biological concepts
Reasons for the Gene Ontology
www.geneontology.org
• Inconsistency in English language
• Increasing amounts of biological data available
• Increasing amounts of biological data to come
Search on ‘DNA repair’...get over 68,000 results
Increasing amounts of biological data available
Expansion of sequence information
Reasons for the Gene Ontology
www.geneontology.org
• Inconsistency in English language
• Large datasets need to be interpreted quickly
• Increasing amounts of biological data available
• Increasing amounts of biological data to come
• Compile the ontologies
- currently over 38,000 terms - constantly increasing and improving
• Annotate gene products using ontology terms
- around 30 groups provide annotations
• Provide a public resource of data and tools
- regular releases of annotations - tools for browsing/querying annotations and editing the ontology
Aims of the GO project
Reactome
http://www.geneontology.org
GO Annotation
UniProt-Gene Ontology Annotation (UniProt-GOA) project at the EBI
• Largest open-source contributor of annotations to GO
• Provides annotation for more than 350,000 species
• Our priority is to annotate the human proteome
A GO annotation is … …a statement that a gene product;
1. has a particular molecular function or is involved in a particular biological process
or is located within a certain cellular component
2. as determined by a particular method
3. as described in a particular reference
P00505
Accession Name GO ID GO term name Reference Evidence code
IDAPMID:2731362aspartate transaminase activityGO:0004069GOT2
Electronic Annotation
Manual Annotation
UniProt-GOA incorporates annotations made using two methods
• Quick way of producing large numbers of annotations• Annotations use less-specific GO terms
• Time-consuming process producing lower numbers of annotations• Annotations tend to use very specific GO terms
• Only source of annotation for many non-model organism species
Electronic annotation methods
GO:0004707: MAP kinase activity
GO:0005634: Nucleus
GO:0009734: Auxin mediated signaling pathway
1. Mapping of external concepts to GO terms
Annotations are high-quality and have an explanation of the method (GO_REF)
Macaque
Mouse
DogCow
Guinea PigChimpanzee Rat
Chicken
Ensembl compara
2. Automatic transfer of manual annotations to orthologs
...and more
e.g. Human
Arabidopsis
Rice
Brachypodium
Maize
Poplar
Grape
…and moreEnsembl compara
Electronic annotation methods
http://www.geneontology.org/cgi-bin/references.cgi
Manual annotation by GOA
High–quality, specific annotations made using:
• Full text peer-reviewed papers
• A range of evidence codes to categorise the types of evidence found in a
papere.g. IDA, IMP, IPI
http://www.ebi.ac.uk/GOA
* Includes manual annotations integrated from external model organism and specialist groups
1,149,802Manual annotations*
102,205,043Electronic annotations
Sep 2012 Statistics
Number of annotations in UniProt-GOA database
How to access and use
GO annotation data
Where can you find annotations?
UniProtKB
Ensembl
Entrez gene
UniProt vs. QuickGO annotation display
QuickGO
UniProt
UniProt vs. QuickGO annotation display
• Exclude root terms (e.g. Molecular Function)
• Exclude annotations with qualifier (e.g. NOT, contributes_to)
• Exclude annotations to less granular terms
• Exclude annotations to GO:0005515 protein binding
• Exclude lower quality assignments for same data, e.g. UniProt taken in preference to MGI
• Add electronic annotations that cover ground not covered by manual annotation
Filtering mechanism
GO Consortium website
Gene Association Files17 column files containing all information for each annotation
http://www.ebi.ac.uk/GOA/downloads.html
UniProt-GOA websiteNumerous species-specific files
GO browsers
http://www.ebi.ac.uk/QuickGO
The EBI's QuickGO browser
Search GO terms or proteins
Find sets of GO annotations
Exercise
Find annotations to a protein Exercise 2 (pg.16)
Find annotations to a list of proteins Exercise 1 and 2 (pg.22)
• Access gene product functional information
• Analyse high-throughput genomic or proteomic datasets
• Validation of experimental techniques
• Get a broad overview of a proteome
• Obtain functional information for novel gene products
How scientists use the GO
Some examples…
Term enrichment
• Most popular type of GO analysis
• Determines which GO terms are more often associated with a specified list of genes/proteins compared with a control list or rest of genome
• Many tools available to do this analysis
• User must decide which is best for their analysis
Selected Gene Tree: pearson lw n3d ...Branch color classification:Set_LW_n3d_5p_...
Colored by: Copy of Copy of C5_RMA (Defa...Gene List: all genes (14010)
attacked
time
control
Puparial adhesionMolting cycleHemocyanin
Defense responseImmune responseResponse to stimulusToll regulated genesJAK-STAT regulated genes
Immune responseToll regulated genes
Amino acid catabolismLipid metobolism
Peptidase activityProtein catabolismImmune response
Selected Gene Tree: pearson lw n3d ...Branch color classification:Set_LW_n3d_5p_...
Colored by: Copy of Copy of C5_RMA (Defa...Gene List: all genes (14010)
Bregje Wertheim at the Centre for Evolutionary Genomics, Department of Biology, UCL and Eugene Schuster Group, EBI.
MicroArray data analysis
Analysis of high-throughput genomic datasets
Annotating novel sequences
• Can use BLAST queries to find similar sequences with GO annotation which can be transferred to the new sequence
• Two tools currently available;
AmiGO BLAST – searches the GO Consortium database
BLAST2GO – searches the NCBI database
Using the GO to provide a functional overview for a large dataset
• Many GO analysis tools use GO slims to give a broad overview of the dataset
• GO slims are cut-down versions of the GO andcontain a subset of the terms in the whole GO
• GO slims usually contain less-specialised GO terms
Slimming the GO using the ‘true path rule’ Many gene products are associated with a large number of descriptive, leaf GO nodes:
Slimming the GO using the ‘true path rule’ …however annotations can be mapped up to a smaller set of parent GO terms:
GO slims
or you can make your own using;
Custom slims are available for download;
http://www.geneontology.org/GO.slims.shtml
• AmiGO's GO slimmer
• QuickGOhttp://www.ebi.ac.uk/QuickGO
http://amigo.geneontology.org/cgi-bin/amigo/slimmer
www.ebi.ac.uk/QuickGO
Map-up annotations with GO slims
The EBI's QuickGO browser
Search GO terms or proteins
Find sets of GO annotations
Precautions when using GO annotations for analysis
• Recommended that ‘NOT’ annotations are removed before analysis - only ~3000 out of 57 million annotations are ‘NOT’- can confuse the analysis
• The Gene Ontology is always changing and GO annotations are continually being created
- always use a current version of both
- if publishing your analyses please report the versions/dates you used
http://www.geneontology.org/GO.cite.shtml
Precautions when using GO annotations for analysis
• Unannotated is not unknown
- where there is no evidence in the literature for a process, function orlocation the gene product is annotated to the appropriate ontology’sroot node with an ‘ND’ evidence code (no biological data), thereby distinguishing between unannotated and unknown
• Pay attention to under-represented GO terms- a strong under-representation of a pathway may mean that normalfunctioning of that pathway is necessary for the given condition
Exercise
Use the QuickGO web services to retrieve annotations
Exercise 1 Pg.46
The UniProt-GOA group
Curators:
Software developer:
Team leaders:
Rachael Huntley
Rolf Apweiler
Email: [email protected]
http://www.ebi.ac.uk/GOA
Claire O’Donovan
Tony Sawford
Yasmin Alam-FaruquePrudence Mutowo
Project leader:
Acknowledgements
Ensembl
Ensembl Genomes
GO Consortium
Members of;
UniProtKB
InterPro
IntAct
HAMAP
Funding
National Human Genome Research Institute (NHGRI)
British Heart Foundation
EMBL