Translating research data into Gene Ontology annotations
Pascale GaudetSIB – Swiss Institute of Bioinformatics
GO Consortium
Ontology Annotations Model of biology
Gene Ontology Consortium What we provide
A structured representation of biology, composed of:
• Classes• Relations• Definitions
+ =
- Antigen binding- Adaptive immune response- Extracellular
IGHA1Immunoglobulinheavyconstantalpha1
- Glutamine-tRNAligase activity- Translation- Cytoplasm
QARSGln tRNA synthetase
Statements about the functions of specific gene products. 3 aspects: • Molecular function• Biological process• Cellular component
Representation of current knowledge in a manner that is: • Human
understandable• Machine computable
GO “annotations”§ An annotation is a statement linking a gene to
some aspect of its function (a GO ontology term)
§ Each annotation is based on some evidence, recorded as part of the annotation§ Evidence code (type of evidence)§ Reference (published journal article)
Examples:Annotation1:INSR+‘receptoractivity’Annotation2:INSR+‘plasmamembrane’Annotation3:INSR+‘insulinreceptorsignalingpathway’
Semantics of a GO annotationThe association of a GO class with a gene product is a statement that means:
§ molecular function: molecular activities of gene products
§ cellular component: where gene products are active§ biological process: pathways and larger processes
made up of the activities of multiple gene products.§ In other words, annotations represent the
normal, in vivo biological role of gene products
Manual- Literature-based Manual- Sequence-based Algorithmic(unreviewed)
How are annotations generated?
Ancomputerprogramanalysesasequencesandmakeapredictionbasedonsomedecisioncriteria,forexample:
-proteindomain(InterPro2GO)- sequencesimilarity(BLAST2GO)
Anexpertreviewstheliteratureandassignsfunctions,processesandcellularcomponentstogenesproducts
>500,000annotations >65MannotationsAnexpertanalysesasequenceandmakesaprediction concerningthegenefunctionbasedonknownfunctionsofrelatedsequences
Thepredictionscanbebasedontheknownfunctionofevolutionarilyrelatedsequences(phylogeneticrelationships)
>3Mannotations
Manual- Literature-based
Evidence types
Chibucos MC,Siegele DA,HuJC,Giglio M.(2017)EvidenceandconclusionontologyPMID:27812948
Manual- Sequence-based Algorithmic(unreviewed)
EXPexperimentalevidence
IDAinferredfromdirectassay
IPIinferredfromphysicalinteraction
IMPinferredfrommutantphenotype
ISSinferredfromsequencesimilarity
ISOinferredfromsequenceortholog
IBAinferredfrombiologicalaspectofancestor
IEAinferredfromelectronicannotation
Who produces GO annotations?• Model organism databases (SGD, FlyBase,
wormbase, MGI, etc)• Generalist databases, for eg UniProtKB, IntAct• Domain-specific projects: Cardiovascular project
(UCL), synapse project (VU), etc.• Anyone who wishes to contribute their expertise
and data to the project
Best practices for generating literature-based GO annotations
§ Ensure consistency of usage across a broad consortium of contributors
§ Improve inferencing capabilities
Focus on the research hypothesis§ Use prior knowledge to understand the hypothesis
being tested and its relation to the experimental observation
Protein Knownroles Hypothesis Assay Result Conclusionfor GODDFB(O76075) DNase Thenucleaseactivityof
DDFBisrequiredfornuclearDNAfragmentationduringapoptosis
ApoptoticDNAfragmentationincreasedinthepresenceofDDFB
DDFBmediatesnuclearDNAfragmentationduringapoptosis=apoptoticDNAfragmentation(GO:0006309)
FOXL2(P58012) Transcriptionfactor
MutationsinFOXL2areknowntocauseprematureovarianfailure,whichmaybeduetoincreasedapoptosis
ApoptoticDNAfragmentationincreasedinthepresenceofFOXL2
FOXL2increasestherateofapoptosis=positiveregulationofapoptoticprocess(GO:0043065)
Annotate the conclusion, not the assay
1) rubidium if often used to assay potassium transport,
because the radioactive form is more readily available;
- the physiologically relevant substrate is potassium
2) Protein kinases are often tested with non-physiologically
relevant substrates, such as histone
- if the authors do not discuss the physiological relevance,
one cannot annotate the substrate
On the in vivo relevance of phenotypes• Phenotypes can help understand the function of proteins• Phenotypes can insights into mechanisms leading to disease• The scope of the GO, though, is to capture the normal function of proteins
Indirect effects of a mutation- RNA polymerase affects essentially all cellular processes (cell
proliferation, development, etc) but does not mediate theseprocesses
Lack of hypothesis for a role of a protein in a process: - Knockdown of Tmem234 in zebrafish results defects in pronephric
glomerulus formation. Annotation by IMP to glomerulus formation isnot supported by any cellular/molecular data
Get the wider perspective• Favor a gene-by-gene or pathway-by-pathway
approach for curation rather than paper-by-paper
• Read recent publications
• Remove incorrect annotations based on invalidated
hypothesis
Guidelines for high quality annotations
• Annotate the conclusion of the experiment• Use the biological context to interpret the
experiments• Carefully select publications. Read recent
publications• Ensure consistency with existing annotations • Keep annotation up-to date: Remove obsolete
annotations
Other approaches for quality control
• Annotation consistency exercises• Taxonomic constraints• Co-occurrence of annotations• Phylogenetic annotations• User feedback
- from GO website- from PubMed- from databases
Top Related