Evolving consensus-based curatorial strategies

1

www.guidetopharmacology.org

Will the real drugs and targets please stand up?Evolving consensus-based curatorial strategies

Chris Southan, IUPHAR/BPS Guide to PHARMACOLOGY Web portal Group, Centre for Integrative Physiology,School of Biomedical Sciences, University of Edinburgh, Hugh Robson Building, Edinburgh,

EH8 9XD, UK. [email protected]

Presented to the Gloriam/GPCRDB Team and the Dept. of Pharmaceutical Sciences,University of Copenhagen, 6th May 2014

mailto:[email protected]

GToPdb: receptors, ligands, targets and drugs

• An expert-curated database overseen by the IUPHAR Nomenclature Committee (NC-IUPHAR)

• >70 subcommittees comprising ~700 international scientists working on individual target families.

• 4 full-time curators, 1 part-time admin, 1 developer.• NC-IUPHAR publishes nomenclature recommendations and

reviews on various topics in pharmacological journals and through the IUPHAR database.

• Subcommittees update their database pages annually.• Continuously expanding to incorporate new data types, new

targets and ligands and new domain committees• Public database releases every 3-4 months

Content

Detailed annotation

Pharmacological and clinical data

Wellcome Trust Grant 099156/Z/12/Z

• Key objective: “encompass all the human targets of current prescription medicines and the likely targets of future medicines”

• Conceptually familiar from our established receptor/channel-centric database

• But - needed to re-define curatorial approaches, caveats and end-points

• Balance between theoretical rigour and pragmatic utility

• Four foci - grant fulfilment, user value, data mining, data consumption

• Discuss and document changes in curatorial strategies with practical guidelines

• Add enhancements, new relationships and features

• Control activity-mapping stringencies and relationship distributions

• QC legacy content, harmonise and remediate where necessary

• Aim for small, but perfectly-formed, data content vs. complete coverage

7

Technical implementation

• Restrict relationships to citable/provenanced quantitative mappings (typically IC50, Ki, Kd)

• Formally tag data-supported “primary targets”

• Only data-supported polypharmacology

• Mask nutraceuticals, metabolites or endogenous hormones from bloating drug > target relationship space

• Limit drug > multiple subunit mappings to direct interactions

• Normalize targets to UniProt IDs and Swiss-Prot for human

• Normalise drugs and ligands to PubChem compound records (CIDs)

• Extend useful relationships e.g. drug > prodrug, drug > active metabolite, ligand = target (antibody > cytokine)

• Flexibility to handle edge cases (e.g. heparinoids)• Options for selective expansion (e.g. kinases, proteases and

Alzheimer’s)

8

Defining limits for curation

• The good news: capture of targets and drugs in databases and literature reports is continuously expanding

• The bad news: no one agrees on numbers, relationship definitions, curatorial rules, identifiers, exact molecular structures, choices of primary sources or provenance attribution

• More bad news: source proliferation < “circular” annotation • Human target range: 186 approved drugs in 2006

(PMID:17139284 ) < 3,044 in ChEMBL_18• Approved drug ranges: 1,216 FDA Maximum Daily Dose

(PubChem Assay ID 1195) < 2,750 for the NCGC Pharmaceutical Collection (PMID:21525397)

• Outer bioactivity ranges: 8057 INNs < 928,875 actives in PubChem BioAssays < 6.3 million from GVKBIO with SAR from papers and patents

9

Evolution of our consensus strategy

Based on many collective years of curatorial engagement and deep source knowledge we now pursue a consensus approach for the following reasons:

1. Concordant sources are generally more likely to be right than wrong

2. Curatorial efficiency of starting with solid consensus sets3. Multiple sources are informatically synergistic ( if truly

independent)4. Approach is flexible via source updates and testing different filters5. We control total numbers for matching to curatorial capacity6. The concept can easily be explained to users7. The exercise of comparing sources is very informative 8. It forces entity identifier normalisation (via cross-mapping if

necessary)9. Consensus lists per se have value for users (e.g. hosting on

website)

10

Will the real targets please stand up ?

• Compared as human Swiss-Prot IDs for 2013 database releases • Intersect is 351 the union is 3,046 (i.e. 15% of the 20,265 human proteome)• Lists included approved, clinical and research targets

Figure 7d from: “Comparing the chemical structure and protein content of ChEMBL, DrugBank, Human Metabolome Database and the Therapeutic Target Database” PMID: 24533037

11

Genome Ontology comparison indicates source selectivity

12

Use a target consensus to populate the database

• ChEMBL 17, 252 approved

• Mathias Rask-Anderson et. al July 2013, 481 approved

• Southan et. al, 2013 3-way human DrugBank/ChEMBL/TTD 352

• 3-way or 2-way, 19 + 40 + 143 = 202 Targets Of Approved Drugs (TOADS) set selected for GToP upload

13

Will the real drugs please stand up?

• Work up the following CID triage inside PubChem• Select DrugBank 1504 “approved” drug structures• Select two additional sources TTD and ChEMBL• Filter to remove salts and mixtures• Select synonym INN (WHO International Non-proprietary

Name). • The final step was the Boolean intersect between all five

14

Observations and caveats

• This set of 923 drugs can be accessed via the MyNCBI open URLhttp://www.ncbi.nlm.nih.gov/sites/myncbi/collections/public/1Fo7u3apR1bzS_UWr1YhHOTkZ/• TTD last submitted in Feb 2012 so drug content is thus capped

to before that date (dropping TTD gives 1117 CIDs)• Some metabolites (e.g. amino acids) come through the filters• Older drugs have no INN (e.g. aspirin) • Some peptide drug CIDs are missing (suggesting low

concordance)• Approved fixed-mixtures are excluded (they do not get an INN)• The computed CID identity is actually a hash-code match,

rather than via InChIKey (but this should give similar numbers)• Each of the 923 had 76 submissions (SIDs)• Applying “same (bond) connectivity” gives 18749 but removing

the virtual deuterated entries reduces this to 6919 (i.e. the 923 have, on average, 7.5 alternative stereo CIDs)

http://www.ncbi.nlm.nih.gov/sites/myncbi/collections/public/1Fo7u3apR1bzS_UWr1YhHOTkZ/



15

Closing consensus drugs > targets

• From Phase I targets > drugs we have moved to Phase 2 for drugs > targets

• Current stats = 228 TOADS (inward mapping expanded the set by ~10%)

• Current stats = 996 approved drugs (need to complete the activity mappings)

• Note that antibodies and larger peptides (with no PubChem CIDs) are subsumed in the 996

• 2013 new drug CIDs loaded http://cdsouthan.blogspot.se/2014/03/the-drugs-of-2013-in-pubchem.html

• Will back-fill 2010-2012 new approvals as ligands, targets and activities (but most already there)

http://cdsouthan.blogspot.se/2014/03/the-drugs-of-2013-in-pubchem.html

http://cdsouthan.blogspot.se/2014/03/the-drugs-of-2013-in-pubchem.html

16

GPCRdb/GToPdb collaborative opportunity

• Inspect which GPCRs are concordant or discordant between the target lists

• Might be able to do similar exersise for GPCR-active drug/compound lists – depending on what we can find with linkage (e.g. GLIDA)

• Work up a triage for alert triggers for new GPCR ligand

structures in PDB (e.g. via MMDB)

17

References and Acknowledgments

The database team: Adam Pawson, Joanna Sharman, Helen Benson, Elena Faccenda

Evolving consensus-based curatorial strategies

Health & Medicine

Transcript of Evolving consensus-based curatorial strategies