Post on 10-May-2015
description
EXPLORING CHEMICAL AND BIOLOGICAL
KNOWLEDGE SPACES WITH PUBCHEM
Dr. Paul A. Thiessen, NCBI
2013/03/21 draft
What is a “Knowledge Space”?
May be a database But may be a concept not encapsulated
in a database
Literature(PubMed) Chemicals
(PubChem)
Targets(sequences)
Genes Diseases
Patents
Drugs
Assays(PubChem)
Connecting the Spaces
Database cross-links
Literature(PubMed)
Chemicals(PubChem)
Targets(sequences)
Assays(PubChem)
Active
Inactive
MeSH
Depositor
Moving Within a Space Neighbors… some examples
Chemicals(PubChem)
Assays(PubChem)
Sameconnectivity
Sameparent
Similarby 2Dor 3D
Similartarget(BLAST)
Similar setsof screenedchemicals
Drug Repurposing as a Spatial Transformation
DrugsSearch
TargetsDiseases(hypothesized)
One possible route…
Diseases(known)
Similarity
What is in PubChem
117M Substances (SIDs)Information from depositors, including links to
PubMed, sequences, structures, patents, etc. 47M Compounds (CIDs)
Derived from Substances (including links)Computed properties
650k Assays (AIDs)~200M test results on SIDsLinks to target sequences
Some PubChem Statistics All CIDs 46,814,409 Unique parents by connectivity 36,806,372 Rule of 5 34,343,056 Rule of 5 but MW 250-800 31,483,865 Active in any BioAssay 824,028 Tested in any BioAssay 1,872,313 Experimental 3D (mainly PDB) 41,406 Computed 3D (multiple confs + neighbors) 42,252,570 Pharmacological Actions 11,531 Biosystems 9,703 Chemical vendors 28,852,943 NIH Molecular Libraries 402,076 Patent sources 14,512,499 Patent links 5,978,538
… as of 2013/03/20
What is in NCBI Entrez
Many other databases…PubMedProtein/Nucleotide sequencesGenesBiosystems (metabolic pathways)PDB structures (with VAST neighbors)
Text and numeric search fields Cross-links
Between databasesWithin databases (neighbors)
How Entrez Works
Search results = list of identifiers Boolean operations on lists (query
refinement) Links from one database to another
PubChemSearch
CIDList
PubChemSearch
CIDList
Link
to PubMed
PMIDList
Limitations of Entrez
Only text or numeric searchSearch fields hard to discoverSearch fields and defaults vary by databaseChemical structure search, and other
specialized algorithms, must be done outside Entrez
The kicker: links are incompleteOnly 500-10,000 ids!Limit also varies by database
Working Around the Limitations
Scripting E-Utils, PUG SOAP/REST, etc.Break queries into smaller chunks
Specialized servicesPubChem’s ID ExchangeClassification trees (with associated IDs)
What is not in Entrez
… as a database per se, but which may be imported and linked to PubChem
Drugs(sort of but not really)
Targets(again sort of)
Diseases Patents
Some Public Sources of Information Relevant to Drugs and Repurposing United States (FDA, NLM, NCBI, …)
ClinicalTrials.gov NDF(-RT) RxNorm HSDB MeSH DailyMed PubMed, PubMed Health USPTO
Europe ChEBI / ChEMBL EPO / WIPO
Canada DrugBank
Japan KEGG
… not an exhaustive list
… some are linked to PubChem
… some are works in progress
MeSH and ChEBI
Chemical structure classification
Biological role
Pharmacological action
KEGG and DrugBank
Drug classification
Targets
Patents PubChem depositors
Per SID:○ Patent IDs○ PubMed IDs
Classifications ECLA IPC USPC CPC
Aside: Patent Summaries
NDF-RT
Molecular interactions Drug ingredients Diseases (with drugs) Physiological effects
Has links to MeSH… which leads to CIDs
NDF-RT linked to SID, CID
Classifications as Navigation Tools Where are the CIDs in the tree?
• Example: chemicals affecting serotonin transporters according to KEGG
Classifications for Query Refinement Where are MY CIDs in the tree?
• Example: what diseases are linked by NDF to KEGG’s serotonin transport drugs?
Big Classifications… Some Engineering Required
WIPO IPC
• 72,000 tree nodes
• 6,000,000 CIDs
• 124,000,000 node-CID links
Filtering on the fly:
• 22,000 CIDs from PDB
… interactive!
More Space to Explore
Literature(PubMed)
Chemicals(PubChem)
Targets(sequences)
Genes
Diseases
Patents
Drugs
Assays(PubChem)
… and beyond
Conclusions PubChem is…
A very generalized systemBased on open dataPart of the larger Entrez collection
We strive to…Make analysis across multiple knowledge spaces
accessible and powerfulEnable hypothesis generation for drug
repurposing (as one scenario among many)
Feedback is always welcome!info@ncbi.nlm.nih.gov
Acknowledgements
Evan Bolton Steve Bryant Asta Gindulyte (classification front end)
Chris Southan
… Thank You!