Exploring Chemical and Biological Knowledge Spaces with PubChem

EXPLORING CHEMICAL AND BIOLOGICAL

KNOWLEDGE SPACES WITH PUBCHEM

Dr. Paul A. Thiessen, NCBI

2013/03/21 draft

What is a “Knowledge Space”?

May be a database But may be a concept not encapsulated

in a database

Literature(PubMed) Chemicals

(PubChem)

Targets(sequences)

Genes Diseases

Patents

Assays(PubChem)

Connecting the Spaces

Database cross-links

Literature(PubMed)

Chemicals(PubChem)

Targets(sequences)

Assays(PubChem)

Active

Inactive

Depositor

Moving Within a Space Neighbors… some examples

Chemicals(PubChem)

Assays(PubChem)

Sameconnectivity

Sameparent

Similarby 2Dor 3D

Similartarget(BLAST)

Similar setsof screenedchemicals

Drug Repurposing as a Spatial Transformation

DrugsSearch

TargetsDiseases(hypothesized)

One possible route…

Diseases(known)

Similarity

What is in PubChem

117M Substances (SIDs)Information from depositors, including links to

PubMed, sequences, structures, patents, etc. 47M Compounds (CIDs)

Derived from Substances (including links)Computed properties

650k Assays (AIDs)~200M test results on SIDsLinks to target sequences

Some PubChem Statistics All CIDs 46,814,409 Unique parents by connectivity 36,806,372 Rule of 5 34,343,056 Rule of 5 but MW 250-800 31,483,865 Active in any BioAssay 824,028 Tested in any BioAssay 1,872,313 Experimental 3D (mainly PDB) 41,406 Computed 3D (multiple confs + neighbors) 42,252,570 Pharmacological Actions 11,531 Biosystems 9,703 Chemical vendors 28,852,943 NIH Molecular Libraries 402,076 Patent sources 14,512,499 Patent links 5,978,538

… as of 2013/03/20

What is in NCBI Entrez

Many other databases…PubMedProtein/Nucleotide sequencesGenesBiosystems (metabolic pathways)PDB structures (with VAST neighbors)

Text and numeric search fields Cross-links

Between databasesWithin databases (neighbors)

How Entrez Works

Search results = list of identifiers Boolean operations on lists (query

refinement) Links from one database to another

PubChemSearch

CIDList

PubChemSearch

CIDList

to PubMed

PMIDList

Limitations of Entrez

Only text or numeric searchSearch fields hard to discoverSearch fields and defaults vary by databaseChemical structure search, and other

specialized algorithms, must be done outside Entrez

The kicker: links are incompleteOnly 500-10,000 ids!Limit also varies by database

Working Around the Limitations

Scripting E-Utils, PUG SOAP/REST, etc.Break queries into smaller chunks

Specialized servicesPubChem’s ID ExchangeClassification trees (with associated IDs)

What is not in Entrez

… as a database per se, but which may be imported and linked to PubChem

Drugs(sort of but not really)

Targets(again sort of)

Diseases Patents

Some Public Sources of Information Relevant to Drugs and Repurposing United States (FDA, NLM, NCBI, …)

ClinicalTrials.gov NDF(-RT) RxNorm HSDB MeSH DailyMed PubMed, PubMed Health USPTO

Europe ChEBI / ChEMBL EPO / WIPO

Canada DrugBank

Japan KEGG

… not an exhaustive list

… some are linked to PubChem

… some are works in progress

MeSH and ChEBI

Chemical structure classification

Biological role

Pharmacological action

KEGG and DrugBank

Drug classification

Targets

Patents PubChem depositors

Per SID:○ Patent IDs○ PubMed IDs

Classifications ECLA IPC USPC CPC

Aside: Patent Summaries

NDF-RT

Molecular interactions Drug ingredients Diseases (with drugs) Physiological effects

Has links to MeSH… which leads to CIDs

NDF-RT linked to SID, CID

Classifications as Navigation Tools Where are the CIDs in the tree?

• Example: chemicals affecting serotonin transporters according to KEGG

Classifications for Query Refinement Where are MY CIDs in the tree?

• Example: what diseases are linked by NDF to KEGG’s serotonin transport drugs?

Big Classifications… Some Engineering Required

WIPO IPC

• 72,000 tree nodes

• 6,000,000 CIDs

• 124,000,000 node-CID links

Filtering on the fly:

• 22,000 CIDs from PDB

… interactive!

More Space to Explore

Literature(PubMed)

Chemicals(PubChem)

Targets(sequences)

Diseases

Patents

Assays(PubChem)

… and beyond

Conclusions PubChem is…

A very generalized systemBased on open dataPart of the larger Entrez collection

We strive to…Make analysis across multiple knowledge spaces

accessible and powerfulEnable hypothesis generation for drug

repurposing (as one scenario among many)

Feedback is always welcome!info@ncbi.nlm.nih.gov

Acknowledgements

Evan Bolton Steve Bryant Asta Gindulyte (classification front end)

Chris Southan

… Thank You!

Exploring Chemical and Biological Knowledge Spaces with PubChem

Health & Medicine

Transcript of Exploring Chemical and Biological Knowledge Spaces with PubChem

acetaldehyde _ C2H4O - PubChem

12/31/2020 Dibutyl sebacate | C18H34O4 - PubChem

Deuterated drugs in PubChem

PubChem: An Information Resource Linking Chemistry and Biologyacscinf.org/docs/meetings/232nm/presentations/232nm01.pdf · PubChem: An Information Resource Linking Chemistry and Biology

“Exploring High-D Spaces with Multiform Matrices and Small Multiples”

Invisible Spaces & Hidden Places - Exploring the Spatial Capital of Cities

PubChem and ChEMBL Beyond Lipinski

PubChem Bioassays as a Source of Polypharmacology

I : Exploring Deep State Spaces via Fuzzing€¦ · IJON: Exploring Deep State Spaces via Fuzzing Cornelius Aschermann, Sergej Schumilo, Ali Abbasi, and Thorsten Holz Ruhr University

PubChem Mining - From Small Molecule to Structures and ...

PubChem Substructure Fingerprint V1.3 ...

in Online Spaces Exploring Performances

4-Chlorophenol _ c6h5clo - Pubchem

PubChem: An Information Resource Linking Chemistry and Biology

Exploring Spaces and Materials

Exploring SAR between Patents and PubChem

Exploring Passive Social Wearables with Gossip - Social Spaces Group

Exploring Hypotheses Spaces in Neural Machine Translationpshantha/papers/mtsummit17.pdf · Exploring Hypotheses Spaces in Neural Machine ... a combination of the scores estimated

Exploring Successful Small Urban Spaces’ Criteria with ...

Butane _ c4h10 - Pubchem