Exploring Chemical and Biological Knowledge Spaces with PubChem

Post on 10-May-2015

371 views 0 download

description

My presentation for the Drug Repurposing workshop at the upcoming Bio-IT World Expo. http://www.bio-itworldexpo.com/Bio-It_Expo_Content.aspx?id=124256 Presentation abstract: PubChem has a wealth of chemical structure and biological activity information. In conjunction with NCBI’s other resources such as PubMed and GenBank, PubChem is a vast source of information relevant to repurposing not only of established drugs but any compounds with in vivo pharmacology and/or clinical results. The challenge is how to take advantage of this knowledge. The ability to explore not only chemical similarity but relationships between diseases and disease targets has crucial value in repurposing. While focused investigations are already possible within the existing Entrez system, navigation across these linked information spaces can be difficult to do on a large scale with current tools. We are actively developing new infrastructure to support such analyses, and pursuing new methods of exploring inter- and intra-database relationships between chemicals, targets, diseases, and patents. Progress and some future direction in these areas will be presented.

Transcript of Exploring Chemical and Biological Knowledge Spaces with PubChem

EXPLORING CHEMICAL AND BIOLOGICAL

KNOWLEDGE SPACES WITH PUBCHEM

Dr. Paul A. Thiessen, NCBI

2013/03/21 draft

What is a “Knowledge Space”?

May be a database But may be a concept not encapsulated

in a database

Literature(PubMed) Chemicals

(PubChem)

Targets(sequences)

Genes Diseases

Patents

Drugs

Assays(PubChem)

Connecting the Spaces

Database cross-links

Literature(PubMed)

Chemicals(PubChem)

Targets(sequences)

Assays(PubChem)

Active

Inactive

MeSH

Depositor

Moving Within a Space Neighbors… some examples

Chemicals(PubChem)

Assays(PubChem)

Sameconnectivity

Sameparent

Similarby 2Dor 3D

Similartarget(BLAST)

Similar setsof screenedchemicals

Drug Repurposing as a Spatial Transformation

DrugsSearch

TargetsDiseases(hypothesized)

One possible route…

Diseases(known)

Similarity

What is in PubChem

117M Substances (SIDs)Information from depositors, including links to

PubMed, sequences, structures, patents, etc. 47M Compounds (CIDs)

Derived from Substances (including links)Computed properties

650k Assays (AIDs)~200M test results on SIDsLinks to target sequences

Some PubChem Statistics All CIDs 46,814,409 Unique parents by connectivity 36,806,372 Rule of 5 34,343,056 Rule of 5 but MW 250-800 31,483,865 Active in any BioAssay 824,028 Tested in any BioAssay 1,872,313 Experimental 3D (mainly PDB) 41,406 Computed 3D (multiple confs + neighbors) 42,252,570 Pharmacological Actions 11,531 Biosystems 9,703 Chemical vendors 28,852,943 NIH Molecular Libraries 402,076 Patent sources 14,512,499 Patent links 5,978,538

… as of 2013/03/20

What is in NCBI Entrez

Many other databases…PubMedProtein/Nucleotide sequencesGenesBiosystems (metabolic pathways)PDB structures (with VAST neighbors)

Text and numeric search fields Cross-links

Between databasesWithin databases (neighbors)

How Entrez Works

Search results = list of identifiers Boolean operations on lists (query

refinement) Links from one database to another

PubChemSearch

CIDList

PubChemSearch

CIDList

Link

to PubMed

PMIDList

Limitations of Entrez

Only text or numeric searchSearch fields hard to discoverSearch fields and defaults vary by databaseChemical structure search, and other

specialized algorithms, must be done outside Entrez

The kicker: links are incompleteOnly 500-10,000 ids!Limit also varies by database

Working Around the Limitations

Scripting E-Utils, PUG SOAP/REST, etc.Break queries into smaller chunks

Specialized servicesPubChem’s ID ExchangeClassification trees (with associated IDs)

What is not in Entrez

… as a database per se, but which may be imported and linked to PubChem

Drugs(sort of but not really)

Targets(again sort of)

Diseases Patents

Some Public Sources of Information Relevant to Drugs and Repurposing United States (FDA, NLM, NCBI, …)

ClinicalTrials.gov NDF(-RT) RxNorm HSDB MeSH DailyMed PubMed, PubMed Health USPTO

Europe ChEBI / ChEMBL EPO / WIPO

Canada DrugBank

Japan KEGG

… not an exhaustive list

… some are linked to PubChem

… some are works in progress

MeSH and ChEBI

Chemical structure classification

Biological role

Pharmacological action

KEGG and DrugBank

Drug classification

Targets

Patents PubChem depositors

Per SID:○ Patent IDs○ PubMed IDs

Classifications ECLA IPC USPC CPC

Aside: Patent Summaries

NDF-RT

Molecular interactions Drug ingredients Diseases (with drugs) Physiological effects

Has links to MeSH… which leads to CIDs

NDF-RT linked to SID, CID

Classifications as Navigation Tools Where are the CIDs in the tree?

• Example: chemicals affecting serotonin transporters according to KEGG

Classifications for Query Refinement Where are MY CIDs in the tree?

• Example: what diseases are linked by NDF to KEGG’s serotonin transport drugs?

Big Classifications… Some Engineering Required

WIPO IPC

• 72,000 tree nodes

• 6,000,000 CIDs

• 124,000,000 node-CID links

Filtering on the fly:

• 22,000 CIDs from PDB

… interactive!

More Space to Explore

Literature(PubMed)

Chemicals(PubChem)

Targets(sequences)

Genes

Diseases

Patents

Drugs

Assays(PubChem)

… and beyond

Conclusions PubChem is…

A very generalized systemBased on open dataPart of the larger Entrez collection

We strive to…Make analysis across multiple knowledge spaces

accessible and powerfulEnable hypothesis generation for drug

repurposing (as one scenario among many)

Feedback is always welcome!info@ncbi.nlm.nih.gov

Acknowledgements

Evan Bolton Steve Bryant Asta Gindulyte (classification front end)

Chris Southan

… Thank You!