Big Data in Drug Discoveryhpcuniversity.org/vscse/bigdata/downloads/2010-Big... · 04 2005-07...
Transcript of Big Data in Drug Discoveryhpcuniversity.org/vscse/bigdata/downloads/2010-Big... · 04 2005-07...
Big Data in Drug Discovery
David J. WildAssistant Professor & Director, Cheminformatics Program
Indiana University School of Informatics and Computing
[email protected] - http://djwild.info
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
Epochs in drug discovery
Empirical – up until 1960’s
754 First pharmacy opened in Baghdad
Late 1800’s – major pharmaceutical companies, mass production
1900-1960 – major discoveries (insulin, penicillin, the pill …)
Rational – 1960’s to 1990’s
Designing molecules to target protein active sites – “lock and key”
Computational Drug Discovery
Biggest success HIV (RT, protease inhibitors)
Big Experiment – 1990’s to 2000’s
High throughput screening
Microarray Assays
Gene Sequencing and Human Genome Project
Big Data – 2010’s onwards
Informatics-driven drug discovery
Accepting the body is amazingly complex and we don’t understand it well
Everything is connected
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
The metabolic pathways of a single cell
David Wild,
December 2009.
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
The inner life of the cell
David Wild,
December 2009.
http://video.google.com/videoplay?docid=-2351549868099343381&hl=en#
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
Big Data in the public domain
There is now an incredibly rich resource of public information relating compounds, targets, genes, pathways, and diseases. Just for starters there is in the public domain information on:
69 million compounds and 449,392 bioassays (PubChem)
4,763 drugs (DrugBank)
9 million protein sequences (SwissProt) and 58,000 3D structures (PDB)
14 million human nucleotide sequences (EMBL)
19 million life science publications - 800,000 new each year (PubMed)
Multitude of other sets (drugs, toxicogenomics, chemogenomics, SAR, …)
Even more important are the relationships between these entities. For example a chemical compound can be linked to a gene or a protein target in a multitude of ways:
Biological assay with percent inhibition, IC50, etc
Crystal structure of ligand/protein complex
Co-occurrence in a paper abstract
Computational experiment (docking, predictive model)
Statistical relationship
System association (e.g. involved in same pathways cellular processes)
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
PubChem growth since 2005
David Wild,
December 2009.
2,824,265
35,379,748
56,774,950
69,088,100
0
10,000,000
20,000,000
30,000,000
40,000,000
50,000,000
60,000,000
70,000,000
80,000,000
2005-0
1
2005-0
3
2005-0
5
2005-0
7
2005-0
9
2005-1
1
2006-0
1
2006-0
3
2006-0
5
2006-0
7
2006-0
9
2006-1
1
2007-0
1
2007-0
3
2007-0
5
2007-0
7
2007-0
9
2007-1
1
2008-0
1
2008-0
3
2008-0
5
2008-0
7
2008-0
9
2008-1
1
2009-0
1
2009-0
3
2009-0
5
2009-0
7
2009-0
9
2009-1
1
2010-0
1
2010-0
3
2010-0
5
2010-0
7
PubChem Substance Size 2005-2010
Addition of
ChemSpider
434635
1
10
100
1000
10000
100000
1000000
2005-0
1
2005-0
4
2005-0
7
2005-1
0
2006-0
1
2006-0
4
2006-0
7
2006-1
0
2007-0
1
2007-0
4
2007-0
7
2007-1
0
2008-0
1
2008-0
4
2008-0
7
2008-1
0
2009-0
1
2009-0
4
2009-0
7
2009-1
0
2010-0
1
2010-0
4
2010-0
7
PubChem Bioassays 2005-2010
Addition of
ChEMBL
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
Large amount of data and links for each compound
David Wild,
December 2009.
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
Proteins & Genes
David Wild,
December 2009.
http://www.genome.jp/en/db_growth.html
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
Chem2Bio2RDF: The FaceBook of Drug Discovery
David Wild,
December 2009.
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
You are a big pile of data too!
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
Large-scale predictive modeling adds even more data
Range of ROCV values from different classes of BioAssay data set. Range of ROCV values from three different classes of BioAssay data set for original
models and models built with additional inactive compounds (ŅimprovedÓ).
Chen, B. and Wild, D.J. PubChem BioAssays as a data source for predictive models, Journal of
Molecular Graphics and Modeling. 2010; 28, 420-426.
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
Informatics-based drug discovery
Predicting new molecular targets for known drugs. Nature 462, 175-181(12 November 2009)
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
“Systems chemical biology” and chemogenomics
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
Recent enabling technologies for SCB / Chemogenomics
Cloud computing allows processing
and data mining on a vast scale
Integrative cheminformatics &
bioinformatics connects
compounds, targets genes, pathways, diseases and side
effects
Health informatics (PHRs and EHRs)
allows integration of the molecular and
patient models (QP)
Semantic technologies and complex systems
tools allow seamless integration and
human-scale data mining
Analysis
Visualization, projection,data mining, hypothesis
generation, network tools
Integration
RDF, XML, Triple StoresOntologies, SPARQL,
Graph algorithms
Access
Web Services, RPCInformation extraction
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
ChemBioGrid.org: Web service infrastructure for cheminformatics
Dong, X., Gilbert, K.E., Guha, R., Heiland, R., Kim, J., Pierce, M.E. Pierce, Fox, G.C. and Wild, D.J. Web service
infrastructure for chemoinformatics, J. Chem. Inf. Model., 2007; 47(4) pp 1303-1307.
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
The Semantic Web – meaning & relationships
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
Chem2Bio2RDF – RDF integration & SPARQL querying
Chen, B., Dong. X., Jiao, D., Wang, H., Zhu, Q., Ding, Y., Wild, D.J.Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems
chemical biology data. BMC Bioinformatics 2010, 11, 255
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
Chem2Bio2RDF context
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
Chem2Bio2RDF Relationships
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
Linked Open Data Cloud (linkeddata.org)
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
Converting data into RDF
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
Finding multi-target inhibitors of MAPK pathway with a SPARQL query
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
Finding compounds with similar polypharmacology using SPARQL
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
Projecting queries into chemical space
GTM / MDS projection and embedding of all PubChem using clouds
Plotting and embedding unknown
compounds with SCB
property labels
Dynamic querying and projection into
chemical space
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
Projecting queries into chemical space
Choi, J.Y. , Bae, S.H., Qiu, J., Fox, G., Chen, B., Wild. D.J. Browsing Large Scale Cheminformatics Data
with Dimension Reduction. Emerging Computational Methods for the Life Sciences Workshop, ACM
Symposium for High Performance Distributed Computing Jun 21-25, 2010, Chicago IL
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
“Doppler Radar Plot” – Kinase Specificity
Choi, J.Y. , Bae, S.H., Qiu, J., Fox, G., Chen, B., Wild. D.J. Browsing Large Scale Cheminformatics Data
with Dimension Reduction. Emerging Computational Methods for the Life Sciences Workshop, ACM
Symposium for High Performance Distributed Computing Jun 21-25, 2010, Chicago IL
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
“Doppler Radar Plot” – Kinase Specificity
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
Chem2Bio2RDF Dashboard: finding paths
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
Pathfinder
DexamethasoneTriamcinalone
NFKB1
Glucocorticoid Receptor
http://ella.slis.indiana.edu/~yuysun/flex/pathfinder.html
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
Dynamic exploration with clouds and Cytoscape
Virtuoso runs Chem2Bio2RDF queries on the
cloud
Cytoscape plugins give access to
Chem2Bio2RDF, LPG and chemical
structure visualization
Dynamic exploration in
Cytoscape
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
Hydrocortisone – Dexamethasone links
Fig. Use Case 1.Network diagram of the paths obtained between Hydrocortisone and Dexamethasone using
ChemBioScape.Drugbank interaction contains information about every drug’s target. In this case, DB00741 and DB01234
share common targets through several different Drugbank interaction ID’s.
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
Tolcapone and Entacapone links
Fig. Use case 2.Tolcapone and Entacapone are connected to each other through drugbank interaction
2348 and 1962.Also, the two drugs appear in PubMed articles 8119326 and 8223912 via their CID
(Compound ID)
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
Isoniazid and Ethionamide – replicate paper results
Banerjee, A., Dubnau, E., Quemard, A., Balasubramanian, V., Um, K., Wilson, T., et al.: inhA, a gene encoding a target for isoniazid
and ethionamide in Mycobacterium tuberculosis. Science, 263(5144), 227-230 (1994).
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
WENDI – exploring compound knowledge space
Doxorubicin (anthracyclin antibiotic)
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
WENDI v1.0 - insights from the literature
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
WENDI v2.0 - Automated reasoning with RDF
Simple OWL ontology for relationships
Large RDF network expands out from Query
RDF inference engines applied & results filtered / prioritized
QUERY
CID
86427
CID
8642
AID
328
PubMed
12856
Breast
Cancer
Breast
Cancer
HER2Breast
Cancer
similar_to
similar_to
active_against contains_term
contained_in contains_term
contains_term
predicted_
inactive_again
st
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
WENDI OWL/RDF Network & Inferred Associations
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
Semantic text mining of journal articles
Jiao, D. and Wild, D.J. Extraction of CYP Chemical Interactions from Biomedical Literature Using Natural Language
Processing Methods, Journal of Chemical Information and Modeling, 49(2); pp263-269
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
Chemical & Biological Literature Extraction
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
Validating topics by experimental relationships
Topic 26: cell, expression, cancer,
tumor,…
Related Disease: DNA Damage,
Melanoma, Glioblastoma, …
Un-proved link
proved link by c2b2r_chemogenomics
Target
Drug
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
Bio-LDA III
Entropy
In information theory, entropy is a measure of the uncertainty associated with a
random variable.
Here we can compute the bio-term entropies over topics
Kullback-Leibler divergence (KL divergence)
a non-symmetric measure of the difference between two probability distributions.
Here we used the KL divergence as the non-symmetric distance measure for two
bio-terms over topics
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
Combining path finding and Bio-LDA
Detect semantic association
Path finding algorithm
millions of RDF triples from Chem2bio2rdf
Assess semantic association
Bio-LDA model
Entropy and KL divergence
Additional knowledge base: 50, 100 and 200 topics using the recent 336,899
MEDLINE abstracts, which contains 13,338 identical bio-terms
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
Summary
Drug discovery is enetering a new era that is arguably centered on informatics
analysis of the vast amount of biological and chemical data now being produced,
and which looks at the effect of drugs on biological systems as a whole. This new
approach underlies the new fields of systems chemical biology and chemogenomics
Analyzing this data and particularly the relationships beween compounds, drugs,
proteins, genes, diseases, pathways and people promises to provide important
understanding of the nature of disease and treatment
The Semantic Web provides an effective framework for logically managing the data,
and Cloud Computing provides a physical framework for computation and searching
Early-stage methods developed at Indiana allow integrated access to this data, path
finding between any two points, visualization in chemical space and network tools,
and advanced handling of the scholarly literature
Critical next steps include ranking and intelligent filtering of paths and relationships
to provide aggregate evidence-based approaches, and integration of NGS and
patient data
Big Data in Drug Discovery David Wild, July 2010. http://djwild.info.
Cheminformatics group at Indiana University
http://djwild.info