Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.
-
Upload
ryan-wiley -
Category
Documents
-
view
213 -
download
0
Transcript of Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.
![Page 1: Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.](https://reader036.fdocuments.in/reader036/viewer/2022062618/55151447550346a80c8b5ce3/html5/thumbnails/1.jpg)
Building a Chemical Informatics Grid
Marlon Pierce
Community Grids Laboratory
Indiana University
![Page 2: Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.](https://reader036.fdocuments.in/reader036/viewer/2022062618/55151447550346a80c8b5ce3/html5/thumbnails/2.jpg)
CICC Project Information
“Chemical Informatics and Cyberinfrastructure Collaboratory” is an NIH and MS-funded research project to combine the CI’s.
Project web site and more information www.chembiogrid.org www.chembiogrid.org/wiki
Team members include Computer Science: Geoffrey Fox (PI), Dennis Gannon, Beth Plale, Marlon
Pierce, Yuqing (Melanie) Wu, Malika Mahoui, Jake Kim Chemical Informatics and Chemistry: Gary Wiggins, Mu-Hyun (Mookie)
Baik, David Wild, Rajarshi Guha, Kevin Gilbert I have stolen slides and content from these fine people.
We collaborate with several groups Peter Murray Rust’s group at University of Cambridge University of Michigan’s MACE group. Chemistry Development Kit (CDK) project DTP NIC at NIH Scripps High Throughput Screening Center
![Page 3: Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.](https://reader036.fdocuments.in/reader036/viewer/2022062618/55151447550346a80c8b5ce3/html5/thumbnails/3.jpg)
Chemical Informatics and Cyber-infrastructure Building Blocks Chemical Informatics Resources:
Deluge of experimental data > 100,000 compounds screened by 10 publicly funded high throughput
screening centers using various assay techniques (molecular to cellular) Molecular Libraries Screening Center Network
Chemical databases maintained by various groups NIH PubChem, NIH DTP
Chemical informatics and computational chemistry Data clustering, data mining, descriptor calculations, toxicity prediction,
docking, molecular modeling, and quantum chemistry Visualization tools Web resources: journal articles, etc.
A Chemical Informatics Grid will need to integrate these into a common, loosely coupled, open, distributed computing environment.
![Page 4: Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.](https://reader036.fdocuments.in/reader036/viewer/2022062618/55151447550346a80c8b5ce3/html5/thumbnails/4.jpg)
Our Solution Stack Domain specific Web Services
VOTables, CDK services Grid services, Cyber-
infrastructure for computationally intensive applications. Clustering, quantum chemistry
Workflow and service management We work with Taverna Many solutions: Kepler, BPEL
engines, etc. Portlets and other user
interfaces Rich desktop apps Ubiquitous clients
Portals and Other User Interfaces
Workflow and ServiceManagement
Web and Grid Services
Each level is subject for research and development, as is their integration.
![Page 5: Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.](https://reader036.fdocuments.in/reader036/viewer/2022062618/55151447550346a80c8b5ce3/html5/thumbnails/5.jpg)
A Library of Chemical Informatics Web Services
![Page 6: Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.](https://reader036.fdocuments.in/reader036/viewer/2022062618/55151447550346a80c8b5ce3/html5/thumbnails/6.jpg)
All Services Great and Small
Like most Grids, a Chemical Informatics Grid will have the classic styles: Data Grid Services: these provide access to data sources
like PubChem, etc. Execution Grid Services: used for running cluster analysis
programs, molecular modeling codes, etc, on TeraGrid and similar places.
But we also need many additional services Handling format conversions (InChI<->SMILES) Shipping and manipulating tabular data Determining toxicity of compounds Generating batch 2D images
So one of our core activities is “build lots of services”
![Page 7: Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.](https://reader036.fdocuments.in/reader036/viewer/2022062618/55151447550346a80c8b5ce3/html5/thumbnails/7.jpg)
VOTables: Handling Tabular Data Developed by the Virtual Observatory community for encoding
astronomy data. The VOTable format is an XML representation of the tabular
data (data coming from BCI, NIH DTP databases, and so on). VOTables-compatible tools have been built
We just inherit them. SAVOT and JAVOT JAVA Parser APIs for VOTable allow us
to easily build VOTable-based applications Web Services Spread sheet Plotting applications.
VOPlot and TopCat are two
![Page 8: Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.](https://reader036.fdocuments.in/reader036/viewer/2022062618/55151447550346a80c8b5ce3/html5/thumbnails/8.jpg)
mrtd1.txt – smiles representation of chemical compounds along with its properties
![Page 9: Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.](https://reader036.fdocuments.in/reader036/viewer/2022062618/55151447550346a80c8b5ce3/html5/thumbnails/9.jpg)
Votable.xml : xml representation of mrtd1.txt file
![Page 10: Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.](https://reader036.fdocuments.in/reader036/viewer/2022062618/55151447550346a80c8b5ce3/html5/thumbnails/10.jpg)
VOPlot Application from generated votable.xml file : Graph plotted on Mass (X–axis) and PSA (Y-axis)
![Page 11: Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.](https://reader036.fdocuments.in/reader036/viewer/2022062618/55151447550346a80c8b5ce3/html5/thumbnails/11.jpg)
More Services: WWMM ServicesServices Descriptions Input Output
InChIGoogle Search an InChI structure through Google
inchiBasic
type
Search result in HTML format
InChIServer Generate InChI version
format
An InChI structure
OpenBabelServer
Transform a chemical format to another using Open Babel
format
inputData
outputData
options
Converted chemical structure string
CMLRSSServer
Generate CMLRSS feed from CML data
mol, title description link, source
Converted CMLRSS feed of CML data
![Page 12: Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.](https://reader036.fdocuments.in/reader036/viewer/2022062618/55151447550346a80c8b5ce3/html5/thumbnails/12.jpg)
CDK-Based Services
Common Substructure
Calculates the common substructure between two molecules.
CDKsim Takes two SMILES and evaluates the Tanimoto coefficient (ratio of intersection to union of their fingerprints).
CDKdesc Calculates a variety of molecular and atomic descriptors for QSAR modeling
CDKws Fingerprint generation
CDKsdg Creates a jpeg of the compound’s 2D structure
CDKStruct3D Generates 3D coordinates of a molecule from its SMILE
![Page 13: Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.](https://reader036.fdocuments.in/reader036/viewer/2022062618/55151447550346a80c8b5ce3/html5/thumbnails/13.jpg)
ToxTree Service The Threshold of Toxicological
Concern (TTC) establishes a level of exposure for all chemicals below which there would be no appreciable risk to human health.
ToxTree implements the Cramer Decision Tree approach to estimate TTC.
We have converted this into a service. Uses SMILES as input. Note the GUI must be
separated from the library to be a service
http://ecb.jrc.it/QSAR/home.php?CONTENU=/QSAR/qsar_tools/qsar_tools_toxtree.php
![Page 14: Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.](https://reader036.fdocuments.in/reader036/viewer/2022062618/55151447550346a80c8b5ce3/html5/thumbnails/14.jpg)
OSCAR3 Service Oscar3 is a tool for shallow, chemistry-specific
natural language parsing of chemical documents (i.e. journal articles).
It identifies (or attempts to identify): Chemical names: singular nouns, plurals, verbs etc., also
formulae and acronyms. Chemical data: Spectra, melting/boiling point, yield etc. in
experimental sections. Other entities: Things like N(5)-C(3) and so on.
Results are exported as an XML file. There is a larger effort, SciBorg, in this area
http://www.cl.cam.ac.uk/~aac10/escience/sciborg.html It also has potentially very interesting Workflows
http://wwmm.ch.cam.ac.uk/wikis/wwmm/index.php/Oscar3
![Page 15: Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.](https://reader036.fdocuments.in/reader036/viewer/2022062618/55151447550346a80c8b5ce3/html5/thumbnails/15.jpg)
Use Cases and Workflows
Putting data and clustering together in a distributed environment.
![Page 16: Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.](https://reader036.fdocuments.in/reader036/viewer/2022062618/55151447550346a80c8b5ce3/html5/thumbnails/16.jpg)
A Workflow Scenario: HTS Data Organization and Flagging This workflow demonstrates how screening data can be flagged
and organized for human analysis. The compounds and data values for a particular screen are
retrieved from the NIH DTP database and then are filtered to remove compounds with reactive groups, etc. A tumor cell line is selected. The activity results for all the
compounds in the DTP database in the given range are extracted from the PostgreSQL database
OpenEye FILTER is used to calculate biological and chemical properties of the compounds that are related to their potential effectiveness as drugs
ToxTree is used to flag the potential toxicities of compounds. Divkmeans is used to add a column of cluster numbers. Finally, the results are visualized using VOPlot and the 2D
viewer applet.
![Page 17: Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.](https://reader036.fdocuments.in/reader036/viewer/2022062618/55151447550346a80c8b5ce3/html5/thumbnails/17.jpg)
HTS data organization & flagging
A tumor cell line is selected. The activity results for all the compounds in the DTP database in the given range are extracted from the PostgreSQL database
The compounds are clustered on chemical structure
similarity, to group similar compounds together
The compounds along with property and cluster information are converted to VOTABLES format and displayed in VOPLOT
OpenEye FILTER is used to calculate biological and chemical properties of the compounds that are related to their potential effectiveness as drugs
![Page 18: Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.](https://reader036.fdocuments.in/reader036/viewer/2022062618/55151447550346a80c8b5ce3/html5/thumbnails/18.jpg)
Web Services
![Page 19: Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.](https://reader036.fdocuments.in/reader036/viewer/2022062618/55151447550346a80c8b5ce3/html5/thumbnails/19.jpg)
Example plots of our workflow output using VOPlot and VOTables
![Page 20: Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.](https://reader036.fdocuments.in/reader036/viewer/2022062618/55151447550346a80c8b5ce3/html5/thumbnails/20.jpg)
Chemical Informatics and the TeraGrid
![Page 21: Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.](https://reader036.fdocuments.in/reader036/viewer/2022062618/55151447550346a80c8b5ce3/html5/thumbnails/21.jpg)
A Workflow for IU’s Big Red Demo PubMed abstracts
555,007 PubMed abstracts of 2005 – 2006 (part) 1,000 abstracts per node
511 nodes X 1,000 input abstracts used for the demo OSCAR3
Extracts chemical information from text and produces an XML instance highlighting the chemical information
SMILES extraction Extracting SMILES elements from OSCAR’s XML output files Unique SMILES list within a batch
Use this to drive docking and molecular modeling applications.
![Page 22: Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.](https://reader036.fdocuments.in/reader036/viewer/2022062618/55151447550346a80c8b5ce3/html5/thumbnails/22.jpg)
Bigger Picture for the Workflow
NIHPubMed
Database
OSCARText
Analysis
POV-RayParallel
Rendering
Initial 3DStructure
Calculation
ToxicityFiltering
ClusterGrouping
Docking
MolecularMechanics
Calculations
Quantum Mechanics
Calculations
IU’sVaruna
Database
NIHPubChemDatabase
Big Red Demo
High Throughput Screening (HTS)
Data Organization and Flagging
![Page 23: Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.](https://reader036.fdocuments.in/reader036/viewer/2022062618/55151447550346a80c8b5ce3/html5/thumbnails/23.jpg)
A Workflow for Big Red Demo
Final HTML pages
![Page 24: Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.](https://reader036.fdocuments.in/reader036/viewer/2022062618/55151447550346a80c8b5ce3/html5/thumbnails/24.jpg)
VARUNA – Towards a Grid-based Molecular Modeling Environment
Taking the Big Red demo from stunt to science.
![Page 25: Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.](https://reader036.fdocuments.in/reader036/viewer/2022062618/55151447550346a80c8b5ce3/html5/thumbnails/25.jpg)
Automatic Quantum Mechanical Curation of Structure Data (AutoGeFF) Chemical research logic is often driven by molecular structure
Large-scale, small molecule DB’s (such as PubChem and, through OSCAR, PubMed) have low-resolution structure data
Often key properties are not consistently available: e.g.: Rotation-barriers, Redox Potentials, Polarizabilities, IR
frequencies, reactivity towards nucleophiles QM web-services will provide tools for generating high-resolution
data Produces a new, curated database of QM results
These can then be combined with databases of proteins (PDB, MOAD, PDB-Bind) for docking and other detailed simulation studies.
![Page 26: Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.](https://reader036.fdocuments.in/reader036/viewer/2022062618/55151447550346a80c8b5ce3/html5/thumbnails/26.jpg)
Prototype-Project: Controlling the TGF pathway
PDB
1IAS1IASInactive TGF
VARUNA
Experimentsin the Zhang
Lab
Active TGFActive TGFWith inhibitorWith inhibitor
PubChem
in-house Molecules in Varuna
Conceptual Conceptual Understanding of Understanding of TGFTGF
InhibitionInhibition
SimulationsAutoGeFFAutoGeFF
Questions:
- What molecular feature controls inhibitor binding?
- How do mutations impact binding?
![Page 27: Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.](https://reader036.fdocuments.in/reader036/viewer/2022062618/55151447550346a80c8b5ce3/html5/thumbnails/27.jpg)
More Information
Contact me: [email protected] Website and wiki:
www.chembiogrid.org www.chemibiogrid.org/wiki
We have project plus collaborator mailing lists if you really are interested.