1 Gary Wiggins for Geoffrey Fox April 30, 2007 Computer Science, Informatics, Physics Pervasive...

20
1 http:// www.chembiogrid.org Gary Wiggins for Geoffrey Fox April 30, 2007 Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401 [email protected] http://www.infomall.org

Transcript of 1 Gary Wiggins for Geoffrey Fox April 30, 2007 Computer Science, Informatics, Physics Pervasive...

Page 1: 1  Gary Wiggins for Geoffrey Fox April 30, 2007 Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana.

11

http://www.chembiogrid.org

Gary Wiggins for

Geoffrey Fox

April 30, 2007

Computer Science, Informatics, PhysicsPervasive Technology Laboratories

Indiana University Bloomington IN [email protected]

http://www.infomall.org

Page 2: 1  Gary Wiggins for Geoffrey Fox April 30, 2007 Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana.

22

Indiana University Focus

Creating a comprehensive, easily accessible infrastructure for cheminformatics tools and data sources

Becoming a central hub of cheminformatics education

Page 3: 1  Gary Wiggins for Geoffrey Fox April 30, 2007 Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana.

33

CICC Web Service Infrastructure

Cheminformatics services

Statistics services

Database services

Grid services

Portal services

Page 4: 1  Gary Wiggins for Geoffrey Fox April 30, 2007 Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana.

Web Services Vision

Web services provide a neutral approach to exposing functionality

They can be located anywhere:• On your desktop

• Intranet

• Internet

Literally anything can be made into a web service:• Libraries

• Standalone programs

• Commerical code

• Open-source code

Page 5: 1  Gary Wiggins for Geoffrey Fox April 30, 2007 Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana.

Modes of Access Web Pages Workflow Tools

• Taverna, Pipeline Pilot, Xbaya, etc.

GUIs• Chimera

RSS Feeds• Feeds include 2D/3D structures in CML

• Viewable in Bioclipse, Jmol as well as Sage etc.

• Two feeds currently available: SynSearch – get structures based on full or partial chemical

names DockSearch – get best N structures for a target

Page 6: 1  Gary Wiggins for Geoffrey Fox April 30, 2007 Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana.

Where Does Our Functionality Come From?

Indiana University VOTables NCI DTP predictions Database services

Cambridge University InChi generation / search OSCAR

OpenEye Docking

DigitalChemistry BCI fingerprints DivKMeans

CDK Cheminformatics

Univ. of Michigan PkCell

R Foundation R package

NIH PubChem PubMed

gNova Consulting

European Chemicals Bureau ToxTree toxicity predictions

Page 7: 1  Gary Wiggins for Geoffrey Fox April 30, 2007 Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana.

77

Methods Development at the CICC

Tagging methods for web-based annotation exploiting del.icio.us and Connotea

Development of QSAR model interpretability and applicability methods

RNN-Profiles for exploration of chemical spaces VisualiSAR - SAR through visual analysis

• http://www.daylight.com/meetings/mug99/Wild/Mug99.html

Visual Similarity Matrices for High Volume Datasets• http://www.osl.iu.edu/~chemuell/new/bioinformatics.php

Fast, accurate clustering using parallel Divisive K-means

Mapping of Natural Language queries to use cases and workflows

Page 8: 1  Gary Wiggins for Geoffrey Fox April 30, 2007 Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana.

Algorithm Development

Goals• Focus on interpretability and applicability

• Devise novel approaches to clustering problems

• Investigate the utility of low dimensional representations for a variety of problems

Examples• Ensemble feature selection (JCIM, in press)

• Cluster counting with R-NN curves (in revision)

Page 9: 1  Gary Wiggins for Geoffrey Fox April 30, 2007 Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana.

Chemical Data Mining Working on screening data with Scripps, FL

• Random forests (modeling & feature selection)

• Naïve Bayes (modeling)

• Identifying features indicative of toxicity

• Domain applicability

NCI DTP Cell line activity predictions• Random forest models for 60 cell lines

All available as• downloadable R models

• web services (supply SMILES, get prediction) with web page clients

Page 10: 1  Gary Wiggins for Geoffrey Fox April 30, 2007 Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana.

Computational Infrastructure

R, CDK, and PubChem Goals

• Access cheminformatics from within R

• Access PubChem data from within R

rcdk package allows to do cheminformatics within R using CDK functionality

rpubchem provides access to PubChem compound data and bioassay data• Searchable via assay ID, keywords

J. Stat. Soft, 2007, 18(6)

Page 11: 1  Gary Wiggins for Geoffrey Fox April 30, 2007 Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana.

1111

Example: R Statistics applied to PubChem data

By exposing the R statistical package, and the Chemistry Development Kit (CDK) toolkit as web services and integrating them with PubChem, we can quickly and easily perform statistical analysis and virtual screening of PubChem assay data.

Predictive models for particular screens are exposed as web services, and can be used either as simple web tools or integrated into other applications.

Example below uses DTP Tumor Cell Line screens - a predictive model using Random Forests in R makes predictions of probability of activity across multiple cell lines (avail. at http://www.chembiogrid/cheminfo/ncidtp/dtp).

Page 12: 1  Gary Wiggins for Geoffrey Fox April 30, 2007 Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana.

Databases

Our databases aim to add value to PubChem or link into PubChem

3D structures (MMFF94)• Searchable by CID, SMARTS, 3D similarity

Docked ligands (FRED)• 960,000 drug-like compounds into 7 targets

• Will eventually cover ~2000 targets

Page 13: 1  Gary Wiggins for Geoffrey Fox April 30, 2007 Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana.

1313

Example: PubDock Database of 960K PubChem structures (the most drug-like) docked

into proteins taken from the PDB Available as a web service, so structures can be accessed in your

own programs, or using workflow tools like Pipeline Pilot Several interfaces developed, including one based on Chimera

(below) which integrates the database with the PDB to allow browsing of compounds in different targets, or different compounds in the same target

Page 14: 1  Gary Wiggins for Geoffrey Fox April 30, 2007 Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana.

How do we use all of this?Percent Inhibition or IC50 data is retrieved from HTS

Question: Was this screen successful?

Question: What should the active/inactive cutoffs be?

Question: What can we learn about the target protein or cell line from this screen?

Compounds submitted to PubChem

Workflows encoding distribution analysis of screening results

Grids can link data analysis ( e.g image processing developed in existing Grids), traditional Chem-informatics tools, as well as annotation tools (Semantic Web, del.icio.us) and enhance lead ID and SAR analysis

A Grid of Grids linking collections of services atPubChemECCR centersMLSCN centers

Workflows encoding plate & control well statistics, distribution analysis, etc

Workflows encoding statistical comparison of results to similar screens, docking of compounds into proteins to correlate binding, with activity, literature search of active compounds, etcCHEMINFORMATICSPROCESS GRIDS

Page 15: 1  Gary Wiggins for Geoffrey Fox April 30, 2007 Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana.

1515

Example HTS workflow: Finding cell-protein relationships

A protein implicated in tumor growth with a known ligand is selected (in this case HSP90 taken from the PDB 1Y4 complex).

Similar structures to the ligand can be

browsed using client portlets.

Once docking is complete, the user visualizes the high-scoring docked structures in a portlet using the JMOL applet.

Similar structures are filtered for drugability, are converted to 3D, and are automatically passed to the OpenEye FRED docking program for docking into the target protein.

The screening data from a cellular HTS assay is similarity searched for compounds with 2D structures similar to the ligand.

Docking results and activity patterns fed into R services for building of activity models and correlations

LeastSquaresRegression

RandomForests

NeuralNets

Page 16: 1  Gary Wiggins for Geoffrey Fox April 30, 2007 Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana.

1616

Varuna environment for molecular modeling (Baik, IU)

QMDatabase

ResearcherResearcher

Simulation ServiceFORTRAN Code,

Scripts

Chemical Concepts

Experiments

QM/MMDatabasePubChem, PDB,

NCI, etc.

ChemBioGridChemBioGrid

ReactionDB

DB ServiceQueries, Clustering,

Curation, etc.

Papersetc.

Condor

TeraGridSupercomputers

“Flocks”

Page 17: 1  Gary Wiggins for Geoffrey Fox April 30, 2007 Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana.

1717

Cheminformatics Education at IU School of Informatics degree programs: BS, MS,

PhD• Cheminformatics MS and track on PhD in Informatics• Informatics Undergraduates can choose a chemistry

cognate (minor in chemistry) Also Bioinformatics MS and Bioinformatics and Complex

Systems tracks on PhD in Informatics Good employer interest but modest student understanding

of value of Cheminformatics degree 3 core graduate courses in Cheminformatics plus seminars

and independent study courses Significant interest in distance education versions of

courses promising for the Graduate Certificate in Chemical Informatics

http://www.informatics.indiana.edu

Page 18: 1  Gary Wiggins for Geoffrey Fox April 30, 2007 Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana.

1818

Spreading cheminformatics education with distance education

Partnered with the University of Michigan to offer our introductory graduate cheminformatics course at IU and Michigan as a CIC CourseShare• UM pharmacy, chemistry and

engineering students can be trained in cheminformatics for course credit at UM

Individual students in academia, government, and small and large life science companies have taken the class remotely from all over the country for credit towards the graduate certificate

Uses mixture of web conferencing (Breeze), videoconferencing, and online resources for maximum flexibility

• Most recent course wiki is available at http://cheminfo.informatics.indiana.edu/djwild/I571_2006_wiki

Giving a class remotely to UM students with video and web conferencing

Page 19: 1  Gary Wiggins for Geoffrey Fox April 30, 2007 Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana.

1919

CICC Infrastructure Vision Drug Discovery and other academic chemistry and pharmacology

research will be aided by powerful modern information technology. ChemBioGrid is set up as distributed cyberinfrastructure in

eScience model. ChemBioGrid will provide user interfaces (portals) to distributed

databases, results of high throughput screening instruments, results of computational chemical simulations and other analyses.

ChemBioGrid will provide services to manipulate this data and combine in workflows; it will have convenient ways to submit and manage multiple jobs.

ChemBioGrid will include access to PubChem, PubMed, PubMed Central, the Internet and its derivatives like Microsoft Academic Live and Google Scholar.

The services include open-source software like CDK, commercial code from vendors such as Digital Chemistry, OpenEye, and Google, and any user contributed programs.

ChemBioGrid will define open interfaces to use for a particular type of service allowing plug and play choices between different implementations.

Page 20: 1  Gary Wiggins for Geoffrey Fox April 30, 2007 Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana.

2020

CICC Senior Personnel Geoffrey C. Fox Mu-Hyun (Mookie) Baik Dennis B. Gannon Kevin E. Gilbert Rajarshi Guha Marlon Pierce Beth A. Plale Gary D. Wiggins David J. Wild Yuqing (Melanie) Wu

Peter T. Cherbas Mehmet M. Dalkilic Charles H. Davis A. Keith Dunker Kelsey M. Forsythe John C. Huffman Malika Mahoui Daniel J. Mindiola Santiago D. Schnell William Scott Craig A. Stewart David R. Williams

From Biology, Chemistry, Computer Science, Informatics

at IU Bloomington and IUPUI (Indianapolis)