Open Source Cheminformatics

28
Open Source Cheminformatics Rajarshi Guha Open Source Open Standards Open Data Open Source Cheminformatics Tools and Data Rajarshi Guha School of Informatics, Indiana University Bio IT World 29 th April, 2009

description

Upcoming presentation at BioIT world

Transcript of Open Source Cheminformatics

Page 1: Open Source Cheminformatics

Open SourceCheminformatics

Rajarshi Guha

Open Source

Open Standards

Open DataOpen Source Cheminformatics

Tools and Data

Rajarshi Guha

School of Informatics, Indiana University

Bio IT World

29th April, 2009

Page 2: Open Source Cheminformatics

Open SourceCheminformatics

Rajarshi Guha

Open Source

Open Standards

Open Data

Open Source Cheminformatics

I Been around for some time, niche field

I OSS snippets/code based on closed source API’s versusfully open source tools

Why use OSS cheminformatics?

I Articulated nicely by Delano

I Reverse also articulated nicely by Stahl

Goal

I Not argue for or against Open Source

I Show what’s there, how it fits in with other technologies

Delano, W. L., Drug Discovery Today, 2005, 10, 213–217

Stahl, M. T., Drug Discovery Today, 2005, 10, 219–222

Page 3: Open Source Cheminformatics

Open SourceCheminformatics

Rajarshi Guha

Open Source

Open Standards

Open Data

Open Source Cheminformatics

I Been around for some time, niche field

I OSS snippets/code based on closed source API’s versusfully open source tools

Why use OSS cheminformatics?

I Articulated nicely by Delano

I Reverse also articulated nicely by Stahl

Goal

I Not argue for or against Open Source

I Show what’s there, how it fits in with other technologies

Delano, W. L., Drug Discovery Today, 2005, 10, 213–217

Stahl, M. T., Drug Discovery Today, 2005, 10, 219–222

Page 4: Open Source Cheminformatics

Open SourceCheminformatics

Rajarshi Guha

Open Source

Open Standards

Open Data

Cheminformatics Software

I The ecosystem is composed of developer- anduser-oriented software

I Most applications will depend on lower level functionalityI Choice of toolkit influences

I robustnessI performanceI ease of distributionI integration with other libraries

I Won’t be talking about user-oriented software

Page 5: Open Source Cheminformatics

Open SourceCheminformatics

Rajarshi Guha

Open Source

Open Standards

Open Data

The Toolkit Ecosystem

+Python(third-party package)

GuidelinesOEChem and its sister libraries for molecular modeling are fast, flexible, powerful and complete (except for fingerprints). It is designed for high-end users who know the nuances of cheminformatics. Expensive. My choice for C++, Java and Python.

CDK is the toolkit to use if you are on the JDK and OEChem is too pricey. It has a strong structure and structural biology component, close ties with 2D and 3D display programs, and integration with Bioclipse, Taverna, and Knime.

RDKit is relatively new and with a small user community. The software engineering skills are the best of the free projects. Includes 2D layout, 2D→3D, QSAR, forcefield, shape and machine learning components. Worth a look!

OpenBabel is the most community driven. Its strength is file format conversion, for both small molecules and biomolecules. It is expanding towards more modeling support, including several forcefield implementations. Often used as a test-bed for new algorithms. Code quality is variable, reflecting the diverse contributor base.

Do not use the Daylight toolkit for new code. It is expensive, there's very little new development, and you can get nearly all of its functionality elsewhere.

Daylight

OELib

DayPerl

DaySWIG

PyDaylight

1995 and earlier 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008

frowns

Babel

OpenBabel

OEChem

RDKit

Pybel

(OBabel)

JOELib

+Python

+Ogham &Lexichem

C++ +Java

C and Fortran

Python; API based on PyDaylight

Tcl, Python and more

C++(not a library)

higher-level Python API

higher-level Python API

abstraction API

cinfony

C++/Python - internal libraryPublic release on Sourceforge

Java; API based on OELib

CDKPart of JChemDraw

Java

Accessible from the C version of PythonAccessible from the Java version of Python (Jython)

Timeline of cheminformatics toolkits**(runs on Unix and supports SMILES and SMARTS)

Is a wrapper

Developer moved between projects

+Java, Ruby+Python, Perl

Andrew Dalke’s EuroQSAR 2008 poster

Page 6: Open Source Cheminformatics

Open SourceCheminformatics

Rajarshi Guha

Open Source

Open Standards

Open Data

What’s available?

I CDK (Java)

I Openbabel (C++)

I RDKit (C++)

I Licensing varies

I A large degree of overlap

Page 7: Open Source Cheminformatics

Open SourceCheminformatics

Rajarshi Guha

Open Source

Open Standards

Open Data

Toolkits - A Comparison

Feature CDK OpenBabel RDKit

License LGPL GPL new BSD

Language Java C++ C++ / Python

SLOC 188,554 194,358 173,219

Fingerprints

Hashed 333 333 333

Substructure 333 333 333

File format support 33 333 3

Aromaticity models 3 3 3

Stereochemistry 3 33 333

Canonicalization 333 333 333

Descriptors 333 3 333

2D coordinate generation 333 7 333

3D coordinate generation 3 333 333

2D depictions 333 7 333

Conformer generation 7 3 3

Rigid alignment 333 333 333

SMARTS searching 333 333 333

Pharmacophore searching 33 7 333

Page 8: Open Source Cheminformatics

Open SourceCheminformatics

Rajarshi Guha

Open Source

Open Standards

Open Data

CDK Overview

Category functionality

Input / Output Support for various formats including SDF,SMILES, CML, PDB, InChI, PubChemXML formats, Canonical SMILES support,Pharmacophore serialization

Visualization 2D coordinate generation and depiction

Properties Fingerprinting Gasteiger-Marsilli andMMFF94 partial charges, Atom, bond andmolecular descriptors, NMR prediction viaHOSE codes, Aromaticity perception

Graph Isomorphism and Sub-graph isomorphismdetection, SMARTS support, Ringperception, pharmacophore searching. Avariety of graph theoretical algorithms(including traversal, shortest paths,distance matrix)

Page 9: Open Source Cheminformatics

Open SourceCheminformatics

Rajarshi Guha

Open Source

Open Standards

Open Data

Data Visualization

I Lots of OSS molecular visualization tools available

I Needs to be combined with data analysis tools

I R is great for analytics, has powerful graphics

I Not cheminformatics aware, not user-friendly

Possibilities

I Rattle

I GGobi

I Processing - developer oriented, good for ad-hoc,multiple data type visualizations

I Bioclipse

Page 10: Open Source Cheminformatics

Open SourceCheminformatics

Rajarshi Guha

Open Source

Open Standards

Open Data

Data Visualization - Bioclipse

Page 11: Open Source Cheminformatics

Open SourceCheminformatics

Rajarshi Guha

Open Source

Open Standards

Open Data

Open Source Cheminformatics Workflows

Requirements

I Core cheminformatics

I Analytics

I Database backends

I Integration

Can it be done?

I Yes, in various ways

I For the non-expert user, pipeline tools provide a niceplatform for integrating all the above

I For expert users, it’s useful to go lower level

I Integration between R and the CDK provides acheminformatics enhanced modeling platform

Page 12: Open Source Cheminformatics

Open SourceCheminformatics

Rajarshi Guha

Open Source

Open Standards

Open Data

CDK and R

I R is oriented towards statistical modeling andcomputations

I Cheminformatics agnostic

I rcdk integrates the CDK into the R environmentI Read and process molecular structure information

I DescriptorsI FingerprintsI General molecule manipulation

I Provides access to CDK functionality in idiomatic R

http://cran.r-project.org/web/packages/rcdk/index.html

Page 13: Open Source Cheminformatics

Open SourceCheminformatics

Rajarshi Guha

Open Source

Open Standards

Open Data

Accessing Chemical Information from R

I rcdk is good for processing and manipulating moleculesin R

I Also useful to be able to access chemical informationdirectly from databases

I rpubchem provides access to PubChem compound,substance and bioassay collections

I By compound, substance, assay ID’sI By keyword searchesI Packages assay information into a data.frame and

includes associated metadata

I Supplements the rcdk package

http://cran.r-project.org/web/packages/rpubchem/index.html

Page 14: Open Source Cheminformatics

Open SourceCheminformatics

Rajarshi Guha

Open Source

Open Standards

Open Data

Standards for Cheminformatics?

I Open standards/specifications help everybodyI Most refer to file formats

I CML, JCAMP-DXI InChI, AniML

I Who sets them? How are they constructued?

I Are there usage restrictions?

Page 15: Open Source Cheminformatics

Open SourceCheminformatics

Rajarshi Guha

Open Source

Open Standards

Open Data

Standards for Cheminformatics

Open definition

I Public participation in defining the standard

I Mailing lists, wiki’s for transparency

I Possibility of forking the standard

I FlexMol, OpenSmiles, JCAMP-DX

Open use

I No royalties for usage

I No patents, trademarks, copyrights etc

I SMILES, SDF, InChI, SLN

Page 16: Open Source Cheminformatics

Open SourceCheminformatics

Rajarshi Guha

Open Source

Open Standards

Open Data

Standards for Cheminformatics

De facto standard

I In wide use, few or no variants

I Data exchange is easy and reliable

I SDF, SMILES, PDB

Formal standard

I Endorsed by some sort of recognized group, academic, orgovernment body

I InChI, OpenSMILES, JCAMP-DX

Page 17: Open Source Cheminformatics

Open SourceCheminformatics

Rajarshi Guha

Open Source

Open Standards

Open Data

The Blue Obelisk

I Umbrella for a variety of OSS projects

I Covers code, data, standards

I Open to everybody

I OpenSMILES is a recent project aiming to provideexplicit description of the SMILES grammar

http://blueobelisk.sourceforge.net/ http://www.opensmiles.org/

Page 18: Open Source Cheminformatics

Open SourceCheminformatics

Rajarshi Guha

Open Source

Open Standards

Open Data

The Pistoia Alliance

. . . established to streamline noncompetitive elements of the pharmaceuticaldrug discovery workflow by the specificationof common business terms, relationships andprocesses . . .

I An opportunity for the Open Source cheminformaticscommunity to link with industrial users

I ontology developmentsI web service interfacesI database schema

http://pistoiaalliance.sourceforge.net/

Page 19: Open Source Cheminformatics

Open SourceCheminformatics

Rajarshi Guha

Open Source

Open Standards

Open Data

The Distributed Future

Page 20: Open Source Cheminformatics

Open SourceCheminformatics

Rajarshi Guha

Open Source

Open Standards

Open Data

The Distributed Future

I Web services, cloud computing, . . .

I The OSS cheminformaticsecosystem integrates with thesescenarios very easily

I Cost and licenses are one aspect

I Redundancy is a big benefit

I Data / functionality mashups can lead to innovativesolutions

Cheminformatics web services

I CDK based services (hosted at various places)

I Daylight web services

I NCI, Chemspider

Page 21: Open Source Cheminformatics

Open SourceCheminformatics

Rajarshi Guha

Open Source

Open Standards

Open Data

There’s Data in Them Thar Internets

I Many significant public resources of chemicalinformation

I PubChemI ChemSpiderI NMRShiftDB

I Use anything to access them

I Does OSS have a role to play here?

I Open Access is likely more important in this case

Page 22: Open Source Cheminformatics

Open SourceCheminformatics

Rajarshi Guha

Open Source

Open Standards

Open Data

Data Access

I Good to have access to data in open fashion

I What about adding value to the data?I Could replicate databases

I Easier if the data source is built on a OSS stackI Raw data dumps obviate this need

I But open, well defined API’s are preferableI Avoiding hosting/update hasslesI Easier to mash multiple data sources

I Made easier when data sources support standards

Page 23: Open Source Cheminformatics

Open SourceCheminformatics

Rajarshi Guha

Open Source

Open Standards

Open Data

Benchmark Datasets

I Benchmarking is vitalI Some sub-fields have collections of benchmark datasets

I Docking (DUD)I Virtual screening (MUV)

I No general datasets or attempts for benchmarking corecheminformatics operations

I Initial attempt at cheminfbenchmark on GitHubI Restricted to Java libraries at this point (CDK, MX)I Uses datasets taken from PubChemI Fingerprinting, SD parsing, SMARTS parsing,

substructure searching

Rohrer, S. G. et al., J. Chem. Inf. Model., 2009, 49, 169–184

Page 24: Open Source Cheminformatics

Open SourceCheminformatics

Rajarshi Guha

Open Source

Open Standards

Open Data

Open Source & Open Notebook Science

I ONS is a paradigm whereby some or all experimentalresults are published in an open form with little or no lagtime

I Championed by Jean Claude Bradley, Cameron Neylon,Raf Aerts and others

I Closed source versus open source cheminformaticsdoesn’t necessarily hinder ONS practise

I But open source cheminformatics makes life easier

Page 25: Open Source Cheminformatics

Open SourceCheminformatics

Rajarshi Guha

Open Source

Open Standards

Open Data

ONS Solubility Challenge

I Led Jean-Claude Bradley (Drexel U.)

I Solubility measurements in various non-aqueous solvents

I Part of a larger project to identify anti-malarialcompounds

I Very distributedI Multiple groups generating and modeling dataI Data hosted on wiki’s and Google spreadsheetsI Multiple views, enhanced via cheminformatics web

services

Page 26: Open Source Cheminformatics

Open SourceCheminformatics

Rajarshi Guha

Open Source

Open Standards

Open Data

ONS Solubility Challenge

Data Storage Data Storage

Data Views

Data Modeling

Web Services

Data Generation

Page 27: Open Source Cheminformatics

Open SourceCheminformatics

Rajarshi Guha

Open Source

Open Standards

Open Data

What’s Holding OSS Cheminformatics Back?

I Niche field

I Comprehensiveness, polish

I Funding

Page 28: Open Source Cheminformatics

Open SourceCheminformatics

Rajarshi Guha

Open Source

Open Standards

Open Data

Conclusions

I The ecosystem is alive with activity

I Distributed systems are important - OSScheminformatics fits in nicely

I OSS projects should coordinate with usersI industrial and academic

I Quality and effectiveness will be the final arbiter