C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

36
C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team

Transcript of C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

Page 1: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

C2D Cheminformatics : Methods,Tools and Results

ByOSDD-Cheminformatics team

Page 2: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

The burden of TB

• About 9 million people were infected with TB in year 2009, and 1.7 million died

• India is the world Tb capital with estimated 1.9 million cases reported every year.

• India has 2nd largest estimated number of MDR-TB cases(99000 in 2008).

• By July 2010, 58 countries had reported at least 1 case of XDR-TB.

Page 3: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

Cheminformatics : What?• COMPUTERS have been applied to solve problems

almost everywhere. When we use them in chemistry, we call it cheminformatics.

• Cheminformatics is applied mostly to large number of molecules.

• Deals with – Storage, retrieval and crosslinking of chemical structures

and associated data.– Prediction of physical, chemical and biological properties

of compounds.– Analysis and prediction of reactions.– Drug Design...

Page 4: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

Steps in drug development

Page 5: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

Cheminformatics in drug design

Target Virtual Screening Data

Data MiningHitIdentification

Lead identification

Building computational

models for drug discovery

process.

Lead optimization

Page 6: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

Aim of Cheminformatics Project

• To screen molecules interacting with the Potential TB targets using classifiers.

• Select the selected molecules and dock with Targets to further screen the molecules for leads.

• Use cheminformatics techniques such as QSAR ,3D QSAR, ADMET to look for potential leads and design Drugs using the leads – by building combinatorial libraries.

Page 7: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

Ways to perform Virtual screening

• Use a previously derived mathematical model that predicts the biological activity of each structure

• Run substructure queries to eliminate molecules with undesirable functionality

• Use a docking program to identify structures predicted to bind strongly to the active site of a protein (if target structure is known)

• Filters remove structures not wanted in a succession of screening methods

Page 8: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

Main Classes of Virtual Screening Methods

• Depend on the amount of structural and bioactivity data available– One active molecule known: perform similarity search

(ligand-based virtual screening)– Several active molecules known: try to identify a common

3D pharmacophore, then do a 3D database search– Reasonable number of active and inactive structures

known: train a machine learning technique (with the help of Molecular descriptors or Molecular properties)

– 3D structure of the protein known: use protein-ligand docking

Page 9: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

Molecule PropertiesSPC : Structure Property CorrelationSPC : Structure Property Correlation

INTRINSIC INTRINSIC PROPERTIESPROPERTIES Molar Volume Connectivity Indices Charge Distribution Molecular WeightPolar surface Area

INTRINSIC INTRINSIC PROPERTIESPROPERTIES Molar Volume Connectivity Indices Charge Distribution Molecular WeightPolar surface Area

CHEMICAL PROPERTIESCHEMICAL PROPERTIESpKaLog PSolubilityStability

CHEMICAL PROPERTIESCHEMICAL PROPERTIESpKaLog PSolubilityStability

BIOLOGICAL PROPERTIESBIOLOGICAL PROPERTIESActivityToxicityBiotransformationPharmacokinetics

BIOLOGICAL PROPERTIESBIOLOGICAL PROPERTIESActivityToxicityBiotransformationPharmacokinetics

Page 10: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

Molecular descriptors used for machine Learning

Molecular descriptors are numerical values that characterize properties of molecules.

The descriptors fall into Four classes a) Topological b) Geometrical c) Electronic d) Hybrid or 3D Descriptors

Page 11: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

Descriptors Used For Classification

Name of Descriptors used

Number of Descriptors

Pharmacophore Fingerprints

147

Weighted Burden Number

24

Properties 8

Page 12: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

Data mining

According to David Hand et al., of MIT press (2001)“ Data mining is the analysis of (often large) observational data sets to find

unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner”.

Data mining …. But why? Data Information Knowledge

The main aim of a user is always to extract knowledge from an information obtained from data.

Data mining is one of key step in Knowledge discovery process, although sometimes it is confused with Knowledge discovery itself!

A user always looks for more information search with least amount of time being spent on exploring the resources.

Page 13: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

Data mining in Cheminformatics

• Data mining approaches are an integral part of cheminformatics and pharmaceutical research.

• This will tend to increase due to the increase of computational methods for biology and chemistry.

• Data mining has found major use in the virtual screening process of cheminformatics.

Page 14: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

Data Mining Taxonomy

Page 15: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

CLASSIFIER ALGORITHMS IS USED

• Bayes classifier Naïve bayes.

• Trees j48

Random forest• Functions SMO

Page 16: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

WORKFLOW

Page 17: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

Accessing the HTS bioassay

data

Upload the sdf file

All compounds

sdf file

Generate descriptor file

Open the CSV file in Excel

Bioassay result (all)

Testing

Training File splitting

Remove the useless attributes

Select the actives and inactive compounds

Apply classifier

algorithms

Selection of best classifier

model

TP %, FP<20%, Accuracy >70%

Append the bioassay result corresponding

to the compounds

PubChem

PowerMV PowerMV

Excel

WEKA

Page 18: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

Molecular Descriptor generation

• Chemistry Development Kit (CDK) – http://rguha.net/code/java/cdkdesc.html

• PowerMVhttp://nisla05.niss.org/PowerMV/?q=PowerMV

Page 19: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

PowerMv• A Software Environment for Molecular Viewing, Descriptor

Generation, Data Analysis and Hit Evaluation.

• An operating environment for biologists and statisticians for viewing or browsing medium to large molecular SD files, computing descriptors.

19

Page 20: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

Features

• Importing, viewing and sorting SD files.

• Capacity is limited only by available memory.

• Compounds structure and attributes can be easily exported to Microsoft Excel.

Page 21: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

Pre-requisites• Requires .NET framework.

Limitation• Windows based

Page 22: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

Weka - toolkit• Collection of machine learning algorithms for data analysis

and classification experiments.

• Tools available for data pre-processing, classification, regression, clustering, association rules, and visualization.

22

Page 23: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

Weka – on GARUDA

23

Page 24: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

The Script file• RemoveUselessAttributes

java <CLASSPATH> -Xmx4000m weka.filters.unsupervised.attribute.RemoveUseless -i <in.csv> -o <out.csv>

• Using cost-sensitive classification

java <CLASSPATH> –Xmx4000m weka.classifiers.meta.CostSensitiveClassifier -cost-matrix “[0.0 10.0; 1.0 0.0]” -t AID1626train.arff -x 5 -d smo.model -W weka.classifiers.functions.SMO -i -- -M

Page 25: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

Case Study: AID899

To get trained in using different classifiers in weka and analyzing the

results

Page 26: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

Cyp450 - a novel target against Mycobacterium tuberculosis

Page 27: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

The P450s are mono-oxygenase enzymes,

Generally interact with flavoprotein and/or iron–sulphur centre redox partners for catalysis

The Mtb genome sequence—a plethora of P450s .

‘‘P450 dense’’ by comparison with eukaryotic genomes

•most effective azoles have extremely tight binding constants for one of the Mtb P450s (CYP121).

Thus, analysis of Mtb CYP51 revealed P420 is an irreversibly inactivated and structurally disrupted species.

Organism P450s Genome size Ratio

Humans 57 3.3 billion bp 1:5.8 million bp

D. melanogaster 84 123 million bp 1: 1.5 million bp

A. thaliana has 249 115 million bp 1: 462,000 bp

M. tuberculosis 20 4.4 million bp 1: 220,000 bp

Mutations were largely located not in the active site area itself, but instead in regions that are conformationally mobile, where entry and exit of substrate to the active site is facilitated

Thus, acquired resistance could be mediated by mutations and it enhances flexibility and conformational rearrangements to increased activity

Why Cyp450

Page 28: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

Objectives

To develop model from AID 899 HTS to study the compound/drug interaction with Human CYP450.

Why1) A lead molecule developed should not interact with CYP450 of

human a) Drug metabolism b) affecting CYP450

2) It should work against CYP450 of M.tuberculosis

  

Page 29: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

Work plan

Select active/inactive compounds against human CYP450 from Pubchem HTS data Generate model for lead compound screening Screen the compounds via model Select the inactives

 Go for testing against mycobacterium CYP450 (model) Select active lead compound Go for insilico drug designing Invitro studies  and invivo studies

Current working

To be worked

Page 30: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

Confusion Matrix

TPActive classified as active

FNActive classified as

inactive

FPInactive classified as

active

TNInactive classified as inactive

Base Classifier and Cost Sensitive Classifier (CSC)

CSC setting cost factor False Negative TP, FP rate increases

So FN is important than FP

Page 31: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

Problem Faced

Data Redundancy

Computational Power

Communication – need alternative to SKYPE

Institutional limitations – Ban of media stream,

social network, chatting, etc.

Page 32: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

Data Redundancy

Tried two approaches for processing the AID to obtain train and test data set.

Method 1: We downloaded sdf file containing all tested compounds. We downloaded bioassay data files for the same. Then we matched it in MS excel. It contained active, inactive, inconclusive and discrepancy We further selected only active and inactive and ran in PowerMV to get csv Then after converting to arff we processed test and train from it. Loaded the two files in Weka and used different algorithms to build best model.

Method 2: We download active and inactive SDF files separately from the same pubchem page. After processing in PowerMV both files were combined to form one. Then similar steps were followed as in Method 1.

Problem: The number of final active and inactive compounds differ between the methods.

Active Inactive Discrepancy Inconclusive

Method I 1767 6255 230 1127

Method II 1901 6441 Nil 1279

AID 899 -  not curated  “Problem reported to pubchem“. Director will be looking at it.          

Page 33: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

Progress & Results

1) We understood the basic working with weka

2) How to derive results from confusion matrix

3) Ignored Classifier gives good results (LAZY)

4) Got good results with RANDOM FOREST, etc unlike reported in Virtual bioassay paper

5) Maximum accuracy of 86.16

Page 34: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

Strategy followed

From the preliminary investigation it is clear that AID 899 is not a properly curated dataset In method I many classifiers were applied and the results are represented below In method II still many classifiers can be run and results generated.  

Page 35: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

List of Best classifiers : Fp<20, Accuracy >75

Page 36: C2D Cheminformatics : Methods,Tools and Results By OSDD-Cheminformatics team.

sincere thanks to

OSDD