Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search...

32
Computational Chemistry Group Building QSAR Models using Spotfire Dr. Julian Cherryman Computational Chemistry Group Leader www.IntertekASG.com

Transcript of Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search...

Page 1: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

Building QSAR Models using Spotfire

Dr. Julian CherrymanComputational Chemistry Group Leader

www.IntertekASG.com

Page 2: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

Abstract

In previous years we have shown how we have integrated our data capture and retrieval using Spotfire’s information model (IIM) and guides. Since then we have been focusing on deriving the maximum amount of information from that data. This has been performed using a combination of Spotfire’s statistical tools, our own tools and guides. We will show how this combination has allowed us to build a complete QSAR model from start to finish. This includes: selecting the initial molecules, retrieving and cleaning the testing data, generating and rationalising descriptors, and building a MLR statistical model. This model has then been compared with the initial testing data and also taken forwards to identify new molecules to be tested. The talk will also highlight some of the recent enhancements to the PCA tool that have been added based, partially, on this work.

Page 3: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

Intertek Group

• Network of 283 laboratories and 531 offices serving 102 countries worldwide

• 12,900 employees with a diversity of skills

• FY2003 group revenue >£471m

• Successful IPO on 29th May 2002 (Market Cap. $1 Billion)

• Serving the world’s leading oil, chemical and petrochemical companies since 1885

• Global Presence in most ports and many major cities worldwide

• ITS division for petroleum, petrochemicals and chemicals analysis

• Yr 2003 revenue > £169m; 5 years compound growth ~15% pa

• Analytical Testing from Outsourcing growing – already >30% sales

Page 4: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

Intertek ASG

• ASG has a long history of advanced, innovative computational and analytical science

– The major central computational and analytical laboratory for Avecia from June 1999 to April 2004

• R&D projects to build state of the art capability in key areas• Increased use of informatics, statistics, data manipulation and visualisation• Development focus on capabilities to support the Pharmaceutical and Biotechnology

businesses• Translation into efficient on-going capabilities and services

– Significant component of corporate computational and analytical infrastructure within Zeneca / ICI prior to formation of Avecia

• Intertek ASG capabilities and services now accessible to a much broader client base

Page 5: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

QSAR WorkflowSubstructure

SearchRetrieve Testing

DataBuild Model

Clean TestingData

GenerateDescriptors

CleanDescriptors Data

Merge Data Predict forUnknowns

DataReduction

Page 6: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

Substructure Search

• Compound library options– Real molecules– Virtual molecules

• Library enumeration

• ISIS Substructure search with specific fragment gives ~900 molecules

• Testing data– Present or absent for those molecules

NN

R

R

R

Page 7: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

Retrieve Testing Data

• Greatly assisted by previous projects– Automatic uploading of data into database with validation– Spotfire information model built with link to database– Guides implemented to allow retrieval by different criteria,

e.g. molecular identifier

– ISIS Direct not currently implemented, otherwise search and retrieval would be a single step.

• This yields 6000 rows of test data based on 300 structures

• However, test protocols have changed over time.– We need to select a consistent dataset.

Page 8: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

What Data Have We Got?• 6000 rows of data,

split by– Repeats– Different tests, protocols

and properties

• Consistent dataset– 200 rows of data– 60 different molecules

• Remaining molecules– Present (60) – used to build

QSAR model– Absent (840) – used to test

QSAR model

TestP

roto

col

Page 9: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

Clean Testing Data

5

10

15

20

25

30

35

40

45

50

55

60

65

F005900 Other S187284CountMeanStdDev

56 81 5133.7 33.2 40.43.8 14.1 4.0

S Numbe r1

Rod

los

s pc

ent

Per

form

ance

Standard 1 Standard 2

• Outliers– Box Plots– Normalisation

• Missing Data

Page 10: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

QSAR WorkflowSubstructure

SearchRetrieve Testing

DataBuild Model

Clean TestingData

GenerateDescriptors

CleanDescriptors Data

Merge Data Predict forUnknowns

DataReduction

Page 11: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

Generate Descriptors

• For the 900 molecules

• Dragon molecular descriptors– Numerical representation of molecular structure– Allows for multiple occurrences of the same

molecular feature

• ISIS keys– Binary – only show presence or absence– Short range

Page 12: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

Dragon Software

• Descriptors– Arranged in blocks– 1D, 2D, 3D = 1630– 1D, 2D only = 760

• Input from– MDL (.mol, .sdf)– Sybyl (.mol, .mol2, .ml2, .sm2)– Smiles (no 3D)

• Commercial software from R. Todeschini– Milano Chemometrics and QSAR Research Group

(http://www.disat.unimib.it/chm/)

Page 13: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

Descriptors Procedure

• Export SDFile– Must include molecular identifier

• Allows data merging at later stage– Structure cleaning may be required

• Depends on data source• Specific hydrogen atoms

• Process with Dragon• Results exported as a text file• This can be loaded into Spotfire.

– Very few software packages can cope with large numbers of columns, particularly when > 1024.

Page 14: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

Clean Descriptors Data

• Generate Correlation Matrix– Highlights columns with very similar content

• Column Clean Tool– Reduce the amount of redundant and duplicated

data

Page 15: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

Correlation Matrix

• Javascript tool– Connects to Spotfire– Use maths functions:

• sfsmStdDev• sfsmCovariance

– Create new recordset– Open in Spotfire

• New column from expr.– Abs (correlation)– Colour by ….– (greyscale best for viewing)

Page 16: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

Correlation > 0.95

Page 17: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

Column Clean Tool

• Written in-house, runs within Spotfire

• Reduces data set size by deleting:– Consistent Columns (171)

• Only one value in column– Near Consistent (9)

• Only two values in column• One value occurs only once

– Correlated Columns (319)• Highly correlated columns effectively

replicate information• Remove the columns with the highest

correlation

Page 18: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

Further Data Reduction

• Column Clean Tool– Reduced the number of columns from 760 to 260

• Normalisation– Z-score for continuous data– Unit range for non- or semi-continuous data?

(still needs to be implemented)

• Principal Component Analysis– Reduces the number of columns to a smaller

number of principal components

Page 19: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

Principal Component Analysis

• With 260 columns and multiple PC’s, the loadings plot on html report gets very crowded

• Requested enhancements to tool (now available in version 8.0)

– Return PCA results to new Spotfire instance– Default - Scree and scores plots

– Transpose Data: Loadings plot for each pair of PC’s

Page 20: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

PCA ~ Scree Plot

Page 21: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

PCA ~ Scores Plot

PC1

PC2

Page 22: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

PCA ~ Loadings Plot

PC1

PC2

Page 23: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

QSAR WorkflowSubstructure

SearchRetrieve Testing

DataBuild Model

Clean TestingData

GenerateDescriptors

CleanDescriptors Data

Merge Data Predict forUnknowns

DataReduction

Page 24: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

Merge Data

• Load performance testing data

• Use ‘add columns’ tool to add the descriptor data (multiple times)

Page 25: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

Build Model

• Currently performed externally

• Multi-Linear Regression– Stepwise– Six parameters– Model returned to Spotfire using ‘New column

from expression’

• PLS using SIMPLS algorithm also considered– Hard to transfer back into Spotfire by expression– Didn’t provide a better model

Page 26: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

Experimental vs Predicted

Experimental

Pre

dict

ed

Page 27: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

Predict for Unknowns

• Use generated model and new column from expression

• Use this to:– Compare experimental to predicted

• How good is the model?– Use predictively for unknowns

• What properties would we expect for untested materials?– Consider diversity

• Initial results look promising

Page 28: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

Histogram of Predicted Values

• Bar charts are much improved in v8.0, particularly auto-binning and trellis

Page 29: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

Diversity of Data by PCA

Page 30: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

Alternative Diversity Plot

• Using profile search to generate similarity value

• Allows for multiple PC columns to be reduced into one axis

Page 31: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

QSAR WorkflowSubstructure

SearchRetrieve Testing

DataBuild Model

Clean TestingData

GenerateDescriptors

CleanDescriptors Data

Merge Data Predict forUnknowns

DataReduction

Page 32: Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search Retrieve Testing Data. Build Model. Clean Testing Data. Generate Descriptors. Clean

Computational Chemistry Group

Thanks

• Daniel Tackley (Intertek ASG)

• Prof. E. Martin (CPACT)• P. English (CPACT)