Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search...

Computational Chemistry Group

Building QSAR Models using Spotfire

Dr. Julian CherrymanComputational Chemistry Group Leader

www.IntertekASG.com


Abstract

In previous years we have shown how we have integrated our data capture and retrieval using Spotfire’s information model (IIM) and guides. Since then we have been focusing on deriving the maximum amount of information from that data. This has been performed using a combination of Spotfire’s statistical tools, our own tools and guides. We will show how this combination has allowed us to build a complete QSAR model from start to finish. This includes: selecting the initial molecules, retrieving and cleaning the testing data, generating and rationalising descriptors, and building a MLR statistical model. This model has then been compared with the initial testing data and also taken forwards to identify new molecules to be tested. The talk will also highlight some of the recent enhancements to the PCA tool that have been added based, partially, on this work.


Intertek Group

• Network of 283 laboratories and 531 offices serving 102 countries worldwide

• 12,900 employees with a diversity of skills

• FY2003 group revenue >£471m

• Successful IPO on 29th May 2002 (Market Cap. $1 Billion)

• Serving the world’s leading oil, chemical and petrochemical companies since 1885

• Global Presence in most ports and many major cities worldwide

• ITS division for petroleum, petrochemicals and chemicals analysis

• Yr 2003 revenue > £169m; 5 years compound growth ~15% pa

• Analytical Testing from Outsourcing growing – already >30% sales


Intertek ASG

• ASG has a long history of advanced, innovative computational and analytical science

– The major central computational and analytical laboratory for Avecia from June 1999 to April 2004

• R&D projects to build state of the art capability in key areas• Increased use of informatics, statistics, data manipulation and visualisation• Development focus on capabilities to support the Pharmaceutical and Biotechnology

businesses• Translation into efficient on-going capabilities and services

– Significant component of corporate computational and analytical infrastructure within Zeneca / ICI prior to formation of Avecia

• Intertek ASG capabilities and services now accessible to a much broader client base


QSAR WorkflowSubstructure

SearchRetrieve Testing

DataBuild Model

Clean TestingData

GenerateDescriptors

CleanDescriptors Data

Merge Data Predict forUnknowns

DataReduction


Substructure Search

• Compound library options– Real molecules– Virtual molecules

• Library enumeration

• ISIS Substructure search with specific fragment gives ~900 molecules

• Testing data– Present or absent for those molecules

NN

R

R

R


Retrieve Testing Data

• Greatly assisted by previous projects– Automatic uploading of data into database with validation– Spotfire information model built with link to database– Guides implemented to allow retrieval by different criteria,

e.g. molecular identifier

– ISIS Direct not currently implemented, otherwise search and retrieval would be a single step.

• This yields 6000 rows of test data based on 300 structures

• However, test protocols have changed over time.– We need to select a consistent dataset.


What Data Have We Got?• 6000 rows of data,

split by– Repeats– Different tests, protocols

and properties

• Consistent dataset– 200 rows of data– 60 different molecules

• Remaining molecules– Present (60) – used to build

QSAR model– Absent (840) – used to test

QSAR model

TestP

roto

col


Clean Testing Data

5

10

15

20

25

30

35

40

45

50

55

60

65

F005900 Other S187284CountMeanStdDev

56 81 5133.7 33.2 40.43.8 14.1 4.0

S Numbe r1

Rod

los

s pc

ent

Per

form

ance

Standard 1 Standard 2

• Outliers– Box Plots– Normalisation

• Missing Data




DataBuild Model

Clean TestingData

GenerateDescriptors



DataReduction


Generate Descriptors

• For the 900 molecules

• Dragon molecular descriptors– Numerical representation of molecular structure– Allows for multiple occurrences of the same

molecular feature

• ISIS keys– Binary – only show presence or absence– Short range


Dragon Software

• Descriptors– Arranged in blocks– 1D, 2D, 3D = 1630– 1D, 2D only = 760

• Input from– MDL (.mol, .sdf)– Sybyl (.mol, .mol2, .ml2, .sm2)– Smiles (no 3D)

• Commercial software from R. Todeschini– Milano Chemometrics and QSAR Research Group

(http://www.disat.unimib.it/chm/)


Descriptors Procedure

• Export SDFile– Must include molecular identifier

• Allows data merging at later stage– Structure cleaning may be required

• Depends on data source• Specific hydrogen atoms

• Process with Dragon• Results exported as a text file• This can be loaded into Spotfire.

– Very few software packages can cope with large numbers of columns, particularly when > 1024.


Clean Descriptors Data

• Generate Correlation Matrix– Highlights columns with very similar content

• Column Clean Tool– Reduce the amount of redundant and duplicated

data


Correlation Matrix

• Javascript tool– Connects to Spotfire– Use maths functions:

• sfsmStdDev• sfsmCovariance

– Create new recordset– Open in Spotfire

• New column from expr.– Abs (correlation)– Colour by ….– (greyscale best for viewing)


Correlation > 0.95


Column Clean Tool

• Written in-house, runs within Spotfire

• Reduces data set size by deleting:– Consistent Columns (171)

• Only one value in column– Near Consistent (9)

• Only two values in column• One value occurs only once

– Correlated Columns (319)• Highly correlated columns effectively

replicate information• Remove the columns with the highest

correlation


Further Data Reduction

• Column Clean Tool– Reduced the number of columns from 760 to 260

• Normalisation– Z-score for continuous data– Unit range for non- or semi-continuous data?

(still needs to be implemented)

• Principal Component Analysis– Reduces the number of columns to a smaller

number of principal components


Principal Component Analysis

• With 260 columns and multiple PC’s, the loadings plot on html report gets very crowded

• Requested enhancements to tool (now available in version 8.0)

– Return PCA results to new Spotfire instance– Default - Scree and scores plots

– Transpose Data: Loadings plot for each pair of PC’s


PCA ~ Scree Plot


PCA ~ Scores Plot

PC1

PC2


PCA ~ Loadings Plot

PC1

PC2




DataBuild Model

Clean TestingData

GenerateDescriptors



DataReduction


Merge Data

• Load performance testing data

• Use ‘add columns’ tool to add the descriptor data (multiple times)


Build Model

• Currently performed externally

• Multi-Linear Regression– Stepwise– Six parameters– Model returned to Spotfire using ‘New column

from expression’

• PLS using SIMPLS algorithm also considered– Hard to transfer back into Spotfire by expression– Didn’t provide a better model


Experimental vs Predicted

Experimental

Pre

dict

ed


Predict for Unknowns

• Use generated model and new column from expression

• Use this to:– Compare experimental to predicted

• How good is the model?– Use predictively for unknowns

• What properties would we expect for untested materials?– Consider diversity

• Initial results look promising


Histogram of Predicted Values

• Bar charts are much improved in v8.0, particularly auto-binning and trellis


Diversity of Data by PCA


Alternative Diversity Plot

• Using profile search to generate similarity value

• Allows for multiple PC columns to be reduced into one axis




DataBuild Model

Clean TestingData

GenerateDescriptors



DataReduction


Thanks

• Daniel Tackley (Intertek ASG)

• Prof. E. Martin (CPACT)• P. English (CPACT)

Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search...

Documents

Transcript of Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search...