Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search...
Transcript of Building QSAR Models using Spotfire Qsar... · 2006-01-10 · QSAR Workflow. Substructure Search...
Computational Chemistry Group
Building QSAR Models using Spotfire
Dr. Julian CherrymanComputational Chemistry Group Leader
www.IntertekASG.com
Computational Chemistry Group
Abstract
In previous years we have shown how we have integrated our data capture and retrieval using Spotfire’s information model (IIM) and guides. Since then we have been focusing on deriving the maximum amount of information from that data. This has been performed using a combination of Spotfire’s statistical tools, our own tools and guides. We will show how this combination has allowed us to build a complete QSAR model from start to finish. This includes: selecting the initial molecules, retrieving and cleaning the testing data, generating and rationalising descriptors, and building a MLR statistical model. This model has then been compared with the initial testing data and also taken forwards to identify new molecules to be tested. The talk will also highlight some of the recent enhancements to the PCA tool that have been added based, partially, on this work.
Computational Chemistry Group
Intertek Group
• Network of 283 laboratories and 531 offices serving 102 countries worldwide
• 12,900 employees with a diversity of skills
• FY2003 group revenue >£471m
• Successful IPO on 29th May 2002 (Market Cap. $1 Billion)
• Serving the world’s leading oil, chemical and petrochemical companies since 1885
• Global Presence in most ports and many major cities worldwide
• ITS division for petroleum, petrochemicals and chemicals analysis
• Yr 2003 revenue > £169m; 5 years compound growth ~15% pa
• Analytical Testing from Outsourcing growing – already >30% sales
Computational Chemistry Group
Intertek ASG
• ASG has a long history of advanced, innovative computational and analytical science
– The major central computational and analytical laboratory for Avecia from June 1999 to April 2004
• R&D projects to build state of the art capability in key areas• Increased use of informatics, statistics, data manipulation and visualisation• Development focus on capabilities to support the Pharmaceutical and Biotechnology
businesses• Translation into efficient on-going capabilities and services
– Significant component of corporate computational and analytical infrastructure within Zeneca / ICI prior to formation of Avecia
• Intertek ASG capabilities and services now accessible to a much broader client base
Computational Chemistry Group
QSAR WorkflowSubstructure
SearchRetrieve Testing
DataBuild Model
Clean TestingData
GenerateDescriptors
CleanDescriptors Data
Merge Data Predict forUnknowns
DataReduction
Computational Chemistry Group
Substructure Search
• Compound library options– Real molecules– Virtual molecules
• Library enumeration
• ISIS Substructure search with specific fragment gives ~900 molecules
• Testing data– Present or absent for those molecules
NN
R
R
R
Computational Chemistry Group
Retrieve Testing Data
• Greatly assisted by previous projects– Automatic uploading of data into database with validation– Spotfire information model built with link to database– Guides implemented to allow retrieval by different criteria,
e.g. molecular identifier
– ISIS Direct not currently implemented, otherwise search and retrieval would be a single step.
• This yields 6000 rows of test data based on 300 structures
• However, test protocols have changed over time.– We need to select a consistent dataset.
Computational Chemistry Group
What Data Have We Got?• 6000 rows of data,
split by– Repeats– Different tests, protocols
and properties
• Consistent dataset– 200 rows of data– 60 different molecules
• Remaining molecules– Present (60) – used to build
QSAR model– Absent (840) – used to test
QSAR model
TestP
roto
col
Computational Chemistry Group
Clean Testing Data
5
10
15
20
25
30
35
40
45
50
55
60
65
F005900 Other S187284CountMeanStdDev
56 81 5133.7 33.2 40.43.8 14.1 4.0
S Numbe r1
Rod
los
s pc
ent
Per
form
ance
Standard 1 Standard 2
• Outliers– Box Plots– Normalisation
• Missing Data
Computational Chemistry Group
QSAR WorkflowSubstructure
SearchRetrieve Testing
DataBuild Model
Clean TestingData
GenerateDescriptors
CleanDescriptors Data
Merge Data Predict forUnknowns
DataReduction
Computational Chemistry Group
Generate Descriptors
• For the 900 molecules
• Dragon molecular descriptors– Numerical representation of molecular structure– Allows for multiple occurrences of the same
molecular feature
• ISIS keys– Binary – only show presence or absence– Short range
Computational Chemistry Group
Dragon Software
• Descriptors– Arranged in blocks– 1D, 2D, 3D = 1630– 1D, 2D only = 760
• Input from– MDL (.mol, .sdf)– Sybyl (.mol, .mol2, .ml2, .sm2)– Smiles (no 3D)
• Commercial software from R. Todeschini– Milano Chemometrics and QSAR Research Group
(http://www.disat.unimib.it/chm/)
Computational Chemistry Group
Descriptors Procedure
• Export SDFile– Must include molecular identifier
• Allows data merging at later stage– Structure cleaning may be required
• Depends on data source• Specific hydrogen atoms
• Process with Dragon• Results exported as a text file• This can be loaded into Spotfire.
– Very few software packages can cope with large numbers of columns, particularly when > 1024.
Computational Chemistry Group
Clean Descriptors Data
• Generate Correlation Matrix– Highlights columns with very similar content
• Column Clean Tool– Reduce the amount of redundant and duplicated
data
Computational Chemistry Group
Correlation Matrix
• Javascript tool– Connects to Spotfire– Use maths functions:
• sfsmStdDev• sfsmCovariance
– Create new recordset– Open in Spotfire
• New column from expr.– Abs (correlation)– Colour by ….– (greyscale best for viewing)
Computational Chemistry Group
Correlation > 0.95
Computational Chemistry Group
Column Clean Tool
• Written in-house, runs within Spotfire
• Reduces data set size by deleting:– Consistent Columns (171)
• Only one value in column– Near Consistent (9)
• Only two values in column• One value occurs only once
– Correlated Columns (319)• Highly correlated columns effectively
replicate information• Remove the columns with the highest
correlation
Computational Chemistry Group
Further Data Reduction
• Column Clean Tool– Reduced the number of columns from 760 to 260
• Normalisation– Z-score for continuous data– Unit range for non- or semi-continuous data?
(still needs to be implemented)
• Principal Component Analysis– Reduces the number of columns to a smaller
number of principal components
Computational Chemistry Group
Principal Component Analysis
• With 260 columns and multiple PC’s, the loadings plot on html report gets very crowded
• Requested enhancements to tool (now available in version 8.0)
– Return PCA results to new Spotfire instance– Default - Scree and scores plots
– Transpose Data: Loadings plot for each pair of PC’s
Computational Chemistry Group
PCA ~ Scree Plot
Computational Chemistry Group
PCA ~ Scores Plot
PC1
PC2
Computational Chemistry Group
PCA ~ Loadings Plot
PC1
PC2
Computational Chemistry Group
QSAR WorkflowSubstructure
SearchRetrieve Testing
DataBuild Model
Clean TestingData
GenerateDescriptors
CleanDescriptors Data
Merge Data Predict forUnknowns
DataReduction
Computational Chemistry Group
Merge Data
• Load performance testing data
• Use ‘add columns’ tool to add the descriptor data (multiple times)
Computational Chemistry Group
Build Model
• Currently performed externally
• Multi-Linear Regression– Stepwise– Six parameters– Model returned to Spotfire using ‘New column
from expression’
• PLS using SIMPLS algorithm also considered– Hard to transfer back into Spotfire by expression– Didn’t provide a better model
Computational Chemistry Group
Experimental vs Predicted
Experimental
Pre
dict
ed
Computational Chemistry Group
Predict for Unknowns
• Use generated model and new column from expression
• Use this to:– Compare experimental to predicted
• How good is the model?– Use predictively for unknowns
• What properties would we expect for untested materials?– Consider diversity
• Initial results look promising
Computational Chemistry Group
Histogram of Predicted Values
• Bar charts are much improved in v8.0, particularly auto-binning and trellis
Computational Chemistry Group
Diversity of Data by PCA
Computational Chemistry Group
Alternative Diversity Plot
• Using profile search to generate similarity value
• Allows for multiple PC columns to be reduced into one axis
Computational Chemistry Group
QSAR WorkflowSubstructure
SearchRetrieve Testing
DataBuild Model
Clean TestingData
GenerateDescriptors
CleanDescriptors Data
Merge Data Predict forUnknowns
DataReduction
Computational Chemistry Group
Thanks
• Daniel Tackley (Intertek ASG)
• Prof. E. Martin (CPACT)• P. English (CPACT)