Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe)...

30
Using AI to Extend QSAR Models Chaoyang (Joe) Zhang School of Computing Sciences and Computer Engineering University of Southern Mississippi [email protected]

Transcript of Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe)...

Page 1: Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe) Zhang School of Computing Sciences and Computer Engineering University of Southern Mississippi

Using AI to Extend QSAR Models

Chaoyang (Joe) Zhang

School of Computing Sciences and Computer EngineeringUniversity of Southern Mississippi

[email protected]

Page 2: Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe) Zhang School of Computing Sciences and Computer Engineering University of Southern Mississippi

Conflict of Interest Statement

I have no perceived conflicts of interest with the research described in this presentation.

Page 3: Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe) Zhang School of Computing Sciences and Computer Engineering University of Southern Mississippi

Outline/Objectives

To introduce SAR-based chemical toxicity prediction To develop machine learning approaches for QSAR modeling. To extend QSAR models using deep learning To address its challenges and identify the future efforts for predictive

toxicity analysis

Page 4: Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe) Zhang School of Computing Sciences and Computer Engineering University of Southern Mississippi

In vitro: studies on cell lines

In vivo: studies on animal subjects

In silico: computational experiments

Challenges• Ethical (inhumane) • Economic (time

consuming and expensive)

• Use AI approaches for Structure Activity Relationship (SAR) modeling

Toxicology Study

Chemicals disrupt normal cell functions by binding and altering:

ProteinsDNALipidsOr react with oxygen to form free radicals which can damage cells

Page 5: Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe) Zhang School of Computing Sciences and Computer Engineering University of Southern Mississippi

Molecular Descriptors

0 1 1 0 01 0 0 1 11 1 0 0 00 1 0 1 11 0 0 0 1

1 0 1 0 1

f

Prediction

ActiveInactive

SAR-Based Predictive Toxicology

Page 6: Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe) Zhang School of Computing Sciences and Computer Engineering University of Southern Mississippi

Supervised (inductive) learning Given: training data + desired outputs (labels)

Unsupervised learning Given: training data (without desired outputs)

Semi-supervised learning Given: training data + a few desired outputs

Reinforcement learning Rewards from sequence of actionswhat actions should an agent take in a particular situation

Types of Machine Learning

Page 7: Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe) Zhang School of Computing Sciences and Computer Engineering University of Southern Mississippi

Supervised Learning Framework and Steps Gather a training set

– Data type, size, characteristics Determine the input feature

representation– Curse of dimensionality – Feature engineering (selection and

extraction) Choose learning algorithms

– RF, SVM, Bayesian, Deep Learning Complete the design

– Optimization, cross-validation Evaluate the accuracy

– Which evaluation metrics?

Page 8: Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe) Zhang School of Computing Sciences and Computer Engineering University of Southern Mississippi

Tox21 Data Challenges

The Toxicology in the 21st Century (Tox21) program, a federal collaboration involving NIH, EPA, and FDA– The goal of the challenge is to "crowdsource" data analysis by independent

researchers to reveal how well they can predict compounds' interference in biochemical pathways using only chemical structure data.

To determine which environmental chemicals and drugs are of the greatest potential concern to human health.

https://tripod.nih.gov/tox21/challenge/

Page 9: Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe) Zhang School of Computing Sciences and Computer Engineering University of Southern Mississippi

Data Statistics–Highly Imbalanced Data

Imbalance ratio (IR), refers to the ratio of the number of instances in the majority class to the number of instances in the minority class

Page 10: Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe) Zhang School of Computing Sciences and Computer Engineering University of Southern Mississippi

Machine Learning Approaches and Workflow

Page 11: Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe) Zhang School of Computing Sciences and Computer Engineering University of Southern Mississippi

Imbalance Handling Techniques

• Imbalance Handling Techniques• Random undersampling• Synthetic minority over-

sampling technique (SMOTE) • SMOTEENN (i.e., a

combination of SMOTE and Edited Nearest Neighbor (ENN) algorithms)

Page 12: Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe) Zhang School of Computing Sciences and Computer Engineering University of Southern Mississippi

Classification Methods

Four Classification Models• RF: RF without imbalance handling • RUS: RF with random undersampling• SMO: RF with SMOTE • SMN: RF with SMOTEENN

Random forest (RF) is an ensemble classifier that consists of many decision trees and outputs the class that is the mode of the class's output by individual trees.

Page 13: Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe) Zhang School of Computing Sciences and Computer Engineering University of Southern Mississippi

Model Evaluation Metrics

Precision = 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇+𝐹𝐹𝑇𝑇

Recall = 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹

F1-score = 2 * 𝑇𝑇𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃∗𝑅𝑅𝑃𝑃𝑃𝑃𝑅𝑅𝑅𝑅𝑅𝑅𝑇𝑇𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃+𝑅𝑅𝑃𝑃𝑃𝑃𝑅𝑅𝑅𝑅𝑅𝑅

Specificity = 𝑇𝑇𝐹𝐹𝑇𝑇𝐹𝐹+𝐹𝐹𝑇𝑇

Balanced Accuracy (BA) = 𝑅𝑅𝑃𝑃𝑃𝑃𝑅𝑅𝑅𝑅𝑅𝑅+𝑆𝑆𝑆𝑆𝑃𝑃𝑃𝑃𝑃𝑃𝑆𝑆𝑃𝑃𝑃𝑃𝑃𝑃𝑆𝑆𝑆𝑆2

MCC = 𝑇𝑇𝑇𝑇∗𝑇𝑇𝐹𝐹 – 𝐹𝐹𝑇𝑇∗𝐹𝐹𝐹𝐹(𝑇𝑇𝑇𝑇+𝐹𝐹𝑇𝑇)(𝑇𝑇𝑇𝑇+𝐹𝐹𝐹𝐹)(𝑇𝑇𝐹𝐹+𝐹𝐹𝑇𝑇)(𝑇𝑇𝐹𝐹+𝐹𝐹𝐹𝐹)

AUROC: Area under the ROC curve(receiver operating characteristic)

AUPRC: Area under Precision-Recall curve

We are more interested in the minority class of active compounds.

AUROC is not good for performance evaluation of imbalanced classification problem!

Page 14: Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe) Zhang School of Computing Sciences and Computer Engineering University of Southern Mississippi

Performance Comparison

Average Friedman ranks for the four classification methods based on F1_score, AUPRC, AUROC, MCC or BA metrics

p-values for multiple and pair-wise comparisons

F1 score, MCC and Brier score are more sensitive and consistent metrics

Page 15: Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe) Zhang School of Computing Sciences and Computer Engineering University of Southern Mississippi

Comparison with Tox21 Data Challenge Winners

Red-color: the highest among all the classifiers (both this study and Tox21 Data Challenge)

Bold font: the best among the Tox21 Challenge participating teams.

Page 16: Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe) Zhang School of Computing Sciences and Computer Engineering University of Southern Mississippi

Impact of Imbalance Ratio (IR)

There exists a strong negative correlation between the prediction accuracy and the imbalance ratio (IR)

Page 17: Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe) Zhang School of Computing Sciences and Computer Engineering University of Southern Mississippi

Summary–Handling Imbalance Data in QSAR

• The performance of SAR-based, imbalanced chemicaltoxicity classification can be significantly improved throughimbalance handling.

• There exists a strong negative correlation between theprediction accuracy and the imbalance ratio (IR). Allmethods became less effective when IR exceeded a certainthreshold (e.g., >40).

• F1 score, MCC and Brier score are sensitive metrics andare better for performance evaluation than other metrics.

Page 18: Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe) Zhang School of Computing Sciences and Computer Engineering University of Southern Mississippi

Deep Learning for Toxicity Prediction Mayr et al. (2016) DeepTox: Toxicity Prediction using

Deep Learning. Front. Environ. Sci. – DeepTox pipeline won 7 out of 12 sub-challenges (12 bioassays)

Liu, R. et al. (2018) Assessing Deep and Shallow Learning Methods for Quantitative Prediction of Acute Chemical Toxicity. Toxicol. Sci. 164, 512–526.

Gabriel Idakwo et al. (2019) Deep Learning-Based Structure-Activity Relationship Modeling for Multi-Category Toxicity Classification: A Case Study of 10K Tox21 Chemicals with High-Throughput Cell-Based Androgen Receptor Bioassay Data," Front. Physiol.

Page 19: Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe) Zhang School of Computing Sciences and Computer Engineering University of Southern Mississippi

Four assay outcomes:• Agonist• Antagonist• Inactive• Inconclusive

A data frame with: • 7665 unique

compounds• 2544 features • 4 classes

Combine agonistic and antagonistic Androgen Receptor (AR) assays

Data Curation and Preprocessing

Page 20: Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe) Zhang School of Computing Sciences and Computer Engineering University of Southern Mississippi

Deep Learning Model

Four class labelsReLU activation function Auto feature engineering

Page 21: Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe) Zhang School of Computing Sciences and Computer Engineering University of Southern Mississippi

Bayesian Hyperparameter Optimization in DL

Implemented in Hyperas, a tool that combines the Keras DL library

The search space included – hidden layers {2,3,4} – neurons {32,64,128,256,512,1024}– optimization methods {mini-batch

gradient descent, Adam, RMSprop, Adagrad}

– batch size {8,16,32,64,128} – learning rate {random uniform

distribution between 0 and 1}

DL Hyperparameters– Number of layers– Number of neurons– Learning rate– Epoch (number of iterations)– Choice of activation function

Sigmoid function Rectified Linear Unit (ReLU)

– Dropout parameter– Batch size

Page 22: Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe) Zhang School of Computing Sciences and Computer Engineering University of Southern Mississippi

Workflow

Overview of the machine learning-based SAR approach with a nested double-loop cross-validation strategy for model construction, validation and evaluation

Page 23: Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe) Zhang School of Computing Sciences and Computer Engineering University of Southern Mississippi

Algorithm Comparison (Default)

KNN, k-nearest neighbors RF, random forest CART, classification and regression trees NB, Naïve Bayes SVM, support vector machine; DNN, deep neural network

Page 24: Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe) Zhang School of Computing Sciences and Computer Engineering University of Southern Mississippi

Prediction Results of Optimized RF

Macro-averages of five evaluation metrics derived using random forest.

Parameters Initial distribution

Optimized

Max_depth 2,3, None None

Criterion "gini", "entropy“ gini

Min_samples_leaf

0.5, 1, 5, 10, 20,25

10

N_estimators 50, 100, 200, 300,400

200

Max_features "auto", "log2", None, 0.8, 0.5,

0.2,0.1

0.2

RF parameter optimization

Page 25: Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe) Zhang School of Computing Sciences and Computer Engineering University of Southern Mississippi

Comparison of DL and RF

The DL model with a macro-average F1-score of 0.83 was shown to perform better than RF with 0.56.

Page 26: Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe) Zhang School of Computing Sciences and Computer Engineering University of Southern Mississippi

Comparison of RF and DL: Confusion Matrix

RF DL

Page 27: Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe) Zhang School of Computing Sciences and Computer Engineering University of Southern Mississippi

Summary–Extend QSAR Modeling Using DL

Deep learning has great potential to significantly improve the accuracy of in-silicon predictive toxicology. – The hyperparameters in the deep learning must be optimized.

The DL model with a macro-average f-measure of 0.83 was shown to perform better than RF with 0.56.

Both DL and RF algorithms had difficulty predicting the antagonist outcome correctly, but DL did better

Page 28: Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe) Zhang School of Computing Sciences and Computer Engineering University of Southern Mississippi

Discussion and Future Efforts Large benchmark dataset and quality Improve quantitative toxicity prediction using novel

descriptors derived from molecular dynamics simulation, docking and other information

Feature engineering (feature selection and extraction)– Autoencoder

Develop multi-model deep learning framework for in silico predictive toxicology

Ensemble methods

Page 29: Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe) Zhang School of Computing Sciences and Computer Engineering University of Southern Mississippi

References Mayr A, Klambauer G, Unterthiner T, et al (2016) DeepTox: Toxicity Prediction

using Deep Learning. Front Environ Sci 3:1–15. Huang R, Xia M, Nguyen D-T, et al (2016) Tox21Challenge to Build Predictive

Models of Nuclear Receptor and Stress Response Pathways as Mediated by Exposure to Environmental Chemicals and Drugs. Front Environ Sci 3:85.

Idakwo G, Thangapandian S, Luttrell J, et al (2019) Deep Learning-Based Structure-Activity Relationship Modeling for Multi-Category Toxicity Classification: A Case Study of 10K Tox21 Chemicals With High-Throughput Cell-Based Androgen Receptor Bioassay Data. Front Physiology 10:1044.

Mayr et al. (2018) conducted a large-scale comparison of drug target prediction. Chemical Science

Liu, R. et al. (2018) Assessing Deep and Shallow Learning Methods for Quantitative Prediction of Acute Chemical Toxicity. Toxicol. Sci. 164, 512–526.

Page 30: Using AI to Extend QSAR Models - Society of …...Using AI to Extend QSAR Models Chaoyang (Joe) Zhang School of Computing Sciences and Computer Engineering University of Southern Mississippi

Acknowledgements– Dr. Ping Gong & Dr. Sundar Thangapandian (ERDC)

Environmental LaboratoryU.S. Army Engineer Research and Development Center

– Dr. Huixiao Hong (NCTR)Division of Bioinformatics and BiostatisticsNational Center for Toxicological ResearchU.S. Food and Drug Administration

– Gabriel Idakwo & Joseph Luttrell (PhD students at USM)School of Computing Sciences and Computer EngineeringUniversity of Southern Mississippi