Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected...

37

Transcript of Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected...

Page 1: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases
Page 2: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

Using Machine Learning Models Based on

Phenotypic Data to Discover New

Molecules for Neglected Diseases

Sean Ekins

Collaborative Drug Discovery, Inc., Burlingame, CA.

Collaborations Pharmaceuticals, Inc. Fuquay Varina, NC.

Collaborations in Chemistry, Inc. Fuquay Varina, NC.

Wikipedia

Page 3: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

Machine Learning Examples

• Data is BIG for neglected diseases

• To discover new leads • Tuberculosis – from public data to open models to create IP

• Chagas Disease - from public data to create new IP

• Ebola virus – from little data to create open data and IP

• Other diseases, emerging diseases?

Page 4: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

Neglected Disease Drug Discovery An urgent need for new therapeutics

http://www.mm4tb.org/

Page 5: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

Tuberculosis kills 1.6-1.7m/yr (~1 every 8 seconds)

1/3rd of worlds population infected!!!!

streptomycin (1943) para-aminosalicyclic acid (1949) isoniazid (1952) pyrazinamide (1954) cycloserine (1955) ethambutol (1962) rifampicin (1967)

Multi drug resistance in 4.3% of cases

Extensively drug resistant increasing incidence

one new drug (bedaquiline) in 40 yrs

Tuberculosis

Page 6: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

Tested >350,000 molecules Tested ~2M 2M >300,000

>1500 active and non toxic Published 177 100s 800

Bigger Open Data: Screening for New Tuberculosis Treatments

~350,000 accessible

TBDA screened over 1 million, 1 million more to go TB Alliance + Japanese pharma screens

R43 LM011152-01

Page 7: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

Over 8 years analyzed in vitro data and built models

Top scoring molecules

assayed for

Mtb growth inhibition

Mtb screening

molecule

database/s

High-throughput

phenotypic

Mtb screening

Descriptors + Bioactivity (+Cytotoxicity)

Bayesian Machine Learning classification Mtb Model

Molecule Database

(e.g. GSK malaria

actives)

virtually scored

using Bayesian Models

New bioactivity data

may enhance models

Identify in vitro hits and test models 3 x published prospective tests ~750

molecules were tested in vitro

198 actives were identified

>20 % hit rate

Multiple retrospective tests 3-10 fold

enrichment

NH

S

N

Ekins et al., Pharm Res 31: 414-435, 2014

Ekins, et al., Tuberculosis 94; 162-169, 2014

Ekins, et al., PLOSONE 8; e63240, 2013

Ekins, et al., Chem Biol 20: 370-378, 2013

Ekins, et al., JCIM, 53: 3054−3063, 2013

Ekins and Freundlich, Pharm Res, 28, 1859-1869, 2011

Ekins et al., Mol BioSyst, 6: 840-851, 2010

Ekins, et al., Mol. Biosyst. 6, 2316-2324, 2010,

R43 LM011152-01

Page 8: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

5 active compounds vs Mtb in a few months

7 tested, 5 active (70% hit rate)

Ekins et al.,Chem

Biol 20, 370–378,

2013

1. Virtually screen 13,533-member GSK antimalarial hit library 2. Bayesian Model = SRI TAACF-CB2 dose response + cytotoxicity model 3. Top 46 commercially available compounds visually inspected 4. 7 compounds chosen for Mtb testing based on - drug-likeness - chemotype diversity

GSK # Bayesian

Score Chemical Structure

Mtb H37Rv MIC

(mg/mL)

GSK Reported

% Inhibition HepG2 @ 10 mM cmpd

TCMDC-123868 5.73 >32 40

TCMDC-125802 5.63 0.0625

5

TCMDC-124192 5.27 2.0 4

TCMDC-124334 5.20 2.0 4

TCMDC-123856 5.09 1.0 83

TCMDC-123640 4.66 >32 10

TCMDC-124922 4.55 1.0 9

R43 LM011152-01

Page 9: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

• BAS00521003/ TCMDC-125802 reported to be a P.

falciparum lactate dehydrogenase inhibitor

• Only one report of antitubercular activity from 1969

- solid agar MIC = 1 mg/mL (“wild strain”)

- “no activity” in mouse model up to 400 mg/kg

- however, activity was solely judged by

extension of survival!

Bruhin, H. et al., J. Pharm. Pharmac. 1969, 21, 423-433.

.

MIC of 0.0625 ug/mL • 64X MIC affords 6 logs of

kill

• Resistance and/or drug

instability beyond 14 d

Vero cells : CC50 = 4.0

mg/mL

Selectivity Index SI =

CC50/MICMtb = 16 – 64

In mouse no toxicity but

also no efficacy in GKO

model – probably

metabolized.

Ekins et al.,Chem Biol 20, 370–378, 2013

Taking a compound in vivo identifies issues

R43 LM011152-01

Page 10: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

Modeling and mapping Mouse in vivo data

Mouse TB model data over 70 yrs 784 training and 60 test set

Extended earlier study J Chem Inf Model. 2014 Apr 28;54(4):1070-82

Page 11: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

Optimizing the triazine series as part of this project, improve solubility and show in

vivo efficacy

1U19AI109713-01

Page 12: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

Chagas Disease

• About 7 million to 8 million people estimated to be infected worldwide

• Vector-borne transmission occurs in the Americas.

• A triatomine bug carries the parasite Trypanosoma cruzi which causes the disease.

• The disease is curable if treatment is initiated soon after infection.

• No FDA approved drug, pipe line sparse

Hotez et al., PLoS Negl Trop Dis. 2013 Oct 31;7(10):e2300

R41-AI108003-01

Page 13: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

• Modeled data with over 300,000 cpds but focused on smaller set

• Dataset from PubChem AID 2044 – Broad Institute data

• Dose response data (1853 actives and 2203 inactives)

• Dose response and cytotoxicity (1698 actives and 2363 inactives)

• EC50 values less than 1 mM were selected as actives.

• For cytotoxicity greater than 10 fold difference compared with EC50

• Models generated using : molecular function class fingerprints of maximum

diameter 6 (FCFP_6), AlogP, molecular weight, number of rotatable bonds,

number of rings, number of aromatic rings, number of hydrogen bond

acceptors, number of hydrogen bond donors, and molecular fractional polar

surface area.

• 5-fold cross validation or leave out 50% x 100 fold cross validation was used

to calculate the ROC for the models generated

T. cruzi Machine Learning models

R41-AI108003-01 Ekins et al., PLoS Negl Trop Dis. 2015 Jun 26;9(6):e0003878

Page 14: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

Model Best

cutoff

Leave-one out

ROC

5-fold cross

validation ROC

5-fold cross

validation sensitivity

(%)

5-fold cross

validation

specificity (%)

5-fold cross

validation

concordance (%)

Dose response

(1853 actives,

2203 inactives)

-0.676 0.81 0.78 77 89 84

Dose response

and cytotoxicity

(1698 actives,

2363 inactives)

-0.337 0.82 0.80 80 88 84

External ROC Internal ROC

Concordance

(%)

Specificity

(%) Sensitivity (%)

0.79 ± 0.01 0.80 ± 0.01 73.48 ± 1.05 79.08 ± 3.73 65.68 ± 3.89

5 fold cross validation

Dual event 50% x 100 fold cross validation

R41-AI108003-01 Ekins et al., PLoS Negl Trop Dis. 2015 Jun 26;9(6):e0003878

Page 15: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

Good Bad

Ekins et al., PLoS Negl Trop Dis. 2015 Jun 26;9(6):e0003878

T. cruzi Dose Response and cytotoxicity Machine Learning model features

Tertiary amines, piperidines and aromatic fragments with basic Nitrogen

Cyclic hydrazines and electron poor chlorinated aromatics

R41-AI108003-01

Page 16: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

Bayesian Machine Learning Models

- Selleck Chemicals natural product lib. (139 molecules); - GSK kinase library (367 molecules); - Malaria box (400 molecules); - Microsource Spectrum (2320 molecules); - CDD FDA drugs (2690 molecules); - Prestwick Chemical library (1280 molecules); - Traditional Chinese Medicine components (373 molecules)

7569 molecules

99 molecules

R41-AI108003-01 Ekins et al., PLoS Negl Trop Dis. 2015 Jun 26;9(6):e0003878

Page 17: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

Synonyms Infection Ratio EC50 (µM) EC90 (µM) Hill slope

Cytotoxicity CC50

(µM)

Chagas mouse model (4

days treatment,

luciferase): In vivo

efficacy at 50 mg/kg bid

(IP) (%)

(±)-Verapamil hydrochloride, 715730,

SC-0011762 0.02, 0.02 0.0383 0.143 1.67 >10.0 55.1

29781612, Pyronaridine 0.00, 0.00 0.225 0.665 2.03 3.0 85.2

511176, Furazolidone 0.00, 0.00 0.257 0.563 2.81 >10.0 100.5

501337, SC-0011777, Tetrandrine

0.00, 0.00 0.508 1.57 1.95 1.3 43.6

SC-0011754, Nitrofural 0.01, 0.01 0.775 6.98 1.00 >10.0 78.5*

* Used hydroxymethylnitrofurazone for in vivo study (nitrofural pro-drug)

Ekins et al., PLoS Negl Trop Dis. 2015 Jun 26;9(6):e0003878

H3C

O

N

CH3

N

CH3

H3C

O

CH3

O

H3C

O

H3C

N

N

HN

N

N

OH

Cl

O

CH3

O

NN

+

N

O

O–

O

O

O

N+

O

O–

N

HN

NH2

O

In vitro and in vivo data for compounds selected

R41-AI108003-01

Page 18: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

7,569 cpds => 99 cpds => 17 hits (5 in nM range)

Infection Treatment Reading

0 1 2 3 4 5 6 7

Pyronaridine Furazolidone Verapamil

Nitrofural Tetrandrine Benznidazole

In vivo efficacy of the 5 tested compounds

Vehicle

Ekins et al., PLoS Negl Trop Dis. 2015 Jun 26;9(6):e0003878 R41-AI108003-01

Page 19: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

Pyronaridine: New anti-Chagas and known anti-Malarial

EMA approved in combination with artesunate The IC50 value 2 nM against the growth of KT1 and KT3 P. falciparum Known P-gp inhibitor Active against Babesia and Theileria Parasites tick-transmitted

R41-AI108003-01

Work provided starting point for a phase II and phase I grant (submitted)

N

N

HN

N

N

OH

Cl

O

CH3

Broad group missed this cpd

Page 20: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

2014-2015 Ebola outbreak

March 2014, the World Health Organization (WHO) reported a major Ebola outbreak in Guinea, a western African nation

8 August 2014, the WHO declared the epidemic to be an international public health emergency

I urge everyone involved in all aspects of this epidemic to openly and rapidly report their experiences and findings. Information will be one of our key weapons in defeating the Ebola epidemic. Peter Piot

Wikipedia

Wikipedia

Page 21: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

Madrid PB, et al. (2013) A Systematic Screen of FDA-Approved Drugs for Inhibitors of Biological Threat Agents. PLoS ONE 8(4): e60579. doi:10.1371/journal.pone.0060579

Chloroquine in mouse

Page 22: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

Machine Learning for EBOV

• 868 molecules from the viral pseudotype entry assay and the EBOV replication assay

• Salts were stripped and duplicates removed using Discovery Studio 4.1 (Biovia, San

Diego, CA)

• IC50 values less than 50 mM were selected as actives.

• Models generated using : molecular function class fingerprints of maximum diameter 6

(FCFP_6), AlogP, molecular weight, number of rotatable bonds, number of rings,

number of aromatic rings, number of hydrogen bond acceptors, number of hydrogen

bond donors, and molecular fractional polar surface area.

• Models were validated using five-fold cross validation (leave out 20% of the database).

• Bayesian, Support Vector Machine and Recursive Partitioning Forest and single tree

models built.

• RP Forest and RP Single Tree models used the standard protocol in Discovery Studio.

• 5-fold cross validation or leave out 50% x 100 fold cross validation was used to

calculate the ROC for the models generated

Page 23: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

Models

(training set 868 compounds)

RP Forest

(Out of bag

ROC)

RP Single Tree

(With 5 fold

cross validation

ROC)

SVM

(with 5 fold

cross validation

ROC)

Bayesian

(with 5 fold

cross validation

ROC)

Bayesian

(leave out

50% x 100

ROC)

Open

Bayesian

(with 5 fold

cross

validation

ROC)

Ebola replication (actives = 20) 0.70 0.78 0.73 0.86 0.86 0.82

Ebola Pseudotype (actives = 41) 0.85 0.81 0.76 0.85 0.82 0.82

Ebola HTS Machine learning model cross validation

Receiver Operator Curve Statistics.

Ekins et al., F1000Res 4:1091, (2015)

Page 24: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

Discovery Studio pseudotype Bayesian model

B

Discovery Studio EBOV replication model

Good Bad

Good Bad

Ekins et al., F1000Res 4:1091, (2015)

Page 25: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

Effect of drug treatment on infection with Ebola-GFP

3 Molecules selected from MicroSource Spectrum virtual screen and tested in vitro All of them nM activity

-8 -7 -6 -5 -4-10

0102030405060708090

100110

Chloroquine

Pyronaridine

Quinacrine

Tilorone

Untreated control

Log Conc. (M)%

Eb

ola

In

fecti

on

Compound EC50 (mM) [95% CI] Cytotoxicity CC50 (µM)

Chloroquine 4.0 [1.0 – 15] 250

Pyronaridine 0.42 [0.31 – 0.56] 3.1

Quinacrine 0.35 [0.28 – 0.44] 6.2

Tilorone 0.23 [0.09 – 0.62] 6.2

Duplicate experiments

control

Ekins et al., F1000Res 4:1091, (2015)

Page 26: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

Making Ebola models available • From data published by others …to proposing target

• Collaborated with lab to open up their screening data, build models, identified more active inhibitors

• To date the most potent drugs and drug-like molecules

• Still a need for a drug that could be used ASAP

• Models in MMDS http://molsync.com/ebola/

More data continues to be published

• We collated 55 molecules from the literature

• A second review lists 60 hits – Picazo, E. and F. Giordanetto, Drug Discovery Today. 2015 Feb;20(2):277-86

• Additional screens have identified 53 hits and 80 hits respectively – Kouznetsova, J., et al., Emerg Microbes Infect, 2014. 3(12): p. e84.

– Johansen, L.M., et al., Sci Transl Med, 2015. 7(290): p. 290ra89.

Litterman N, Lipinski C and Ekins S 2015 F1000Research 2015, 4:38

Page 27: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

1000’s of models from

• Skipped targets with > 100,000 assays and sets with < 100 measurements

• Converted data to –log

• Dealt with duplicates

• 2152 datasets

• Cutoff determination

• Balance active/ inactive ratio

• Favor structural diversity and activity distribution

Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60

http://molsync.com/bayesian2

Page 28: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

What do 2000 ChEMBL models look like

Folding bit size

Average ROC

http://molsync.com/bayesian2 Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60

Page 29: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

PolyPharma a new free app for drug discovery

Page 30: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

#ZikaOpen

Image by John Liebler

Page 31: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

Proposed workflow for rapid drug discovery against Zika virus

Ekins S, Mietchen D, Coffee M et al. F1000Research 2016, 5:150 (doi: 10.12688/f1000research.8013.1)

Page 32: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

HOMOLOGY MODELS FOR ZIKA

Models developed with SWISS-MODEL

Will dock millions of compounds vs these models

Ekins et al., F1000Research 5:275 (2016)

Page 33: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

Ekins S, Mietchen D, Coffee M et al. 2016 F1000Research 2016, 5:150 (doi: 10.12688/f1000research.8013.1)

Compounds and chemical libraries suggested for testing against Zika virus

Page 34: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

• Data is out there to produce models for neglected diseases

• Also modeled Marburg, Lassa, Dengue..

• Computational and experimental collaborations with open data have lead to : – New hits and leads

– New IP

– New grants for collaborators

• Even Ebola had enough data to build models and suggest compounds to test in 2014

• Zika = starting from scratch – no data – need to use other approaches

• Make findings open and published immediately

• Need for easier facilities to test compounds

• Challenges still – sharing and accessing information / knowledge

• How do we prepare for the next BIG ONE

Conclusions

Page 35: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

Alex Clark Jair Lage de Siqueira-Neto Joel Freundlich Peter Madrid Robert Davey Megan Coffee Robert Reynolds Nadia Litterman Christopher Lipinski Christopher Southan Antony Williams Carolyn Talcott Malabika Sarker Steven Wright Mike Pollastri Ni Ai Barry Bunin and all colleagues at CDD

Acknowledgments and contact info

[email protected]

collabchem

Page 36: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

ZIKAOPEN ACKNOWLEDGMENTS

Tom Stratton

Priscilla L. Yang

Page 37: Using Machine Learning Models Based on Phenotypic Data to Discover New Molecules For neglected diseases

Software on github Models can be accessed at

• http://molsync.com/bayesian1

• http://molsync.com/bayesian2

• http://molsync.com/transporters

• http://molsync.com/ebola/