SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018....

SMILES based encodersbase for QSAR feature modelling and generative modelling

Esben Jannik Bjerrum, Principal ScientistEighth Joint Sheffield Conference on Chemoinfomatics 17-19 June 2019

https://cisrg.shef.ac.uk/shef2019

2

Introduction - Outline

The Players:

SMILES

Recurrent Neural Networks (RNN)

Long Short-Term Memory cells (LSTM)

SMILES enumeration

RNN as SMILES readers and generators

Combining it:

Autoencoders

Heteroencoder training trick

Latent space properties

Sampling in a 1:many setting

Use in QSAR

Use in Generative Modelling

Conclusion

Toolkits and Code

SMILES, a Chemical Language and Information System

3

6

Recurrent Neural Networks (RNN)

●Sequences of features as inputs

●The same task for every element

of a sequence, with the output

being affected by the previous

computations

●Modeling of sequences such as

text, tweets, time series etc.

• Neural Networks learns on vectors, matrices or tensors.

• One-hot encoding with a defined vocabulary converts SMILES

strings into 2D matrices

7

One hot encoding

Long Short-Term Memory cells (LSTM)

8

Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. “Long Short-Term Memory.” Neural Computation.

9

RNNs as an encoder

Internal LSTM states gets changed from step to step. The full sequence

influences the final vector used for prediction task.

• Canonical SMILES ensures a 1:1

relationship between molecule and

SMILES

• I go the other way and generate

multiple SMILES for the same

molecule

• Works as data augmentation

10

Enumeration of non-canonical SMILES

Live DEMO of generation

Bjerrum, Esben Jannik. 2017. “SMILES Enumeration as Data Augmentation for Neural Network

Modeling of Molecules.” http://arxiv.org/abs/1703.07076

11

Random SMILES in practice

kfxl284

Highlight

kfxl284

Sticky Note

over 11000 SMILES generated

12

RNNs as generators

REINVENT: Olivecrona et. al. Molecular de-novo design through deep reinforcement

learning, J. Cheminf. 2017

13

SMILES Based Autoencoders

Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven

Continuous Representation of Molecules.” ACS Central Science 4(2): 268–76.

14

Projecting non-canonical SMILES in latent space

Conv2RNN RNN2RNN

15

A SMILES string is not a Molecule

Moleculec1ccccc1

16

HeteroEncoders

Also possible with InChi’s and CAS-names : Winter et. al 2018

From chemical images to SMILES: Bjerrum & Sattarov 2018

17

Projecting non-canonical SMILES in the latent space

18

Latent space molecular similarities

SMILES sequence similarity

Morgan FP Tanimoto similarity

Model R2 FP metric R2 Seq metric

Can2Can 0.24 0.58

Enum2Can 0.37 0.53

Can2Enum 0.58 0.55

Emun2Enum 0.49 0.40

20

Non deterministic sampling of molecules

Can2Can Enum2Enum

(2l)

Unique SMILES 1 111

%Correct MOL 100 57

Unique SMILES for

Correct MOL

1 42

Unique Molecules 1 17

21

1:Many sampling from latent space point

22

Latent vectors as a base for QSAR models

Figure adapted from : Bjerrum, Esben Jannik, and Boris Sattarov. 2018. “Improving Chemical Autoencoder Latent Space and Molecular

De Novo Generation Diversity with Heteroencoders.” Biomolecules.

23

Latent Space gets more relevant for QSAR modelling tasks

RMSEP of 5 datasets modelled using deep neural networks

IGC50 LD50 BCF Solubility MP Norm

Mean

Enum2Enum 0.43 0.54 0.71 0.65 37 0.75

Can2Enum 0.46 0.54 0.69 0.69 37 0.77

Enum2Can 0.46 0.57 0.71 0.66 38 0.78

Can2Can 0.53 0.62 0.79 0.87 43 0.89

ECFP4 0.62 0.59 0.94 1.21 43 1.00

impro

vem

ent

ECFP4 performance low when compared to literature, Enum2Enum close

Bjerrum, Esben Jannik, and Boris Sattarov. 2018. “Improving Chemical Autoencoder Latent Space

and Molecular De Novo Generation Diversity with Heteroencoders.” Biomolecules.

25

Trends in searching for Hyperparameters

Larger is better

for the decoder

(Except for

Linear QSAR

model)

26

Final Architecture

Validity: 99%

Reconstruction: 76 (80) %

28

QSAR performance using Linear regression model

Datasets from ExCAPE-DB

29

QSAR performance using SVM models

Latent space is non-linear, works with non-linear models (SVM, NN),

but not so well with linear (MLR)

30

Optimization of molecular properties

We

currently

use

REINVENT

Decoder

based:

31

Conclusions

• SMILES based autoencoders can be improved by training on non-

canonical to different non-canonical SMILES (Heteroencoders)

• Binomial sampling makes it possible for the RNN to solve the 1:Many

task on the output level

• But reconstruction of molecules is non-deterministic

• Latent space is relevant with respect to QSAR tasks compared to

autoencoders

• Nearly on par with ECFP4

• Latent space is non-liniear and works best with non-liniear ML models

• Can be used to optimize molecules

• Benefits over current approaches at AZ (REINVENT) still not

resolved

32

Toolkits – Source code - Links

Molvecgen: github.com/Ebjerrum/molvegen

Blogposts: www.wildcardconsulting.com

Deep Drug Coder (to be released)

https://github.com/EBjerrum/Deep-Drug-Coder.git

http://www.wildcardconsulting.com/

33

Acknowledgements

De Novo Design groupOla Engkvist, Associate Director, Molecular AI

Christian Tyrchan, Team Leader - Computational Chemistry

Atanas Patronov, Associate Principal Scientist, Molecular AI

Michael Withnall, Ph.D student, Molecular AI

Rocio Mercado, Post.doc. Molecular AI

Jiazhen He, post.doc. Molecular AI

Josep Arus Pous, Ph.D student, BIGCHEM

Dhanushka Weerakoon, Graduate Scientist, IMED Graduate Programme

Simon Johansson, Master student

Oleksii Prydkhodko, Master student

Panagiotis-Christos Kotsias, Graduate Scientist, IMED Graduate Programme

Boris Sattarov, Informatics Programmer, Science Data Software LLC (Ext.)

Hongming Chen, Principal Scientist, Molecular AI

Thank you for listening

34

Confidentiality Notice

This file is private and may contain confidential and proprietary information. If you have received this file in error, please notify us and remove

it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the

contents of this file is not permitted and may be unlawful. AstraZeneca PLC, 1 Francis Crick Avenue, Cambridge Biomedical Campus,

Cambridge, CB2 0AA, UK, T: +44(0)203 749 5000, www.astrazeneca.com

35

Figure on Slides 15 - 21 from Open Access: Bjerrum, Esben Jannik, and Boris Sattarov. 2018. “Improving Chemical Autoencoder Latent

Space and Molecular De Novo Generation Diversity with Heteroencoders.” Biomolecules.

kfxl284

Sticky Note

Approved by publication sign off 2019-06-19

SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018....

Documents

Transcript of SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018....