SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018....
Transcript of SMILES based encodersSMILES Based Autoencoders Example: Gómez-Bombarelli, Rafael et al. 2018....
SMILES based encodersbase for QSAR feature modelling and generative modelling
Esben Jannik Bjerrum, Principal ScientistEighth Joint Sheffield Conference on Chemoinfomatics 17-19 June 2019
2
Introduction - Outline
The Players:
SMILES
Recurrent Neural Networks (RNN)
Long Short-Term Memory cells (LSTM)
SMILES enumeration
RNN as SMILES readers and generators
Combining it:
Autoencoders
Heteroencoder training trick
Latent space properties
Sampling in a 1:many setting
Use in QSAR
Use in Generative Modelling
Conclusion
Toolkits and Code
SMILES, a Chemical Language and Information System
3
6
Recurrent Neural Networks (RNN)
●Sequences of features as inputs
●The same task for every element
of a sequence, with the output
being affected by the previous
computations
●Modeling of sequences such as
text, tweets, time series etc.
• Neural Networks learns on vectors, matrices or tensors.
• One-hot encoding with a defined vocabulary converts SMILES
strings into 2D matrices
7
One hot encoding
Long Short-Term Memory cells (LSTM)
8
Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. “Long Short-Term Memory.” Neural Computation.
9
RNNs as an encoder
Internal LSTM states gets changed from step to step. The full sequence
influences the final vector used for prediction task.
• Canonical SMILES ensures a 1:1
relationship between molecule and
SMILES
• I go the other way and generate
multiple SMILES for the same
molecule
• Works as data augmentation
10
Enumeration of non-canonical SMILES
Live DEMO of generation
Bjerrum, Esben Jannik. 2017. “SMILES Enumeration as Data Augmentation for Neural Network
Modeling of Molecules.” http://arxiv.org/abs/1703.07076
11
Random SMILES in practice
12
RNNs as generators
REINVENT: Olivecrona et. al. Molecular de-novo design through deep reinforcement
learning, J. Cheminf. 2017
13
SMILES Based Autoencoders
Example: Gómez-Bombarelli, Rafael et al. 2018. “Automatic Chemical Design Using a Data-Driven
Continuous Representation of Molecules.” ACS Central Science 4(2): 268–76.
14
Projecting non-canonical SMILES in latent space
Conv2RNN RNN2RNN
15
A SMILES string is not a Molecule
Moleculec1ccccc1
16
HeteroEncoders
Also possible with InChi’s and CAS-names : Winter et. al 2018
From chemical images to SMILES: Bjerrum & Sattarov 2018
17
Projecting non-canonical SMILES in the latent space
18
Latent space molecular similarities
SMILES sequence similarity
Morgan FP Tanimoto similarity
Model R2 FP metric R2 Seq metric
Can2Can 0.24 0.58
Enum2Can 0.37 0.53
Can2Enum 0.58 0.55
Emun2Enum 0.49 0.40
20
Non deterministic sampling of molecules
Can2Can Enum2Enum
(2l)
Unique SMILES 1 111
%Correct MOL 100 57
Unique SMILES for
Correct MOL
1 42
Unique Molecules 1 17
21
1:Many sampling from latent space point
22
Latent vectors as a base for QSAR models
Figure adapted from : Bjerrum, Esben Jannik, and Boris Sattarov. 2018. “Improving Chemical Autoencoder Latent Space and Molecular
De Novo Generation Diversity with Heteroencoders.” Biomolecules.
23
Latent Space gets more relevant for QSAR modelling tasks
RMSEP of 5 datasets modelled using deep neural networks
IGC50 LD50 BCF Solubility MP Norm
Mean
Enum2Enum 0.43 0.54 0.71 0.65 37 0.75
Can2Enum 0.46 0.54 0.69 0.69 37 0.77
Enum2Can 0.46 0.57 0.71 0.66 38 0.78
Can2Can 0.53 0.62 0.79 0.87 43 0.89
ECFP4 0.62 0.59 0.94 1.21 43 1.00
impro
vem
ent
ECFP4 performance low when compared to literature, Enum2Enum close
Bjerrum, Esben Jannik, and Boris Sattarov. 2018. “Improving Chemical Autoencoder Latent Space
and Molecular De Novo Generation Diversity with Heteroencoders.” Biomolecules.
25
Trends in searching for Hyperparameters
Larger is better
for the decoder
(Except for
Linear QSAR
model)
26
Final Architecture
Validity: 99%
Reconstruction: 76 (80) %
28
QSAR performance using Linear regression model
Datasets from ExCAPE-DB
29
QSAR performance using SVM models
Latent space is non-linear, works with non-linear models (SVM, NN),
but not so well with linear (MLR)
30
Optimization of molecular properties
We
currently
use
REINVENT
Decoder
based:
31
Conclusions
• SMILES based autoencoders can be improved by training on non-
canonical to different non-canonical SMILES (Heteroencoders)
• Binomial sampling makes it possible for the RNN to solve the 1:Many
task on the output level
• But reconstruction of molecules is non-deterministic
• Latent space is relevant with respect to QSAR tasks compared to
autoencoders
• Nearly on par with ECFP4
• Latent space is non-liniear and works best with non-liniear ML models
• Can be used to optimize molecules
• Benefits over current approaches at AZ (REINVENT) still not
resolved
32
Toolkits – Source code - Links
Molvecgen: github.com/Ebjerrum/molvegen
Blogposts: www.wildcardconsulting.com
Deep Drug Coder (to be released)
https://github.com/EBjerrum/Deep-Drug-Coder.git
33
Acknowledgements
De Novo Design groupOla Engkvist, Associate Director, Molecular AI
Christian Tyrchan, Team Leader - Computational Chemistry
Atanas Patronov, Associate Principal Scientist, Molecular AI
Michael Withnall, Ph.D student, Molecular AI
Rocio Mercado, Post.doc. Molecular AI
Jiazhen He, post.doc. Molecular AI
Josep Arus Pous, Ph.D student, BIGCHEM
Dhanushka Weerakoon, Graduate Scientist, IMED Graduate Programme
Simon Johansson, Master student
Oleksii Prydkhodko, Master student
Panagiotis-Christos Kotsias, Graduate Scientist, IMED Graduate Programme
Boris Sattarov, Informatics Programmer, Science Data Software LLC (Ext.)
Hongming Chen, Principal Scientist, Molecular AI
Thank you for listening
34
Confidentiality Notice
This file is private and may contain confidential and proprietary information. If you have received this file in error, please notify us and remove
it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the
contents of this file is not permitted and may be unlawful. AstraZeneca PLC, 1 Francis Crick Avenue, Cambridge Biomedical Campus,
Cambridge, CB2 0AA, UK, T: +44(0)203 749 5000, www.astrazeneca.com
35
Figure on Slides 15 - 21 from Open Access: Bjerrum, Esben Jannik, and Boris Sattarov. 2018. “Improving Chemical Autoencoder Latent
Space and Molecular De Novo Generation Diversity with Heteroencoders.” Biomolecules.