Convolutional Neural Networks for Protein-Ligand...

Methods Results

ConclusionWe have shown that a convolutional neural network with a well-optimized model andappropriate training dataset has great potential in aiding with drug discovery. Creatingvariety in the poses through rotation, translation and shuffling during training areimportant in training the model. Other parameters such as network depth and widthcan also reduce overfitting. Visualizations highlight the pose sensitivity of the CNNmodel and can emphasize regions of interest in the protein-ligand complex.

Our best model performs better than Autodock Vina at pose selection when evaluatedfor pose predication performance (CSAR) and virtual screening performance (DUD-E),although the nature of the training data greatly influences the result.

ConvolutionalNeuralNetworksforProtein-LigandScoringMattRagoza1,2, ElisaIdrobo3,6,JoshuaHochuli2,4,JocelynSunseri5 andDavidKoes5

1DepartmentofNeuroscience,2DepartmentofComputerScience,3TECBioREU@Pitt,4DepartmentofBiologicalSciences,5Dept.ofComputationalandSystemsBiology,UniversityofPittsburgh,Pittsburgh,PA15260,6TheCollegeofNewJersey,Ewing,NJ08618

AbstractComputational approaches to drug discovery reduce the time and cost associated withexperimental assays and enable the screening of novel chemotypes. Structure-baseddrug design methods rely on scoring functions to rank and predict binding affinities andposes. The ever expanding amount of protein-ligand binding and structural dataenables deep machine learning techniques for protein-ligand scoring.

We describe a convolutional neural network (CNN) scoring function that takes as inputa comprehensive 3D representation of a protein-ligand interaction. A CNN scoringfunction automatically learns the key features of protein-ligand interactions thatdetermine binding. We train and optimize our CNN scoring functions to discriminatebetween correct and incorrect binding poses and known binders and nonbinders. Wefind that our CNN scoring function outperforms the AutoDock Vina scoring functionwhen ranking poses both for pose prediction and virtual screening.

BackgroundProtein-ligand scoring provides a metric of binding strengthbetween small molecules and target proteins. This has a widearray of uses such as virtual screening, which filters largedatabases of candidate molecules for potential hits, anddocking, which predicts the binding pose of a ligand.

Machine learning strategies have been used to treat scoring as a classification problembetween “good” and “bad” binding states, though this often requires manuallyselecting properties that the model uses for discrimination, for example pairwiseinteractions and global counts of typical chemical interactions. However, other machinelearning models can learn the most important features directly from data.

Input data are fed forward through the network, and a prediction is output by the lastlayer. A neural network is trained by iteratively updating its weights by minimization ofan objective function, for example, the mean squared deviation between predictionsand their ground truth labels.

Within the last decade convolutional neural networkshave become the state-of-the-art in imageclassification. Convolutional layers only have connectionweights to small spatial subsets of the previous layer,and apply these weight kernels across the entire inputto produce feature maps.

The fact that convolutional layers learn local featuresand apply them across the entire input space leads tofaster training and improved accuracy on data with aspatial structure.

The rise of GPU computing in combination with otheradvances has made training networks with many morelayers feasible, leading to the surge of research in deeplearning. Each successive layer in a deep neural networklearns features at a higher level of abstraction.

Protein-ligand scoring is a natural generalization of image recognition where the full 3D“images” of protein-ligand complexes are used for training. Convolutional neural netstrained on protein-ligand interactions have the potential to provide substantially moreaccurate scoring functions for improved docking and virtual screening.

DatasetsAllposesgeneratedwithsmina/AutoDock VinaCommunityStructure-ActivityResource(CSAR)§ CSARversion2010with2011update§ 337targets,eliminateweakbinders§ Crystalstructuresà Referenceposes§ Poses<2ÅRMSDà actives§ Poses>4ÅRMSDà decoysDatabaseofUsefulDecoys-Enhanced(DUD-E)§ 101targets§ Activeanddecoyligands§ Unknowncorrectposes

InputFormatVoxelgridcenteredatactivesitecalculatedfrommoleculardata34atomtypechannels§ 16receptoratomtypes§ 18ligandatomtypes23.5Å3 Gaussianatomgrid§ 0.5Åresolution§ 483 points

TrainingCaffe DeepLearningFrameworkTrainto10,000iterationsLayer-wisemodeldefinition§ N-dimensionalinputlayer§ Convolutionallayers§ Non-linearlayers(rectifiedlinearunits)§ Fully-connectedlayers§ Softmax(converttoprobabilities)§ Multinomiallogisticloss(2-class)Performance§ Mini-batchparallelism(batchsize=10)§ Multi-GPUsupport

ModelEvaluationReceiveroperatorcharacteristic(ROC)§ Falsepositivevs.truepositiverate§ AreaunderROCcurve(AUC)Clustered3-foldcross-validation§ Splittargetsinto3balancedfolds§ DUD-Etargetswith80%similarityweregroupedintosamefold

§ Train3models,leavingonefoldout§ Combineperformanceontestsets§ Avoidstestingontargets/ligandssimilar totrainingset

BootstrappedAUC

OptimizationCSARsetChangesingleparameterrelativetoareferencemodel§ Widthofnetwork§ Depthofnetwork§ Resolutionofgrid§ Rotating,translating,balancingandshufflingduringtraining

§ Typesofpoolinglayers§ Numericalvsbinaryoccupancies§ Valuesofradiusmultiplier§ FullyconnectedlayerattheendEvaluatebyaccuracyandtrainingtimeCombinebestchangesintonewreferencemodelandrepeat

DUD-EEvaluationCross-validationwasdonewiththebestmodelDifferentratiosofDUD-EtoCSARdatawereusedtotrainMultipleposesofligandsvssingleposesweretested§ Themaximumscoreofaligand’sposeswastakenastheligand’sscore

§ Eachroundofoptimizationincreasedaccuracyanddecreasedtrainingtime

§ Rotations,smalltranslations,balancingbetweenactivesanddecoysandshufflingorderduringtrainingreduceoverfittingtodata

§ Reducingdimensionslowerstrainingtime

§ Higherresolutionincreasesaccuracybutalsosignificantlyincreasestrainingtime

§ OptimizationincreasedtheAUCofthebestmodelfrom0.78to0.82

§ BestmodelhasabetterAUCthanVina’s scoringfunction

Optimization

LigandbindingpredictionwithDUD-ENeural networks are a supervised machine learningalgorithm inspired by the nervous system. A basicnetwork consists of an input layer, one or more hiddenlayers, and an output layer of interconnected nodes. Eachhidden node computes a feature that is a function of theweighted input it receives from the nodes of the previouslayer.

ThisworkissupportedbyR01GM108340fromtheNationalInstituteofGeneralMedicalSciences.TECBio REU@PittissupportedbytheNationalScienceFoundationunderGrantDBI-1263020andisco-fundedbytheDepartmentofDefenseinpartnershipwiththeNSFREUprogram.

FutureWork• Explore alternative network topologies, such as residual neural networks• Evaluate the use of noise models when training• Investigate the use of CNN scoring for affinity prediction• More informative visualizations from backpropagated gradients• Extract positional gradients from neural network to support energy minimization• Use CNN energy minimization to implement CNN-based pose generation• Use reinforcement learning to iteratively refine CNN models for pose generation• Deploy an open-source comprehensive CNN-based molecular docking and energy

minimization software package (http://github.com/gnina)

GPUAcceleratedMolecularGridding§ Parallelizeoveratoms toobtainamaskofatomsthatoverlapeachgridregion

§ Useexclusivescantoobtainalistofatomindicesfromthemask

§ Parallelizeovergridpoints,usingreducedatomlisttoavoidO(Natoms)check

VisualizationMethod

CNNscoringtrainedwithDUD-EhasbetterperformancethanVina for90%oftargets

TrainingwithaDUD-E/CSAR2:1mixisbetterfor81%oftargets

• TrainingwithCSARandtestingonDUD-Eorvice-versaisnotveryeffective• Scoringmultipleposes,insteadoftheposetop-rankedbyVina,andtakingthemaximumscore

increasesaccuracyforvirtualscreening,suggestingtheCNNmodelcanselectbetterposes• UsingbothCSARandDUD-Etotrainincreasesthisgain

Ligand§ Singleatomdecomposition• Removeatomsonebyoneandscore• Computescoredifference• Setatomwithscoredifference

§ Fragmentdecomposition• FragmentligandwithRDKit• Foreachfragment

• Scoreligandwithoutfragment• Accumulatescoredifferenceonfragmentatoms

§ Averagesingleatomandfragmentscores

Protein§ Removewholeresiduesatatime§ Computescoredifference§ SetallresidueatomstoscoredifferenceStoredifferenceinb-factorfieldof.pdb fileVisualizewithPyMOL

Spatiallyaccurate:Whencomparedwithcrystalposes,portionsofthetestposesthatareinthesamelocationscorewell.Atomsfarremovedfromtheircrystallocationoftenscorepoorly.

Ligandandreceptoragreement:Effectsareoftenshownonbothmoleculesatinteractinglocations.

Ligandmoresignificantthanreceptor:Ligandsoftenshowmoresignificantnegativeandpositivevaluesthanreceptors,exceptinthepresenceofmetalions,whichoftendominatethescore.

Littleconsensusoncarbonatoms:Modelsfromdifferenttrainingsetsdisagreeonwhichcarbonssignificantlycontributetooverallscore.

Insights

single atom

fragment

Convolutional Neural Networks for Protein-Ligand...

Documents

Transcript of Convolutional Neural Networks for Protein-Ligand...

LIGAND PHARMACEUTICALS INCORPORATED · Hovione Hovione Farmaciencia IPR&D In-Process Research and Development Ligand Ligand Pharmaceuticals Incorporated, including subsidiaries

Assessment of Programs for Ligand Binding Afﬁnity Predictionpwp.gatech.edu/.../385/2018/...of-Programs-for-Ligand-Binding-Affinit… · observed binding energies for ligand–receptor

Ligand chemistry

9.protein ligand interactions2

Ligand Field n MOT

Ligand Pharmaceuticals - Appendix

P-Selectin Glycoprotein Ligand 1 Is a Ligand for L-Selectin on

Ligand Design

Protein-ligand docking

Ligand-Focused HTS-NMRbionmr.unl.edu/files/publications/119.pdf · detecting protein-ligand interactions, for identifying the ligand-binding site, for calculating dissociation con-

Ligand Substitution

Protein – ligand ppt

Ligand-Controlled Regioselectivity in the …digital.csic.es/bitstream/10261/64805/4/Ligand-Controlle...alkenyl ligand and subsequently the branched vinyl sulfide. Particularly, RhI

Literacy Learning for Lifelong Readers BTS Family Literacy Night December 3, 2013 Hosted by Reading Specialists Donna Turso Lucy Ragoza.

Visualization of Convolutional Neural Network …bits.csb.pitt.edu/files/CNN_Viz_ACS17.pdf1Department of Computer Science,2Department of Biological Sciences, 3Department of Computational

Metal-Ligand Bond

Docking Studies in the Inhibition of Human Serine ...bits.csb.pitt.edu/files/Raj_docking_studies_shmt.pdf · Docking Studies in the Inhibition of Human Serine Hydroxymethyltransferase

Study of the ligand effects on the metal-ligand bond in some new … · 2013. 12. 9. · Study of the ligand effects on the metal-ligand bond in some new organometallic complexes

Fas and Fas Ligand in Embryos and Adult Mice: Ligand ...

Ligand Binding Domain