Shuxing Zhang, Alexander Golbraikh and Alex Tropsha The Laboratory for Molecular Modeling

Development of Novel Geometrical Chemical Descriptors and Their Application to the

Prediction of Ligand-Protein Binding Affinity

Shuxing Zhang, Alexander Golbraikh and Alex Tropsha

The Laboratory for Molecular ModelingSchool of Pharmacy

University of North Carolina at Chapel Hill

April 19, 2023

Problem

Given a protein-ligand complex, predict ligand binding affinity.

Knowledge-based (Statistical) Potentials

• Two Body potentialsPMF Muegge, I.; Martin, Y.C.; J.Med.Chem.1999, 42, 791-804

BLEEP Mitchell, J.B.; Laskowski R.A.; Alex A.; Thornton, J.M.; J. Comp. Chem.

1999, 20,1165-1176 DrugScore Gohlke, H.; Hendlich, M.; Klebe,G.; J Mol Biol 2000, 295, 337-356

SMoG DeWitte, R. S.; Shakhnovich, E.I. J Am. Chem. Soc. 1996, 118,11733-11744 SMoG2001 Ishchenko. A. V.; Shakhnovich, E. I.; J. Med. Chem. 2002, 45,

2770-2780 • Four-Body contact potential (By Jun Feng)

Full Atom-based Delaunay tessellation of Protein-ligand Interface (5HVP)

king

An example of active site tessellation: the ribbon diagram represents the two chains of HIV-1 protease. The ligand acetyl-pepstatin is in spacefill mode and the yellow is the tetrahedral formed by protein and ligand

RRRLRRLLRLLL

RRRL: Formed by 3 receptor atoms and 1 ligand atomsRRLL: Formed by 2 receptor atoms and 2 ligand atomsRLLL: Formed by 1 receptor atoms and 3 ligand atoms

Three Types of Tetrahedra at Protein-ligand Interface

LRRR

RRRLRRRL ff

fE ln

LLRR

RRLLRRLL ff

fE ln

LLLR

RLLLRLLL ff

fE ln

Earlier work: Four-Body Statistical Contact Scoring Function Based on Delaunay

Tessellation

R2 = 0.4678-100

-80

-60

-40

-20

0

-100 -80 -60 -40 -20 0

DDG, calc

DDG,

exp

RLLLRRLLRRRL EEEE

Correlation between experimental and calculated binding free energy for PMF dataset using four-body scoring function

Training Set size

Test Set size

Test Set R2

BLEEP 351 90 0.53

PMF 697 77 0.61

SMoG96 120 46 0.42

SMoG2001 725 111 0.436

DT2001 319 67 0.71

DT2002 319 107 0.54

Comparison of Current Scoring Functions

Multiple CG descriptors of protein-ligand interface and correlation with ligand affinity

• Define the ligand-receptor interface by the means of DT

• Calculate chemical descriptors for nearest neighbor atom quadruplets.

• Use statistical data modeling approach to correlate descriptors and affinity

µ: Electronegativity (chemical potentials) of atoms

Q: Partial charges on atoms

Η: Hardness kernel

Descriptors derived from atomic electronegativity

King

According to study of Dr. Berkowitz's lab, EN is highly related to the energy of molecules (see formulus). Qualitatively, we also know that it is related to hydrogen bond, polarity and polarization. we hope be able to describe the structure and binding with this parameter by applying it to Delaunay tessellation. There are several ways to apply EN to our geometrical method.

Ligand Atom TypesO EN = 3.4

N EN = 3.0

C EN = 2.5

S EN = 2.4

X P and Halgens, EN = 2.0 ~ 2.4, 4.0

M Metal and all other unexpected atom types, EN = 0.6 ~ 1.6

Receptor Atom TypesO EN = 3.4

N EN = 3.0

C EN = 2.5

S EN = 2.4

There are 554 possible interfacial quadruplet composition types. After processing 517 complexes, 100 are found to occur with high frequency (at least 50 times).

Atom Type Definition based on En values

king

In order to generate descriptors, the atom types must be defined. Here we use EN as a criterion. The reasons will bed discussed on next slides. Basically we want to our descriptors make more physico-chemical sense and hope to explain the complicated binding process mechanistically. Another is to control of the number of descriptors not too many.

m: m-th tetrahedral composition typej: Vertex of a tetradedronn: Number of m-th composition type

Thus, there are 100 descriptors for each protein-ligand complex

Descriptor Calculation

S_L

C_R

O_L

N_R

2.5

2.4

3.0

3.4

n

i jijmEN

1

4

EN

king

In order to generate descriptors, the atom types must be defined. Here we use EN as a criterion. The reasons will bed discussed on next slides. Basically we want to our descriptors make more physico-chemical sense and hope to explain the complicated binding process mechanistically. Another is to control of the number of descriptors not too many.

Flowchart of Novel Descriptor GenerationFlowchart of Novel Descriptor Generation

Process files and assign atom type

based on EN value

Define interaction interface with DT and record all interfacial tetrahedra

264 complexes

Classify interfacial tetrahedra into different composition

types and calculate their EN values (Descriptors)

Correlate with

Binding

Data ModelingData Modeling

Structure Binding CG Descriptors

Comp.1 Value1 D1 D2 D3 D4

Comp.2 Value2 " " " "

Comp.3 Value3 " " " "

Comp.N-264 Value264 " " " "

- - - - - - - - - - - - - -

Goal: Establish correlations between descriptors and the binding affinity capable of predicting binding of novel complexes

{Binding affinity} = K{descriptor diversity}^

0

5

10

15

20

25

30

Complex Families

Num

ber o

f Com

plex

es

Diversity of the dataset: 264 Complexes, 33 families

king

The high diversity of our structures and protein families which is hard for most of the current scoring functions to predict their binding affinity

Only accept models that have a

q2 > 0.6R2 > 0.6, etc.

Multiple Training Sets

Validate Predictive Models with Randomly Selected

External Sets (24)

Data Modeling WorkflowData Modeling Workflow

264 Complexes

Multiple Test Sets

Variable Selection kNN to build modelsSplit 240 into

Training and Test Sets

Binding Prediction

Y-Randomization

Randomly Exclude 24 Complexes as

External Set

Leave out one complex from the training set and calculate distance between the eliminated and all remaining compounds

(in the original 100 descriptor space)

k Nearest Neighbork Nearest Neighbor (k (kNN) with Variable SelectionNN) with Variable Selection

Randomly select a subset of descriptors (a hypothetical descriptor pharmacophore)

Leave out a complex

Find k nearest neighbors in the training set

Predict the binding affinity of the eliminated complex by weighted kNN using the identified k nearest neighbors.

Select acceptable models (with q2 > 0.6)Calculate the predictive ability (q2) of the model

N

times

N

times

SA

0

2

4

6

8

10

12

0 2 4 6 8 10 12

Actual PKi

Pre

dic

ted

PK

i

Correlation of Actual ~ Predicted Binding Affinity for 49 Test Set Complexes

king

Prediction with multiple models and this is with the best model. R2 is about 0.783 and RMSD is about 0.91 (I will let you know the equavelent binding energy).

0

2

4

6

8

10

12

0 2 4 6 8 10 12

Actual PKi

Pre

dict

ed P

Ki

Correlation of Actual ~ Predicted Binding Affinity for 24 Complexes with Best Model

king

United consensus prediction: Combine the training and test sets and do consensus prediction of external 24 complexes. R2 is about 0.70 and RMSD is about 0.89.

Training Set size

Test Set size

Test Set R2

BLEEP 351 90 0.53

PMF 697 77 0.61

SMoG96 120 46 0.42

SMoG2001 725 111 0.436

DT2001 319 67 0.71

DT2002 319 107 0.54

CG 191 49 0.78

Comparison of Current Scoring Functions

• Novel geometrical chemical descriptors have been developed

• These simple yet fundamental descriptors can be used to predict binding affinity using correlation approaches; have high prediction power for diverse ligand-protein structures

• The statistical models can be used for fast and accurate scoring of complexes resulting from docking studies

Conclusions

Shuxing Zhang, Alexander Golbraikh and Alex Tropsha The Laboratory for Molecular Modeling

Documents

Transcript of Shuxing Zhang, Alexander Golbraikh and Alex Tropsha The Laboratory for Molecular Modeling