Similarity search and QSAR...

Click here to load reader

  • date post

    18-Jun-2020
  • Category

    Documents

  • view

    3
  • download

    0

Embed Size (px)

Transcript of Similarity search and QSAR...

  • Similarity search and QSAR modeling

    Pavel Polishchuk

    Institute of Molecular and Translational Medicine Faculty of Medicine and Dentistry

    Palacky University

    [email protected] qsar4u.com

    November-December 2016, Palacky University, Olomouc, Czech republic for the short course of “introduction to ligand-based drug discovery”

  • Computer-aided drug design (CADD)

    Protein structure

    Unknown Known

    Liga

    nd

    str

    uct

    ure

    Unknown CADD of no use De novo design

    Known

    Ligand-based design

    QSAR, pharmacophore modeling,

    similarity searching

    Structure-based design

    Docking, molecular dynamics,

    pharmacophore modeling

  • Similarity principle

    Similia similibus curantur

    Similar chemicals are solved in similar ones

    Similar structures exhibit similar properties

  • Are they similar?

    morphine codeine

    S-thalidomide R-thalidomide

    tirofibane

    eptifibatide

    tirofibane ethyl ester

  • How to measure similarity?

    Structure representation (encoding)

    Similarity measure

  • Structure representation and similarity

    Structural keys

    Feature vectors

    Pharmacophore

    Shape similarity

    Field similarity

    Fingerprints

    10001001010101110111001

    11 4 4 21 … 5 4 0 0 1

    Similarity

  • Structure encoding. Structural similarity.

    Nomenclature

    Line notation (e.g. SMILES, InChi)

    Molecular graph (2D descriptors)

    Fragments count (e.g. structural keys, fingerprints)

    3D structures (geometrical descriptors)

    glucose

    OC[[email protected]]1OC(O)[[email protected]](O)[[email protected]@H](O)[[email protected]@H]1O

    InChI=1S/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2/t2-,3-,4+,5-,6?/m1/s1

    n(OH) = 5; n(ring6) = 1, etc

    shape, volume, surface, etc

    topological indices, etc

  • Structural keys

    Structural keys are predefined patterns (fixed length).

    • The presence/absence of each element, or if an element is common (nitrogen, for example), several bits might represent "at least 1 N", "at least 2 N", "at least 4 N", and so forth.

    • Unusual or important electronic configurations, such as "sp3 carbon" or "triple-bonded nitrogen.“

    • Rings and ring systems, such as cyclohexane, pyridine, or napthalene.

    • Common functional groups, such as alcohols, amines, hydrocarbons, and so forth.

    Examples:

    MACCS (166 or 960 bits) BCI CACTVS

  • MACCS keys (166 structural keys)

    49: [!+0], # CHARGE

    50: [#6]=[#6](~[#6])~[#6], # C=C(C)C

    51: [#6]~[#16]~[#8], # CSO

    52: [#7]~[#7], # NN

    53: [!#6;!#1;!H0]~*~*~*~[!#6;!#1;!H0], # QHAAAQH

    54: [!#6;!#1;!H0]~*~*~[!#6;!#1;!H0], # QHAAQH

    55: [#8]~[#16]~[#8], # OSO

    … … … … … … 0 1 1 0 … … … … … … … … …

    charge

    fixed length, binary or counts

  • Structural keys: advantages and drawbacks

    + fixed length(easy to store and update data base) - lack of generality (can have little discriminating power for certain data sets due to too dense or too sparse bit strings)

  • Fingerprints

    No predefined patterns, patterns are generated separately for each structure.

    Usually binary, but maybe counts (feature vectors)

  • Atom-pair fingerprints

    Atom type: element + # heavy neighbors + # pi electrons

    Atom-pair: atom type – topological distance – atom type

    O_1_1

    O_1_0

    N_2_1

    C_2_1

    C_3_1

    N_2_1-4-O_1_1 C_3_1-2-O_1_0 N_2_1-2-C_2_1

    R.E. Carhart, D.H. Smith, R. Venkataraghavan JCICS 25: 64-73 (1985)

  • Sequence-based fingerprints (Daylight)

    Generate all paths up to certain length considering atom and bond types (usually length is 2-7)

    0-bond paths N C O

    1-bond paths N-C C-C C-O C=O

    2-bond paths N-C-C C-C-O C-C=O

    3-bond paths N-C-C-O N-C-C=O

  • Atom-centric (augmented atoms) fingerprints

    Generate substructures starting from each atom and considering all its neighbors up to the specified distance (radius or diameter). Morgan fingerprints, Extended-connectivity fingerprints (ECFP). Functional-class fingerprints (FCFP) , etc.

    radius=2 (diameter=4)

    Rogers, D. & Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 50, 742-754 (2010)

  • Atom-centric (augmented atoms) fingerprints

    Morgan fingerprints like Functional-class fingerprints (FCFP)

    Rogers, D. & Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 50, 742-754 (2010)

    Feature Code SMARTS (RDKit implementation)

    Donor hd [$([N;!H0;v3,v4&+1]),$([O,S;H1;+0]),n&H1&+0]

    Acceptor ha [$([O,S;H1;v2;!$(*-*=[O,N,P,S])]),$([O,S;H0;v2]),$([O,S;-]),$([N;v3;!$(N-*=[O,N,P,S])]),n&H0&+0,$([o,s;+0;!$([o,s]:n);!$([o,s]:c:n)])]

    Aromatic a [a]

    Halogen h [F,Cl,Br,I]

    Basic b [#7;+,$([N;H2&+0][$([C,a]);!$([C,a](=O))]),$([N;H1&+0]([$([C,a]);!$([C,a](=O))])[$([C,a]);!$([C,a](=O))]),$([N;H0&+0]([C;!$(C(=O))])([C;!$(C(=O))])[C;!$(C(=O))])]

    Acidic ac [$([C,S](=[O,S,P])-[O;H1,-1])]

  • Fingerprints

    Each molecule has variable length set of substructures – variable length fingerprints

    N-C-C C-C-O C-C=O C-C-C C:C:C C:C:N C:N:C

    1 1 1 0 0 0 0

    1 1 1 1 0 0 0

    0 0 0 0 1 1 1

    2-bond sequences

  • Hashed fingerprints

    Have fixed length (usually 512, 1024 or 2048 bits)

    hash code

    N-C-C C-C-O C-C=O C-C-C C:C:C C:C:N

    13823 9740 37278 28478 874 283764

    0 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 0 0 1 1

    pseudo-random number generator

    fixed-length bit string

    Each substructure activates several bits (usually 4-5) to avoid collisions and produce bit string of enough density Missing bits mean that certain substructures are not presented, Active bits mean that certain substructure may be present (but due to possible collisions one cannot be sure)

  • Folding of fingerprints and keys bit strings

    To increase density of very sparse bit vectors

    0 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 0 0 1 1

    0 0 0 1 0 0 1 1 0 1

    0 0 1 0 1 0 0 0 1 1

    OR

    n bit fingerprints

    n/2 bits

    n/2 bits

    0 0 1 1 1 0 1 1 1 1 n/2 bit fingerprint

    Usually optimal density is 0.2-0.3

  • Feature vectors

    May contain counts of substructures as well as other features derived from the structure (calculated logP, MW, etc)

    Feature vectors usually used in QSAR modeling than in simple similarity search

    N-C-C C-C-O C-C=O C-C-C C:C:C C:C:N C:N:C

    1 1 1 0 0 0 0

    2 1 1 1 0 0 0

    0 0 0 0 3 2 1

    2-bond sequences

  • Maximum common substructure (MCS)

    furosemid (diuretic)

    sulfadiazine (antibiotic)

    aspirin (anti-coagulant, COX inhibitor)

    clopidogrel (anti-coagulant, P2Y12 inhibitor)

  • Similarity indices

  • Similarity indices for binary fingerprints

    Index name Formula Range

    Tanimoto (Jaccard) [0; 1]

    Tversky depends on α and β

    Dice [0; 1]

    Cosine [0; 1]

    Manhattan distance [0; n]

    Euclidian distance [0; n]

    cba

    cS BA

    ,

    ccbca

    cS BA

    )()(,

    ba

    cS BA

    2,

    ab

    cS BA ,

    cbaD BA 2,

    cbaD BA 2,

    a, b – number of bits “on” (1) in A and B bit strings; c – number of common bits “on” (1) in both A and B strings

    P. Willet, J.M. Barnard and G.M. Downs. J. Chem. Inf. Comput. Sci., 1998, 38 (6), pp 983–996

  • Similarity indices for numeric fingerprints

    Index name Formula Range

    Tanimoto (Jaccard) [-1/3; 1]

    Tversky depends on α and β

    Dice [-1; 1]

    Cosine [-1; 1]

    Manhattan distance [0; ∞]

    Euclidian distance [0; ∞]

    n

    i BiAi

    n

    i Bi

    n

    i Ai

    n

    i BiAi

    BA

    xxxx

    xxS

    1 ,,1

    2

    ,1

    2

    ,

    1 ,,

    ,

    n

    i BiAi

    n

    i Bi

    n

    i Ai

    n

    i BiAi

    BA

    xxxx

    xxS

    1 ,,1

    2

    ,1

    2

    ,

    1 ,,

    ,

    n

    i Bi

    n

    i Ai

    n

    i BiAi

    BA

    xx

    xxS

    1

    2

    ,1

    2

    ,

    1 ,,

    ,

    2

    n

    i Bi

    n

    i Ai

    n

    i BiAi

    BA

    xx

    xxS

    1

    2

    ,1

    2

    ,

    1 ,,

    ,

    n

    i

    BiAiBA xxD1

    ,,,

    n

    i

    BiAiBA xxD1

    ,,,

    xi,M – value of i-th descriptor of a molecule M; n – overall number of features in a numeric vector

    P. Willet, J.M. Barnard and G.M. Downs. J. Chem. Inf. Comput. Sci., 1998, 38 (6), pp 983–996

  • Example of Tanimoto similarity calculation

    0 0 0 1 0 0 1 1 0 1

    0 0 1 0 1 0 1 0 1 1 cba

    cS BA

    ,

    A

    B

    a = 4 b = 5 c = 2

    2857.07

    2

    254

    2,

    BAS

  • Example of Tanimoto similarity calculation on MCS

    cba

    cS BA

    ,

    a = 21 atoms b = 17 atoms

    c = 11 atoms

    4074.027

    11

    111721

    11,

    BAS

  • Example

    Tanimoto Dice

    MACCS 0.94 0.969

    Atom pairs 0.994 0.997

    Morgan (ECFP4-like) 0.759 0.878

    Morgan (FCFP4-like) 0.782 0.878

    *calculated with RDKit

    morphine codeine

  • Conclusion

    Results of similarity search will depend on selected fingerprints and similarity measure It is extremely fast and may be applied towards millions of compounds

  • QSAR modeling

  • Activity = F(structure)

    M – mapping function E – encoding function

    Modeling of compounds properties

    Activity = M(E(structure))

    D1 D2 D3 D4 D5 D6 … DN

    1 0 9 0 11 1 … 1

    4 0 1 0 0 0 … 1

    0 0 0 0 0 4 … 6

    0 2 3 6 0 0 … 3

    … … … … … … … …

    4 0 0 0 1 2 … 1

  • QSAR modeling workflow

    D1 D2 D3 D4 D5 D6 … DN

    1 0 9 0 11 1 … 1

    4 0 1 0 0 0 … 1

    0 0 0 0 0 4 … 6

    0 2 3 6 0 0 … 3

    … … … … … … … …

    4 0 0 0 1 2 … 1

    Structure Descriptors (features) Model

    Encoding (represent structure with

    numerical features)

    Mapping (machine learning)

  • 1) a defined endpoint

    2) an unambiguous algorithm

    3) a defined domain of applicability

    4) appropriate measures of goodness-of–fit, robustness and predictivity

    5) a mechanistic interpretation, if possible

    OECD principles of QSAR modeling

  • Overall QSAR workflow

    Input data

    Bioassays

    Databases

    Preprocessing Feature

    engineering Model

    learning Model

    validation

    Classification

    Regression

    Clustering

    Cross-validation

    Bootstrap

    Test set

    Applicability Domain

    Feature selection

    Feature combination

    Data normalization

    Feature extraction

    Interpretation

    j

    j

    ii

    z

    xxx '

  • Step 1. Data collection

    Scientific literature and patents Databases (ChEMBL, PubChem, BindingDB, etc)

    Traditionally modeled compounds should have the same mechanism of action, however using of complex non-linear machine learning method allows to model data sets with mixed or even unknown mechanism of action with reasonable accuracy.

    Conditions may substantially influence the results of bioassays (change in temperature, activators, detectors, etc)

    Units checking

  • Step 2. Data curation (normalization)

    Removal of mixtures, inorganics, metalorganics, etc

    Strip of salts, counterions, etc

    Ionization, if necessary (at the particular pH level)

    Chemotype normalization, resonance structure and tautomers

    Duplicates removal

    Manual checking

  • Step 3. Descriptors: classification Object type:

    molecular descriptors (single molecules) descriptors of molecular ensemble (mixtures, materials) reaction descriptors (reactions)

    Descriptor origin: calculated from the structure empirical (Hammet constants, lipophilicity chemical shifts in NMR, etc)

    Locality: local (atom charge) global (molecular weight, molecular volume, lipophilicity, etc)

    Dimensionality: 1D (number of methyl groups, molecular weight, etc) 2D (topological indices, fragmental descriptors) 3D (molecular volume, quantum chemical descriptors) 4D (based on a set of conformers)

    Calculation method: physico-chemical (lipophilicity, etc) topological (invariants of molecular graph, Randic index, Wiener index, etc) fragmental (fingerprints, etc) pharmacophore spatial (moment of inertia, etc) quantum-chemical (energy of HOMO/LUMO, etc) etc. R. Todeschini and V. Consonni Handbook of Molecular Descriptors, 2008

  • Step 4. Feature processing

    Feature transformations: linear and non-linear scaling

    Feature combinations: Feature selection:

    removal of correlated descriptors (important e.g. for linear regression models) removal of descriptors with missing values removal constant and near constant features removal of descriptors with low correlation with the target property step-wise feature selection evolutionary feature selection (genetic algorithm, etc)

    sd

    xxz ii

    n

    i

    i xxn

    nsd

    1

    2)(1

    ixi ez

    1

    1range (0; 1)

    jiij xxz

    2

    ii xz add quadratic term

  • Step 5. Model building

    Unsupervised clustering

    Supervised

    Regression Classification

    Multiple linear regression (MLR)

    Partial linear regression (PLS) Logistic regression

    Gaussian Process (GP) Naïve Bayes (NB)

    Decision trees (DT)

    Support vector machine (SVM)

    Neural nets (NN)

    Random forest (RF)

    k-Nearest neighbors (kNN)

  • Step 6. Validation

    Test set (usually 20-25% of the work set)

    1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2 working set

    random test set 1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2

    stratified test set

    1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2

    Cross-validation

    1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2 working set

    1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2

    1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2

    1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2

    fold 1

    fold 2

    fold 3

    predictions of different folds are combined to calculate the final predictive measure

  • Step 6. Measures of predictive ability of models

    Classification

    FPTN

    TNySpecificit

    FNTP

    TPySensitivit

    2

    ySensitivitySpecificitaccuracy Balanced

    baseline-1

    baseline -accuracy Kappa

    N

    TNTPaccuracy

    2N

    FP)FN)(TP(TPFN)FP)(TN(TNbaseline

    Confusion matrix

    Predicted

    positive class (1)

    negative class (0)

    observed

    positive class (1)

    true positive

    (TP)

    false negative

    (FN)

    negative class (0)

    false positive

    (FP)

    true negative

    (TN)

    ))()()(( FNTNFPTNFNTPFPTP

    FNFPTNTPMCC

    [-1; 1]

    [0; 1]

    [0; 1]

    [0; 1]

    [0; 1]

    [0; 1]

  • Step 6. Measures of predictive ability of models

    Regression

    i

    2

    obspredi,

    i

    2

    obsi,predi,2

    )y(y

    )y(y

    1Q

    1N

    )y(y

    RMSE i

    2

    obsi,predi,

    N

    1i

    obsi,predi, yyN

    1MAE

    Determination coefficient

    Root mean squared error

    Mean absolute error

  • Step 7. Applicability domain (AD) Extrapolation to very distant objects is dangerous

    There is a need to define the domain where our model is reliable (models are not universal!)

    Only compounds which are similar to the training set compounds should be included in applicability domain of the model. One should estimate similarity of new compounds (test set, etc) to the training set compounds.

  • Step 7. Applicability domain (AD) measures

    Bounding box - based on descriptor range

    Desc_1

    Desc_2

    AD • internal regions are usually empty, especially if the number of descriptors is big

    • it doesn’t take into account descriptor correlation

    Distance from training set compounds in descriptor space

    Desc_1

    Desc_2 Distance from training set compounds in model space

    Model_1

    Model_2

    Requires several models (e.g. consensus model, bootstrap models)

    One-class SVM

    Conformal predictions

  • Step 7. Applicability domain (AD) example

    J. Chem. Inf. Model., Vol. 50, No. 12, 2010

  • Step 8. Interpretation of QSAR models

    1/C = 4.08π – 2.14π2 + 2.78σ + 3.38

    π = logPX – logPH

    σ - Hammet constant

    plant growth inhibition activity of phenoxyacetic acids

    electronic factors rate of penetration of membranes in the plant cell

    Hansch equation

    R is H or CH3;

    X is Br, Cl, NO2 and

    Y is NO2, NH2, NHC(=O)CH3

    Inhibition activity of compounds against Staphylococcus aureus

    Act = 75RH – 112RCH3 + 84XCl – 16XBr – 26XNO2 + 123YNH2 + 18YNHC(=O)CH3 – 218YNO2

    Free-Wilson models

  • Step 8. Interpretation of QSAR models

    347 agonists of 5-HT1A receptor

    Ar - substituted (hetero)aryls L - polymethylene chain R - various (poly)cyclic residues

    O O OMe Cl Cl CF3

    N

    N

    CH3

    F

    O2N

    Cl

    PLS 0.84 0.18 0.03 -0.04 -0.06 -0.09 -0.11 -0.66 -0.73 -0.94 -0.96 RF 0.27 0.24 0.04 0.07 -0.02 0.11 0.04 -0.04 -0.55 -0.66 -0.66

    Ar

    L -(CH2)6- -(CH2)5- -(CH2)4- -(CH2)3- -(CH2)2- -CH2-

    PLS 0.8 0.71 0.81 0.08 -0.04 0.06

    RF 0.14 0.19 0.14 -0.01 -0.03 0.05

  • Step 8. Interpretation of QSAR models

    δ)f(xδ)f(xC

    f(x)δ)f(xC

    δ)f(xf(x)C

    forward difference

    backward difference

    central difference

    Partial derivatives is used to calculate contributions of single features. It is applicable to any model built on meaningful (interpretable) descriptors. May be calculated analytically or numerically.

    Estimated contributions of separate substructural features can be mapped back on the structure to reveal contribution of fragments.

    Marcou, G.; Horvath, D.; Solov'ev, V.; Arrault, A.; Vayer, P.; Varnek, A. Interpretability of SAR/QSAR Models of any Complexity by Atomic Contributions. Molecular Informatics 2012, 31, 639-642

  • Step 8. Interpretation of QSAR models

    = –

    A B C

    Activitypred(A) Activitypred(B) Contribution(C)

    f(A) = x f(B) = y W(C) = x – y

    Polishchuk, P. G.; Kuz'min, V. E.; Artemenko, A. G.; Muratov, E. N. Universal Approach for Structural Interpretation of QSAR/QSPR Models. Molecular Informatics 2013, 32, 843-853

    Universal approach

  • Step 8. Interpretation of QSAR models

    Polishchuk P, Tinkov O, Khristova T, Ognichenko L, Kosinskaya A, Varnek A, Kuz’min V. Journal of Chemical Information and Modeling 2016, 56, 1455-1469.

    Acute oral toxicity on rats

    ?

    ?

  • Overall QSAR workflow

    Input data

    Bioassays

    Databases

    Preprocessing Feature

    engineering Model

    learning Model

    validation

    Classification

    Regression

    Clustering

    Cross-validation

    Bootstrap

    Test set

    Applicability Domain

    Feature selection

    Feature combination

    Data normalization

    Feature extraction

    Interpretation

    j

    j

    ii

    z

    xxx '