Molecular Similarity Searching Using Inference Network · 2013-04-09 · Molecular Similarity...

Click here to load reader

  • date post

    18-Jun-2020
  • Category

    Documents

  • view

    0
  • download

    0

Embed Size (px)

Transcript of Molecular Similarity Searching Using Inference Network · 2013-04-09 · Molecular Similarity...

  • Molecular Similarity Searching Using Inference Network

    Ammar Abdo, Naomie Salim* Faculty of Computer Science & Information Systems

    Universiti Teknologi Malaysia

  • Molecular Similarity Searching

    •  Search for chemical compounds with similar structure or properties to a known compound

    •  A variety of methods used in these searches –  Graph theory –  1 D, 2D and 3D shape similarity, docking similarity, electrostatic

    similarity and others. –  Machine learning methods e.g. BKD,SVM,NBC,NN •  Vector space model using 2D fingerprints and

    Tanimoto coefficients is one the most widely used molecular similarity measure

  • Rationale for Chemical Similarity

    •  Similar property principle ―  structurally similar molecules are likely to have

    similar properties •  Given a known active molecule, a similarity

    search can identify further molecules in the database for testing

  • Probabilistic models (Alternative approach)

    •  Why probabilistic models –  Information Retrieval deals with Uncertain

    Information •  Query and compounds characterizations are

    incomplete – Probability theory seems to be the most

    natural way to quantify uncertainty – Applied in IR for text document

  • Why Bayesian Networks

    – Bayesian Nets is the most popular way of doing probabilistic inference in AI

    – Clear formalism to combine evidences – Modularize the world (dependencies) – Bayesian Network Models for IR

    •  Inference Network (Turtle & Croft, 1991) • Belief Network (Ribeiro-Neto & Muntz, 1996)

    – Simple

  • Bayesian inference

    •  Bayes’ Rule : the heart of Bayesian techniques P(H|E) = P(E|H)P(H) / P(E) where, H is a hypothesis and E is an evidence P(H) : prior probability P(H|E) : posterior probability P(E|H) : probability of E if H is true P(E) : a normalizing constant, then we write: P(H|E) ~ P(E|H)P(H)

  • Bayesian Networks

    •  What is a Bayesian networks ? –  It is directed acyclic graphs (DAGs) in which nodes

    represent random variables, –  The parents of a node are those judged to be direct

    causes for it. –  The root of the network are the nodes without parents. –  The arcs represent casual relationships between these

    variables, and the strengths of these casual influences are expressed by conditional probabilities.

    x1… xn : parent nodes, X the set of parents of y (in this case, root nodes)

    y : child node xi cause y The influence of X on y can be quantified by any function

    (conditional probabilities) F(y,X)=P(y|X)

    x2 x1 xn

    y

  • a b

    c

    p(c|a,b) for all values for a,b,c

    p(a)

    p(b)

    Conditional dependence

    •  Running Bayesian Nets: •  Given probability

    distributions for roots and conditional probabilities of nodes, we can compute apriori probability of any instance

    •  Changes in parents (e.g., b was observed) will cause recomputation of probabilities

    Bayesian networks

  • •  How to describe and compare molecules –  Network Model generation

    •  Description of the system in a suitable network form –  Representation of importance of descriptors

    (weighting schemes) –  Probability estimation for the network model –  Calculate the similarity scores

    Bayesian networks approach to molecular similarity searching

  • Bayesian inference network

    •  Nodes –  compounds (cj) –  features (fi) –  queries (q1, q2, and qr) –  target (A)

    •  Edges –  from cj to its feature

    nodes fi indicate that the observation of cj increase the belief in the variables fi.

  • Definitions

    •  f1, cj, and q1 are random variables. •  F=(f1, f2, ...,fn) is an n-dimensional

    vector (equal to fingerprint length) •  fi,∀i∈{0, 1}, then f has 2x2n possible

    states •  cj,∀j∈{0, 1}; ∀q∈{0, 1} •  The rank of a compound cj is

    computed as P(q=true| cj=true) •  (cj stands for a state where cj=true

    and ∀ i≠j ⇒ ci =false, because we observe one compound at a time)

  • Direct Acyclic Graph (DAG) of •  compound nodes as roots,

    contain prior probability of observing compound

    •  feature nodes as leaves, contain probability associated with node given set of parent compounds

    Construct Compound Network (once)

  • •  Inverted DAG with single leaf for target molecule, multiple roots that correspond to the features that express query.

    •  A set of intermediate query nodes may also be used in case of a multiple query used to express the target.

    •  Attach it to compound Network

    Construct Query Network for each query

  • –  Find probability that target molecule (A) is satisfied given compound cj has been observed

    •  Instantiate each cj which corresponds to attaching evidence to network by stating that cj is true and rest of compounds as false

    –  Find subset of cj’s which maximizes the probability value of node A (best subset).

    –  Retrieve these cj’s as the answer to query.

    Similarity calculation

  • Bayesian inference network

    •  The retrieval of an active compound compared to a given target structure is obtained by means of an inference process through a network of dependences.

    •  To achieve the inference process –  We need to estimate the strength of the relationships represented by

    network –  This involves estimating & encoding a set of conditional probability

    distributions •  The inference network we have described comprise of four different

    layers of nodes (four different random variables), first layer comprise the compound nodes (roots)

    •  The probability associated with these nodes is define as:

    P(cj)=1/(collection size) ⇒ prior probability

  • •  The second layer comprise of the feature nodes, so we need to compute P(fi). •  P(fi|cj) will be computed as follows, since dependency is based on first layer

    (parent nodes).

    •  Weighting function is used to estimate the probability in p(fi /cj)

    •  where α is a constant and experiments using the inference network (Turtle, 1991) show that the best value for α is 0.4, ffij is the frequency of the ith feature within jth compound, icfi is the inverse compound frequency of the ith feature in the collection, clj is the size of jth compound, total_cl is the total length of compounds in the collection, and m is total number of compounds in the collection (this Eq. has been adapted from Okapi retrieval system (Robertson et al., 1995))

    Bayesian inference network

  • Bayesian inference network

    •  The third layer comprises only the query nodes p(qk)

    •  where cjk the set of features in common between jth compound and kth query , clj is the size of jth compound, nffik is the normalized frequency of the ith feature within kth query, nicfi is the normalized inverse compound frequency of the ith feature in the collection and pi is the estimated probability at the ith feature node.

    where ffij is the frequency of the ith feature within jth compound,

  • Bayesian inference network

    •  The last layer comprises only the activity-need node (target) or bel(A) in the case of where more than one query is used.

    Weighted MAX Weighted SUM

    •  where cjk is the set of feature in common between jth compound and kth query, qlk is the size of the kth query, pjk is the estimated probability that the kth query is met by the jth compound, and r is the number of queries.

  • •  Subset of MDDR with 40751 molecules –  12 activity classes

    •  In all, 6804 actives in the 12 classes –  10 set of 10 randomly chosen compounds from each activity

    class (to form a set of queries). –  For comparison purpose, similarity calculation is also done using

    non-binary Tanimoto coefficient –  Six different type of weighted fingerprints from Scitegic

    •  atom type extended-connectivity counts (ECFC), •  functional class extended-connectivity counts (FCFC), •  atom type atom environment counts (EEFC), •  functional class atom environment counts (FEFC), •  atom type hashed atom environment counts (EHFC), and •  functional class hashed atom environment counts (FHFC)

    Experimental details

  • no. of unique av. no. mols. diversity

    Code Activity class Actives AFa MFb AF MF mean SD

    5H3 5HT3 antagonists 213 133 87 1.60 2.45 0.8537 0.008

    5HA 5HT1A agonists 116 67 54 1.73 2.15 0.8496 0.007

    D2A D2 antagonists 143 109 75 1.31 1.91 0.8526 0.005

    Ren Renin inhibitors 993 542 328 1.83 3.03 0.7188 0.002

    Ang Angiontensin II AT1 antagonists 1367 698 396 1.96 3.45 0.7762 0.002

    Thr Thrombin inhibitors 885 528 335 1.68 2.64 0.8283 0.002

    SPA Substance P antagonists 264 119 78 2.22 3.38 0.8284 0.006

    HIV HIV-1 protease inhibitors 715 455 330 1.57 2.17 0.8048 0.004

    Cyc Cyclooxygenase inhibitors 162 83 44 1.95 3.68 0.8717 0.006

    Kin Tyrosin protein kinase inhibitors 453 247 162 1.83 2.80 0.8699 0.006

    PAF PAF antagonists 716 381 252 1.88 2.84 0.8669 0.004

    HMG HMG-CoA reductase inhibitors 777 337 168 2.31 4.63 0.8230 0.002 a Unique AF is the number of unique atomic frameworks present in the class. b Unique MF is the number of unique molecular frameworks present in the class.

    MDDR Data

  • Use of a single reference structure

    Highest diverse class

  • Use of a single reference structure

    Comparison of the average percentage of unique atomic frameworks obtained in the top 5% of the ranked test set using BIN & Tan with EHFC_4

    Highest diverse class

  • Use of multiple reference structures

    Comparison between BIN & Tan using MAX rule and ECFC_4

  • Use of multiple reference structures

    Comparison of the average percentage of atomic frameworks retrieved obtained in the top 5% of the ranked test set using BIN-MAX & Tan-MAX with ECFC_4

  • •  So far have considered using just a single molecular descriptor and multiple reference structures as the basis for a search

    •  Further work –  to search with multiple molecular descriptors

    (ECFC4, EHFC4, FHFC4, FPFC4,PHPFC3) with single and multiple reference structures

    BIN with multiple molecular descriptors

  • Use of a single molecular descriptors and a single reference structure

    D1

    Ds

    c1 c2 cm

    A

    f1 fn f1 fn f1 fn

    q1 qr

    D2

    q1 qr D2

    q1 qr Ds

    Feature nodes

    Compound nodes

    Query nodes

    Target node

    wmax1 wmaxs Weighted-max link matrices

    wsum

    D1

    wmax2

  • Use of multiple molecular descriptors and a single reference structure

    Comparison between multiple descriptors and single descriptor with single reference structure using BIN

  • Comparison between multiple descriptors and single descriptor with multiple reference structures using BIN

    Use of multiple molecular descriptors and a multiple reference structures

  • •  BIN method with a single active reference structure outperforms the Tanimoto similarity method in 11 classes (between 6% to 71%) –  19% overall improvement –  only in one activity (Cyclooxygenase inhibitors) BIN is slightly inferior

    to Tan (-5%)

    •  BIN with multiple reference structures superior to Tan in all activity classes (between 5% to 118%) significantly outperform Tan –  with overall improvements 35% performance improvement in the

    overall average recall rate

    Summary I

  • •  BIN with multiple descriptors and single reference structure slightly outperform the BIN with single descriptor and single reference

    •  BIN with multiple descriptors and multiple reference structures slightly outperform the BIN with multiple descriptors and multiple references

    •  BIN with multiple descriptors will enhance performance (with a high percentage) when the sought actives are structurally heterogeneous. –  But it will slightly enhance performance when the sought actives

    are structurally homogeneous.

    Summary II

  • •  Some evidence to suggest that the BIN is more effective at scaffold hopping for the more diverse data sets.

    •  The networks do not impose additional costs because the networks do not include cycles.

    •  The major strength is net combination of distinct evidential sources to support the rank of a given compound.

    •  BIN provide the ability to integrate into a single framework, several descriptors and several references

    Summary III

  • Use of a single reference structure

    Thank you