Molecular Similarity Searching Using Inference Network · 2013-04-09 · Molecular Similarity...
Transcript of Molecular Similarity Searching Using Inference Network · 2013-04-09 · Molecular Similarity...
Molecular Similarity Searching Using Inference Network
Ammar Abdo, Naomie Salim* Faculty of Computer Science & Information Systems
Universiti Teknologi Malaysia
Molecular Similarity Searching
• Search for chemical compounds with similar structure or properties to a known compound
• A variety of methods used in these searches – Graph theory – 1 D, 2D and 3D shape similarity, docking similarity, electrostatic
similarity and others. – Machine learning methods e.g. BKD,SVM,NBC,NN
• Vector space model using 2D fingerprints and Tanimoto coefficients is one the most widely used molecular similarity measure
Rationale for Chemical Similarity
• Similar property principle ― structurally similar molecules are likely to have
similar properties • Given a known active molecule, a similarity
search can identify further molecules in the database for testing
Probabilistic models (Alternative approach)
• Why probabilistic models – Information Retrieval deals with Uncertain
Information • Query and compounds characterizations are
incomplete – Probability theory seems to be the most
natural way to quantify uncertainty – Applied in IR for text document
Why Bayesian Networks
– Bayesian Nets is the most popular way of doing probabilistic inference in AI
– Clear formalism to combine evidences – Modularize the world (dependencies) – Bayesian Network Models for IR
• Inference Network (Turtle & Croft, 1991) • Belief Network (Ribeiro-Neto & Muntz, 1996)
– Simple
Bayesian inference
• Bayes’ Rule : the heart of Bayesian techniques P(H|E) = P(E|H)P(H) / P(E) where, H is a hypothesis and E is an evidence P(H) : prior probability P(H|E) : posterior probability P(E|H) : probability of E if H is true P(E) : a normalizing constant, then we write: P(H|E) ~ P(E|H)P(H)
Bayesian Networks
• What is a Bayesian networks ? – It is directed acyclic graphs (DAGs) in which nodes
represent random variables, – The parents of a node are those judged to be direct
causes for it. – The root of the network are the nodes without parents. – The arcs represent casual relationships between these
variables, and the strengths of these casual influences are expressed by conditional probabilities.
x1… xn : parent nodes, X the set of parents of y (in this case, root nodes)
y : child node xi cause y The influence of X on y can be quantified by any function
(conditional probabilities) F(y,X)=P(y|X)
x2 x1 xn
y
…
a b
c
p(c|a,b) for all values for a,b,c
p(a)
p(b)
Conditional dependence
• Running Bayesian Nets: • Given probability
distributions for roots and conditional probabilities of nodes, we can compute apriori probability of any instance
• Changes in parents (e.g., b was observed) will cause recomputation of probabilities
Bayesian networks
• How to describe and compare molecules – Network Model generation
• Description of the system in a suitable network form
– Representation of importance of descriptors (weighting schemes)
– Probability estimation for the network model – Calculate the similarity scores
Bayesian networks approach to molecular similarity searching
Bayesian inference network
• Nodes – compounds (cj) – features (fi) – queries (q1, q2, and qr) – target (A)
• Edges – from cj to its feature
nodes fi indicate that the observation of cj increase the belief in the variables fi.
Definitions
• f1, cj, and q1 are random variables. • F=(f1, f2, ...,fn) is an n-dimensional
vector (equal to fingerprint length) • fi,∀i∈{0, 1}, then f has 2x2n possible
states • cj,∀j∈{0, 1}; ∀q∈{0, 1} • The rank of a compound cj is
computed as P(q=true| cj=true) • (cj stands for a state where cj=true
and ∀ i≠j ⇒ ci =false, because we observe one compound at a time)
Direct Acyclic Graph (DAG) of • compound nodes as roots,
contain prior probability of observing compound
• feature nodes as leaves, contain probability associated with node given set of parent compounds
Construct Compound Network (once)
• Inverted DAG with single leaf for target molecule, multiple roots that correspond to the features that express query.
• A set of intermediate query nodes may also be used in case of a multiple query used to express the target.
• Attach it to compound Network
Construct Query Network for each query
– Find probability that target molecule (A) is satisfied given compound cj has been observed
• Instantiate each cj which corresponds to attaching evidence to network by stating that cj is true and rest of compounds as false
– Find subset of cj’s which maximizes the probability value of node A (best subset).
– Retrieve these cj’s as the answer to query.
Similarity calculation
Bayesian inference network
• The retrieval of an active compound compared to a given target structure is obtained by means of an inference process through a network of dependences.
• To achieve the inference process – We need to estimate the strength of the relationships represented by
network – This involves estimating & encoding a set of conditional probability
distributions • The inference network we have described comprise of four different
layers of nodes (four different random variables), first layer comprise the compound nodes (roots)
• The probability associated with these nodes is define as:
P(cj)=1/(collection size) ⇒ prior probability
• The second layer comprise of the feature nodes, so we need to compute P(fi). • P(fi|cj) will be computed as follows, since dependency is based on first layer
(parent nodes).
• Weighting function is used to estimate the probability in p(fi /cj)
• where α is a constant and experiments using the inference network (Turtle, 1991) show that the best value for α is 0.4, ffij is the frequency of the ith feature within jth compound, icfi is the inverse compound frequency of the ith feature in the collection, clj is the size of jth compound, total_cl is the total length of compounds in the collection, and m is total number of compounds in the collection (this Eq. has been adapted from Okapi retrieval system (Robertson et al., 1995))
Bayesian inference network
Bayesian inference network
• The third layer comprises only the query nodes p(qk)
• where cjk the set of features in common between jth compound and kth query , clj is the size of jth compound, nffik is the normalized frequency of the ith feature within kth query, nicfi is the normalized inverse compound frequency of the ith feature in the collection and pi is the estimated probability at the ith feature node.
where ffij is the frequency of the ith feature within jth compound,
Bayesian inference network
• The last layer comprises only the activity-need node (target) or bel(A) in the case of where more than one query is used.
Weighted MAX Weighted SUM
• where cjk is the set of feature in common between jth compound and kth query, qlk is the size of the kth query, pjk is the estimated probability that the kth query is met by the jth compound, and r is the number of queries.
• Subset of MDDR with 40751 molecules – 12 activity classes
• In all, 6804 actives in the 12 classes – 10 set of 10 randomly chosen compounds from each activity
class (to form a set of queries). – For comparison purpose, similarity calculation is also done using
non-binary Tanimoto coefficient – Six different type of weighted fingerprints from Scitegic
• atom type extended-connectivity counts (ECFC), • functional class extended-connectivity counts (FCFC), • atom type atom environment counts (EEFC), • functional class atom environment counts (FEFC), • atom type hashed atom environment counts (EHFC), and • functional class hashed atom environment counts (FHFC)
Experimental details
no. of unique av. no. mols. diversity
Code Activity class Actives AFa MFb AF MF mean SD
5H3 5HT3 antagonists 213 133 87 1.60 2.45 0.8537 0.008
5HA 5HT1A agonists 116 67 54 1.73 2.15 0.8496 0.007
D2A D2 antagonists 143 109 75 1.31 1.91 0.8526 0.005
Ren Renin inhibitors 993 542 328 1.83 3.03 0.7188 0.002
Ang Angiontensin II AT1 antagonists 1367 698 396 1.96 3.45 0.7762 0.002
Thr Thrombin inhibitors 885 528 335 1.68 2.64 0.8283 0.002
SPA Substance P antagonists 264 119 78 2.22 3.38 0.8284 0.006
HIV HIV-1 protease inhibitors 715 455 330 1.57 2.17 0.8048 0.004
Cyc Cyclooxygenase inhibitors 162 83 44 1.95 3.68 0.8717 0.006
Kin Tyrosin protein kinase inhibitors 453 247 162 1.83 2.80 0.8699 0.006
PAF PAF antagonists 716 381 252 1.88 2.84 0.8669 0.004
HMG HMG-CoA reductase inhibitors 777 337 168 2.31 4.63 0.8230 0.002 a Unique AF is the number of unique atomic frameworks present in the class. b Unique MF is the number of unique molecular frameworks present in the class.
MDDR Data
Use of a single reference structure
Highest diverse class
Use of a single reference structure
Comparison of the average percentage of unique atomic frameworks obtained in the top 5% of the ranked test set using BIN & Tan with EHFC_4
Highest diverse class
Use of multiple reference structures
Comparison between BIN & Tan using MAX rule and ECFC_4
Use of multiple reference structures
Comparison of the average percentage of atomic frameworks retrieved obtained in the top 5% of the ranked test set using BIN-MAX & Tan-MAX with ECFC_4
• So far have considered using just a single molecular descriptor and multiple reference structures as the basis for a search
• Further work – to search with multiple molecular descriptors
(ECFC4, EHFC4, FHFC4, FPFC4,PHPFC3) with single and multiple reference structures
BIN with multiple molecular descriptors
Use of a single molecular descriptors and a single reference structure
D1
Ds
c1 c2 cm
A
f1 fn f1 fn f1 fn
q1 qr
D2
q1 qr D2
q1 qr Ds
Feature nodes
Compound nodes
Query nodes
Target node
wmax1 wmaxs Weighted-max link matrices
wsum
D1
wmax2
Use of multiple molecular descriptors and a single reference structure
Comparison between multiple descriptors and single descriptor with single reference structure using BIN
Comparison between multiple descriptors and single descriptor with multiple reference structures using BIN
Use of multiple molecular descriptors and a multiple reference structures
• BIN method with a single active reference structure outperforms the Tanimoto similarity method in 11 classes (between 6% to 71%) – 19% overall improvement – only in one activity (Cyclooxygenase inhibitors) BIN is slightly inferior
to Tan (-5%)
• BIN with multiple reference structures superior to Tan in all activity classes (between 5% to 118%) significantly outperform Tan – with overall improvements 35% performance improvement in the
overall average recall rate
Summary I
• BIN with multiple descriptors and single reference structure slightly outperform the BIN with single descriptor and single reference
• BIN with multiple descriptors and multiple reference structures slightly outperform the BIN with multiple descriptors and multiple references
• BIN with multiple descriptors will enhance performance (with a high percentage) when the sought actives are structurally heterogeneous. – But it will slightly enhance performance when the sought actives
are structurally homogeneous.
Summary II
• Some evidence to suggest that the BIN is more effective at scaffold hopping for the more diverse data sets.
• The networks do not impose additional costs because the networks do not include cycles.
• The major strength is net combination of distinct evidential sources to support the rank of a given compound.
• BIN provide the ability to integrate into a single framework, several descriptors and several references
Summary III
Use of a single reference structure
Thank you