Post on 21-May-2018
- 1 -
לצורכי פרוש התמונה םלימוד ושימוש במודלים גראפיי
THE LEARNING AND USE OF GRAPHICAL MODELS FOR
IMAGE INTERPRETATION
Thesis for the degree of
Master of Science
By
Leonid Karlinsky
Advisor
Professor Shimon Ullman
October 2004
Submitted to the Scientific Council of the
Weizmann Institute of Science
Rehovot, Israel
חיבור לשם קבלת התואר
מוסמך למדעים
מאת לאוניד קרלינסקי
מנחה
שמעון אולמן פרופסור
ה "תשס ןחשוו
מוגש למועצה המדעית של מכון ויצמן למדע
ישראל , רחובות
- 2 -
Acknowledgments
First of all, I would like to thank my advisor, Prof. Shimon Ullman, without
whom this work would never see light of day. I would also like to thank my
family, my mother and father who taught me everything I know, my wife
Anna-Odelia for her never ending love and support, my grandparents, my
sister Irena and last, but not least, my mother in-law. Finally, I would like to
thank my friends without whom this work‟s presentation would have been
much worse.
- 3 -
Abstract
This work deals with the construction, training and use of graphical
models, for the purpose of image interpretation. The work has two main
contributions. The first is the construction of maximally informative
hierarchical models. We develop methods for both constructing
hierarchical models and learning their optimal parameters, in a manner
that will maximize the information between the features set and the class.
The second contribution of this work is a novel method called “slow
connections” for computing or approximating an optimal (MAP)
interpretation on loopy graphical models. Computing MAP on general
loopy-networks is known to be NP-hard. We introduce a method that
under specified conditions finds either the global or a local optimum to the
problem. In empirical experiments, this “slow connections” method
outperformed the Belief Revision algorithm, which is commonly used to
approximate MAP computation in loopy graphical models.
- 4 -
Table of contents 1. Introduction ........................................................................................................................... - 5 - 2. Probabilistic Models ............................................................................................................ - 11 - 3. Solving inference problems on singly connected networks .............................................. - 16 - 3.1. Generalized Distributive Law (GDL) algorithm ................................................................... - 16 - 3.2. EM model parameter learning using GDL ............................................................................ - 21 - 3.3. Belief Propagation (BP), Factor Graphs ................................................................................ - 26 - 3.3.1. Belief Propagation (BP) .................................................................................................. - 26 - 3.3.2. Sum-Product (Factor Graphs) algorithm ........................................................................ - 27 - 4. Maximum MI Training ....................................................................................................... - 29 - 4.1. MaxMI ................................................................................................................................... - 32 - 4.2. MaxMI approximation on observed & unobserved models .................................................. - 36 - 4.3. MaxMI & TAN Restructuring ............................................................................................... - 41 - 4.4. Combining MaxMI and TAN restructuring ........................................................................... - 43 - 4.5. Maximizing MI vs. Minimizing PE ....................................................................................... - 48 - 4.5.1. Maximizing MI and Minimizing PE in the “ideal” training case .................................... - 49 - 4.5.2. Disadvantages of Minimizing PE ..................................................................................... - 53 - 4.5.3. MI(C;F) maximization as a classification model training criterion ................................ - 54 - 5. Existing approaches for coping with loopy networks ....................................................... - 57 - 5.1. Triangulation ......................................................................................................................... - 58 - 5.2. Loopy Belief Revision ........................................................................................................... - 59 - 5.3. CCCP: Minimizing Bethe-Kikuchi approximation of Free Energy ...................................... - 60 - 6. Using “Slow Connections” for solving MAP on loopy networks ..................................... - 60 - 6.1. General overview of the approach ......................................................................................... - 61 - 6.2. Approaches for obtaining a local optimum ........................................................................... - 67 - 6.2.1. Iterative fixing ................................................................................................................. - 68 - 6.2.2. Local optimum assumption .............................................................................................. - 69 - 6.3. Assumption for obtaining a global optimum ......................................................................... - 71 - 6.4. Coping with general networks – from theory to practice ...................................................... - 73 - 6.4.1. Partial iterative approximation ....................................................................................... - 74 - 6.4.2. The hybrid approach ....................................................................................................... - 78 - 6.5. Clique Carving ...................................................................................................................... - 81 - 7. Applying “Slow Connections” approaches in practice ..................................................... - 83 - 8. Experimental results ........................................................................................................... - 88 - 8.1. Max-MI classification model training ................................................................................... - 88 - 8.2. “Slow Connections” approximation ...................................................................................... - 97 - 9. Summary and conclusions ................................................................................................ - 107 - 10. Future work ....................................................................................................................... - 114 - 10.1. Information based training .................................................................................................. - 114 - 10.1.1. Using observed and unobserved in the models .............................................................. - 114 - 10.1.2. Complete training approaches ...................................................................................... - 116 - 10.1.3. Bottom-up training ........................................................................................................ - 116 - 10.1.4. Maximizing MI vs. minimizing PE.................................................................................. - 117 - 10.2. Slow connections MAP approximation ............................................................................... - 117 - 10.2.1. Slow connections selection methodology ....................................................................... - 117 - 10.2.2. Convergence criteria for slow connections ................................................................... - 118 - 11. APPENDICS ...................................................................................................................... - 119 - 12. References .......................................................................................................................... - 130 -
- 5 -
1. Introduction
This work is concerned with the development of methods for learning and using
graphical models, for performing visual interpretation of images. We describe below
what we mean by the interpretation problem, what are the graphical models that we use to
approach this problem, and the main goals of the current study. We also list briefly the
main results obtained in this work.
Visual Interpretation
By “visual interpretation” we refer to a generalization of visual object classification.
Given a specific class of visual objects, for example faces, we wish to construct an
approach that will allow us not only to classify an image as containing or not containing
an object from the class, but also provide a way to specify the identity and locations of
meaningful parts of the object (for instance “eyes”, “nose”, etc. for the “faces” class) in
the image.
The models we use in this work are feature-based graphical models. In these models, the
class object is represented by an interrelated set of features, which comprise the set of
“meaningful parts” of the class object we wish to interpret. The features that we use in
these models are based on fragments, which are informative image patches selected
during a training phase (for a more detailed discussion see [8]). The graphical models that
are used to achieve the interpretation task are hierarchical, namely, they represent the
structure of the interpreted class object in terms of its meaningful parts at multiple levels.
As an example consider decomposition of an “eyes and nose” region within a face into
the “left eye”, “nose” and a “right eye”, each of which is composed of several sub-parts;
for instance the “left eye” is decomposed into “eyebrow”, “left eye-corner”, “pupil”,
“right eye-corner” and etc.
A major difficulty in the interpretation task is that this task cannot be achieved separately
for each feature, as by itself, the smaller sub-features are highly ambiguous even for the
trained human eye. For instance consider the face decomposition example in Figure 1:
- 6 -
Figure 1: Ambiguity of part identification. Put together, one immediately identifies
each face part from the resulting face image. However, taken separately – each face
part appears highly ambiguous, even to a human observer.
Each of the sub-images is highly ambiguous if taken separately, but when put together,
the identity of each part becomes clear. The key to a successful part interpretation
therefore depends on properly interconnecting the features in the graphical model, and
using these connections to link features together for learning and using object models.
Graphical Models and their Training
Graphical models use graph structure to represent probability distributions and other
forms of interrelations between model parts. For instance directed graphical models can
be used to compactly represent a decomposition of a joint distribution to a set of
conditional distribution factors. In this case, the nodes of the graph represent random
variables and directed edges (parent-child relations) represent the dependence relations in
the decomposition. This kind of graphical models are called Belief Networks (BNs), and
they will be discussed further below in grater detail.
The usual manner to represent graphical models is via a graph representation. In this
representation every element of the model is represented by a node of the model graph
),( EVG , where edges of G represent dependencies between the model elements.
Usually, graphical model nodes are divided into two groups: observed and unobserved.
Observed nodes are assigned input values based on which the “best” values for the
- 7 -
unobserved nodes are chosen (definition of “best” varies between problems for which the
models are constructed). As an example, consider a classification model in which all the
features are observed and connected to an unobserved “Class” node C. When this model
is applied to a classification problem instance, all the features are assigned values, based
on which, the (binary) value of C is decided. Decision tasks of this kind are called
“inference problems” on the graphical models.
One of the key tasks in the context of graphical models is training. Usually we are given a
set of model parts, for instance features for the classification problem. By training we
refer to the task of combining these parts into a graph underlying the model and learning
an optimal set of parameters for each part (the notion of optimality varies between the
different uses of the model).
Results related to model construction: MaxMI(F;C)
Our first result, presented in this work, provides a novel method for training Belief
Networks in the context of the classification and visual interpretation problems. The
essence of our training method is the selection of the model, its features and feature
parameters, in a way that the Mutual Information (MI) between the model and the class is
maximized. Roughly, for a set of features F and a class C, our method constructs a
graphical model using the features and sets their optimal parameters (such as thresholds),
so and to maximize );( CFMI .
The models which are constructed by our technique are the so-called loop free models,
i.e. models having a Junction Tree (JT). The notions of loop free models and junction
trees will be explained in later sections in greater detail. In the so-called hybrid variant of
our novel training algorithm, the log-probability of the training data is also being
maximized, which causes the Kullback-Leibler divergence, between the model and the
true joint probability of the features and the class, to be minimized.
The proposed training method developed in this section can provide a general approach,
which under some assumptions can be viewed as an optimal approach for selecting a
feature model for classification. This view is based on the following argument.
We want to construct a model based on a set of features F, and determine their
parameters, to solve a given classification problem for a class C.
- 8 -
We argue (in section 4.5.3) that a useful criterion, that can be viewed as an optimal
criterion, is to select F so as to maximize );( CFMI .
We also want the model to allow an efficient computation of the class C given the
features. This can be formulated as efficiently computing the most likely
interpretation given F, that is, )|(maxarg FCp . A general method for computing
)|(maxarg yxp is when the ),( yxp can be decomposed and expressed as a junction
tree (with a limited tree width).
As a result, a useful approach to feature selection and model construction is therefore
to select a set of features F, with a joint distribution ),( FCp so that:
(i) );( CFMI is maximal.
(ii) ),( FCp has a decomposition into a junction tree (with low tree width).
This is what our method accomplishes.
In addition, the same method also reduces the Kullback-Leibler divergence between
the model and the true joint probability of the features and the class, guarantees
robustness of the model. By robustness of the model we refer to avoiding overfitting
to the training data and hence making the trained model better applicable on novel
examples.
Loop-free and loopy models
In dealing with hierarchical models, a distinction is often made between loop-free and
loopy models. In the first part of this work we deal with loop-free models. This is a
family of models which allow efficient computation is constructed from loop-free tree
models, where the information propagates upwards from leaf features (which are the
smallest, most ambiguous patches) to the root feature (which is usually used to represent
the class object) and downwards to the leafs again. We refer to this type of computation
as a two-pass algorithm. Loop-free models, and a two-pass computation are used in
various applications (see for example, [3] and [16]).
In many domains, a simple hierarchical tree representation may not be enough as a
realistic model, because it disregards the dependencies present between different
meaningful image-parts that we want to interpret. In order to represent these inter-
- 9 -
dependencies, loopy connections are introduced into the model. Such models introduce
computational difficulties, since inference in loopy models is long known to be a hard
problem (NP-hard in general [17, 18]).
In light of the above, it is of high theoretical interest to approximate inference problems
solutions on loopy networks. Our second result presented in this work is a novel
algorithm for performing efficient MAP approximation on loopy networks.
Results regarding loopy graphs: the use of „slow connections‟
In this part of the work we develop a method for performing efficient MAP computations
on certain classes of loopy graphical models.
We show that the proposed loopy-MAP approximation method is guaranteed to converge
to either the global MAP solution, or to a local optimum, on a restricted class of loopy
networks, which subject to several assumptions. However, empirical experiments,
described at the end of this work suggest that it is a good approximation in the general
loopy case. In comparative testing, our approximation technique outperformed the
approach usually used to approximate MAP in the loopy case – loopy Belief Revision.
Structure of the Thesis
The thesis is organized into two main parts. These parts describe the two themes covered
by the thesis. Part I, which is comprised of sections 3 and 4, focuses on the training of
probabilistic models and describes our novel training technique – Maximal Mutual
Information training (MaxMI). Part II, comprised of sections 5 to 7, deals with inference
on loopy networks, and describes our MAP approximation algorithm for loopy networks.
Following is a section-wise description of the thesis structure.
Section 2 briefly describes the theoretical background behind the two popular graphical
models: Belief Networks (BNs) and Markov Random Fields (MRFs).
Section 3 provides a summary of the popular approaches for solving inference problems
on loop-free graphical models: Generalized Distributive Law (GDL) [1], Belief
Propagation (BP) [3], Belief Revision (BR) [3] and Factor Graphs (FG) [19]. Section 3
also shows the equivalence between these approaches, by showing them to be equivalent
to GDL. Moreover, Section 3 covers an important training technique – Expectation
- 10 -
Maximization (EM) [20] and shows a method of applying its two popular variants (hard-
EM and soft-EM) on loop-free BNs, using GDL as a tool for calculating marginals in
intermediate steps.
Section 4 describes our novel information based training method – Maximum Mutual
Information (MaxMI) and discusses its possible extensions. Extensions to observed and
unobserved graphical model case, extension to a hybrid approach which includes model
construction and extension to a similar algorithm with better convergence properties then
the hybrid approach. Furthermore, in this section we describe a possible approach to
complete model training. This approach provides a method for building the complete
trained model from the training data, involving feature selection together with structure
and parameter training. In section 4.5 we derive some interesting analytical results on
comparison between maximizing Mutual Information (MI) and minimizing the
Probability of Error (PE) training criteria. These results are used to explain why
maximizing MI is a useful, and in many cases optimal, criterion for learning
classification models.
Section 5 provides a brief summary of the most popular approaches for coping with loopy
graphical models: Triangulation (a clustering technique) [1], Loopy Belief Revision
(LBR) [13] and the so-called Convergent Convex Concave Procedure (CCCP) [7].
Section 6 deals with our novel method for efficient loopy-MAP approximation, the “slow
connections” technique. In this section we also briefly discuss some possible extensions
to our technique and give some preliminary results regarding its computational
efficiency.
Section 7 covers some practical aspects of applying our loopy-MAP approximation
technique in practice and describes our implementation of this technique.
Section 8 summarizes the empirical results obtained when applying our novel training
and loopy-MAP approximation algorithms in practice in the context of visual
interpretation task models. We provide results for several versions of our algorithms and
also provide a comparison with a popular loopy-MAP approximation algorithm – Loopy
Belief Revision.
Section 9 gives a brief summary of the main results developed in the thesis.
- 11 -
Section 10 summarizes the general directions for future research, some of which are
mentioned in various parts of the thesis.
2. Probabilistic Models
Graphical models are commonly used to represent probability distributions and perform
efficient inference using these representations. This section briefly reviews the
background material on graphical models that is relevant to the current work. It describes
the most well known examples of (probabilistic) graphical models, which are Bayesian
Networks (BN) and Markov Random Fields (MRF), explained briefly below.
In general, behind each graphical model is a set of variables, some are observed, y =
y1,…yk, and some are unobserved, x = x1,…xn. They are distributed together, with a joint
distribution p(x1,…xn, y1,…yk). Given the values of y1,…yk, we wish to solve probabilistic
inference problems, i.e. compute some aspects of the probability of the unobserved x.
Examples of such aspects are, the Maximum A-Posteriori (MAP), or marginals. By
marginals we refer to joint distributions of subsets of {x1,…xn} given, {y1,…yk}, which
result from p(x1,…xn| y1,…yk) by summing by the remaining variables. The interest in
marginals comes from variational minimum variance estimations.
Inference in probabilistic models is impractical in general, unless there are some
restrictions of the probability distribution p. If there are some independence relations
between variables, it may become possible to decompose p into the product of simpler
functions. Graphical models deal with different cases of such decompositions. The graph
structure describes the decomposition, or, equivalently, certain independence relations
between variables. They then provide methods for exploiting this decomposition for
efficient inference.
Belief Networks
The Belief Network (BN) [3] makes use of a Directed Acyclic Graph (DAG)
representation, where in the nodes of the graph are Random Variables (RVs) of the model
and directed edges stand for the conditional independence relations expressed by the
decomposition. The DAG ),( EVG underlying the BN represents a possible
decomposition of the joint Probability Density Function (PDF) of all the RVs of the
- 12 -
model. The essence of the decomposition is that the joint PDF is represented as a product
of a set of local kernels each of which is a conditional PDF of a node given its parents.
The parents of a node Vv in G are neighboring nodes of v from which there are
directed edges “pointing” at v. An illustration of the BN representation is given in Figure
2, 2(a) depicts a simpler loop free BN in which there are no undirected loops (loops in the
graph disregarding the edge directions), while 2(b) gives a more complicated example of
a loopy BN. The complexity of the loopy case over the loop free case will be discussed in
greater detail later in this work, here it‟s sufficient to say that it is a well known fact that
exact inference on general loopy BN is NP-hard [17, 18].
Figure 2: BN illustration. (a) Belief Network without undirected loops. This is a loop free
network. (b) Belief Network without directed loops, but with an undirected loop. Such a
network is still considered to be loopy.
Markov Random Fields
The structure underlying the Markov Random Field (MRF) representation is an
undirected graph, with a node for each RV of the model. The structure of the MRF
represents assumptions on the conditional independence between RVs of the model. If
two nodes of the MRF: u and v are “separated” by a set S of MRF nodes (i.e. every path
connecting u and v in the graph underlying the MRF has at least one of the nodes from S
on it), then RVs represented by u and v are independent given S:
)|()|()|,( SvpSupSvup
- 13 -
In particular, p(u | the entire graph) = p (u | immediate neighbors). The MRF
representation of the model also gives a decomposition of the joint PDF of the model
RVs. A well known result named Hammersley-Clifford theorem states that if C is the set
of cliques of the graph underlying the MRF representation, then the joint PDF
decomposes as follows:
Cc
cc xZ
xP 1
)(
where x stands for the vector of all RVs of the model and cx stands for the vector of
RVs in the clique Cc . The functions c are called compatibility functions and Z is a
normalizing constant. An example of MRF is given in Figure 3, where 3(a) depicts a
simpler loop free case, while 3(b) shows the more complicated loopy MRF case. A loop
free MRF is a tree or a forest (a disconnected graph each connected component of which
is a tree) and it can be seen as a special case of the BN, in which the BN is a directed tree
(i.e. each node having exactly one parent). Conversely any directed tree BN can be
represented by an MRF by removing the directions from the edges.
Figure 3: MRF illustration. (a) Loop-free Markov Random Field, in fact it is an undirected
tree. (b) Loopy Markov Random Field – contains a loop A,B,E,C.
- 14 -
Inference in graphical models
A particularly interesting inference problem, usually used under the described
probabilistic setting, is MPF (Marginalize a Product Function) [1]. Roughly speaking,
MPF is a problem of finding specific marginals of a product-decomposition of a specific
function. Under more general setting of a commutative semi-ring, this problem can be
transformed into other very interesting inference problems like MAP (Maximum A-
Posteriori) problem in which we want to find a maximizing assignment to a sum-
decomposition of a specific function. Both these inference problems are very interesting
in our context, as their solutions can be used to derive different kinds of interpretations of
a given image. They will be covered in more detail in later sections. We will also
introduce some novel approximation techniques to the MAP problem on loopy models.
Moreover, we also use MPF solving algorithms, like GDL [1], in our novel model
training techniques.
The probabilistic interpretations of the inference problems that are in the main focus of
this work are:
MAP: finding Maximum A-Posteriori (MAP) sequence of RV values, i.e. the
most probable assignment to the RVs given the evidence.
MPF: recovering marginal probability distributions of the joint PDF represented
graphically by the model.
There are well known (and largely equivalent) methods for obtaining exact solutions for
these problems under the loop free setting (of either BN or MRF): Belief Propagation
(BP) and Belief Revision (BR) [3], Generalized Distributive Law (GDL) [1] and Factor
Graphs (FG) [19].
As we‟ll show in the next section of this work, FG is completely equivalent to GDL.
Moreover, we‟ll show that in the case of Belief Networks, BP is equivalent to GDL as
well (we‟ll show that BP‟s messages are in fact normalized GDL messages). However,
both GDL and FG are built for a more general commutative semi-ring case and non-
normalized decompositions, while BP (also equivalent) is not designed for the more
general case.
When the underlying graph contains loops, computing MAP or MPF becomes
considerably more difficult. The standard algorithms used for loop-free graphs are no
- 15 -
longer guaranteed to find correct solutions. Under the loopy setting these algorithms are
known to obtain approximate solutions for the inference problems, also their convergence
properties are yet largely unknown. Even in the case of a DAG with undirected loops,
these methods are not guaranteed to converge. Hence, the “loopy setting” includes the
case of a DAG with undirected loops. A well known result is that if the standard
inference algorithms (BP / GDL / FG) converge on a loopy network (a model represented
by a loopy graph), then they converges to a stationary point of the so-called Bethe
approximation to the free energy [4, 5] (or Bethe-Kikuchi free energy). In addition, the
desired solution to the inference problem is given by the global minimum of the free
energy on the given network. Hence, it is known that if the standard inference algorithms
converge, they converge to an approximation of the solution of the desired inference
problem, which may or may not be accurate. Another problem is that they are not
guaranteed to converge in the general case (also there are known results stating that BP is
guaranteed to converge in “single loop” graphs, see [6] for detailed explanation).
Our approach to loopy graphs
To cope with problems imposed by the loopy networks, several approaches were
proposed. They include clustering techniques [3], like triangulation [1, 9] or more
complex approaches based on results from statistical physics, like CCCP [7]. In this work
I will describe our novel technique – “slow lateral connections”, as well as a hybrid
approach which involves both triangulation and our proposed techniques. Some of our
techniques require special properties of local kernels (conditional PDFs in BN and clique
compatibility functions in MRF) in the decomposition of the target function (the joint
PDF in both BN and MRF cases). When these requirements are not fulfilled we suggest
an alternative iterative approach which could be used in some cases. We also provide
experimental results obtained when applying the suggested approaches to both simulated
models and models arising from real life problems of visual interpretation and feature
based classification.
Next section will review the standard inference algorithms: GDL, BP / BR and FG, as
well as their application to a well known training algorithms: hard and soft EM. In the
section 4 we use GDL in our novel information based training technique MaxMI.
- 16 -
Part I: Training Probabilistic Models
3. Solving inference problems on singly connected networks
In this section I review the well known techniques for solving inference problems on
loop-free networks. In the case of the BN, the loop-free network is called singly
connected or poly-tree and can be thought of as a tree or a forest of trees with each edge
arbitrarily directed. In the case of the loop-free MRF we refer to undirected tree or a
forest of undirected trees. By inference problems we refer to the MAP (finding Maximum
A-Posteriori assignment) and the MPF (finding marginals) mentioned above. A second
issue that I will present in this section is a method for efficiently using the GDL
algorithm as a tool for learning model parameters with the “soft” version of the
Expectation Maximization (EM) algorithm [20]. The GDL will be used to solve MPF
problems that arise in the maximization step of the soft EM.
The GDL algorithm, as well as other algorithms presented below, is in fact more general
then the probabilistic setting that is assumed by BN or MRF models. It can be applied to
the more general setting of decompositions to non-normalized factors, or even to non-
product decompositions (sum decompositions, etc). In the next section I will describe
these generalizations in greater detail.
3.1. Generalized Distributive Law (GDL) algorithm
The GDL algorithm was first presented in [1]. Its purpose is solving Marginalize a
Product Function (MPF) problems on various commutative semi-rings.
The GDL is designed to be used for any general function that is decomposable into a
product of local kernels – functions (not necessarily normalized) which support is a
subset of the support of the decomposed function. It can be used to solve the MPF
problem in this general setting and is especially efficient if the supports of the local
kernels can be organized into a junction tree as described below. Moreover, as neatly
described in [1] and [19], the MPF problem can be cast from its usual “sum-product”
commutative semi-ring (in which we operate on a function decomposable into a product
of local kernels and we want to find its marginals, i.e. find a summary on part of the
function variables) to other commutative semi-rings in which we substitute the sum and
- 17 -
product operations to other operations. For instance changing sum operations into max
operations, changing product operations into sum operations and changing the original
local kernels of the product decomposition to their logarithms will transform the GDL
from MPF solving algorithm into the MAP solving algorithm. Moreover, both the GDL
and the Factor Graphs [19] algorithms can solve MPF under any commutative semi-ring.
Hence, due to reasons laid out above, in the rest of this work GDL will play a key role, as
the selected inference algorithm in the loop-free scenarios. As we will see in following
sections, other well known inference algorithms, like BP [3] and Factor Graphs are its
special cases.
The GDL method will not be described here in full detail, for a full description see [1].
One reason for selecting GDL as the main algorithm for solving inference problems in
loop-free scenarios is that other algorithms can be cast more naturally into the GDL form
then vice versa. Another reason is that the GDL has a built-in technique for coping with
the loopy situations. The technique is called triangulation, and it will be described in
more detail in the later sections. In the worst case scenario, this technique of coping with
loops can result in an exponential increase in the time complexity, but is still useful in
many situations. In particular, it can be used together with our novel loopy MAP
approximation algorithm (the “slow connections” algorithm) to form what we call a
“hybrid approach”, which expands the range of cases in which we can efficiently apply
our algorithm.
We‟ll now give a short description of the GDL, while working in the sum-product semi-
ring and solving the original MPF problem. As mentioned above, the transition form this
to solving the MAP problem is straightforward.
Let ),,( 1 nxxf be a function which has the following decomposition:
nj xxS
jjn Sgxxf,,
1
1
)(),,(
Where jS are subsets of },,{ 1 nxx . In other words, f can be decomposed into a product
of simpler functions gj, each of which depends only on a subset of the whole set of
variables: jS . These subsets of variables are called the „local domains‟. Moreover,
assume that the local domains jS can be arranged as nodes of a so-called „junction tree‟
- 18 -
[9]. Junction Tree (JT) T for f is such a tree (or a forest), that every sub-graph of nodes of
T containing the variable kx is a connected subtree of T. More formal definition of f‟s
junction tree T is an undirected tree (or a forest) s.t. every node j of T corresponds to the
set jS and every edge (i,j) of the T is labeled by ji SS and if nodes k and m are
connected in T, then for any node l on the path connecting them: lmk SSS .
When such a JT - T exists, then GDL can be applied to solve the MPF problem for f and
find marginals which supports are the sets jS corresponding to the node labels of T and
the sets ji SS corresponding to the labels of edges of T.
The GDL is a message passing algorithm which usually operates on T in a two-pass
schedule: a bottom-up pass sends messages in the direction from the leaves of T towards
the node chosen as the root, and the top-down pass sends messages from the root towards
the leaves of T. The messages passed in the GDL are functions, a message that node i
sends to a node j is denoted by ijm and is a function of ji SS variables. At the
beginning of the GDL run, all the messages are initialized to be unity functions: 1ijm .
Whenever a node i needs to send a message to a node j, this message is calculated as
follows:
ijk iSSx jNl
illiiijiij SSmSgSSm\ }\{
)()()(
Where iN is the set of neighbors of i in T. This message is a function, which support
contains all the variables that are mutual to both local domains: iS and jS . It is formed
by multiplying all the messages received so far by node i from its neighbors by i's local
kernel and summing by all “non-message” variables (i.e. all the variables which are not in
ji SS ). Note also that the function of sum and product operators depends on the
commutative semi-ring over which we operate. For instance it can be ordinary sum and
product when we use GDL to solve the MPF problem and it can be max and sum when
we use the GDL for solving the MAP problem.
Evidence from observations is incorporated into the GDL scheme by fixing the values of
the observed variables (and not summing by them). This means that in every message
computation the observed variables of the involved local domains are not summed by, but
- 19 -
instead are assigned fixed values from the evidence. Whenever observed data is present,
the marginals computed by the GDL include it, i.e. if observed (fixed) data vector y is
incorporated into the GDL run, the marginal for the local domain jS will be ),( ySp j
and will be obtained as:
jNi
ijijjjj SSmSgySp )()(),(
This means that the result of the GDL run in the node j will not provide us with the
probability distribution of j‟s local domain jS . Instead, it will give us a function
),( ySp j , which is proportional to the measure of belief in specific configuration of jS
given the evidence y.
Of course the junction tree T, having the properties as above, does not necessarily exist
for every decomposition of f. Given a decomposition there are simple criteria to test
whether the local domains can be arranged on a junction tree. These criteria will also be
useful to us when we describe our framework for loopy MAP approximation. They are
briefly described here (see [1] for more details):
Construct a “local domain graph” – a complete graph G with nodes jS . Set a weight
for every edge of G, edge connecting nodes jS and iS receives a weight
ijji SSw ,.
Then a JT - T exists for a given decomposition iff a maximum weight spanning tree of
G has weight nSj
j . Moreover if a JT exists then any maximum weight spanning
tree of G is a JT and vice versa.
The complexity of the GDL in terms of total number of multiplications and additions
can be expressed as e
e)( where )(e is a complexity of a JT edge e. In turn, for
an edge e connecting nodes jS and iS in a JT, )()()()( jiji SSASASAe .
The term )(SA stands for the set of all possible assignments to variables of the local
domain S.
Hence when a JT exists for a given decomposition an optimal JT can be found by
updating the standard Prim‟s greedy algorithm for finding maximum weight spanning
- 20 -
trees to select an edge of minimum complexity in cases when multiple edges may be
equivalently selected by the algorithm. As mentioned earlier in cases the JT does not
exist for a given decomposition, clustering methods (such as triangulation, which will be
described later) can be used.
To conclude the GDL description let us mention that the loop-free (singly connected) BN
and MRF networks, all have corresponding JTs. For instance let:
11 ,,
1 )|(),,(jj xxS
jjjn Sxpxxp
be a loop-free BN decomposition. Then the sets jj xS form a JT for the
decomposition if we connect every two non-disjoint sets. The structure of the resulting JT
will be exactly the same as the structure of the original BN. Figure 4(a) shows a BN with
circles around the sets forming the JT nodes and BN nodes being the edges of the JT
drawn in different color, while 4(b) depicts the resulting JT separately. A loop-free MRF
will have its junction tree constructed in a similar manner.
Figure 4: From BN to Junction Tree. Junction Tree of a loop free Belief Network has the
same structure as the Belief Network itself. The JT can be constructed by replacing each BN
node by a local domain consisted of the replaced node and its parents.
- 21 -
3.2. EM model parameter learning using GDL
One of the most popular approaches for learning a-posteriori probabilistic model
parameters is Expectation Maximization (EM) [20]. In this section I will briefly review
the EM and describe an approach of applying it in loop free graphical models using the
GDL algorithm.
The general idea behind EM is the following: given a set of observed training data on the
model we try to obtain the set of parameters that maximize the likelihood of the training
data. In general this problem is exponentially hard, but it can be approximated iteratively,
and that is done using EM.
The general setting in which EM operates is a model given as a PDF: );,( yxp where x
is a vector of hidden variables, y is a vector of observed variables and denotes the
parameters of the model that we wish to obtain (for instance, the conditional probability
tables in the BN case). We are also given a set of independent training data:
nyyY ,,1 , each sample containing the values of the observed variables of the
model. The quantity that EM approximates is therefore:
Yy xYy
yxpyp );,(logmaxarg);(logmaxarg
.
The two most popular forms of EM are so called “hard” EM and “soft” EM, following is
their brief summary and a GDL based implementation in the loop free graphical model
case.
Hard EM
Hard EM tries to approximate
n
i
iiX
yxpX1,
);,(logmaxarg),(
( nxxX ,,1 ,
where ix is the value of x that maximizes );,( ii yxp for a given ) of which the optimal
(i.e. the closest to the true ones) model parameters are obtained as the part of the
argmax.
The process starts from some (arbitrary) 0 . At each step of the process, the current is
replaced by the next step set of parameters by solving MAP for , i.e. finding
- 22 -
n
i
iixxX
yxpXn 1,,
);,(logmaxargˆ
1
, and then re-estimating the parameters using Y and X to
form .
Usually, represent the values of the marginals or conditional distributions (CPTs)
which can be combined to form );,( yxp . Thus to calculate using Y and X , one can
use the maximum likelihood approximation (which is also asymptotically correct). To do
so the histograms of Y and X are calculated and from them the new CPTs or marginals
forming are readily obtained.
Note that if is as described and is updated using the histograms, then as:
n
i
iixxXxxX
yxpYXPXnn 1,,,,
);,(logmaxarg);,(logmaxargˆ
11
and as can be easily shown:
);,ˆ(log);,ˆ(log)ˆ;,ˆ(log)ˆ;,ˆ(log11
YXPyxpyxpYXPn
i
ii
n
i
ii
then we get that if ,...ˆ,ˆ,ˆ,ˆ4321 XXXX is a series of X which resulted in subsequent steps
and ,...ˆ,ˆ,ˆ,ˆ4321 is the corresponding series of , then:
)ˆ;,ˆ(log)ˆ;,ˆ(log 11 iiii YXPYXP
Hence )ˆ;,ˆ(log ii YXP is non-decreasing sequence, bounded above by:
n
i
iiX
yxp1,
);,(logmaxarg
and thus can be thought to be approximating the latter, as desired.
Thus when the model has, for instance, a loop-free (singly connected) BN decomposition,
the MAP stage can be done using the GDL algorithm and then the application of hard EM
becomes iterative application of GDL (to solve the MAP) with intermediate steps of
parameter re-estimation. The final value of to which we converge is then the result of
hard EM training.
Soft EM
- 23 -
Soft EM tries to maximize the likelihood of the observed data, using the full distribution
of the unobserved x variables (in contrast with the hard version that uses only the most
likely values of the x variables):
Yy
yp );(logmaxarg
.
The process starts from some (arbitrary) 0 . At each step of the process, the current n is
replaced by the next step set of parameters 1n by solving:
));,(log(~
maxarg1
Yy
yn yxpEn
where n
E
~ is a conditional expectation taken for PDF: );|( nYXp , and where
}|{ YyxX y - set of unobserved RV vectors corresponding to each observed data
instance from Y.
In fact we can show that:
Yy
y
y
n yxpEn
));,((logmaxarg1
, where y
nE is an
expectation taken for PDF: );|( ny yxp .
Proof:
Following directly from definitions above:
X Yy
y
Yy
ny
X Yy
yn
Yy
yn
yxpyxp
yxpYXp
yxpEn
);,(log);|(maxarg
);,(log);|(maxarg
));,(log(~
maxarg1
Yy
y
y
Yy x
yny
Yy x xX yYz
nzyny
X Yy
y
Yz
nz
yxpE
yxpyxp
zxpyxpyxp
yxpzxp
n
y
y y
));,((logmaxarg
);,(log);|(maxarg
);|();,(log);|(maxarg
);,(log);|(maxarg
\ }\{
- 24 -
Hence the derivation
Yy
y
y
n yxpEn
));,((logmaxarg1
is correct▄
Soft EM on Belief Networks
Following we will show how soft EM algorithm can be applied to a general BN case and
in particular it can be efficiently applied using GDL in the loop-free BN case.
Assume our model has BN decomposition:
k
j
jjj
m
i
iii yParyrxParxqyxp11
))(|())(|();,(
Where m and k are the sizes of vectors x and y respectively, Par denotes the set of parents
of random variable in the BN decomposition, and denotes the conditional probability
tables }{ iq and }{ jr . Then the expectation term takes the form:
x
n
k
j
jjj
m
i
iii
y yxpyParyrxParxqyxpEn
);|())(|(log))(|(log));,((log11
Now if we rearrange the terms we‟ll get that the coefficient of the element
))(|(log iii xParxq (for a specific values of ix and )( ixPar ) is a marginal:
);|)(,( nii yxParxp . If we also assume ix is binary (i.e. takes values from a set 1,0 ),
then for a specific assignment to )( ixPar :
Denote ))(|0( iiii xParxqt - element of (for some fixed value of )( ixPar ).
Then iiii txParxq 1))(|1( .
Taking a gradient of Yy
y yxpEn
));,((log and making it equal to zero, the equation
corresponding to it will be:
Yy
iniiinii
i
tyxParxptyxParxpdt
d)1log();|)(,1(log);|)(,0(0
and hence, using elementary calculus, it - element of 1n will be equal to:
- 25 -
Yy
ni
Yy
nii
Yy
nii
Yy
nii
Yy
nii
i
yxParp
yxParxp
yxParxpyxParxp
yxParxp
t
);|)((
);|)(,0(
);|)(,1();|)(,0(
);|)(,0(
ˆ
Note that )( ixPar stands for a fixed values for ix ‟s parents corresponding to the current
it choice and hence the denominator is not equal to one, as we don‟t sum by )( ixPar .
The 1n elements corresponding to ))(|( jjj yParyr are computed in a similar fashion
using marginals of the form );|)(,( njj yyParyp . Note also, that if we didn‟t consider
ix being binary, it would be obtain as part of a solution of a system of linear equations
(see Appendix A1 for more details on the multi-valued ix case).
Hence, we can conclude that in order to apply “soft” EM, all we need is to be capable of
computing marginals of p: );),(,( nii yxParxp and );),(,( njj yyParyp , i.e. solve the
MPF problem (under the sum-product semi-ring) for local domains )(, ii xParx and
)(, jj yPary and fixed values of the y variables. It‟s also trivial to note that if the above
BN decomposition was singly connected, the required marginals are exactly the ones
calculated via GDL. Thus the “soft” EM in the loop-free case is an iterative application of
GDL with intermediate steps of parameter recalculation using the equations described
above.
Conclusion
As we‟ve shown both the popular methods for EM application are equivalent to an
iterative process of solving MPF under appropriate semi-ring (max-sum for MAP in
“hard” EM and sum-product for the original MPF in “soft” EM) with intermediate well
defined re-estimation steps. Hence providing an algorithm for solving MPF (or at least
MAP) in loopy network models will readily provide us with a method of learning those
models.
In the Section 4 we also provide an additional scheme of learning model parameters using
- 26 -
GDL, this scheme operates in loop-free scenarios and its goal is maximizing Mutual
Information (MI) between the model and the class of objects it represents. We also
compare the performance of this learning scheme to EM learning in the experimental
results section.
3.3. Belief Propagation (BP), Factor Graphs
For the sake of completeness we‟ll briefly review two additional popular inference
algorithms: BP and Sum-Product (Factor Graphs) algorithm. Both these algorithms are
guaranteed to converge to the correct solution (of the MPF problem) in loop-free
scenarios. In this section we‟ll show that these algorithms can be expressed as forms of
the GDL algorithm.
3.3.1. Belief Propagation (BP)
The BP algorithm [3] was first presented by J. Pearl in 1988, it was originally designed
for inference on the BN model. Like GDL, BP is a message passing algorithm. The BP
messages are communicated on the original BN and represent conditional probabilities of
BN variables (nodes) and parts of the evidence. When the BN
n
i
iii
n
ii vParvqvp1
1 ))(|()|( (Par being the set of node‟s parents) is singly connected,
the messages communicated by BP on a directed edge ),( ji vv of the BN can be
interpreted as:
The causal parameter iv sends to jv :
)|()( j
i
j vii
v
v Cvpv
where jv
C is a vector of all the observed variable values “above” jv .
The diagnostic parameter jv sends to iv :
)|()( ivi
v
v vCpvj
i
j
where jv
C is a vector of all the observed variable values “below” jv .
- 27 -
In the above description “above” means all nodes reachable from jv by undirected paths
through iv and “below” means the rest of the nodes. As for the message update rules,
they are as follows:
j ij ijl
l
j
jk
j
k
i
j
v vvPar vvParv
l
v
viijjj
vParv
j
v
vi
v
v vvvvParvqvv}\{)( }\{)()(
)()},{\)(|()()(1
where )(1
jvPar denotes all the “children” of jv , i.e. all the nodes kv s.t. there is an
edge ),( kj vv in the BN, and is a normalization constant which normalizes )( i
v
v vi
j
to sum up to 1. Now if we rename i
j
v
v to jim and i
j
v
v to ijm then we‟ll immediately
get:
}\{)( }\{)(
}\{)( }\{)()(
) otherwise,)( if |()},{\)(|(
)()},{\)(|()(
)(
1
ijk ijl
j ij ijljk
i
j
vvParv vvNv
jltljiijjj
v vvPar vvParv
lljiijjj
vParv
jkj
jii
v
v
jtvParvltvmvvvParvq
vmvvvParvqvm
mv
where )( jvN is the set of neighbors of jv in the BN. Finally notice that local domain
of jq is )(}{ jjj vParvS and hence i
j
v
v clearly is normalized GDL message sent
from JT node corresponding to jq to its JT neighbor node corresponding to iq
(which local domain is clearly )(}{ iii vParvS and hence }{ iji vSS ). As
stated earlier, JT corresponding to the singly connected BN has the form of the BN.
)( )(}\{)(
)())(|()()(1
i il
l
i
jik
i
k
i
j
vPar vParv
l
v
viii
vvParv
i
v
vi
v
v vvParvqvv and hence using the
same renaming as for i
j
v
v , we equivalently see that i
j
v
v is also a normalized GDL
message.
Hence we see that messages communicated by BP are in fact normalized GDL messages,
thus BP is a variant of GDL (with message normalization).
3.3.2. Sum-Product (Factor Graphs) algorithm
The Factor Graphs (FG) algorithm [19] is a message passing algorithm that was
developed in parallel to GDL and is essentially equivalent to it in form and spirit. The FG
- 28 -
algorithm was developed (as well as GDL) to generalize inference on (loop-free)
networks under a single unifying framework. As well as GDL, the goal of FG is solving
the MPF problem under various commutative semi-rings (and hence solving the MAP
problem and etc.). The framework in which FG operates is essentially the same as the one
of GDL, given function decomposition:
nj xxS
jjn Sgxxf,,
1
1
)(),,(
solve the MPF problem and obtain the marginals: }}\{,,{
1
1
),,()(in xxx
ni xxfxf
. The
difference between FG and GDL, is that FG doesn‟t construct a JT for the above
problem, but instead constructs a similar structure called a “factor graph” which is a
graph with nodes corresponding to the set }{},,{ 1 jn gxx and undirected edges
connecting a node corresponding to ix and a node corresponding to jg iff ji Sx . In
this “factor graph” the messages passed are updated as follows:
Let the message a node corresponding to jg sends to a node corresponding to ix be
denoted by jim and a message sent in the reverse direction be denoted by ijm .
Then both jim and ijm are functions of ix and are calculated as follows:
}\{ }\{
)(ˆ)()(ij ijkxS xSx
kkjjjiji xmSgxm
}\{)(
)()(ˆ
jik gxNg
ikiiij xmxm
where )( ixN represents all the function nodes which are neighbors of ix .
We trivially see that if the “factor graph” is loop-free then any two functions that
share a variable, share only one variable (otherwise loops are formed) and thus if we
combine the definitions of jim and ijm we‟ll get
}\{ }\{ }\{)(
)()()(ij ijk jklxS xSx gxNg
klkjjiji xmSgxm . This is exactly the GDL message
passed from node corresponding to local kernel jg to its neighboring node which
shares the variable ix with jg in a JT formed directly from the “factor graph” by
connecting all the function nodes which share a variable by an edge.
- 29 -
Thus as the FG algorithm is guaranteed to converge to a correct solution in the loop-free
case only, and as we‟ve shown – in the loop-free case, the messages sent by the FG on
the “factor graph” are GDL messages on the corresponding JT, we conclude that in loop-
free cases FG is a special case of GDL (with no gain in computational complexity).
4. Maximum MI Training
In this section we present our novel algorithm for simultaneous information driven
structure and parameter learning on a loop free BN. It is a training algorithm, which
draws conclusions about the optimal parameters and structure from a given set of training
examples. As we work in the context of classification and interpretation problems, it is
natural to consider the parameters and structure to be optimal if they maximize the
mutual information between the model and the class. Reasons for that will also be given
in this section when we describe Ullman‟s unpublished “Inverse Fano Inequality” later in
this section.
Also EM is a training algorithm as well; it is fundamentally different from our approach.
One obvious reason is that EM tries to maximize the log-probability of the data and
hence make the trained model more asymptotically correct, while our algorithm
maximizes model information to class. Note also that an extension of our algorithm
maximizes both the log-probability and the model information to class. Another reason
becomes clear when you consider the following example.
Assume we have a face classification model which is comprised of a fixed BN of
observed feature nodes with class node connected as an additional parent to its every
node (if the BN is loop free then this is exactly the TAN model as will be described
later). Suppose we have N sets of Normalized Cross Correlation (NCC) scores, one NCC
score for each feature, taken from N independent images. And suppose we wish to
simultaneously train the NCC thresholds for all the features. One can easily see that using
EM for such a task would be problematic, as EM deals with fixed training data and
parameters that it trains should affect only the distribution and not the data. However,
here it is not the case, if we change the thresholds, the data from which we can obtain the
CPTs changes. For instance, if we use maximum likelihood principle to choose the CPTs
for a given set of thresholds, then one trivially notes that the histograms, from which we
- 30 -
should derive the CPTs, change with different choice of thresholds. Hence, we cannot use
EM in this setting, as using it will cause it to set all the thresholds to 1 or -1 and get the
data which has a probability one, but this is of course not our goal.
As we will later show, our algorithm is tailored for situations of the kind described above
and can be used to efficiently obtain solutions to them assuming several restricting
assumptions.
The algorithm operates over what is usually referred as a TAN (Tree Augmented Naïve
Bayes) classification model [2]. The schematic structure of the TAN model is depicted in
Figure 5. As can be seen from the illustration, TAN is not exactly a loop free BN, as the
class node being a parent of every node in the network introduces undirected loops in the
graph. However, one can easily note that the local domain graph corresponding to the
TAN is loop free and has the TAN underlying tree structure.
Figure 5: TAN model. Similar to the BN model, but with a class node connected
as an additional parent to each node.
If we consider the TAN structure as a special case of a BN, then every feature node,
except the root node, has two parents – its parent in the feature tree and the class node.
- 31 -
More detailed description of the TAN model and its construction can be found in
[Freidman et al. 1997].
The goal of the algorithm is learning a set of optimal local parameters for each feature
node of the TAN model. Unlike leaning by EM, our learning approach determines the
optimal model parameters by Mutual Information (MI) maximization The mutual
information maximized during learning is between the model (the set of feature nodes
arranged in a TAN network) and the class random variable. The rational is that the
parameters which maximize the MI will also be more optimal in a sense that they will
provide better classification results for the MAP decision scheme. One of the theoretical
results which supports this intuition is an “Inverse Fano inequality” (unpublished result
by S. Ullman), as summarized below.
Claim (Inverse Fano inequality): given binary random variable C and a general random
variable F, the probability of classification error in MAP classification scheme PE is
bounded from above as follows:
)|(2
1FCHPE
In words, the probability of classification error is bound by half the residual entropy.
As we refer here to MAP decision rational, probability of an error in classifying C in the
case iFF is obviously: )|)|(maxarg( iii FFFFCPCPq and hence using
the Bayes rule:
ii F
ii
F
iiiE qFFPFFFFCPCPFFPP )()|)|(maxarg()(
Proof: The proof follows directly from the concavity of the logarithm:
pp
ppppppppHp
2)1log(2
))1log(())1(log()1log()1(log)(2
1 222
Meaning that for 2
1p , ppH 2)( .
And applying the above, as by definition 2
1, iqi :
- 32 -
)|(
2
1)()(
2
1)( FCHqHFFPqFFPP
ii F
ii
F
iiE
Assume we are given a classifier for a (binary) class C, with a feature vector F which
uses the MAP decision scheme. As H(C) is constant, from the Inverse Fano Inequality we
conclude that as the mutual information, given by I(C;F) = H(C) - H(C|F), becomes
higher, then residual entropy H(C|F) becomes lower, and therefore the lower is the upper
bound for PE provided by the inequality becomes lower.
In the subsequent sections I will describe a new method for learning model parameters in
loop-free graphical models by maximizing mutual information. The section 4.1 will deal
with models with all-observable nodes, and section 4.2 will show extensions of this
technique to models with unobserved variables. Finally in sections 4.3 and 4.4 I will
discuss a hybrid approach for training and constructing the network, with for the goal of
maximizing both the MI and the log-probability of the model. We‟ll also show a possible
extension of the latter hybrid approach which is guaranteed to converge to a local
optimum of its score function.
4.1. MaxMI
As an example of kind of problems targeted by our learning technique, you are referred to
threshold learning example above. Algorithms used to solve problems of this kind in the
past only set thresholds (parameters) for one feature at a time. The goal in the above
example, and in the rest of our discussion, is to set all the thresholds (parameters)
simultaneously by maximizing MI(F;C).
Let us now describe the TAN setting for our MI(F;C) maximization in greater detail.
Assume having an BN decomposition of the joint distribution of the network nodes and
the class (the class is denoted by random variable C):
n
j
Sjjn jCFPFFCP
1
1 ;,|);,,,(
where j is a set of parents of jF and }{ jjj FS . Our BN is actually a TAN,
which means that every feature node is affected by C and therefore we connect C as
parent to every node of the BN. In the structure, C is included in every conditional
distribution factor of the decomposition. By },,{ 1 n we denote the parameters we
- 33 -
wish to learn, one parameter for each BN node. And by jS we denote a set of parameters
of nodes which are in jS . Our proof for convergence of our algorithm to MI maximizing
solution, requires the following assumptions:
Assumption 1: ),( jSCP depends only on jS .
Assumption 2: The above BN is such that if we remove the C node from it, the structure
of the decomposition is changed in the following way:
n
j
Sjjn jFPFFP
1
1 ;|);,,(
I.e. the structure of the BN remains the same (in the sense of parent / child relations) just
without the C node. For an interesting implication of this assumption in a special, so-
called, partial conditional independence in the class case and a way to resolve the arising
difficulty in this case, please refer to appendix A5. By partial conditional independence in
the class we refer to the case in which the model is consisted of several parts (subsets of
random variables) conditionally independent in the class variable C.
Assumption 3: Assume also having a set of training data, from which );,(jSj CSP can
be inferred given jS for every j. This assumption means that there is an efficient way to
approximate the marginal );,(jSj CSP for a fixed value of
jS from the training data
(previous assumption required that this marginal must depend only on jS , so this
assumption should usually be a natural extension to the previous one). For instance, if
you refer to the thresholds example, when you fix the thresholds, );,(jSj CSP could be
set to the maximal likelihood approximation (determined by the appropriate histogram
calculated from the data) for each j.
The goal of this algorithm is to find },,{ 1 n for which:
);|,,();,,();,,;( 111 CFFHFFHFFCMI nnn
- 34 -
is maximal. In order to achieve this we will show that under the assumptions above, the
mutual information has a simple decomposition that can be used for the maximization.
n j
jjj
n
FF
n
j S
SjSjj
n
j
nSjj
FF
nnnn
SPFPFFPFP
FFPFFPFFPEFFH
,, 11
1
,,
1111
1
1
);();|log();,,();|log(
);,,());,,(log()));,,((log();,,(
The last equality holds because when we sum );,,( 1 nFFP for a fixed value of jS we
get );(jSjSP .
Given our assumptions j
jjj
S
SjSjjSj SPFPf );();|log()( is a function of jS
which can be calculated from the training data for each assignment of jS . Since
);,(jSj CSP can be calculated from the training data, then obviously );(
jSjSP and
jSjjFP ;| can also be inferred from it.
We conclude that:
j
Sjn jfFFH )();,,( 1
That is, );,,( 1 nFFH is decomposed into the sum of local terms that depend on the
local domains only.
A similar decomposition holds for );|,,( 1 CFFH n :
n
j SC
SjSjj
FFC
n
j
nSjj
FFC
nn
nn
j
jj
n
j
n
CSPCFP
FFCPCFP
FFCPCFFP
CFFPECFFH
1 ,
,,, 1
1
,,,
11
11
);,();,|log(
);,,,();,|log(
);,,,());|,,(log(
)));|,,((log();|,,(
1
1
- 35 -
Again under our assumptions j
jjj
SC
SjSjjSj CSPCFPg,
);,();,|log()( is a
function of jS which can be calculated from the training data for each assignment of
jS .
We conclude that the MI(F;C) maximization problem reduces under the above
assumptions to the following one:
Find an assignment of },,{ 1 n for which
n
j
SjSj
n
j
Sj
n
j
Sj jjjjfggf
111
)()()()(
is maximized.
Under this decomposition, the problem is equivalent to a MAP problem (or an MPF
problem over max-sum commutative semi-ring) for the unknown values of
},,{ 1 n . The local kernels for the MAP are )()(jj SjSj fg , and the structure of
this -network is exactly the same as of the original BN without the C node. The
standard algorithm for computing the MAP in a loop-free graphical models (models
which have a junction tree, as in our case) can therefore be used to determine the optimal
values of },,{ 1 n .
We conclude with a short description of our algorithm in light of the above:
1. For each j=1,…,n calculate )()(jj SjSj fg for each assignment to
jS from the
training data.
2. Apply an algorithm to solve the MAP problem of finding:
n
j
SjSj jjfg
1
)()(maxarg
3. Return
as the optimal set of parameters.
Note that if original BN was loop-free, i.e. had a JT that could be constructed from }{ jS ,
then the second step can be performed using GDL.
Finally note that:
j
jjjj
SC
SjjSjSjjSj CxHCSPCxPg,
);,|();,();,|log()(
- 36 -
);|();();|log()(j
j
jjj Sjj
S
SjSjjSj xHSPxPf
And hence, the MAP local kernels are Mutual Information between BN nodes and the
class given the nodes parents, i.e. are of the form:
);|,();,|();|()()(jjjjj SjjSjjSjjSjSj CFMICFHFHfg
and hence we‟ve also obtained the following useful equation:
(4.1.1) j
Sjj jCFMICFMI );|,();(
An application of the above MaxMI algorithm on loop-free BN, is described in section
8.1 on “feature threshold and ROI learning problem” for our all-observed visual
interpretation feature based model.
4.2. MaxMI approximation on observed & unobserved models
In the previous section our goal was to maximize );,,;( 1 nFFCMI where nFF ,,1
were observed features (observed nodes of the BN) of the class C. And we achieved this
(under assumptions stated above) using the MaxMI algorithm.
However the situation is different if we use a BN involving both observed nodes (feature
nodes) and unobserved nodes. The goal remains the same, we still want to maximize
);,,;( 1 nFFCMI , but now nFF ,,1 are not the only nodes of the BN.
We will examine next the use of a model involving both unobserved ( iX ) and observed
( iY ) nodes combined in a tree structure as follows:
- 37 -
Figure 6: TAN with unobserved nodes. All the observed and un-observed nodes have the
class node as their parent.
In the above illustration xi are unobserved nodes, yi are observed nodes, C node is the
class node and the abbreviation Par(xi) stands for parents of xi.
Moreover let the MaxMI assumptions:
Assumption 1: Removing the class node C leaves the model otherwise unchanged, i.e. the
underlying graph representing the decomposition of the distribution of {xi} and {yi} alone
(without C) has the same structure as the original graph with C node and all of its edges
removed.
Assumption 2: Given parameters i and j corresponding to iy and jy s.t.
)( ji xparx , we can approximate the marginal ),,,( jiji yyxxP from the training data.
This approximation (for a fixed set of parameters) could be achieved by EM over the
model restricted to the sub-graph containing the nodes },,,{ jiji yyxx alone. This is true
by definition of EM and our description of how it can be efficiently implemented in the
loop-free cases (as ours here). In fact we could also use a more involved EM technique, if
- 38 -
we mix EM with the applied variant of MaxMI training. During the bottom up pass of the
MaxMI, when the approximation of ),,,( jiji yyxxP is needed, the parameters for the
nodes of the subtree rooted at ix which are best suitable for j are already established.
Thus EM for this whole subtree could be applied in order to get a better approximation
for the marginal.
Our goal is to maximize the information provided by the observed nodes regarding the
class variable. That is, during learning we wish to maximize:
)|()();( CYHYHCYMI
where ),,( 1 nyyY is the vector of the observed variables.
We next use the fact that:
1. YXY YXP
YXPYXPYPYPYH
, )|(
),(log),()(log)()(
YXYX
YXPYXPYXPYXP,,
)|(log),(),(log),(
2. CYXCY CYXP
CYXPCYXPCYPCYPCYH
,,, ),|(
)|,(log),,()|(log),()|(
CYXCYX
CYXPCYXPCYXPCYXP,,,,
),|(log),,()|,(log),,(
Thus );( CYMI decomposes into a sum of two terms. The first is:
CYXYX
CYXPCYXPYXPYXPCYXMI,,,
)|,(log),,(),(log),();,(
which can be decomposed into a sum of local contributions using the previous MaxMI
technique, under the above assumptions. The decomposition is obtained exactly as in the
previous derivation of equation (4.1.1):
j
jjjjparjjj xCyMIxparCxMICYXMI );|,(),);(|,();,( )(
The more problematic second term is:
CYXCYXYX CYXP
YXPCYXPCYXPCYXPYXPYXP
,,,,, ),|(
)|(log),,(),|(log),,()|(log),(
- 39 -
Note that )|( YXP can be decomposed as follows:
(4.2.1) i
iii YxparxPYXP )),(|()|(
where iY is a subset of Y including all the observed nodes in a subtree rooted at ix . For a
detailed proof of (4.2.1) see Appendix A2.
Hence, we can extend the above decomposition to:
CYX i iii
iii
CYX CYxparxP
YxparxPCYXP
CYXP
YXPCYXP
,,,, ),),(|(
)),(|(log),,(
),|(
)|(log),,(
i CYxparx iii
iiiiii
iiiCYxparxP
YxparxPCYxparxP
,),(, ),),(|(
)),(|(log),),(,(
This decomposition resembles a sum of local terms, but there is one major problem with
it, ),),(,( CYxparxP iii depends (in the most general case) on the parameters
corresponding to all of the observed nodes in iY .
The contributing terms are therefore not local as in the all-observable nodes examined
before. However, under some additional simplifying assumptions we can use an
approximation by local terms. It is natural to consider an approximation for
)),(|( iii YxparxP in which we assume that given )( ixpar and some of the iY , ix no
longer depends on the rest of the iY . In particular, one can assume that
)),(|()),(|(idiiiii YxparxPYxparxP where
idY is the subset of iY containing only iy
(the observed node of ix itself) and observed nodes of id - the set of direct children of
ix . Under the latter assumption the above decomposition takes a simplified form:
i CYxparx dii
dii
dii
idii i
i
i CYxparxP
YxparxPCYxparxP
,),(, ),),(|(
)),(|(log),),(,(
Now if we assume that ),),(,( CYxparxPidii can be inferred from the training data given
the set of all the idY parameters, we get that the above is a sum of local contributions
(over the trained parameters). The inference of ),),(,( CYxparxPidii from the training
data given the necessary parameters can be achieved using EM for instance. This sum
- 40 -
decomposition is organized in a tree of TREEWIDTH equal to the number of the learned
parameters which affect ),),(,( CYxparxPidii , in fact it is:
ixpard dyYii
max2|}{|max )(
Keeping the TREEWIDTH low is of crucial importance for the issue of computational
complexity of the approximation. The TREEWIDTH, or the size of the maximal clique in
the triangulated moral graph, controls the complexity of the most demanding message
construction and passing operation during the run of the GDL we use for the
maximization step of the training.
Summary
We conclude this subsection with a short summary on the un-observed & observed model
maximal information training. We‟ve seen that in case the un-observed variables are
present, the previous simple MaxMI decomposition doesn‟t apply. We‟ve developed an
alternative method for training in this case and provided a generally correct
decomposition of the training objective into a sum of (large) local kernels which gives a
foundation for other (application dependent) approximations. Further development of the
“un-observed & observed model maximal information training” framework discussed
here is one of the themes for future work. Empirical tests over this framework will be
necessary to fully establish its usefulness.
Alternative to observed & un-observed model training
For applications in the field of visual interpretation, it is also interesting to consider the
following alternative for construction and training of observed & un-observed (O&U)
models.
In the visual interpretation application we assign to each O&U model observed node the
meaning of a detector measuring the presence of a feature template, residing in the
observed node, in the target image. At the same time, the un-observed node attached to
the observed node is considered to be a binary RV taking the value 1 iff the object part
which “stands behind” the feature template is present.
For example, an observed node may detect a presence of an “eye” feature being an image
patch with a corresponding NCC threshold. The value of the observed node is calculated
- 41 -
regardless of the rest of the model. The un-observed node corresponding to this observed
node will detect the presence of “eye” face part. In order to calculate its value, it will use
not only the information provided by its observed node, but also the data conveyed to it
from its children and parents using the un-observed to un-observed edges of the model.
However, if the “eye” feature is sufficiently “good”, the “eye” un-observed node will rely
on the value of its corresponding “eye” observed node as a good initial guess.
In light of the above, it seems reasonable that an O&U model, of the form depicted on
Figure 6, constructed from the all-observed model (with the same features) using the
following steps will perform well in visual interpretation applications:
1. Train the all-observed TAN model using Max-MI and restructuring techniques
discussed in subsequent sections.
2. Construct the O&U model by replacing each node of the all-observed model with an
un-observed node and attaching the replaced observed node to it as its corresponding
observed node detector.
3. Initialize the un-observed to their observed CPTs, so that they‟ll show strong
dependence between the un-observed nodes and their corresponding observed
detectors.
4. Run EM on the resulting O&U model (for instance soft EM) in order to increase the
log-probability of the training data for this model and hence increase the models
“correctness”, that is its applicability on the training data like cases.
Please refer to the experimental results section for the empirical test results for the O&U
construction and training scheme suggested above.
4.3. MaxMI & TAN Restructuring
In the previous sections we developed a method for deriving the optimal parameters for
the classification model. The model itself was assumed to be fixed and given, and was
assumed to have the structure of a TAN, that is, a standard tree BN, but with a class node
attached to every node of the network. In this section we consider the dual problem of
constructing an optimal TAN model for the data, together with the assignment of optimal
parameters to the model.
- 42 -
The TAN model together with an algorithm for inferring the “optimal” TAN structure
and conditional PDFs for a given class and a given set of features (with fixed parameters)
was introduced in [Friedman et al. 1997]. Here “optimal” means having the maximal log-
probability of the training data. This notion of optimality guarantees “asymptotic
correctness”. This means that if the “true” model joint distribution (the one used to
generate the training / test data) is TAN, then given enough training data we will get back
the original model used to generate it.
For a given set of features and a fixed set of feature parameters , Friedman‟s algorithm
selects the optimal TAN structure as the MST (Maximal weight Spanning Tree) from the
complete graph with features in the nodes and edges weighted by the following weight
function:
),,|;(),(kj FFkjkjTAN CFFMIFFw
The TAN structuring algorithm selects the tree which maximizes ),( kjTAN FFw .
As was described in the above MaxMI sub-section, the contribution of an edge to
)|,,;( 1 nFFCMI under the MaxMI assumptions is:
),,|;(),(kj FFkjkjMaxMI FCFMIFFw
where )( jk FParF , i.e. kF is a parent of jF . Also note that the score maximized by
MaxMI was ),( kjMaxMI FFw .
These comparisons suggest the use of a hybrid of the two schemes, the TAN restructuring
for selecting the optimal structure given model parameters, and the MaxMI for selecting
the optimal model parameters given the model structure. This will be a scheme that
attempts to maximize both the log-probability of the training data and the
)|,,;( 1 nFFCMI . It has to choose a TAN structure and parameters in such a way,
that for the selected set of parameters, the TAN tree is the MST( ) of the complete graph
with ),( kjTAN FFw edge weights, and at the same time the sum of the ),( kjMaxMI FFw
edge weights is maximal over all and the according MST( ).
In order to better understand our above reasoning consider the following dilemma.
Suppose that we have a fixed . For it, there is an optimal TAN, which is MST( ). For
this fixed TAN, is no longer the optimal choice of parameters that maximizes
- 43 -
)|,,;( 1 nFFCMI or our approximation to it in the form of the sum of ),( kjMaxMI FFw
over the edges of the TAN. However, as for any given set of parameters , the optimal
TAN for is the closest choice to the “true” joint distribution of the trained features, we
give maximizing TAN precedence over maximizing MI. Hence, we require that the result
of the “hybrid” training will produce on one hand an optimal TAN for the resulting
and on the other hand this optimal TAN will have the maximal MaxMI approximation
weight over all the optimal TANs for other choices of .
The above problem is complex and without a simple closed form solution for it. In our
experiments, we‟ve tried simply iterating MaxMI parameter training and TAN restructure
steps (more reasons for doing it are given in the following sub-section). The addition of
TAN restructure steps caused a substantial improvement over the MaxMI results alone.
Hence, intuitively, there is a very good reason for future research in the direction of
unifying theses two schemes under a single framework.
In the subsequent section we‟ll further develop the connection between MaxMI and TAN
restructure scores and suggest several possible methods for combining them under a
unified hybrid approach.
4.4. Combining MaxMI and TAN restructuring
One possible approach for combining MaxMI and TAN restructure algorithms is to
define a weighted (with fixed normalized weights and )1( ) average weight
function for the edges of the complete graph:
),()1(),(),(1 kjMaxMIkjTANkjH FFwFFwFFw
where 10 . When each edge of the complete graph has only a single weight
),(1 kjH FFw , the hybrid training algorithm can iteratively increase its score (sum of
),(1 kjH FFw over edges of MST of the complete graph) by:
1. Finding MST over the complete graph for a current set of parameters.
2. Using GDL to choose the parameters to maximize sum of ),(1 kjH FFw over
the MST edges
3. Iterate steps 1 and 2 until convergence to (a local) optimum occurs.
- 44 -
The algorithm has to converge as after each iteration the sum of ),(1 kjH FFw over the
current MST increases. The hope is that an appropriately chosen will give good
results.
Another possibility is to try a greedy approach based on the weight functions, which will
add the tree nodes one by one each time adding the node which gives the best MaxMIw
score while attaching it to the “parent” node s.t. the TANw score is maximized.
Next we continue to develop the relationship between the two weights maximized by the
MaxMI and TAN restructuring approaches, MaxMIw and TANw . A closer look on the
MaxMI edge scoring function reveals the following facts:
),;;();;(),(
),;;();;(),;|;(
),;;(),;,;(
),;|;(),(
kjj
kjjkj
kjkj
kj
FFkjFjkjTAN
FFkjFjFFkj
FFkjFFkj
FFkjkjMaxMI
FFMICFMIFFw
FFMICFMICFFMI
FFMIFCFMI
FCFMIFFw
Note that ),( kjTAN FFw is included in ),( kjMaxMI FFw , as a positive summand. This
together with the special structure of their difference: ),;;(kj FFkj FFMI (which is the
TANw of the “feature only” joint distribution under the MaxMI assumptions), suggests an
approach for combining MaxMI with TAN restructure. The approach would be to simply
iterate MaxMI and TAN restructure steps one after another. The MaxMI steps will set the
parameters for the subsequent TAN restructure step, and the TAN step will set the
structure for the subsequent MaxMI steps. Each restructure step would increase the
),( kjTAN FFw and thus increase the model log-probability (and hence asymptotic
“correctness”) and each subsequent MaxMI step would increase:
),|;()|;(),(),(kjj FFkjFjkjTANkjMaxMI FFMICFMIFFwFFw
and hence the MI of the model to class.
However, each MaxMI step can potentially decrease the TAN score, as well as each TAN
step can decrease the MaxMI score. This is due to a negative summand:
),|;(kj FFkj FFMI in the score of the MaxMI step. Hence, the above algorithm isn‟t
guaranteed to converge on its own. Therefore if we use this algorithm to train our model,
- 45 -
it will require some stopping criteria, for instance reaching a fixed number of iterations or
iterating until the results stop improving. As will be seen from experimental results, the
latter hybrid approach gives better results than MaxMI alone.
We next use the relation between MaxMIw and TANw to derive a new optimization criterion
and an alternative hybrid learning procedure.
Alternative hybrid approach
One of the terms in the expansion of MaxMIw above is ),|;(kj FFkj FFMI . This term
can be viewed as a TAN score of model that is closely related to the TAN model: this is a
model consisting only from the feature nodes (without the class node) structured in the
same way as the TAN. Recall that this model, that consists of the observable features
only, was also used in developing the MaxMI learning above. We have assumed that this
“feature nodes only” model represents the “true” joint PDF decomposition of the feature
nodes without the class. Maximizing the ),|;(kj FFkj FFMI score by results of
Chow and Liu [21] causes the log-probability of the “feature nodes only model” to be
maximized and thus making it more asymptotically correct. That is if the true model of
the joint PDF of the feature nodes alone is a tree, then it will be obtained given a
sufficient amount of training data.
We conclude that since the applicability of the MaxMI relies on the assumption of
structural invariance to class node removal, it makes sense to add ),|;(kj FFkj FFMI
to the MaxMI score, which gives a higher preference to models that are compatible with
this assumption. We therefore propose to learn a class model by maximizing the
following score:
(4.4.1) ),|;(),(kj FFkjkjMaxMI FFMIFFw
which is also equivalent to maximizing:
(4.4.2) )|;(),(jFjkjTAN CFMIFFw
To maximize this score, we can use the following iterative procedure. We maximize the
new score using bottom-up, top-down MaxMI procedure This results in new values for
the parameters which give the global maximum of the score for the current tree
- 46 -
structure. We next use the features with fixed parameters obtained by the MaxMI
procedure, and apply the TAN restructure algorithm to maximize the ),( kjTAN FFw
part of the score by changing the structure of the feature tree. Note that we cannot use the
MaxMI step for maximizing )|;(jFj CFMI alone, as this could potentially decrease
),( kjTAN FFw with the change of parameters.
By iterating these steps we obtain an algorithm which is guaranteed to increase the score
at each step. MaxMI steps increase the full score and TAN steps increase just the
),( kjTAN FFw summand, hence no step will ever decrease the score. Hence, the
suggested approach is guaranteed to converge to a local maximum of the score function.
To summarize, the alternative hybrid algorithm uses the following procedure:
1. Start with some initial set of feature parameters 0 .
2. Construct an optimal TAN in the usual way (Friedman), which just maximizes:
),( kjTAN FFw .
3. Apply the maximization stage on the TAN which resulted in step 2. At this stage
we maximize the score (4.4.2): )|;(),(jFjkjTAN CFMIFFw . The
maximization can be done by GDL, since the sum to be maximized can be
decomposed into local domains that form a junction tree.
4. Return to step 2. Note that we need not maximize the full (4.4.2) score in order to
guarantee its monotonicity as changing the structure of the TAN doesn‟t affect the
)|;(jFj CFMI summand and hence step 2 only increases the score.
5. The above iterations continue until the score stops increasing.
The above procedure maximizes the score (4.4.2) and thus also the score (4.4.1):
),|;(),(kj FFkjkjMaxMI FFMIFFw , which maximizes the MI(F;C) (by the first
term), but also makes sure (using the second term) that the tree we get will be as close to
the MaxMI requirement as possible.
The complete approach for feature, structure and parameter selection training
At this point we will sketch two approaches for the so-called: complete model training.
- 47 -
The complete approach receives only a set of training examples, say a set of training
images, finds the “best” features, say image patches, together with their parameters, say
thresholds and ROI, and the model structure, for e.g. an appropriate loop free BN.
Let us now discuss two possible complete approaches involving our novel techniques of
maximal information training.
Constrained TAN with feature selection:
This approach incorporates novel feature selection technique develop recently by B.
Epshtein and S. Ullman, see [8] for reference. The technique selects features in
hierarchical manner, each time breaking the lowest level features of the feature tree into a
set of sub-features which comprise the subsequent tree-level. The complete approach uses
this technique for the feature selection. After breaking a feature we apply the hybrid
approach on the resulting tree (with the sub-features of the currently broken features
attached to their parent feature). The hybrid approach involves the usual MaxMI steps,
but the TAN steps are replaced by so called "constrained TAN" step, which are restricted
to allow restructure of only the sub-features of the currently broken feature. This
constrained form of the TAN step does not allow it to change the hierarchical
relationships in free manner and hopefully results in constructing a more intuitive model.
The development of this approach is an interesting theme for future research.
MaxMI for feature selection:
This approach does not defer from the MaxMI approach discussed in the previous
sections. In fact, here we suggest a method for using MaxMI not only for parameter
training, but also for feature selection. The suggested technique simply regards features
residing in thee nodes of the all-observed TAN model, as parameters. This means that we
apply the hybrid approach described previously, while the MaxMI steps retain on the
model structure and select features together with their parameters. For instance, assume
we use this technique to select image patch features and train threshold and ROI
parameters for them. Then MaxMI steps of the hybrid algorithm will regard the training
image number, x, y, width and height of the image patches as part of their parameters, just
as the threshold and ROI. This means that MaxMI steps will only use the model structure
- 48 -
as the skeleton for the next set of features which is filled in by the MaxMI together with
their threshold and ROI parameters.
Of course learning features together with their parameters considerably enlarges the
support of the local kernels of the MaxMI. This in turn will have a significant effect on
the computational cost of this approach. However, several heuristics can be used which
restrict the search scope of different MaxMI steps in order to get a much more efficient
variant of this approach. For instance subsequent MaxMI steps could search only around
the features found in previous steps in order to get a so-called coarse-to-fine approach.
Further study of this technique is also an interesting topic for future research.
4.5. Maximizing MI vs. Minimizing PE
In previous sections, we have described our training technique, whose goal was to train
the model with all its aspects (structure, parameters and choice of features) by
maximizing the Mutual Information (MI) between the class and the model. Another
possible criterion for model selection is based on the Probability of Error (PE) rather than
maximizing MI. The MAP classifier decides on the appropriate class value by choosing
the most probable class value given a specific assignment to the evidence (observed)
variables. Under the MAP decision rule, the “best” classifier is the one having the
minimal PE. Natural questions that arise in this context are:
What are the cases in which maximizing MI is minimizing PE?
Is minimizing PE superior to maximizing MI under the non-MAP decision rules?
In this section we will deal with these questions. We will describe the governing
dynamics behind maximizing MI and minimizing PE in the, so-called, “ideal” training
case. By ideal case, we refer to the case when the training algorithm selects a model from
all possible models, i.e. models producing all possible CPTs with the learned class. In the
ideal case we will give simple description of the models maximizing MI and the models
minimizing PE. These descriptions, which are interesting even outside the scope of the
present comparison, will allow us to discuss the cases in which maximizing MI produces
also models which minimize PE.
- 49 -
In this section we will also discuss the disadvantages of using the “minimize PE”
paradigm compared with maximal MI. The disadvantages include computational
efficiency issues, as well as the use of non-MAP decision rules.
Finally, we will provide several reasons for using MI maximization as a classification
model training criterion. We will argue that MI maximization can be viewed as an
optimal criterion for training classification models.
4.5.1. Maximizing MI and Minimizing PE in the “ideal” training case
Let us first describe the notations. Assume having a class represented by an n-valued
Random Variable (RV) C taking values from the set },,1{ n . Denote by jcjCP )(
the values of the prior probability of C. W.l.o.g. we assume that the values of C are
ordered in a way, such that the following is true:
nccc 21
Assume that we want to represent, or measure C by a k-valued RV F taking values from
the set },,1{ k , where nk . The representation of C using F is expressed by the CPT
of C given F and by the probability distribution of F. We denote by ijpiFjCP )|(
the values of the CPT of the representation, and by iriFP )( the values of F‟s
probability distribution. We assume that we operate under the “ideal” case scenario, in
which the CPT of C given F and the probability distribution of F can be set to any desired
value. That is, for any CPT and probability distribution we can find an appropriate F
which is distributed according to the distribution and produces such a CPT with the given
C.
The following are direct consequences of the above notations:
1)(11
n
j
n
j
j jCPc
1)|(11
n
j
n
j
ij iFjCPp
j
k
i
k
i
k
i
iji ciFjCPiFjCPiFPpr 111
),()|()(
- 50 -
We will now describe the form of C‟s representation using F, which is obtained by
selecting the best representation using the minimizing PE training paradigm.
First we make the term PE explicit:
f c
FCBfc
EE
fFfFcCPCPfFP
fFcCPFCPP
)|),(maxarg()(
),(),(),(),(
Where
),(maxarg|),(),( fFvCPcfcFCBv
is the set of all pairs ),( fc for
which c would be misclassified by the MAP decision rule, if it is known that the feature
has the value f.
Claim 4.5.1.1: Structure of the min. PE solution
A k-valued feature F obtains min. PE, that is, for a given n-valued class C the minimum
possible value of the function ),( FCPE is obtained in F, iff the CPT of C given F
assumes the following form:
},1{},1{: kk - a (one-to-one) permutation of the set },1{ k .
)|(maxargmaxarg)( iFjCPpij
ijj
0)|)((:,1 )( jipiFjCPijkj
In other words, for any value i of F, there is a corresponding class value )(i (one of the
most probable in distribution of C), which is the most probable value of the CPT for
iF . At the same time, the probability of obtaining class value )(i for any value of F
different then i is zero.
The global minimum of PE which is obtained by using such F is equal to:
k
j
jEF
cFCP1
1),(min
Proof: See Appendix A3▄
We next consider the maximizing MI training paradigm. A feature F with k values best
approximates n-valued class C in the max. MI training paradigm, if the expression:
)|()();( FCHCHFCMI
- 51 -
is maximized, which is equivalent (as for a given C, )(CH is constant) to demanding that
the residual entropy )|( FCH is minimized by the best F. In our notation, the residual
entropy takes the following form:
ij
fc
k
i
n
j
iji pprfFcCPfFcCPFCH
, 1 1
log)|(log),()|(
Claim 4.5.1.2: Structure of the max. MI solution
A k-valued feature solves max. MI, that is for a given n-valued class C the minimal
possible value of )|( FCH is obtained in F, iff the CPT of C given F assumes the
following form:
There exists a group of sets: kAAA ,, 21 such that k
i
iAn1
},,1{
as a disjoint
union, that is for every two sets iA and jA : ji AA .
If we define a random variable A taking values from the set },,{ 21 kAAA ,
distributed according to:
iAj
ii jCPAPAAP )()()( , then A would have the
maximum entropy over all possible choices of kAAA ,, 21 . That is, )(AH is obtains
its maximal possible value (over all possible choices of such a group of sets) over this
choice of kAAA ,, 21 .
The entries of the CPT of C given F are as follows:
i
i
i
j
ij
Aj
AjAP
c
iFjCPpnjki
0
)()|(:1,1
where
ii Aj
j
Aj
i cjCPAP )()( .
The probability distribution of F is: )()(:1 ii APiFPrki .
In other words, F divides the class values into k disjoint sets kAAA ,, 21 and this
grouping, without knowing the value of F, has maximal possible entropy. The CPT of C
- 52 -
given F is such, that knowing the value of F determines to which set, of the kAAA ,, 21 ,
the true value of C belongs.
The global minimum of the )|( FCH which is obtained using such an F is equal to:
)()()|(min AHCHFCHF
where A is the random variable described above which is taking values from
},,{ 21 kAAA . Thus the global maximum of );( FCMI which is obtained by using such
an F is:
)())()(()()|(min)();(max AHAHCHCHFCHCHFCMIFF
Proof: See Appendix A4. We give a partial proof for the general case, and a full proof of
the case 2k . The full proof for an arbitrary value of k should be similar and is a theme
for future research▄
The result above about the structure of the maximum MI solution has an intuitive
explanation. The entropy of A is subtracted from the entropy of C when the value of F is
known. Knowing the value of F removes the uncertainty (i.e. entropy) of choosing the
right value of A from the consideration.
Using the above claims regarding the structure of the max. MI and min. PE solutions, we
derive the following simple rule for deciding when a feature F solving max. MI, also
solves min. PE. This rule follows directly from the two claims:
For a given n-valued class C, a k-valued feature F which maximizes
);( FCMI , also minimizes ),( FCPE iff the most probable class values
1,2,…,k each reside in different set of the sets kAAA ,, 21 which
correspond to F.
Using this rule leads us to the following conclusions:
There are cases in which a feature F which solves max. MI and which can be
obtained in the “ideal” training case, will not be a solution to min. PE. As an example
consider a 5-valued class distributed according to }6
1,
6
1,
6
1,
4
1,
4
1{ and a 2-valued
feature. Using Claim 4.5.1.2, one of the two sets which correspond to the max. MI
- 53 -
solution will contain both the most probable values (the ones with probability 4
1),
and thus, using the above rule, the solution to max. MI will not be a min. PE solution.
When the probabilities of the most probable class values are sufficiently large, or
when k is sufficiently large with respect to n, then it is reasonable then by mere
Dirichlet principle, the most probable class values would be distributed among the
different sets. One possible example is when the class is uniformly distributed over
},,1{ n . Here, all the class values can be considered most probable; hence no matter
how will they be distributed between the sets of the max. MI solution, this solution
will always be a min. PE solution.
Another example is of a binary feature in some natural class, say faces, classification
problem. Usually, in such a problem, the most probable class value would be non-
class (for instance non-face value for C able to distinguish between 99 face types and
a non-face value). Thus it is natural to assume that non-class probability would be
larger then 2
1 (which is usually correct considering the variability of the natural
examples), which immediately assigns it to be the only element of one of the two sets
of the binary feature max. MI solution. Hence, in such a case the max. MI solution
will also be the min. PE solution.
4.5.2. Disadvantages of Minimizing PE
In the previous sections we examined when max. MI and min. PE coincide. In this section
we consider the case when they are different. Let us first make explicit the meaning of
PE. By the Probability of Error, PE, we refer to the probability of making a mistake when
answering a single classification query, using MAP decision strategy. The analytic
expression of PE is:
f c
EE fFfFcCPCPfFPFCPP )|),(maxarg()(),(
The first drawback of using PE minimization as the goal of the training scheme, is that
there is no known way to represent PE as a sum or product of local kernels (functions of
small local domains) in the general case. Furthermore, there are no known general
conditions under which such decomposition exists. In contrast, we have shown that under
- 54 -
general assumptions specified at the beginning of section 4.1, the analytic expression of
MI, has a decomposition into the sum of local kernels. This introduces a major
computational difference between PE minimization and MI maximization. It is much
more efficient to maximize, using GDL for example, a decomposable function, then to
minimize a general non-continuous function. The minimized function PE is usually
discontinuous, because the set of training examples is finite and thus the marginal
distribution tables, usually approximated using histograms, are step functions of feature
parameters. Moreover, the minimization problem in the general case (if there is no
decomposition) can be exponential. Therefore, minimizing PE suffers from a
computational inefficiency compared with maximizing MI: if the assumptions for MI
decomposition are satisfied, then maximizing MI would be exponentially more efficient
then minimizing PE.
The second drawback of using minimizing PE training paradigm lies in the definition of
PE as the error of “single query MAP decision” scheme. In many practical situations, we
do not want the classification scheme to give only a single “best” guess; instead we
would like it, for instance, to arrive at the correct answer in a minimal number of guesses.
Well known information theoretic results imply that for achieving minimal number of
guesses, the best strategy would be the one that maximizes MI, rather then minimizes PE.
Furthermore, if we gradually increase the number of allowed guesses, then the min. PE
solution will tend to the max. MI solution.
We conclude that the criteria of Maximizing MI or minimizing PE in training are often
closely related, and that the decision which of the two is superior as a training goal is
application dependent.
4.5.3. MI(C;F) maximization as a classification model training criterion
Consider a given classification problem for a class C and assume that our task is to
construct a model based on a set of features F, and determine their optimal parameters, in
order to solve this classification problem. In this section we argue that a useful criterion
for achieving this task is selecting F, feature parameters and model structure so that
);( FCMI is maximized. This claim is based on the following arguments:
Inverse Fano inequality (proof of which is given in section 4) states that:
- 55 -
);()(2
1FCMICHPE
Where EP is a probability of an error in the MAP classification scheme and )(CH is
a constant entropy of the class. Therefore, maximizing the mutual information
between the model and the class, that is );( FCMI , reduces the upper bound on EP .
Moreover, for a binary class, Fano inequality states that:
)();()( EPHFCMICH
Thus having );( FCMI non-equal to its maximal possible value: )(CH , gives a non-
zero lower bound on EP . Therefore we argue that optimal classification model must
have );( FCMI maximized in order to achieve the smallest possible EP .
In section 4.5.1 we have further explored the connection between );( FCMI
maximization and EP minimization. In that section we gave a simple descriptions for
the structures of the solutions of both the );( FCMI maximization and EP
minimization in the ideal case. That is in the case that a model exhibiting any joint
distribution );( FCP can be selected. Using these simple descriptions we have
derived a simple rule for checking whether a max. );( FCMI solution is a min. EP
solution in the ideal case. Using this rule we also gave an intuition why solving max.
);( FCMI in some natural cases approximates min. EP solutions.
In section 4.5.2 we have described several disadvantages of straight-forward EP
minimization. A major disadvantage of EP minimization is that there is no known
decomposition of EP into a sum or a product of local factors (which depend only on
small subset of the trained model parameters). This gives a major computational
advantage to );( FCMI maximization, since in section 4.1 we have specified several
general assumptions under which );( FCMI can be decomposed into a sum of local
factors on which efficient training algorithms can be applied.
Even when );( FCMI is maximal, in order to efficiently perform inference, such as
computing MAP, on a general model, we need that model to be loop free. Here,
“efficiently” means in a non-exponential time unless P = NP.
- 56 -
Reducing the Kullback-Leibler divergence between the model and the true joint
probability of the features and the class, guarantees robustness of the model. By
robustness of the model we refer to avoiding overfitting to the training data and hence
making the trained model better applicable on unseen examples.
Using the above arguments, we are convinced that maximizing );( FCMI can be viewed
as an optimal criterion for classification model learning, i.e. feature selection, model
construction and parameter training.
- 57 -
Part II: Inference on Loopy Networks
5. Existing approaches for coping with loopy networks
In this part of the thesis we consider the problem of solving or approximating inference
problems on loopy networks.
As discussed earlier, some interesting algorithms on network models (BN, MRF, etc.)
such as EM distribution learning, or various MaxMI based training algorithms, require
solving the Marginalize a Product Function (MPF, [1]) problem (regardless of the loops
that may or may not be present). Furthermore, an efficient method for solving the MPF
problem immediately gives raise to efficient versions of theses algorithms.
Many real life problems in various areas of research can be thought of as instances of
inference problem on loopy networks. Examples can be found in coding theory [13]
(Turbo Codes), vision (which is the focus of the current work) and artificial intelligence
[3] communities. For instance, many natural models of visual classification naturally
include loops.
Inference on loopy networks is known to be NP-hard [17, 18]. Various approximations to
the solution have been suggested, and some of them will be described in the following
section. In general, the GDL-like algorithms discussed in the previous section are also
known to provide (often surprisingly good) approximations in some cases (as for Turbo
Codes [13] for instance).
Recent work by Yedidia et al. [4] has shed some light on these cases by showing that
when BP converges, it converges to an extreme point of the so-called Kikuchi
approximation to the Bethe Variational Free Energy. In fact, the entire problem of finding
marginals (MPF), which is the goal of the BP algorithm, can be cast as a problem of
finding the Free Energy of a system – an expression having a fundamental significance in
statistical physics. Another classical result from statistical physics shows that another
expression, namely Variational Free Energy, has the Free Energy as its global minimum.
Kikuchi‟s approximation to the Variational Free Energy is the expression which is
potentially minimized by the BP algorithm in case of convergence, and in light of the
above is an approximation to the MPF inference problem.
- 58 -
Not surprisingly (since BP is a special case of the general GDL scheme, as was explicitly
shown in Section 3) the same fact is true for the GDL, as was shown in [5] by McEliece
et al. (who originally developed the GDL algorithm).
Unfortunately, neither BP nor GDL are guaranteed to converge for a given inference
problem. Although convergence to the exact solution cannot be guaranteed, there are
some recent methods that use the Free Energy formulation, and provide algorithms to
approximate the minimum of Kikuchi‟s approximation to Variational Free Energy, while
guaranteeing convergence (see for instance the work of Yuille [7], reviewed briefly in the
sub-section 5.3).
5.1. Triangulation
One of the first and basic methods (suggested initially by Pearl [3]) for coping with
problems imposed by introducing loops to BN / MRF models using the BP / BR methods,
was to artificially enlarge the support of some of the local kernels, so that loops in the
corresponding junction graph are eliminated (i.e. the resulting decomposition has a JT).
One of the basic methods to obtain such an enlargement is Triangulation. The procedure
behind the triangulation scheme is quite simple: given a loopy moral graph (a graph with
variables in the nodes which connects all variables which share any local domains) we
need to add a set of edges, so that every loop in the graph with length more then 3 will
have an arc (i.e. a non-loop edge connecting two loop nodes).
After the moral graph is triangulated, from the resulting graph a new decomposition
having a JT is constructed [22]. Each local kernel of this new decomposition corresponds
to a clique of the triangulated moral graph. It is constructed by multiplying all the local
kernels of the original decomposition whose local domains are contained in the
corresponding clique of the triangulated graph.
On the JT obtained by the triangulation, standard GDL algorithm can be applied in order
to solve inference problems such as MAP or MPF. However, the price paid for using
triangulation is the increase in size of the local domains of the decomposition.
Consequently, the computational cost, which is exponential in the size of the largest local
domain, is considerably increased.
- 59 -
As for the optimality issue of triangulation, the problem of finding optimal triangulation
is often referred to as the TREEWIDTH problem of the graph. TREEWIDTH of a graph
is defined to be the size of the largest clique after triangulation, minimized over all
possible triangulations. For instance, the TREEWIDTH of a tree is 2. Note that all the
GDL related algorithms are exponential in the TREEWIDTH of a graph, hence the
crucial importance for minimizing it in real life applications.
The problem of finding the TREEWIDTH of a general graph is known to be NP-hard
[Arnborg et al., 1987]. However, several approximations exist, for example, see [12] and
[15].
5.2. Loopy Belief Revision
Belief Revision is often used for approximating MAP on loopy networks. Loopy Belief
Revision (LBR) uses same message passing algorithm as BR / BP discussed for the loop
free case (i.e. LBR is the original BR applied on a loopy network with some message
passing schedule).
However, although the messages passed are the same as for the loop free case, there is a
problem involving the algorithm termination: in the loop free case any BR schedule
which follows BR message passing rules is bound to eventually terminate, but this is not
the case for the LBR.
Consider for instance the simplest case of a loopy network – a single loop. As shown in
[6], on this network binary BP is guaranteed to converge to the correct marginals. In the
non-binary case, a simple criterion is provided for BP convergence over this single-loop
network. However, in the case of LBR, convergence over the single-loop network is not
guaranteed. Indeed, if we consider even the simplest schedule of a single message going
around the loop, if all the (directed) edge weights of the loopy belief network are
positive, the “looping” message will trivially diverge to infinity unless the process is
terminated by other means then termination in case of non-increase of a message in one
of the nodes.
In practice we can consider different termination conditions for the LBR. For instance, in
our experiments we used two such conditions, described later.
- 60 -
5.3. CCCP: Minimizing Bethe-Kikuchi approximation of Free Energy
As mentioned above, the Bethe-Kikuchi Free Energy approximation was found to be of
key importance to the approximation of Loopy Belief Propagation (LBP) following the
developments of Yedidia et al. [4], who showed that if LBP converges, then it converges
to a stationary point of the BK approximation. This discovery led to the development of
new algorithms, which, unlike the LBP, are guaranteed to converge to a local minimum
of the BK approximation.
One such algorithm is the so-called CCCP – Convergent Convex Concave Procedure
developed by Yuille [7]. This algorithm exploits the decomposition of the BK
approximation into a sum of a convex term and a concave term. Yuille uses this fact to
derive a message passing algorithm which relies on simple analytical properties of such a
decomposition. CCCP is guaranteed to converge to a set of beliefs comprising a local
minimum of the BK approximation. The BK is in turn an approximation to the minimum
of the Free Energy which is the true set of beliefs. Yuille reports good results even in
simulations in which LBP failed to converge.
It is worth stressing that minimizing BK approximation only provides a method for
approximating the local marginals in BN / MRF network (solving the MPF problem), and
not the MAP on these networks. Another paper by Yuille [23], suggests solving MAP by
using Temperature Annealing - introducing a temperature factor to the BK approximation
and letting it go to 0, each time using CCCP (or any other BK minimization method) to
calculate the next step initial beliefs.
Another important point is that the BK approximation is not guaranteed to be a good
approximation of the real Free Energy, or the real solution we are seeking. In the general
case it can be arbitrarily bad.
6. Using “Slow Connections” for solving MAP on loopy networks
In this section we introduce our novel scheme for coping with loopy networks. Our main
effort will be directed towards solving the MAP problem on loopy function
decompositions over the max-sum commutative semi-ring. The common aspect of our
techniques is the use of what we call “slow lateral connections” in the loopy network.
- 61 -
The use of such „slow lateral connections‟ is motivated in part by properties of biological
brain circuits. In the brain‟s cortex, lateral connections within a cortical area are typically
considerably slower than between neurons in different areas (see [10] and [11] for further
reference). This difference in conductance speed affects the message passing scheduling
in processing that involves these neurons. In our proposed techniques, we designate some
of the loopy network connections to be “slow”, i.e. being updated in a slower schedule
than the rest of the network. We present several conditions under which this approach is
guaranteed to converge to local or global maximum of the (loopy) function.
The conditions that we assume to guarantee convergence may not always be applicable to
a given problem. We introduce several methods to cope with these situations. One
method we introduce is an iterative approach approximating MAP, over a series of
functions, which, under some conditions, converge to the desired function, thus solving
the MAP problem for this function. Another approach we propose is a hybrid approach,
which uses triangulation (introduced previously) together with our techniques. The latter
approach will always converge to the global maximum, but the efficiency of the
improvement will depend on the problem at hand.
Finally, in section 7 we will introduce one possible method in which our techniques could
be applied in practice together with some experimental results given in section 8. It
involves breaking the application of the “slow connections” algorithm into several steps.
Each step uses the “slow connections” technique to achieve its goal, but only on a
fraction of the whole network. We will also show how some well known theoretic results
from triangulation related graph theory can be used to give an upper bound on complexity
improvement (over standard triangulation) that can be achieved using our techniques.
6.1. General overview of the approach
To introduce our approach, let us first describe the general setting used in subsequent
sections. Consider a function )(xf where ),,( 1 nxxx is a vector of its variables, and
assume that )(xf has the following sum-decomposition:
j
jj Sgxf )()(
- 62 -
where n
kkj xSj 1}{: are subsets of f variables. In other words, f is the sum of
simpler functions jg , that depend on small subsets of the variables. Our goal is to solve
the MAP inference problem for f , i.e. find an assignment x for x s.t. )(maxargˆ xfxx
.
Let us first draw a “local domain graph” of f decomposition (as described in the GDL
section). Nodes of this graph are the local domains }{ jS , and every two nodes jS and
kS , s.t. kj SS , are connected. The weight of the edge connecting them is set to
kjjk SSw . As stated in the GDL section (and shown in [1]), if the maximal weight
spanning tree of the graph has weight equal to nSj
j , then the decomposition has a
corresponding JT, which can be used to solve the MAP problem using the GDL
algorithm.
However, in case of a loopy decomposition (i.e. when the corresponding moral graph has
un-triangulated loops) the weight of the maximal weight spanning tree (MST) of the
“local domain graph” will be smaller than nSj
j and hence there will be no JT for
the graph.
Intuition
Consider for example, a loopy MRF depicted in Figure 7. Nodes A, B, C and E clearly
form a loop and hence the GDL or Belief Revision (BR) algorithms cannot be applied to
solve MAP on this network in the straightforward manner. All the cliques of the network
depicted on Figure 7 are of size two. Thus, the joint distribution decomposition
represented by Figure 7 is of the form:
),(),(),(),(),(1
ECEBDBCABAc
where c is a normalizing constant. Hence, the logarithm of the joint distribution is:
),(log),(log),(log),(log),(loglog ECEBDBCABAc
We denote: ),(),(log P .
- 63 -
Figure 7: A loopy MRF. Nodes A, B, E and C form a loop.
Informally, we suggest approximating the MAP assignment computation using a scheme
of “freezing” connections allowing them to pass messages only between the
maximization rounds applied on the rest of the network. The “frozen” connections are the
so-called “slow connections”. They are called this way since they are updated slower then
the others. For example consider the following illustration on Figure 8.
Figure 8: Slow connection example. The edge (A,C) is “opened” and
replaced by a “normal speed” edge (A,ZC) and a “slow speed” edge (C,ZC).
Messages are passed on the “slow” connection after each maximization
step performed on the rest of the graph.
- 64 -
In this example, the slow connection is between the node A and the node C. The essence
of our approach, which will be described later in full detail, is as follows. During each
maximization round, A assumes some a fixed value of C (which was achieved in the node
C in the previous round). The assumption which is made by A is depicted on Figure 8,
where the node ZC is a so-called “evidence” node or observed node, value of which is
fixed during each maximization steps. Clearly, the network depicted in Figure 8 is a tree.
Hence, standard algorithms, such as GDL can be applied to calculate the MAP
assignment on this network. Between the maximization steps, the MAP value in node C
is transmitted to the evidence node ZC via the “slow” connection between them. This
value serves as the fixed value of the evidence node ZC for the next GDL maximization
round.
Removing loops by variable replication
Let us now formally introduce our approach of removing loops from the graph by
replicating some of the variables. At each step we choose a variable ix that is counted in
the weight of at least one of the current MST edges and which sub-graph induced by all
the nodes jS containing it ( ji Sx ) and edges of the current MST has two or more
connected components. We then choose a leaf jS belonging to the smallest of these
connected components (all the connected components are sub-trees of the MST and
hence must have leaves) and replace ix in jS by a variable ijz . Note that in each step
nSj
j decreases as n is increased (we add a new variable and ix remains in some
other node that was connected to jS in the current MST) and j
jS stays unchanged.
Moreover, the weight of all the edges between jS and its current neighbors in which ix
participated is decreased by one (edges whose weight becomes zero are removed).
If the new MST for the updated graph remains with the same weight, then the difference
between the weight of the current MST and nSj
j (where n is the current number of
variables and }{ jS are the current nodes of the “local domain graph”) decreases by one.
- 65 -
If the weight of the new MST decreases, then it is by no more then 1, as jS was a leaf of
the connected component and hence there was only one edge in the original MST which
was affected by the removal of ix from jS . Moreover, if the connected component was
the node jS alone then the weight of the new MST clearly remains the same as the
weight of the original one. Finally we note that every step reduces the size of at least one
connected component, hence arriving at points in which there are single-node connected
components is imminent. Hence, this process is guaranteed to converge to a point in
which the weight of the current MST will be equal to nSj
j for the current value of n
and the sets jS .
Assume the above process converges after m step and let ),,(11 mm jiji zzz denote the
vector of the variables added in the process. Let us denote by x~ - vector of all variables
from the set },,{1 mii xx and by y – vector of the all variables from the set
},,{\},,{11 miin xxxx . Then the “local domain graph” which has resulted after the
final step of the process represents a decomposition of a function ),~,( zxyg (with the
same local kernels as )(xf decomposition, but with updated domains of the local kernels
– some of the variables are replaced by z variables) which is loop-free, i.e. has a JT (as
this was the termination condition of the process). Moreover the MAP problem for the
)(xf can be updated to the new context as a constrained MAP (CMAP) problem :
),~,(maxargˆ~,~,
zxygxxzxy
, where the (consistency) constraint xz ~ means that
kkk iji xzmk :1 (note that x~ and y together compose the original x vector).
The choice of x~ and z in the above (a cut-set of the loopy network) is not unique in
general, some choices could be better then the others as we will see in the following
sections, where we discuss assumptions under which the CMAP problem can be solved
or approximated, together with approaches that make use of these assumptions. To
conclude this point, Figure 9 illustrates two possible z choices for removing loops from
an exemplar loopy MRF model. Each choice is depicted by coloring the local kernel
variables that are replaced with z variables.
- 66 -
Figure 9: Breaking Loops via creating “slow connections”. (a) A network containing a
loop. (b) One of the connections is „frozen‟ during a part of the computation; during
this computation the graph is effectively opened. (c) Another choice of “slow
connection”.
Finally we introduce a few notations that will make the discussion in the following
section more readable:
- 67 -
Instead of writing ),~,( zxyg we‟ll write ),,( zxyg where we‟ll assume (w.l.o.g.) that
x and z are of the same size (and ordered accordingly). All our results and approaches
can be extended in straightforward fashion from this case to the more general case
described above (in which z is potentially larger then x and several z variables can
correspond to the same x variable). Under this notation, the CMAP problem is a
problem of finding ),,(maxarg)ˆ,ˆ(,,
zxygxyxzxy
, that is argmax under the constraint
z=x. We will also refer to the constrained maximization as a problem of finding
“legal” optimums of ),,( zxyg , maximal points of the form ),,( xxy .
For any fixed z we denote: ),,(maxarg),(,
zxygxyxy
zz .
We denote the original function (the one over which we are interested in obtaining the
MAP assignment) by ),,( xxyg . In all the following discussions we‟ll assume that
),,( xxyg is loopy, while ),,( zxyg is loop free. In fact, the original function that we
have started with was ),( xyg , out of which we construct a loop free ),,( zxyg by
replacing some of the variables from x by variables from z. For example, x2 can occur
in more then one place, but we may replace it by z2 just in one of those places. The
function ),,( zxyg is strictly speaking different from ),( xyg , but ),(),,( yxgxxyg
for any x and y.
We assume that ),,( zxyg is discrete and bounded.
6.2. Approaches for obtaining a local optimum
In this section we describe several approaches for approximating the CMAP via iterative
processes, which converge to so-called “local optimum” points of g, of the form ),,( xxy .
Here “local optimum” means that some local changes (changes in specific subsets of the
whole set of variables) of the optimum point are guaranteed to decrease g.
For example, the function variables in the scheme, which is described in subsequent
section, are divided into two subsets. In each round of this iterative scheme, one of the
subsets is assumed fixed (i.e. being evidence variables) and in the next round the situation
is reversed. The value achieved in the previous round for each of the subsets serves as the
- 68 -
fixed value for the next round. The local optimum for this scheme is achieved with
respect to each of these subsets.
6.2.1. Iterative fixing
An approach that can be used to approximate CMAP is the “iterative fixing” approach.
This approach assumes that selection of z is “symmetric”, i.e. both the decompositions of
),,( 00 xxyg and ),,( 0 xxyg are loop free for any fixed 0x and 0y . In order to better
understand the symmetry assumption, you are referred to Figure 7, where we may
consider (C) to be the vector of x variables and (A,B,D,E) being the vector of y variables.
Then the above symmetry assumption clearly holds (fixing each of the vectors we arrive
to a loop free decomposition).
Given such a selection of z and the corresponding function ),,( zxyg we may
approximate ),,(maxarg)ˆ,ˆ(,,
zxygxyxzxy
by initializing 0zz and iterating the following
steps:
Fix kzz and calculate ),,(maxarg kky
k zzygy , as ),,( kk zzyg has a loop free
decomposition, this can be achieved using GDL for instance.
Fix kyy and calculate ),,(maxarg1 zzygz kz
k . The latter maximization is also
over a loop-free decomposition due to our assumptions on ),,( zxyg (the
“symmetric” assumption).
This iterative process terminates when kk yy 1 or kk zz 1 . Each step of the process
described above increases g over the previous point, and therefore:
),,(),,(),,( 11111 kkkkkkkkk zzygzzygzzyg
and since ),,( zxyg is bounded, the process is guaranteed to terminate. The termination
point of the process fits our “local optimum” description above as at point of termination
),,( lll zzy we have:
),,(),,( and ),,(),,(:, zzygzzygzzygzzygzy lllllllll
hence the only way to improve (i.e. increase g) from ),,( lll zzy is via changing both y
and z.
- 69 -
The above method was successfully used in [14] to calculate MAP (used for training their
model with “hard” EM) over a biological probabilistic model. The model in [14] was not
discrete, but the general steps of the algorithm were the same. As reported in [14] this
approach has produced very good results in their application.
Note that this approach is different from our proposed “slow connections” approach, as in
the slow connections approach the nodes with slow connections to other nodes are not
fixed and participate in maximization steps. In “slow connections” approach, nodes are
fixed only partially, that is they are assumed having some fixed value only be other nodes
connected to them via slow connections.
6.2.2. Local optimum assumption
From this section on, we‟ll discuss our novel techniques for MAP approximation. This
techniques form what we previously informally called the “slow connections” approach.
We derive several iterative processes using the following general approach as informally
introduced above: opening the loopy graph by duplicating some of the variables, and
iterating GDL and variable update. These processes are guaranteed to converge under
some assumptions about the original function. We will now describe the assumptions and
the processes.
The most basic of our approaches uses the following assumption to approximate CMAP
and obtain a “local optimum” for ),,( zxyg :
Assumption (A2) – weak z-minor:
),,(),,(),,(),,(: ZxygZZygZxygxxygy ZZZZZZZ
where ),,(maxarg),(,
Zxygxyxy
ZZ and inequality is strict unless ZxZ .
This assumption has the following meaning. For a fixed z=Z, we can maximize g, and the
maximum is obtained at ),,( Zxy ZZ. This is not a „legal‟ point, since by definition a legal
point has the form ),,( xxy . We can „legalize‟ the point in two different ways: either by
changing Z or by changing Zx . The assumption essentially says that z is a „less effective‟
variable: changing it from Z to Zx in points ),,( Zxy ZZ has a smaller effect than
changing x from Zx to Z and changing y.
- 70 -
Under the above assumption, a simple iterative process is guaranteed to converge to a
“local optimum” approximation of CMAP over ),,( zxyg . The maximize-and-legalize
process (denoted by P1) initializes by setting 0Zz (for instance we could first find the
global maximum ),,( 000 ZXY of loop free ),,( zxyg and take 0Z from there) and iterating
the following steps:
Maximization: Fix kZz and calculate ),,(maxarg),(,
kxy
ZZ Zxygxykk . As
),,( zxyg is loop free the latter maximization can be done using GDL.
Point legalization: Set kZk xZ 1 .
The iterative process terminates when kZ Zxk .
Claim 2: Assuming A2, the above iterative process converges to a point )~,~,~( xxy which
is a “local optimum” of ),,( zyxg in a sense that for any y and any kZ passed during the
process: )~,~,~(),,( xxygZZyg kk .
Proof: By induction. The induction hypothesis is:
),,(),,(:, 11 mmZkk ZZygZZygymkm
Initialization step: for m=0 the hypothesis trivially follows from the fact that
),,(maxarg),( 0,
00Zxygxy
xyZZ and thus ),,(),,(: 000 00
ZxygZZygy ZZ . Moreover
due to the A2 assumption:
),,(),,(),,(),,(: 0000 0000000ZxygZZygZxygxxygy ZZZZZZZ
and hence:
:y ),,(),,(),,( 000 00000ZxygxxygZZyg ZZZZZ or
),,(),,(),,(00000 000 ZZZZZ xxygZxygZZyg
Now recall that 01 ZxZ and thus the induction hypothesis results. Moreover, note that
inequality in the hypothesis is strict unless 00ZxZ .
Inductive step: Assume that inductive hypothesis stands for m-1 and let's show it for m.
By the hypothesis we get that:
),,(),,(:,11 mmZkk ZZygZZygymk
m
- 71 -
As ),,(maxarg),(,
mxy
ZZ Zxygxymm
, get that ),,(),,(: mZZmm ZxygZZygymm
and in
particular ),,(),,(:1 mZZmmZ ZxygZZygy
mmm
. Moreover due to A2:
),,(),,(),,(),,(: mZZmmmZZZZZ ZxygZZygZxygxxygymmmmmmm
and hence:
:y ),,(),,(),,( mZZZZZmm ZxygxxygZZygmmmmm
or
),,(),,(),,(mmmmm ZZZmZZmm xxygZxygZZyg
Finally, as mZm xZ 1 we get ),,(),,(:, 11 mmZkk ZZygZZygymk
m▄
Note that unless mZ Zxm the inequality above is strict, i.e. in this case:
),,(),,(:, 11 mmZkk ZZygZZygymkm
Conclusion: Since, as we have shown by induction, the sequence )},,({ mmZ ZZygm
is
strictly increasing, the process must converge (as g is assumed to be bounded). As
)~,~,~( xxy is the final point of the process, then there is some l for which
)~,~,~(),,( xxyZZy llZ l and hence, the claim follows immediately▄
6.3. Assumption for obtaining a global optimum
In this section we present another assumption on ),,( zxyg which is stronger then A2, but
at the same time guarantees that the process P1 described in the previous section
converges to the global maximum of ),,( xxyg (i.e. to the correct solution of the CMAP)
in a single step.
Assumption (A1) – strong z-minor:
),,(),,(),,(),,(:),(),(, ZxygZxygZxygzxygxyxyZz ZZZZZZZZ
where ),,(maxarg),(,
Zxygxyxy
ZZ .
Roughly speaking, A1 demands that, starting from a maximal point of the form
),,( Zxy ZZ, for any fixed Z and arbitrary z, changing Z to z has smaller effect on value of
g then changing the pair ),( ZZ xy to any other value ),( xy around the point ),,( Zxy ZZ .
Obviously A1 would not be true for a continuous function, but for a discrete function this
means that the z variable is "lateral", i.e. secondary, near the ),,( Zxy ZZ points Z .
- 72 -
Claim 1: Assuming A1, the maximize-and-legalize (P1) process will converge in a single
step to a global maximum of ),,( xxyg , i.e. the correct solution to the CMAP problem.
Proof: Denote by )ˆ,ˆ,ˆ( ZZy a global maximum of ),,( xxyg . We start the process P1 by
selecting an arbitrary Z value. We first maximize over y and x, to obtain ),,( Zxy ZZ, then
legalize to obtain ),,( ZZZ xxy . We will show that this process leads us to the global
maximum )ˆ,ˆ,ˆ( ZZy .
We first show that it must hold that Z
xZ ˆˆ . Assume for the sake of contradiction, that
ZxZ ˆ
ˆ . By A1:
)ˆ,,()ˆ,ˆ,ˆ()ˆ,,(),,( ˆˆˆˆˆˆˆ ZxygZZygZxygxxygZZZZZZZ
We also know that )ˆ,,()ˆ,ˆ,ˆ( ˆˆ ZxygZZygZZ
, from the definition of Z
y ˆ and Z
x ˆ . We
conclude that ),,( ˆˆˆ ZZZxxyg is strictly closer to )ˆ,,( ˆˆ Zxyg
ZZ than )ˆ,ˆ,ˆ( ZZyg is, and
therefore, ),,()ˆ,ˆ,ˆ( ˆˆˆ ZZZxxygZZyg in contradiction to )ˆ,ˆ,ˆ( ZZyg being a global
maximum of ),,( xxyg .
Figure 10: z-change vs. x,y-change. By A1 z-change from a
maximal point is always smaller then x,y-change.
Hence as Z
xZ ˆˆ and the global maximum point )ˆ,ˆ,ˆ( ZZy is a fixed point of P1. Since
),,()ˆ,,()ˆ,ˆ,ˆ( ˆˆˆˆˆ ZZZZZxxygZxygZZyg , and since )ˆ,ˆ,ˆ( ZZy is a global maximum of
),,( xxyg , then so is ),,( ˆˆˆ ZZZxxy .
Now let us choose some ZZ ˆ as the starting point of the process P1. Following the first
step of P1 we reach the point ),,( ZZZ xxy . We will show that this is in fact the global
- 73 -
maximum )ˆ,ˆ,ˆ( ZZy . Assume, for the sake of contradiction, that ),(),( ˆˆ ZZZZxyxy .
Consider the points ),,( ˆˆ ZxyZZ
and )ˆ,,( Zxy ZZ , by A1 applied to Zz ˆ and Zz :
(1) ),,(),,(),,()ˆ,,( ˆˆ ZxygZxygZxygZxyg ZZZZZZZZ
(2) )ˆ,,()ˆ,,()ˆ,,(),,( ˆˆˆˆˆˆ ZxygZxygZxygZxygZZZZZZZZ
From (1) we get that ),,()ˆ,,( ˆˆ ZxygZxygZZZZ . This is because
),,(),,( ˆˆ ZxygZxyg ZZZZ (from the definition of
ZZ xy , and from (1)) and
)ˆ,,( Zxyg ZZ is strictly closer to ),,( Zxyg ZZ than ),,( ˆˆ Zxyg
ZZ is. Similarly from (2) we
get that ),,()ˆ,,( ˆˆ ZxygZxygZZZZ , which is a contradiction. Hence
),(),( ˆˆ ZZZZxyxy . Therefore, P1 starting from arbitrary Zz converges in a single
step to the point ),,( ˆˆˆ ZZZxxy which is the global maximum of ),,( xxyg ▄
Remark: our proof also shows that the global maximum of a function which admits A1 is
unique.
6.4. Coping with general networks – from theory to practice
Assumptions and processes discussed in the previous section provide a useful tool for
MAP inference in certain types of loopy networks. However, two major problems can be
pointed out when applying them to the class of general loopy networks:
1. In general, assumptions strong z-minor (A1) or weak z-minor (A2) may simply not
hold. That is, not every function has a representation that has an appropriate selection
of z variables, so that A1 or A2 will be satisfied.
2. Even if for the given function representation, an appropriate selection of z variables
exists, finding it can be computationally intractable. That is, when the function‟s
support includes a large number of variables, selecting z variables and verifying A1
or A2 for a given selection can be exponentially hard. This is due to the fact that we
potentially have to consider all combinations of z and x, y variables.
In the following sections we will develop methods for addressing these issues, which can
be used for the practical application of A1/A2 based techniques. In addition, we will
- 74 -
derive an upper bound on the decrease in compotation complexity that can be achieved
by using our techniques, compared with the standard triangulation.
6.4.1. Partial iterative approximation
The first issue that we will address is the problem of assumptions A1 / A2 not holding for
a specific selection of z variables. Suppose we have a function whose decomposition
(here we assume that the function has a summation decomposition) can be expressed as
the sum of two parts: ),,(),( xxygxyf where x and y are vectors of variables.
Moreover, assume further that the decomposition of the ),( xyf part is loop free (i.e. has
a JT), while the decomposition of ),,( xxyg is loopy. However, if we replace the second
(vector) x by z we‟ll arrive at ),,( zxyg which is loop free (i.e. ),,( xxyg admits the z
selection as previously discussed). As an example, consider a function
),,(),,,( 3212121 xxxgxxyyf where ),(),(),,,( 2221112121 xyfxyfxxyyf and
),(),(),(),,( 133322211321 xxgxxgxxgxxxg , clearly ),,( 321 xxxg is loopy, while
),,,( 2121 xxyyf is loop-free. The problem we examine is that for the full function
),,(),( zxygxyf assumptions A1 / A2 do not hold.
Now assume there exists some small constant 1a s.t. assumption A1 holds for the
function ),,()],,()1(),([ 11 zxygaxxygaxyf . The last assumption is much less
demanding than assuming A1 for the original function ( ),,(),( zxygxyf ). This is
because the maximal function change caused by z is now controlled by 1a (multiplied by
it), and under relatively broad assumptions we can easily show that such 1a exists. A
sufficient assumption it that there is no zero differences for x, y changes, that is for any z
and any different pairs x1, y1 and x2, y2:
),,(),(),,(),( 22221111 zxygxyfzxygxyf
Recall that the maximize-and-legalize process (P1) under the A1 assumption converges to
the global maximum in a single step. The P1 process had two steps. First, fixing z to
some value and second maximizing over y, x. In the discussion above, this maximization
over y, x was performed by using GDL, since it was performed over a loop free
- 75 -
decomposition. In the current derivation we do not get immediately a loop-free function,
and we will deal with it in several steps.
The proof of P1 convergence implies that we can apply P1 by fixing 1Zz and
maximizing ),,()],,()1(),([ 111 Zxygaxxygaxyf for finding the maximizing
assignments x, y. If we can find this maximum, then after the maximization we simply
assign xz ˆ (where xy ˆ,ˆ is the maximizing y, x assignment) and terminate with a
maximal solution )ˆ,ˆ,ˆ( xxy .
To perform the maximization stage, we still need to maximize:
),,()1()],,(),([ 111 xxygaZxygaxyf
which is still not loop free. However, the first part (i.e. )],,(),([ 11 Zxygaxyf ) has a
loop free decomposition. As for the remaining part, ),,()1( 1 xxyga , we can proceed by
applying the same logic over again. We summarize the proposed iterative procedure by
describing step k of the iterative process:
1. At the beginning of step k the function that needs to be maximized is of the form:
),,()1(]),,(),([1
1
1
1
xxygaZxygaxyfk
i
i
k
i
ii
.
2. Find ka so the function:
),,(),,()1(]),,(),([1
1
1
zxygaxxygaZxygaxyf k
k
i
i
k
i
ii
satisfies A1.
3. Fix kZz and proceed to the next step (step k+1) in which we maximize the
function: ),,()1(]),,(),([11
xxygaZxygaxyfk
i
i
k
i
ii
.
The above process terminates when either one of the two conditions is fulfilled:
1. 11
k
i
ia , in this case we have to maximize ]),,(),([1
k
i
ii Zxygaxyf which is
loop free, and hence this can be done by GDL. The x, y pair which results from the
maximization is the correct solution to the original MAP problem, as follows
immediately from Claim 1 applied iteratively.
- 76 -
2. We also terminate in case we cannot find an appropriate ka in some step. In this case
we only arrive at solution to the MAP problem for the function: ),,(),( xxygcxyf
where
1
1
k
i
iac . This can be thought of as an approximation for the original MAP
solution that lies between completely ignoring the loopy terms and computing the
maximal assignment for the original function.
The heuristic choices that can be made in the above process are the selections of the kZ
constants which affect the subsequent steps of the process. Of course, the above approach
still suffers from the second problem of the A1 / A2 approaches, namely verifying that
A1 holds for a specific intermediate function (which is required for the ka selection) can
still be computationally expensive.
The final point that we can note about the process described in this section, is that it can
also be applied using A2 assumption instead of A1. In this case the maximization steps
will be a sequence of iterations, one for each successive fixed value of z. Moreover, each
iteration will apply recursively the successive “constant choosing” step of the modified
(for use of A2) algorithm. That is, as assuming A2, the P1 process does not converge in a
single step; we will need several iterative P1 steps for the maximization step (step 3) of
each intermediate function which arise in the partial iterative approximation process.
Each P1 step will recursively invoke the partial iterative approximation process for all
successive intermediate functions.
Example
As an example of an application of the above algorithm, consider the following simple
scenario. Assume we want to maximize a function ),,( 321 xxxg such that:
),(),(),(),,( 133322211321 xxfxxfxxfxxxg
Moreover, assume that functions (local kernels) 1f ,
2f and 3f are such that the strong z-
minor (A1) assumption does not apply in this case. However, assume that the local
kernels are such that we can apply the partial iterative approximation algorithm. Assume
that there exists 1a , such that:
- 77 -
),(),()1(),(),(),,,( 1331133121132213211 zxfaxxfaxxfxxfzxxxg
satisfies A1 around 11 Zz . Also assume that:
),()1(),(),(),(),,,( 2331133121132223212 zxfaZxfaxxfxxfzxxxg
satisfies A1 around 2Zz . Note that ),,,( 23212 Zxxxg is loop free and the procedure we
used to transform g into 2g is exactly 2-step partial iterative approximation algorithm.
If the A1 assumptions made in our example are satisfied, then the partial iterative
approximation algorithm applied to this example will terminate after a single
maximization step, as the resulting assignment to 321 ,, xxx will be the maximal
assignment to the original function due to claim 1 in section 6.3. Also note that 21 ZZ ,
as otherwise the original function would satisfy A1 if 3x is replaced by z in 3f .
However, if we replace one or both the A1 assumptions made in our example with A2
assumption, then the run of the partial iterative approximation algorithm will not
terminate in a single step. Instead it will run several maximize-and-legalize (P1) iterations
over 2g starting in
22 Zz until P1 converges, then it will legalize 1z (which started
from 1Z ) to a new fixed value and will run the maximize-and-legalize (again with respect
to 2z ) on the new function and so on. Both A1 and A2 variants of the partial iterative
approximation algorithm are illustrated in Figure 11.
Figure 11: Partial Iterative Approximation algorithm illustration. The nodes 1z and
2z start from the values 1Z and
2Z respectively. Under twice the A1 assumption
the process terminates in a single step and slow connections are not needed. In case
we assume only A2, we run maximize-and-legalize on 2z only, with
1z fixed, then
- 78 -
legalize 1z over the slow connection, then run maximize-and-legalize again, and so
on until convergence. The slow connections used in the process are of different
update “speeds”. The red slow connection is considerably slower then the green
one.
6.4.2. The hybrid approach
This section addresses the second problem arising in our approach for dealing with loopy
inference, which is the complexity of choosing the subset z of variables to keep fixed
during the maximization step. It is often hard and inefficient to select a set of appropriate
z variables to satisfy A1 / A2 in the ),,( zxyg construct. We approach this problem by
using the fact that, as noted briefly in the previous section, P1 can be used regardless of
whether the function in the maximization step is loop free or not. Indeed, assume we
express the function ),,( zxyg as ),,(),,(),( 21 zxygxxygxyf and assume that A1
holds for this function ),,(2 zxyg , while 1g is still loopy. If we have some way of
maximizing this function with z fixed, we can still use the maximize-and-legalize
approach: maximize the function with z fixed, then legalize it by setting xz ˆ where x is
the maximizing x assignment.
Using this fact, a plausible approach (in the efficiency sense) for solving MAP over a
given )(xf (loopy) decomposition, can be obtained by an iterative selection of variables
into the z vector. Variables that are being selected into z are variables that participate in
the loopy part of )(xf decomposition, i.e. each candidate variable should participate in
at least one loop. The resulting scheme has the following form:
1. First construct the local domain graph for )(xf decomposition.
2. At each step select at least one variable of one (or several) of the local domain nodes
of the current graph, so that the variable, taken as the z variable, satisfies A1. In other
words, if we denote the selected variable by x then ),,( zxyg function, which results
from )(xf if we replace x by z in the selected local domain(s) while denoting by y
the set of the remaining variables, satisfies A1 for z. We discuss below useful
heuristics for selecting these variables.
3. Fix Zz for some constant Z. By Claim 1 we know that:
- 79 -
),,(maxarg),,(maxarg,,
xxygZxygxyxy
4. Advance to the next step (either terminate if termination condition is satisfied or
return to step 1) by updating the local domain graph to represent the new ),,( Zxyg
decomposition into the sum (or product) of the updated set of local kernels. After
fixing the z variable in the selected local domain(s), the local kernel for that local
domain is updated accordingly to be a function of a larger set of variables (including
the new z variable).
5. Terminate when the local domain graph has a JT (i.e. represents a loop free
decomposition), or when no new z variable (or set of variables) can be selected.
As an example of the above method, consider the experiments run by us to test the “slow
connections” approaches. The setting of our experiments together with the detailed
description of applied algorithms is given in section 7, while the empirical results are
given in section 8.2.
Note that we could replace satisfying condition A1 in the above discussion by satisfying
A2. We can do so by making the maximization steps iterative. That is, each maximization
step - step 3, which operates on some iz selected in step 1 of the current iteration of the
scheme, would be consisted of several P1 iterations. Each P1 iteration will perform
maximization by recursively applying the following “z selection” steps (recursively
invoking step 1 on the updated local domain graph) and legalization by assigning new
value to the iz variable.
A more efficient version of working under A2 would be to iterate on the whole set of z
variables selected by all the steps, changing its value in each subsequent P1 iteration to
the corresponding x values. That is, instead of recursively applying P1 for each additional
variable selected into z, select the whole z vector and only then apply P1 for the whole z.
The latter version is not equivalent to the former one in the general case as selection of
subsequent z variables depends on the constants selected in the previous steps. However,
it can be used as a more efficient heuristics.
As an example of the above scheme augmented with A2, consider the “slow connections”
experiments which results are given in section 8.2. The “different slow speed” approach
uses recursive application of P1, selecting variables into z one by one and applying itself
- 80 -
recursively for each successive fixed value (resulting from legalization step of P1) of
each selected variable. The “same slow speed” approach selects the whole set of z
variables apriori and only then applies the P1 algorithm.
Although it can be beneficial in some cases, the above procedure has two limitations:
1. Selecting even a single or a small set of variables which satisfy A1 / A2 can be
problematic in some cases. This is because verifying A1 for instance, involves
estimating two values. The first is ),,(),,(max Zxygzxyg ZZZZz
, which usually
can be easily estimated for a single variable z which is selected in only a single local
kernel. The second is ),,(),,(max,
ZxygZxyg ZZyx
, which can be exponentially hard
in the general case. However, a useful heuristics might be selecting the z variable at
each step as being the one with minimum value of:
),,(),,(max,
ZxygzxygM ZZZZZz
z
The reasoning behind this heuristic is that if we restrict ourselves to a single z
variable selection, then selecting a z variable such that zM is not minimal means that
there is a non-z variable (the one resulting in minimum zM if selected into z) with a
smaller change then z, which potentially contradicts A1 / A2. Moreover, the ease of
selection of subsequent z variables can be manipulated by appropriate selection of the
Z constants in each step.
2. The procedure might terminate before all the “loops” are eliminated from the local
domain graph, that is, before the graph has a JT to which the GDL algorithm can be
applied. To address this issue, we combine the method proposed above with the
standard JT technique for coping with loops, which is the method of triangulation. As
discussed earlier in this work, using triangulation, a JT can be constructed for any
loopy network. In our case, triangulation can be applied, after the z selection process
can no longer proceed, on the “moral graph” which results from the decomposition of
the function of the final step of the process. From the triangulated moral graph
maximal cliques are extracted to form the nodes of the JT (as described in the section
5.1, discussing triangulation). This combination of z variables selection together with
the complementary triangulation, applied when the z selection can no longer proceed,
- 81 -
forms what we call below the “Hybrid Approach”. The advantage of this method over
the standard triangulation is in the potential reduction of the treewidth of the moral
graph. Note that the complexity of GDL applied to the triangulated network (i.e. on
the JT resulting after triangulation) is exponential in treewidth, hence fixing the
“lateral connections” potentially results in exponential decrease in the final GDL
computation complexity.
In the following section we will present an upper bound to the decrease in GDL
complexity that can be achieved by our techniques, while operating over a complete
“moral graph” with each edge represented by a local kernel.
6.5. Clique Carving
Assume having a function )(xf where ),,( 1 nxxx is its vector of variables. Assume
)(xf decomposes into a sum (not necessarily) of local kernels, of two variables each, so
the resulting “moral graph” is complete. I.e. )(xf has the following decomposition:
ji
jiij xxfxf ),()(
The treewidth (the size of the largest clique after triangulation) of the resulting moral
graph is n and GDL complexity for MAP for instance is clearly exponential in n. Now
assume we are using our hybrid approach on this problem:
Each time a variable ix from local kernel ),( jiij xxf is chosen and fixed, exactly one
edge from the moral graph is removed.
Suppose we‟ve succeeded in removing only one edge from the moral graph. The
resulting graph has exactly two maximal cliques (the first containing one of the
removed edge end nodes and the second containing the other). Each clique is of size
n-1 (thus the treewidth of the resulting graph is n-1) and most importantly, the graph
is triangulated (obviously, as any cycle of size more then three must contain at least
one node not adjacent to the removed edge and hence being connected to all the other
nodes on the cycle and thus the cycle has cords).
However if we remove exactly two edges, the treewidth of the resulting graph will
remain n-1, as triangulating it would yield back the graph with only one edge
removed (as the four nodes adjacent to the removed edges form a cordless cycle).
- 82 -
In general if we remove enough edges so the sub-graph induced by nodes adjacent on
the removed edges is a tree (or a forest) of m nodes, then the resulting moral graph
will be:
o Triangulated – as any cycle of length larger then three will involve at least one
node not from the tree. Moreover as this node is connected to all the nodes of the
graph (as none of its edges were removed). Thus in particular it will be connected
to all the nodes on the cycle giving it cords.
o With treewidth n-m+2. Obviously, as the largest cliques of the resulting moral
graph are consisted of all the non-tree nodes and exactly two tree nodes (taking
three or more tree nodes will not form a clique as there will be at least one edge
missing).
Thus the decrease in GDL complexity over the JT for the resulting moral graph is
exponential in m-2.
Moreover, the number of edges needed to “carve” a tree of size m (hence the name
“Clique Carving”) out of a complete graph is O(m2) (obviously). Hence the upper
bound to decrease in computation complexity that can be achieved by the hybrid
approach which succeeds in removing m edges (in the complete graph case) is
exponential in m .
Even more generally, we can rely on a well know graph theoretic result from [12]
which states that if every node of graph G1 is connected to every node of the graph G2
then treewidth of the resulting graph: 21 GG is given by:
})(,)(min{)( 122121 VGtreewidthVGtreewidthGGtreewidth
where V1 and V2 represent the vertex sets of G1 and G2 respectively. Thus if we
“carve” out a graph of treewidth k and with m nodes then the treewidth of the
resulting moral graph will be mnkmmnmnk },min{ , as mk . Thus
the upper bound on GDL complexity decrease that the hybrid approach can provide
us in this case is exponential in m-k.
We conclude this section by noting that in case of a complete moral graph, the worst case
that can be considered in sense of the hybrid approach is the one discussed above, i.e. the
- 83 -
case of local kernels with size two domains (obviously, as in this case hybrid approach
eliminates one edge for each z variable selection).
7. Applying “Slow Connections” approaches in practice
We performed a set of computational experiments to test the performance of the “slow
connections” MAP approximation schemes described above. These tests utilized a
graphical model that we call a “clique-tree” graph. In the clique-tree, there are several
cliques which are connected together in a form of a tree. The clique-tree results from a
tree by connecting every node to all of its siblings (children of its parent) in the original
tree. The function maximized over the clique tree was the sum of the logarithms of the
edge weights of the tree plus sum of logarithm of local weights of the nodes. That is, in a
clique tree G, the function to be maximized was:
)()(),(
)()),(log()(GVv
ii
GEvv
jiij
iji
vfvvwGf
where the arguments of the function (such as ji vv , ) were variables residing in the nodes
of G.
The experimental testing used the algorithm described in the “hybrid approach” section.
This computation can also be regarded in a more simplified manner as a message passing
algorithm, with some messages proceeding slower than others in terms of the message
passing speed. Moreover, let the junction graph, consisting of only the “faster” links, be a
tree (i.e. it has no loops). Then the algorithm can be viewed as iteratively running GDL
(i.e. standard bottom-up, top-down algorithm) on the “faster” links tree, then passing the
values obtained from the GDL over the slow links to serve as “initial messages” (for the
next GDL iteration) received over these links. These messages affect the values of the
GDL local kernels. In this general setting there are several factors that may vary to
produce different algorithms:
1. The selection of the subset of slow edges (as well as their directions) is a key factor
for the performance of the general algorithm above as a MAP estimator. In our
experiments, we compared several approaches for the slow edges selection. The
tested variants included “iterative A2 selection” and “random selection” (explained
further below).
- 84 -
In addition, the slow edges may be fixed (i.e. the same slow edges are used during the
entire progress of the algorithm), or they may be re-selected during runtime, and
thereby affect the message passing schedule. In our experiments, we tested both the
fixed and the varying slow edges schemes.
2. The slow edges “speed” may vary. The slow edges may be updated together or
updated by some specific schedule. If we allow the slow edges speed to vary, we may
further improve the results under A2 assumption, as we then can apply A2
recursively, fixing one edge at a time and iterating over it (recursively applying the
same fixing algorithm) as long as there is improvement. However, this approach is
potentially much less efficient in cases there are many loops, because every loop
appears in the recursion and then the run-time complexity is potentially exponential in
the number of loops. In our experiments we tested both the alternatives.
All the tested algorithms iterated three kinds of steps:
1. Edge removal step – at this step an edge or a set of edges were selected, together
with fixing their directions. By fixing direction of an edge ),( ji vv with edge weight
)),(log( jiij vvw we refer to selecting either iv or jv to be “fixed”, and be the sender
of the slow edge update after the GDL step. After one of the edge directions is fixed,
the weight of the fixed edge becomes part of the local weight of the node which was
not fixed (i.e. if iv was selected in the direction selection then jv ‟s local weight is
updated).
- 85 -
Figure 12: Edge removal step. G1 and G2 are connected components of the graph. The
functions )( ii vf and )( jj vf are local weights of iv and jv , while iZ is the fixed z
value selected for iv when edge ),( ji vv is removed. The “slow connection”, drawn
using green dots below, supplies jv with new values of iZ , which are the maximizing
values of iv from the previous maximization iteration.
- 86 -
2. Contraction step – after one or more edges are fixed, some nodes may become
connected to the rest of the graph by means of a single edge only. The algorithm
readily runs the GDL update step from these nodes and removes them from the graph.
The messages passed in the update steps, which are the standard GDL messages over
max-sum semi-ring, are incorporated in the local weights of the nodes receiving them,
as they are functions of the receiving node alone (this follows directly from the
definition of the GDL messages).
Figure 13: Edge contraction step. The functions )( ii vf and )( jj vf are local
weights of iv and jv , while )( jij vm is the standard GDL message defined as:
),(log)(max)( jiijiv
jij vvwvfvmi
.
- 87 -
3. Split step – after one or more edges are fixed some edges may become “splitting
edges” of the graph. A splitting edge is an edge removing which disconnects the
graph into two connected components. When splitting edges result, we view each of
the resulting (loopy) connected components as nodes of a “super” tree connected via
the splitting edges. We then run the standard GDL algorithm over the “super” tree
using our algorithm inside each connected component for the maximization steps of
the GDL.
Figure 14: Split step. G1 and G2 are connected components of the graph connected by a
single edge ),( ji vv . The maximizing assignment is computed using GDL on the “super”
JT below. The “super” JT has vertices G1, G2 and ji vv , , the latter has a local kernel
),(log)()( jiijjjii vvwvfvf . Maximum in G1 and G2 nodes is computed using the
“slow connections” algorithms.
Each edge removal step could lead to contraction or split step or alternatively to the
following edge removal step. Contraction step may lead to other contraction steps. In
- 88 -
general it is better to run all the possible contraction steps prior to the split steps in order
to keep the process simpler, as otherwise, single nodes may be unnecessarily regarded as
connected components – candidates for contraction which is less efficient.
8. Experimental results
The next sections present the experimental results we obtained for our two novel
approaches: the Max-MI training method, and the Slow Connections MAP
approximation. The results for Slow Connections algorithm also contain comparison
results with other MAP approximations, such as the commonly used Loopy Belief
Revision.
8.1. Max-MI classification model training
Problem setting
The problem we consider is object recognition. We construct and train feature based
models that are used to solve this problem. In order to use a feature based model, we need
a way to find a visual feature in an input image. The features that we use in our
experiments are image patches. For example, on Figure 15 features are parts of a face,
and on Figure 16, parts of a cow. Each feature is represented hierarchically in terms of
simpler sub-features. Each feature is searched in the image by normalized cross-
correlation, and it has two parameters: a threshold θ and a region of interest (ROI). A
feature Fi is detected (Fi = 1) if its correlation exceeds its threshold θi. It is searched in
the image within a limited window given by the ROI with respect to the position of its
parent node. We are given a set of features, and the problem we consider is the
construction of an optimal TAN structure, and the optimal setting of all the thresholds
and ROI values, such that the resulting model will have the maximum prediction power
for object recognition.
Our goal is to recognize a visual object on unseen images using the trained models. The
visual objects that we recognize in our experiments are part of the face and part of a cow,
and the models are trained over face image database and cow image database
respectively.
- 89 -
Experimental setting
The results of this part consist of experiments conducted on two models of different size.
The first was a “face parts” model consisting of 12 feature nodes, as depicted in Figure
15.
Figure 15: Original face parts model. Features are image patches which form a
hierarchy in which the larger patches appear at the top and the smaller patches appear
at the bottom. This model was constructed for face vs. non-face (binary class)
classification.
The second model was a cow parts model, consisting of 26 feature nodes as depicted in
Figure 16.
Figure 16: Original cow parts model. This model was constructed for cow vs. non-cow
(binary class) classification.
The trained parameters were feature thresholds and ROI. We have tested both the
thresholds + ROI training and thresholds only training. In the experiments which trained
- 90 -
thresholds alone, the ROI was set to some fixed, preset value. Note that although we have
used binary features (one threshold per feature) and a single ROI window per feature, we
could as well train several thresholds and ROI windows for each feature, using the same
learning approach.
Learning the ROI parameter poses a special problem for the MaxMI combined with TAN
restructure scheme. The problem comes from the nature of ROI parameter. The ROI is a
search window of a feature, which is specified with respect to the feature‟s parent
location. Thus it is problematic to apply the conventional TAN restructure algorithm
introduced by Friedman et al. in [2]. As even when ROI is fixed, one needs to assign Fi
parent Fj or Fj parent Fi in order to compute the edge weight MI(Fi,Fj;C), since the result
will be potentially different for each choice of edge direction.
In the context of the visual interpretation problem described above, the original feature
hierarchy is rather strict. That is, it usually makes no sense to reverse parent–to-
descendants relationships. Hence, a possible solution to ROI training problem in this
context is using “constrained TAN” restructure step instead of the conventional TAN
restructure step. This means that, instead of computing MST on a full graph, we would
compute a directed MST on a “layered” graph consisted of several fully connected layers.
Each layer would consist of features residing at the same depth in the current feature tree.
For each edge, the parent node would be the node with smaller layer number, where the
layer number of a node is its depth in the feature tree. The results of applying the
constrained TAN heuristic are summarized in the Tables 1 and 2 below. The schematic
representations of the models restructured using the constrained TAN heuristic are given
in Figures 17 and 18.
- 91 -
Figure 17: Constrained TAN restructured face parts model. The feature hierarchy in
terms of feature‟s layer number was preserved in this case. The resulting model
performed slightly better then the one trained with MaxMI + greedy TAN restructure.
Figure 18: Constrained TAN restructured cow parts model. The TAN restructure step of
the training caused many of the 3rd
layer features to change their parent. However, the
performance of the resulting model was the same as for the model trained with MaxMI +
greedy TAN restructure.
In our experiments which included the ROI parameter, we have also implemented a
greedy variant of Friedman‟s TAN restructure algorithm. In each of the greedy
algorithm‟s iterations one node was chosen and connected to the tree. The chosen node
was the node that gave the maximum contribution to the TAN restructure score:
)|;( CFFMI kj
- 92 -
That is if the set of tree nodes added before iteration i is denoted by iT then the node
chosen in iteration i was:
)|;(maxmaxarg\
CFFMIF kjTFTSF
mij
ik
where S stands for the set of all the nodes from which the tree is formed. The chosen
node mF was connected to the node of iT in which the maximum of the inner term was
obtained.
Increasing the TAN restructure score in turn increases the log-probability of the TAN
model, thus applying even the greedy variant of the algorithm is still reasonable. The
schematic representations of the greedily TAN restructured face and cow models are
given in Figures 19 and 20. The numerical results of greedy TAN heuristics application
are summarized in Tables 1 and 2.
Examining the different options for TAN learning, it can be seen from these results, that
on the cow image database, constrained TAN performed slightly better then the greedy
TAN heuristic in terms of the error rate.
However, it performed worse then the greedy TAN over the faces image database, again
in terms of the error rate. The reason is probably overfitting since the MI of the model
trained using constrained TAN is higher then of the greedy TAN (over the faces image
database), and so is the error rate over the training images. We conclude that the choice
of proper TAN restructure heuristic in cases where the original TAN restructure step
cannot be applied (for e.g. when we train ROI parameters), is implementation dependent.
More experiments with various image databases may also shed some light over the
governing dynamics of the best TAN heuristic choice in our context.
- 93 -
Figure 19: Greedy TAN restructured face parts model. The resulting model performs better then
the model with the original structure (both trained with MaxMI). However, the resulting tree
structure is not as intuitive as the structure obtained with constrained TAN restructure
algorithm.
Figure 20: Greedy TAN restructured cow parts model. The greedy restructure step has
“flattened” the model in a sense that most of the lowest level features became direct children of
the root node. Again, although the performance was improved relative to the original structure,
this tree construct is not intuitive.
Note also that in the experiments involving the threshold parameters only, the original
TAN restructure (MST over the )|;( CFFMI kj edge weights) was applied.
Although the alternative hybrid approach presented at the end of Section 4.4 looks more
promising then the MaxMI & TAN restructure iterations suggested at the beginning of
that section, our empirical experiments have shown otherwise. The results of the
alternative hybrid approach were less good then of the MaxMI & TAN restructure
iterations. One possible reason is data specific; it is possible that more tests with other
- 94 -
test / training data sets would show otherwise. We suggest that more empirical and
theoretical research of the alternative hybrid approach would shed more light on its
performance.
The numerical results are summarized in the tables below in Tables 1 and 2. The original
training refers to parameters obtained by another method (computed during feature
selection in [8]).
Results summary
It can be seen that several versions of the MaxMI and TAN learning significantly
outperformed the original training method used in [8] during the feature selection. The
original training method chose the feature parameters by maximizing the local Mutual
Information terms: );( CFMI i where C is the class variable and Fi is each feature taken
separately. Due to the specific choice of the faces image database, the difference between
MaxMI (without TAN restructure) training and the original training is insignificant on the
faces image database. However, it is highly significant (~36% improvement using
MaxMI) on the “more difficult” cow image database.
In addition, the MaxMI with constrained TAN and greedy TAN methods gave
significantly better error rates then the original training on both image databases.
Moreover, using TAN restructure steps improved the error rates of the MaxMI algorithm
alone. The performance improvement over the MaxMI training alone, introduced by
TAN restructure steps, was especially significant on faces image database (~45% with
greedy TAN restructure) and less significant on cow image database (~16% improvement
with constrained TAN restructure).
- 95 -
Face Parts Model
Test DB
Size
Training DB
Size
Class entropy on
training DB
MI model to class on
training DB
Error rate on
test DB
Error rate on
training DB
MaxMI Training 2257 767 0.792690834 0.758242464 135 25
Original Training 2257 767 0.792690834 0.722429352 136 35
MaxMI Training with
constrained TAN
restructure 2257 767 0.792690834 0.756855168 Miss=62, FA=36 Miss=15, FA=3
MaxMI Training with
greedy TAN restructure 2257 767 0.792690834 0.746516913 Miss=30, FA=44 Miss=16, FA=3
Alternative MaxMI Training
with TAN restructure 2257 767 0.792690834 0.74711484 Miss=33, FA=109 N / A
Threshold only training
(without restructure) 2257 767 0.792690834 0.738676981 Miss=84, FA=46 Miss=30, FA=5
Observed & Un-observed
model training constructed
from the all-observed model
and soft EM 2257 767 0.792690834 N / A 67 N / A
Table 1: Information based training results summary for the face parts model
- 96 -
Cow Parts Model
Test DB
Size
Training DB
Size
Class entropy on
training DB
MI model to class on
training DB
Error rate on
test DB
Error rate on
training DB
Original Training 2256 961 0.46535663 N / A Miss=84, FA=64 Miss=36, FA=16
MaxMI Training 2256 961 0.46535663 N / A Miss=53, FA=42 Miss=25, FA=17
MaxMI Training with
constrained TAN
restructure 2256 961 0.46535663 N / A Miss=32, FA=48 Miss=17, FA=12
MaxMI Training with
greedy TAN restructure 2256 961 0.46535663 N / A Miss=59, FA=30 Miss=23, FA=16
Observed & Un-observed
model training constructed
from the all-observed model
and trained using soft EM 2256 961 0.46535663 N / A 89 N /A
Table 2: Information based training results summary for the cow parts model
- 97 -
8.2. “Slow Connections” approximation
Problem Setting and implementation details
Our experiments were conducted over a special class of loopy networks, the so-called
“clique-tree” networks. The essence of a clique tree network structure is that is a super-
tree of cliques, each two “connected” cliques in the tree are connected via a single edge
between a node in one clique and a node in the other clique. The structure of a clique-tree
network is illustrated on Figure 21.
A structure similar to the clique tree network arises in many interesting applications. A
clustering technique, such as triangulation, can be applied to any loopy belief network to
produce a junction tree with enlarged local domains. The local kernels for the enlarged
domains are aggregates of several original local kernels, which domains were clustered
together to form the enlarged local domain.
We denote by T the “tree” of the clique-tree network, i.e. the tree which nodes are
actually cliques of the network. In fact, T is a junction tree of a clique-tree network, and
local domains of T are the network‟s cliques. Hence, in order to compute the MAP
assignment to the nodes of the network, we ran a GDL on T.
Maximization steps of the GDL, which are used to compute messages from some clique
Ci, needed to maximize the sum of all the weights of edges of nodes of Ci plus the single
node messages received from neighboring cliques. We used our slow connections
algorithm in order to perform this maximization. The messages received from
neighboring cliques (all of which are functions of a single node due to the structure of the
junction tree T) were incorporated into the local weights of the appropriate clique nodes
prior to the slow connections run. The slow connections run iterated (in this order) edge
removal, contraction and split steps, described in detail in section 7, until the approximate
maximum and maximum assignment were computed.
Different slow connection techniques differed in the split steps iterations. Each such
technique used a different paradigm for selecting the edges to remove and replace by a
slow connection.
- 98 -
Figure 21: Structure of the clique-tree network. The network is a tree of cliques, each two
neighboring cliques connected via a single edge. Each node iv of the resulting graph has a local
weight )( ii vf attached to it and weight of an edge ),( ji vv is denoted ),(log jiij vvw . Our goal
is to maximize i
ii
ji
jiij vfvvw )(),(log and find the maximizing assignment (argmax) to
all the nodes }{ iv . That is, solve the MAP problem on this loopy network.
- 99 -
Experimental Setting
Our experiments compared the performance of different types of our “slow” connections
algorithm with the standard Loopy Belief Revision (LBR) algorithm, which is a popular
approach for dealing with loopy graphical models. In our experiments we used two
stopping conditions for the LBR:
A node will not forward messages (updated with its local kernel) if the argmax (with
the node variables as arguments) over all messages received from its neighbors
including the last message, does not change. That is, the last message did not change
the node maximizing assignment, although it could change the maximum value
reached.
A node will stop forwarding messages after it has reached its maximal allowed quota
of forwarding messages. That is, each node will be allowed to forward at most k
messages, where k is a pre-determined parameter. In our experiments we used k=50
and k=10.
Our empirical experiments indicated this variant of LBR produces reasonable results in
many cases. However, it was outperformed by our proposed “slow connections”
approach.
We compared the following cases:
Weak z-minor (A2) with different "slow" speeds
The edge removal step of this algorithm greedily selected edges with minimal
“variation” (difference) from all the available edges and selected the “fixed”
directions which gave the minimal difference. As mentioned in section 6.4.2 selecting
edges that can be made slow, that is edges which weights satisfy one of the A1 or A2
assumptions relative to the rest of the graph, is problematic in general. The problem
lies in efficiency of verifying that A1 or A2 assumption indeed holds. Hence, we used
the following heuristic for selection of slow edges and their directions.
At each edge removal step the edge selected to be removed and replaced by a slow
edge was:
),(log),(logmaxminmaxminarg),(),(
jiijjiijvZvvv
ji vZwvvwvviij
ji
- 100 -
The reasoning behind this selection is that we want to find an edge with minimal
variability, i.e. with minimum effect when we fix one of its directions by turning one
of its nodes into a z-variable. We measure variability by maximizing over all the non-
fixed variable values the minimum over all possible fixed variable values of the
maximum the z-difference. The maximum z-difference is for a given fixed value Zi is
),(log),(logmax jiijjiijv
vZwvvwi
(here we consider vi as a candidate for becoming
z-variable of the edge ),( ji vv ).
The initial fixed value of the z-variable of the edge ),( ji vv made slow at edge
removal step was the one providing minimum over all values of vj maximum z-
difference:
),(log),(logmaxminminarg jiijjiijvvZ
i vZwvvwZij
i
At subsequent steps of the slow connections algorithm the fixed value was updated in
the legalization steps of the maximize-and-legalize (P1) algorithm.
If our heuristic succeeded in choosing slow edges, such that each of them satisfied A1
or A2 with respect to the network state at the time the choice was made (during each
edge removal step), then the applied slow connections algorithm is guaranteed to
converge to a local or a global optimum depending on which assumptions A1 or A2
are satisfied at each slow edge. We call the assumption that: all the selected edges
satisfy A1 or satisfy A2 with respect to the state of the network at the time they were
selected, “iterative A1” or “iterative A2” respectively.
By using different slow connection speeds we refer to applying A1 or A2 at each
edge, selected to be slow, separately. This means that we apply the maximize-and-
legalize (P1) algorithm on each slow edge by itself with respect to the state of the
network at the time this edge was selected, i.e. when the according edge removal step
was applied.
Using different slow connection speeds, we get a recursive algorithm which is
guaranteed to achieve the global optimum under “iterative A1” assumption and local
optima under “iterative A2” assumption. However, the convergence speed of this
algorithm is not guaranteed under A2 assumption and even tends to be exponential in
- 101 -
the number of loops. Under A1 it is linear in the size of the graph as each P1
application terminates in one step.
Weak z-minor (A2) with same "slow" speeds
This method is the same as above, only all the slow edges were updated over
simultaneously. That is, in this case the slow connections were selected once with
their fixed directions and initial fixed values and the slow connections algorithm that
was applied was the original maximize-and-legalize (P1) in which all the slow
connections fixed directions were the z-variables. This algorithm can be regarded as a
synchronous distributed algorithm operating under a global clock. At even clock ticks
(starting from the zero tick) the algorithm runs a GDL message passing algorithm on
all the fast (that is non-slow) edges of the network, that is runs the maximize step of
P1. At odd clock ticks the algorithm propagates the values obtained in the previous
(GDL) tick over the slow connections, in the fixed direction, i.e. from the z-node to
the other node, so that in successive (even) clock tick these values will be
incorporated as the new fixed Z values in the local kernels over which the GDL is
run.
Experimentally, this approach operated in liner time, that is the number of times
needed until the function being maximized (over the whole clique) stopped increasing
was relatively small.
Random Slow Connections
The slow connections and the fixing directions were selected at random and were
updated over simultaneously (i.e. were of the same speed).
The numerical results are summarized in the following tables. The rows represent
different kinds of test sets. In randomly generated test sets, the test sets differ in the
number of possible values for each node. Moreover, we tested over clique-trees generated
from face classification models with edge and local weights based on “natural”
distributions of the features in the models.
Following is the explanations of the terms used in the tables:
Model Size – the depth and the branching of the clique-tree. The depth is the number
of levels in the tree of cliques plus one. The branching is the size of each clique minus
- 102 -
one. For example, depth = 3 and branching = 5 clique tree, has six cliques, each of
size 6.
Node Count – is the number of nodes of the clique-tree, or the number of variables in
the maximized function.
Value Count – the number of values that each variable can take. For example if value
count is two then we maximize a function with binary variables, while if it is four
then every node has four possible values.
Sample Count – the number of clique-tree networks that were generated. The
average approximation rates were calculated over all these networks. The networks
were either randomly generated or constructed from “natural” examples such as
feature trees used in MaxMI experiments. All the corresponding rows of all of the
tables represent experiments performed on the same generated set of networks.
Average Approximation – the percent of the true maximum value obtained averaged
on all the sampled clique-tree networks. The percent is over the difference between
the true maximum value and the true minimum value. The true max. and min. values
were calculated using exhaustive search.
Average Mismatch – the average number of values in the approximate maximal
assignment which differ from the values of the true maximal assignment. The average
is calculated over all the generated samples.
Average Match % - the average percent of the values of the approximate maximal
assignment, which match the values of the true maximal assignment. The percent is
taken over the node count.
Models based on natural feature trees – the clique-tree networks generated from
observed & unobserved feature trees similar to ones used in MaxMI experiments. The
clique-trees were constructed from the unobserved nodes of the feature tree. The
clique-tree structure and edge weights of all the sample models was the same, the
only difference between the models was varying local weights which represented
evidence input. That is, for each test image of the feature tree, the local weight of the
node was set to the probability measure of the corresponding unobserved node being
1 or 0 given the value of its attached observed node calculated from the test image.
- 103 -
We compared the results obtained by the different methods using statistical test. We
compared by paired t-test the significance in performance differences between A2 (same
"slow" speed) and Belief Revision (50 messages) when applied to the depth 3, branching
5, 3-valued model. The difference in performance between slow connections technique
and Belief Revision was highly significant, 1010p , n=1000, two-tailed paired t-test. In
addition, the slow connections method was superior, 1010p , one-tailed paired t-test.
Moreover, t-test established with confidence > 0.9999 that the true interval for the
difference mean is 5.2% to 6.2% for the benefit of the slow connections scheme (the units
of the difference mean are percents of the difference of true maximum and true minimum
of the test models).
This result was confirmed by Wilcoxon’s “Signed Rank Test”, yielding 15010p
probability for the means of the corresponding performance data to be equal.
Summary
From the results obtained from the experiments we can see that slow connections based
algorithms significantly outperform both simple loopy MAP approximations, like
ignoring the loopy links and maximizing on the resulting tree, and more complex
algorithms like the popular Loopy Belief Revision (LBR).
Among all the slow connections algorithms, the more promising one is, to our opinion,
the weak z-minor based “same speed” slow connections algorithm. There are two reasons
that make it the preferred choice. One is that it is far more efficient then the different
speed variant that is likely to be exponential in the number of loops of the network. The
other reason that makes it interesting is that there are reasons to believe, that in the
human brain there are constructs of fast and slow links, where the fast links perform an
up-and-down computations and the slow links pass their messages between such
computations. This lies in complete parallel with what the “same speed” algorithm does,
hence an interesting research direction is trying to explain some of the brain functions
using this algorithm.
- 104 -
Model Size Node Count Value Count Sample Count A2 (different "slow" speed)
Average
Approximation
Average
Mismatch
Average Match
(%) Depth=3,
Branching=5 31 4 1000 98.26% 10-11 65.22%
Depth=3, Branching=5
31 3 1000 98.08% 7-8 74.51%
Depth=3, Branching=5
31 2 1000 98.55% 3-4 88.62%
Based on Natural feature trees, 4 cliques of size 7
25 2 ~2000 97.85% 3-4 86.14%
Model Size Node Count Value Count Sample Count A2 (same "slow" speed)
Average
Approximation
Average
Mismatch
Average Match
(%) Depth=3,
Branching=5 31 4 1000 94.11% 15-16 50.31%
Depth=3, Branching=5
31 3 1000 94.55% 11-12 63.70%
Depth=3, Branching=5
31 2 1000 97.16% 4-5 84.60%
Based on Natural feature trees, 4 cliques of size 7
25 2 ~2000 98.34% 1-2 93.62%
- 105 -
Model Size Node Count Value Count Sample Count Random Slow Connections
Average
Approximation
Average
Mismatch
Average Match
(%) Depth=3,
Branching=5 31 4 1000 82.70% 20-21 34.58%
Depth=3, Branching=5
31 3 1000 81.52% 16-17 45.48%
Depth=3, Branching=5
31 2 1000 79.37% 11-12 62.23%
Based on Natural feature trees, 4 cliques of size 7
25 2 ~2000 N/A N/A N/A
Model Size Node Count Value Count Sample Count Loopy Belief Revision (50 messages per node)
Average
Approximation
Average
Mismatch
Average Match
(%) Depth=3,
Branching=5 31 4 1000 N/A N/A N/A
Depth=3, Branching=5
31 3 1000 89.17% 13-14 55.31%
Depth=3, Branching=5
31 2 1000 88.73% 8-9 72.80%
Based on Natural feature trees, 4 cliques of size 7
25 2 ~2000 93.34% 3-4 87.73%
- 106 -
Model Size Node Count Value Count Sample Count Loopy Belief Revision (10 messages per node)
Average
Approximation
Average
Mismatch
Average Match
(%) Depth=3,
Branching=5 31 4 1000 87.65% 17-18 41.95%
Depth=3, Branching=5
31 3 1000 86.74% 14-15 54.02%
Depth=3, Branching=5
31 2 1000 85.78% 8-9 71.80%
Based on Natural feature trees, 4 cliques of size 7
25 2 ~2000 N/A N/A N/A
Model Size Node Count Value Count Sample Count Ignore Sibling Loopy Links
Average
Approximation
Average
Mismatch
Average Match
(%) Depth=3,
Branching=5 31 4 1000 74.04% 21-22 29.25%
Depth=3, Branching=5
31 3 1000 71.89% 19-20 38.56%
Depth=3, Branching=5
31 2 1000 69.38% 13-14 56.09%
Based on Natural feature trees, 4 cliques of size 7
25 2 ~2000 73.45% 9-10 63.88%
- 107 -
9. Summary and conclusions
In this section we summarize the novel results developed as part of this work. The results
presented in this thesis are divided into two topics. The first topic is information based
training, under which we have developed several novel graphical models training
algorithms based on what we call the MaxMI training framework. The second topic is
loopy MAP approximation, for which we have developed a family of the so-called slow
connections algorithms. Following is a list which covers all the main results developed in
the thesis in their presentation order.
MaxMI based training
We have developed a MaxMI training algorithm for training parameters of graphical
models for the purpose of classification. MaxMI is information maximizing training
algorithm, which is designed for training feature parameters of all-observed TAN
classification models. It can also be applied to general loopy belief networks, but it is
efficient only in cases that the model‟s graphical representation has low treewidth.
We have shown that under specific assumptions, the MaxMI algorithm maximizes the
mutual information between the model and the class. This means, that if the all-
observed model consists of a vector of features F and a class variable C, then the
parameters trained by the MaxMI algorithm maximize );( FCMI . The main
difference between the MaxMI algorithm and other information based training
techniques, such as maximizing mutual information for each feature separately, as in
[8], or maximizing the minimum pair-wise information increase (where ji FF , are
elements of F): );(),;( jji FCMIFFCMI , as in [24], is that MaxMI, if its
assumptions are satisfied, guarantees to find the feature parameters that maximize the
mutual information of the entire feature vector with the class (for a given model
structure).
Experiments performed to test the performance of the MaxMI algorithm, revealed
that is in fact superior to the previous information based approaches. To our opinion
this algorithm has a potential of becoming one of the state-of-the-art training
algorithms for loop-free graphical models.
- 108 -
We have presented extensions to the MaxMI training algorithm for the case in which
the classification model is constructed of both observed and unobserved (O&U)
nodes. These types of models are especially useful for solving visual interpretation
problems, since their unobserved nodes can be regarded as representing the
interpreted parts of the visual object. We have presented two extensions of the
MaxMI algorithm to this case.
The first was a straightforward augmentation of the MaxMI algorithm to support
unobserved nodes. However, it is inefficient if applied to special cases of O&U
models – the observed-in-leafs-only case. In this case, all the observed nodes of the
loop free O&U models are attached only to the leaf unobserved nodes and none of
them is attached to the inner unobserved tree nodes.
The second was applying the original MaxMI algorithm to the observed nodes of the
model alone, obtain the optimal parameters in this case, and train the original model
unobserved to unobserved and unobserved to (trained) observed with soft EM.
Learning feature parameters using EM alone is infeasible, as changing observed
feature parameters, changes the EM training data. A method of applying the soft EM
to TAN models was also developed in the background coverage part of this work.
The second technique of O&U model training was tested as part of our empirical
experiments and exhibited an improvement in the performance over the all-observed
model. This suggests that it has a potential of contributing to visual interpretation
research in the future.
We have developed two hybrid techniques involving both MaxMI and N. Friedman‟s
optimal TAN construction algorithm [2]. These techniques provide a method for not
only training optimal feature parameters, but also constructing optimal TAN model
structures.
The first hybrid technique involved iterative application of a MaxMI parameter
training step followed by a TAN restructure step. The MaxMI step searched for
optimal feature parameters for the given model structure, and the TAN restructure
step searched for optimal structure for the given feature parameters. Iterating these
steps was not guaranteed to converge, since the merits of the MaxMI and TAN
restructure algorithms, also somewhat related, are still different. Hence, maximizing
- 109 -
the merit of MaxMI could potentially decrease the TAN restructure merit and vice
versa.
The second hybrid technique also iterated MaxMI and TAN restructure steps,
although this time the MaxMI merit was augmented. The MaxMI merit was
augmented in such a way that it was only increased by TAN restructure steps.
Therefore, the second hybrid technique guaranteed convergence to a model with
maximal TAN restructure merit and maximal augmented MaxMI merit. This update
was possible due to the relative similarity between the MaxMI and TAN restructure
merits. The update to the MaxMI merit was in fact addition of the Chow and Liu
merit [21] of the TAN model with the class node removed. The addition of this term
to MaxMI merit only supports the use of one of the MaxMI assumptions, which is
invariance to class node removal.
We have performed several experiments in order to test the performance of these
hybrid approaches.
The first approach exhibited a very good performance relative to the other tested
training schemes. It was also used as part of the training of the O&U models in our
experiments (it participated in observed model part training and decided on the final
model structure prior to the soft EM training).
However, the second (convergent) approach exhibited poorer performance then
expected. It can be that its relative lack of success was due to the local properties of
the image databases used in our experiments. We think that more empirical
experiments and analytical inquiries are necessary in order to fully discover its
potential.
We have suggested two so-called complete training approaches that make use of the
MaxMI based algorithms not only for training feature parameters, but also for
selecting the features themselves.
The first approach was what we call constrained TAN based approach. Its essence
was to combine MaxMI together with a feature selection technique, such as used in
[8] and gradually add features and re-train the model using the hybrid approach based
on the, so-called, constrained TAN restructure step instead of the original TAN step.
The difference between the original TAN restructure and the constrained TAN
- 110 -
restructure steps is that in constrained TAN restructure we are not allowed to change
feature‟s layer number in the hierarchy (or in other words, we are not allowed to
change parent – descendant relations). The merit of adding a new feature candidate to
the model is defined to be the increase in the hybrid score after re-training the model
with the new feature added to it.
Apart from being used in this complete approach, our experiments has shown that
constrained TAN is a good heuristic for replacing the original TAN restructure step in
our hybrid approaches, in cases when the trained parameters are affected by structural
changes made by the original TAN restructure algorithm. One of such cases is
training of ROI parameters that appeared in our experiments.
Training feature parameters is, in a sense, feature selection, since we can regard
features with different parameters as different features. The second approach is a
straightforward generalization of this remark. It refers to selecting features by training
the feature defining parameters, such as size and location in the training images, as
part of the parameters trained by the MaxMI based algorithms. Of course, in order to
use this algorithm, we need a systematic way of approaching the best trained
parameter values in a coarse-to-fine manner. Otherwise, the algorithm will be
inefficient due to excessively large sizes of sets of possible values for the local
domains of the MI decomposition.
The final result related to information based training is the analytical characterization
and comparison of maximal MI and minimum PE problem solutions in the so-called
ideal scenario cases. By the ideal scenario we refer to the case in which we can select
any k-valued feature F using which we will classify an n-valued given class C. Here
by any feature F we refer to (a purely information theoretic) scenario in which we are
able to set F‟s distribution together with the CPT of C given F to any desired
functions.
We have shown that the optimal minimum PE problem solution in the ideal scenario
case is obtained when the k most probable values of C are distributed among the k
values of F, when each most probable C value is the most probable choice given its
corresponding F value.
We have also shown that the optimal maximal MI problem solution in the ideal
- 111 -
scenario case is obtained by dividing all the C values among k sets, each
corresponding to an F value, such that the entropy of choosing among the sets is
maximal. A set corresponding to iF consists of all the C values having non-zero
probability given iF . This is a very intuitive result, since the entropy of choosing
among the sets is removed from C‟s entropy when the value of F is known.
In addition we argued that for the general classification purposes, using maximal MI
problem solution is a better training paradigm then using minimum PE problem
solution. We gave several reasons for this, following are two of them:
o There are no known general assumptions for existence of PE decomposition, such
that it will allow us to train using PE minimization as a merit estimate. This is in
contrast to MI maximization, where such assumptions are the MaxMI
assumptions under which MI decomposes into a sum of local terms which in turn
allows us to maximize it using the GDL algorithm.
o When we increase the number of allowed guesses for obtaining the true C value
from the known F value, the min. PE solution tends to the max. MI solution.
Slow connections based loopy MAP approximation
We have developed the maximize-and-legalize algorithm, which allows us to obtain
local or global maximum assignment (MAP assignment) of a function having a loopy
decomposition into a sum or a product of local kernels. By a function decomposition
we refer to a representation of the function as a sum or a product of smaller
functions, called local kernels, each operating on a local domain - a small subset of
the whole set of variables. By loopy decomposition we refer to a decomposition, set
of local domains of which does not have a junction tree.
Convergence of the maximize-and-legalize algorithm to a local or a global maximum
is guaranteed under the weak z-minor or strong z-minor assumptions accordingly.
The weak z-minor assumption requires the following:
o The loopy function has a subset of variables, that we call x variables, replacing
which by the, so-called, z variables turns a loopy decomposition into a loop-free
one, that is a decomposition which local domains have a junction tree. The
variables not being replaced by z-variables are called y variables.
- 112 -
o In special points obtained by fixing z variables to a fixed value Z and maximizing
over x and y, the function of x, y and z has a special property: changing z from Z
to the value of x has smaller effect on function value then changing the value of x
to Z and changing the value of y to any value.
If a function has a decomposition which satisfies the weak z-minor assumption, then
the maximize-and-legalize algorithm is guaranteed to converge to a local optimum
point of the maximized loopy function. By local optimum here we refer to a point at
which changing y variables values to any value and changing x variables values to
any of the fixed Z values passed during the run of the algorithm decreases the value of
the function.
The strong z-minor assumption has a stronger requirement then its “weak”
counterpart. The additional requirement is that at the maximal points (for fixed z = Z),
as above, changing x or y value to any other value has bigger effect on the value of
the function then changing z variables value to the value of x. If a function has a
decomposition which satisfies the strong z-minor assumption, then in the maximize-
and-legalize algorithm is guaranteed to converge to a global optimum point of the
maximized loopy function in a single step.
As weak or strong z-minor assumptions are not satisfied in general function
decompositions, we have developed the partial iterative approximation algorithm.
This algorithm allows to approximate the maximal assignment over a function
decomposition ),(),( yxgyxf , where ),( yxf is a loop-free part of the
decomposition and ),( yxg is its complement part that introduces loops. The
approximated maximal assignment which is returned by the partial iterative
approximation algorithm is the global or a local maximal assignment to a function
),(),( yxgyxf , 10 . The value of largely depends on how well this
decomposition admits to strong or weak z-minor assumptions, optimally we try to
obtain 1 . The main idea behind this algorithm is using the maximize-and-legalize
algorithm recursively on parts of the full decomposition, multiplied by small
constants. The constants are chosen small enough, so that weak or strong z-minor
assumptions are satisfied on them.
This algorithm can be non-formally viewed as chipping away parts of local kernels,
- 113 -
whose local domains generate loops. In case the constants and the “loop introducing”
decomposition parts are chosen, so that strong z-minor assumption is satisfied in each
step, then the global maximal assignment to the loopy function is obtained in a single
step of the algorithm. However, if at some steps only weak z-minor satisfaction
occurs, then the algorithm will potentially run exponentially slow in the number of
those (only weak z-minor satisfying) steps.
In order to make it possible to apply the maximize-and-legalize algorithm in practice,
we have developed a family of so-called slow connections algorithms. These
algorithms are message passing algorithms, which apply maximize-and-legalize steps
on different “loop introducing” parts of the decomposition. If the weak or strong z-
minor assumptions are satisfied iteratively on some ordering of these parts, then the
slow connections algorithms are guaranteed to converge to a local or a global
optimum.
Various slow connections algorithms were experimentally tested in this work. The
loopy network type on which we performed our experiments was a so-called clique-
tree network. The applied slow connection algorithms deferred in selection
methodology and speed of slow connections.
In the clique-tree, each local kernel of the complete network‟s function
decomposition corresponds to an edge of the network. Each edge of the network to
which maximize-and-legalize algorithm was applied was called a slow connection
and it forwarded messages in a slower rate then the other “faster” edges.
Our experiments exhibited good performance of the slow connections algorithms. In
these experiments, the best slow connections algorithms have significantly
outperformed our implementations of the commonly used Loopy Belief Revision
algorithm (with confidence > 0.9999 two-tailed paired t-test).
A theoretical interest that lays in slow connections algorithms due to their good
performance is complemented by the fact that there are hints that in the brain there
are structures which exhibit similar functionality (that is have slow connections
between some of their neurons). It is an established fact, see [10] and [11], that some
neural connections in the brain are about ten times slower then the others. Thus the
- 114 -
slow connections algorithms seem to have biological roots and can potentially be
used for explaining some of the brain functions.
The slow connections algorithms can be used in hybrid with standard techniques of
coping with loops, like triangulation. In the hybrid, the use of slow connections
(which are loopy parts of the decomposition on which maximize-and-legalize
algorithm is applied) is complemented by use of triangulation which deals with the
remaining loops. That is triangulation deals with the loops that cannot be removed
using slow connections, because weak and strong z-minor assumptions do not apply
to any of their edges.
We have shown that the upper bound on the runtime complexity decrease achieved
using slow connections together with triangulation on a single clique loopy network is
exponential in )( mO , where m is the number of slow connections used.
10. Future work
In this section we will summarize several interesting research directions related to the
ideas and results presented in this thesis.
10.1. Information based training
In the following sub-sections we will discuss research topics related to our novel
information based training approach – MaxMI and its extensions.
10.1.1. Using observed and unobserved in the models
In section 4 we developed algorithms for constructing and training optimal TAN models.
Initially the model was based on all-observed nodes, but we also developed some
schemes for construction and training of models involving both observed and unobserved
nodes.
Using TAN models augmented with unobserved nodes, is very useful in situations when
the classification model is used not only for classification, but also for, so-called,
interpretation purposes. In the case of visual interpretation, we have not only to determine
whether the class is present or not, but also we need to decide which of its meaningful
parts are present. In cases we are interested only in classification, it is possible to use a
- 115 -
single unobserved node in the graphical model – the class node. However, when we need
to determine the presence of meaningful parts of the class, we usually require more
unobserved nodes to be present in the model, at least one for each “interpreted” part.
As a result, the development of techniques for constructing and training observed &
unobserved (O&U) graphical models is of fundamental importance for the interpretation
problems. In preliminary work, we have introduced two methods for performing these
tasks. One is the MaxMI technique, but augmented to support O&U models. The other is
a simpler technique based on the standard hybrid MaxMI training of all-observed model,
combined with a method for augmenting the all-observed model with unobserved nodes
using soft EM. We have also performed computational experiments with the simpler,
second, method of all-observed model augmentation. These results were given above in
the experimental results section.
There are several directions for future research on this subject. One is experimental; more
empirical experiments are needed for testing and comparing the two suggested methods
of O&U model construction and training. One particularly interesting application for
experiments is the visual interpretation model.
Another research direction is theoretical; an especially interesting case of O&U models is
the observed-in-leaves-only model. In this model all the “inner” nodes of the model are
unobserved and the observed nodes reside only in the leaves of the O&U TAN model.
This case is particularly interesting, as it is related to the biological structure of human
visual cortex. In the cortex, the observed data from the eyes enters the brain primarily
through the visual are known as V1. Area V1 consists of a network of simple features,
each being a small edge or a corner attached to some fixed location in the visual field.
These simple features are the only “observed” nodes of the visual cortex “model”; the
rest of the visual cortex consists of “unobserved” nodes, not linked directly to the sensory
input of the eyes. The techniques we have developed so far are not suited to handle the
observed-in-leaves-only cases. There is a need to develop new methods for coping with
these situations. The development of such methods is part of our current research effort.
In addition, a research direction requiring both theoretical and empirical research is
comparing TAN all-observed or O&U models against singly connected O&U models
- 116 -
with class node attached to the root of the tree as a single parent. An interesting question
is whether there are some advantages to the TAN model, and if so what they are.
10.1.2. Complete training approaches
Any classification framework based on features and graphical models has to describe a
so-called complete training approach. This means a method for selecting features,
organizing them into a model structure and training their parameters, all using the
training data set. At the end of the discussion regarding our novel information based
training approach – MaxMI, given in section 4, we described two possible complete
training schemes. One was using MaxMI itself to select features by including in the
training parameters also a set of a parameters that perform feature selection (such as
feature size and location). The other was selecting features using the ideas described in
[8], together with using MaxMI and constrained TAN for approximating maximal MI for
decision making in intermediate steps.
Empirically testing these approaches is an interesting experimental research direction for
future work.
10.1.3. Bottom-up training
As was mentioned in sub-section 10.1.1, in the structure of the visual cortex, almost all
the sensory input from the eyes enters the visual cortex through the V1 region, which is
in turn, organized as a system of simple features each being a small local feature attached
to some fixed location in the visual field. It is intuitive to suspect, that most of the
training done by the brain to achieve its remarkable classification capabilities is obtained
in a bottom-up fashion. That is, the model is constructed by building structures of
increasing complexity until the desired complexity of the general class is reached.
Part of our current ongoing research effort therefore involves devising methods for
reproducing this process by using our novel information based training techniques in the
bottom-up construction. Hopefully, this research direction will allow us to understand
better the visual cortex and its learning mechanisms.
- 117 -
10.1.4. Maximizing MI vs. minimizing PE
We have derived simple rules describing the form of max. MI and min. PE solutions in
the “ideal” training case, where for any CPT of class given the model and any probability
distribution of the model, an appropriate model can be found. However, in natural
classification and interpretation problems, the training is done in a “non-ideal” scenario.
That is, the hypothesis space (the space out of which the trained model is selected) does
not cover all the possible CPT and distributions. There is an interesting theoretical issue
of describing what happens in the non-ideal training case, and possibly providing some
general requirements under which the solution of the non-ideal cases approximates the
solution of the ideal case.
Another theme for future research is developing the max. MI to min. PE relations in the
“ideal” scenario. An interesting question in this context is: what are the cases in which
solution of max. MI approximates the solution of min. PE and if it does, then to which
extent.
10.2. Slow connections MAP approximation
In the following sub-sections we discuss research topics related to our novel MAP
approximation techniques – Slow Connections.
10.2.1. Slow connections selection methodology
In section 6 we have given reasons why selecting slow-connections to be the least
significant edges in the loopy decomposition has a good potential of success. These
reasons were based on the weak z-minor and strong z-minor assumptions. If these
assumptions are satisfied, then the maximize-and-legalize algorithm will converge to a
local or global maximum, respectively. This was also confirmed by our empirical
experiments, given above in section 8.
However, an interesting theoretical and experimental research direction is the further
development of slow connections techniques. In light of the good performance of the
slow connections approaches in our experiments, it is also interesting to characterize
cases in which slow connections will exhibit good performance, and to what extent this
performance will be good. This means giving more concrete numerical bounds on the
- 118 -
slow connections performance in different cases of loopy models used in practice in
various research areas.
10.2.2. Convergence criteria for slow connections
We have shown two criteria that guarantee convergence of the slow connections method
to local or global optimum: the weak and strong z-minor, respectively. However, these
are strong assumptions to make for a general function, even if we apply it in a more
relaxed iterative manner, such as in several techniques discussed in section 6. Therefore,
an interesting theoretical research direction is developing weaker assumptions for the
general case, as well as for smaller families of optimized functions.
- 119 -
11. APPENDICS
A1 – Multi-valued soft EM maximization step
Assume ix takes values from the set },,,,{ 1321
i
k
iii vvvv and denote by )( ixPar , some
fixed value of ix ‟s parent. Following is the derivation of the next step values of the
))(|( i
i
jii
i
j xParvxqt - elements of .
First note that:
k
j
i
ji
i
kii
i
k txParvxqt1
11 1))(|(
Taking a gradient of Yy
y yxpEn
));,((log and making it equal to zero, the equation
corresponding to i
jt will be:
Yy
k
j
i
jni
i
kiini
i
jii
j
tyxParvxptyxParvxpdt
d)1log();|)(,(log);|)(,(0
1
1
Hence, using elementary calculus, elements i
jt of 1n will adhere to:
kmjm
i
m
Yy
ni
i
ji
Yy
ni
i
ki
i
j tyxParvxp
yxParvxp
t,
1
ˆ);|)(,(
);|)(,(
1ˆ1
Now, denoting:
Yy
ni
i
ji
Yy
ni
i
ki
i
jyxParvxp
yxParvxp
C);|)(,(
);|)(,(
1
1
We arrive at the following system of linear equations, solution of which gives us the next
step vector it :
1
1
1
1
ˆ
111
1
11
111
111
3
2
1
i
i
k
i
i
i
t
C
C
C
C
which is the system of linear equations mentioned in section 3.2▄
- 120 -
A2 – Proof of (4.2.1)
The equation (4.2.1) follows by induction from the fact that if rx is a root of a tree such
as drawn in Figure 7, and kTTT ,,, 21 are subtree rooted at rx ‟s direct children
kxxx ,,, 21 , then )|( YXP can be decomposed as:
k
i
irirr
k
i
rir
rkr
YxTPYxP
YP
YxTPYxP
YP
YxTTTPYxP
YP
YXPYXP
1
121
),|()|(
)(
),|(),(
)(
),|,,,(),(
)(
),()|(
Here the 3rd
and the 4th
equations are due to the conditional independence of kTTT ,,, 21
given rx (which follows directly from the structure of the BN illustrated from Figure 7)
and as can be immediately noted, rYY and again due to conditional independence
),|(),|( iriri YxTPYxTP where iY and rY denote subsets of observed nodes contained
in the subtrees rooted at ix and rx respectively. The rest of the proof is by induction
(applying the above step for every subtree). Eventually, we‟ll end up with equation
(4.2.1)▄
A3 - Proof of the claim 4.5.1.1
This proof is given in notations for CPT and probability distribution of F given in section
4.5.1.
Let some k-valued feature F which is used to classify n-valued class C using a MAP
decision logic. Let },1{},1{: nk be a function such that:
)|(maxargmaxarg)( iFjCPpij
ijj
Then the PE can be re-written as follows:
)(:..
)()(:.. 1)(1
1),(ijitsj
ji iji
ijitsj
k
i
iji
ij
ij
k
i
iE prprprFCP
In the first term of the latter sum, we sum over all j for which there is no i such that
)(ij , and in its second term we sum over all j such that such i exists, then in the inner
- 121 -
sum of the second term we sum over all i not contained in the inverse image of j: )(1 j
(if the function is not one-to-one the inverse image is a set).
Now note that, as directly follows from our notations, the first term is in fact:
}),,1({)(:..
1kj
j
ijitsj
j cc
Moreover, as iriFP )( are non-zero (otherwise the feature F will be less then k-
valued which clearly non-decreases its PE), and as the second term is non-negative,
),( FCPE is minimized iff the second term is zero or equivalently if )(: iji and
)(1 ji , then 0ijp . Hence, we conclude that the global minimum value of ),( FCPE
is not smaller then:
k
j
jc1
1 . The only thing that remains to be shown is that there exist
such CPT and probability distribution of F such that they fulfill the claim‟s requirements.
To see this consider the following example:
ijkj
kjrk
c
ijr
c
pnjkii
j
i
i
ij
and 0
:1,1
and:
n
kj
j
iik
ccrki
1
:1
The requirements of the claim are trivially satisfied with being the identity mapping.
Moreover:
11111 11
n
i
i
n
kj
jk
i
i
k
i
n
kj
j
i
k
i
i ck
ckc
k
ccr
and:
kjc
k
c
rk
cr
kjcr
crpr
pr
j
k
i
jk
i i
j
i
j
j
j
jjjjk
i
iji
11
1
- 122 -
that is
k
i
jiji cpr1
in any case. Thus we conclude that these example CPT and
probability distribution of F satisfy the claim‟s requirements▄
A4 - Proof of the claim 4.5.1.2
The residual entropy )|( FCH under our notation takes the following form:
k
i
n
j
ijiji
k
i
n
j
ppriFjCPiFjCPFCH1 11 1
log)|(log),()|(
Note that, as stated in section 4.5.1, the following must hold:
k
k
i
ijij
kjj
k
i
ijir
prc
pcpr
1
1
1
and:
1
11
11n
j
ijin
n
j
ij ppp
and finally:
1
11
11k
i
ik
k
i
i rrr
Thus )|( FCH is in fact a function of ijp , where 11 ki and 11 nj , and of ir
where 11 ki . Thus substituting the above expressions into the expression of
)|( FCH we can calculate the derivatives of )|( FCH with respect to these variables:
k
i
kni
k
i
kjiiiniiiji
ij r
rpr
r
rprrprrprFCH
p
loglogloglog)|(
thus in order to make the derivative with respect to ijp , 11 ki and 11 nj ,
being equal to zero, we require that:
k
i
kni
k
i
kjiiiniiijir
rpr
r
rprrprrpr loglogloglog0
inkjknij pppp loglog
- 123 -
inkjknij pppp
And since we require it from all j then due to the fact that 111
n
j
kj
n
j
ij pp we get that
for the derivative to be zero we require: inkn pp and hence, kjij pp . Finally, from:
j
k
i
iji cpr 1
, we get ijkj
k
i
ikj
k
i
kjij pprpprc 11
. Thus at the extremum of
)|( FCH , for ki 1 and nj 1 , jij cp . Also note that for any ir , where
11 ki :
n
j
kjkj
n
j
ijij
i
ppppFCHr 11
loglog)|(
and thus if for ki 1 and nj 1 , jij cp , 0)|(
FCH
ri
for any legal
assignment to ir , ki 1 .
The point jij cp , for any ki 1 and nj 1 , is a maximum point of )|( FCH and
it is also the only extremum in the closed set of legal assignments of a concave (in that
set) function )|( FCH . Hence, the minimum of )|( FCH is obtained at the boundaries
of the closed set of legal assignments to ijp and ir , ki 1 and nj 1 . Due to the
equality: j
k
i
iji cpr 1
, for any fixed assignment to ir , ki 1 , for any ki 1 and
nj 1 , the boundaries for ijp are:
)1,min(0i
j
ijr
cp
Assume now a fixed assignment to ir , ki 1 . For an assignment to ijp to be on the
boundary, at least one of ijp must assume one of its boundary values. As over the entire
set of all legal assignments to ijp , )|( FCH is concave, then fixing some of its free ijp
variables will still create a concave function, again with a single maximum. Hence, we
can continue minimizing the function by fixing variables to their boundary values. Thus
there is a way of arriving at the global minimum of )|( FCH by iteratively fixing
- 124 -
variables ijp to their boundary values. Note that although the minimum value of ijp is
always 0, the maximum value would not have to be )1,min(i
j
r
c at each step of the
iteration, it can potentially be smaller then that depending on the value of ir . If at some
step, some ijp assumes a maximal value of i
j
r
c, then for all jm : 0mjp and the
value j of F solely “owns” the value i of C. However, if due to previously fixed values of
ijp , the fixed value of the current ijp is smaller then i
j
r
c (this can only happen if making
ijp equal to i
j
r
c will cause some of the previously fixed imp ‟s together with this ijp to
sum up to more then one) then value i of C is “split” between several values of F.
Let us now consider a simpler case of 2k . We will prove that the global minimum of
)|( FCH for this case is obtained when no value of C is “split”. We suspect same is true
for the general case, but we don‟t currently have a short proof for this, so we leave it to
future research.
We denote by 1A the set of all j for which jp1 was assigned the value
1r
c j (and hence for
all 1Aj , jp2 was assigned the value 0). Similarly, denote by
2A the set of all j for
which jp2 was assigned the value 2r
c j (and jp1 was assigned 0). Clearly,
1A and 2A are
disjoint. We denote 111 )()( aACPAP and
222 )()( aACPAP .
If no value of C is “split” then },,1{21 nAA and the residual entropy is equal to:
2
1
2
11
2
1
log)(logloglog)|(i
ii
i Aj
ij
n
j
jj
i Aj i
j
i
j
i raCHrcccr
c
r
crFCH
ii
In this case, the )|( FCH is minimized when
2
1
logi
ii ra is minimized, that is when
ii ar . The minimum )|( FCH in this case is equal to
2
1
log)(i
ii aaCH . Note also
- 125 -
that the last derivation holds for the general case, that is, if no j was “split” between
different F values, then the minimum residual entropy is obtained for ii ar where
ki 1 .
Now assume some value j was “split” for 2k . As already stated, there can be only one
such value as we can approach the minimum via “fixing” iterations, each time fixing a
point on the boundary, and if a value of C was split by fixing jp1 or jp2 , then it can only
be the case that for all the rest (yet “un-fixed”) values of jp1 or jp2 respectively are
zero, hence no other value of C can be “split”. Moreover, as 11
n
j
ijp , we have that if
we denote the “split” value of C by m, then:
1
11
1
11
11
11r
ar
r
cpp
Aj
j
Aj
jm
and similarly 1
11
2
122
1
1
r
ar
r
arp m
. Thus, in this case, the residual entropy is:
2
1
2
11
2121
1
1111
2
1
222111
loglog1
1log)1(log)(
logloglog)|(
i
ii
i Aj
jj
i Aj i
j
i
j
immmm
raccr
arar
r
arar
r
c
r
crpprpprFCH
i
i
If we compute the derivative with respect to 1r of the latter expression we will arrive at:
)1()(
)1(log
)1log()1log(log)log(11
1
)1log(1)1log(log1)log()|(
111
211
121111
1
2
1
1
1
21
121
1
11111
1
rar
arr
rarrarr
a
r
a
r
ar
rarr
arrarFCH
dr
d
Thus 0)|(1
FCHdr
d iff )1()()1( 111211 rararr , that is
21
11
aa
ar
and
therefore 21
22
aa
ar
. Hence, the minimal )|( FCH in this case is equal to:
- 126 -
21
22
21
11
2
1
2
1
2111
2
1
2
212221
1
211111
2
1
2
11
2121
1
1111
loglog)(loglog)(log
log)log)((log)1(log)(log
)log)(()(
log)1()(
log)(
loglog1
1log)1(log)()|(
aa
aa
aa
aaCHraccCHcc
raccCHcarcarra
ccCHa
aaaaar
a
aaaaar
raccr
arar
r
ararFCH
i
iimmmm
i
iimmmm
i
ii
mm
i
ii
i Aj
jj
i
The only thing left in order to prove the claim, is to show that for every “split” minimum
of the form 21
22
21
11 loglog)(
aa
aa
aa
aaCH
corresponding to some
1A and 2A , there
is a corresponding “non-split” selection of 1A and 2A (so their disjoint union is the whole
set of C values). Let m be the C value which is not in 1A and not in
2A , then the “split”
minimum is of the form:
)1log()1(loglog)(
)log()(loglog)(
2211
21212211
mm
S
ccaaaaCH
aaaaaaaaCHm
W.l.o.g. assume 2
10 1
mca
, and define: 22
ˆ AA and }{ˆ11 mcAA , then the
minimum corresponding to this “non-split” selection of 1A and 2A has the form:
2211 log)log()()( aacacaCHm mmNS
Finally, we claim that for any value of 10 mc and 2
10 1
mca
, SNS mm . To see
this, consider the following:
1111
22112211
log)1log()1()log()())1log()1(
loglog)(()log)log()()((
aacccacacc
aaaaCHaacacaCHmm
mmmmmm
mmSNS
Thus:
1
1
11
1
log1log1)log()(a
caacamm
a
m
mSNS
and hence in all its extremum points, 0mc and thus 0 SNS mm . At he boundary
points where 0mc or 1mc , 0 SNS mm . At he boundary points where 01 a ,
- 127 -
0 SNS mm and at the boundary points where 2
11
mca
, the difference SNS mm
assumes the following form:
2
1log
2
1)1log()1(
2
1log
2
1
log)1log()1()log()( 1111
mmmm
mm
mmmmSNS
cccc
cc
aacccacamm
and thus:
6.0or 1
03854
110)(
1log4
1log
2
1
2
1
2
1log
2
111log
2
1
2
1log
2
1)(
2
2
2
2
mm
mmm
mSNS
m
mm
m
m
m
SNS
m
cc
ccc
cmmdc
d
cc
cc
cmm
dc
d
Assigning 1mc into the latter SNS mm expression, yields a zero value, as well as for
0mc . Assigning 6.0mc we get a negative value of approximately -0.223. Thus at the
boundary points where 2
11
mca
, SNS mm is always non-positive (otherwise there
would be a positive extremum due to Roll‟s theorem). Hence, we conclude that SNS mm
is less or equal to 0 in all the boundary points and is zero in its extremum points, thus it is
always non-positive (that is: 0 SNS mm for all 10 mc and 2
10 1
mca
), as the
surface SNS mm is continuous over the closed set 10 mc , 2
10 1
mca
. Finally,
we conclude that using a “non-split” 22ˆ AA and }{ˆ
11 mcAA gives a smaller
minimal value to the residual entropy then using the “split” 1A and
2A , which in turn, as
explained above, concludes the proof of the claim for 2k . Note also that for arbitrary k
we have proved the claim up to ruling out the possibility of “split” solutions. Showing
that the “non-split” solutions can produce a smaller value for the residual entropy in the
arbitrary k case is a topic for future research▄
- 128 -
A5 – MaxMI in case of partial conditional independence in class
This appendix discusses an implication of Assumption 2 of section 4.1 which introduced
the MaxMI algorithm. The notations used here are the ones introduced in section 4.1.
The assumption was that the BN, which is being trained using MaxMI, is such that if we
remove the C node from it, the structure of the decomposition is changed in the following
way:
n
j
Sjjn jFPFFP
1
1 ;|);,,(
I.e. the structure of the BN remains the same (in the sense of parent / child relations) just
without the C node.
The implications of this assumption are particularly interesting for special case of partial
conditional independence in class. Assume a special case of );,,,( 1 nFFCP
decomposition where the underlying BN is a set of disjoint sub-graphs (sub-BNs)
conditionally independent in C:
m
i
Aiin iCAPCPFFCP
1
1 ;|)();,,,(
Where i
ni FFA ,,1 as a disjoint union,
i
ji
A
j
SjjAii CFPCAP1
;,|;| and if
ij AF then ij A as well. Here iA denoted the parameters of all ij AF . This
situation is illustrated in Figure 22.
- 129 -
Figure 22: Partial conditional independence in class. Given the class value, the model joint
PDF decomposes into a product of conditional PDFs, one for each component Ai.
In general it is unnatural to assume that we can safely remove the class node C from the
above decomposition in order to get the (approximate) decomposition of );,,( 1 nFFP .
I.e. in the above case usually:
m
i
Aiin iAPFFP
1
1 ;);,,( , as conditional
independence in class doesn‟t mean general independence.
However, we need not make such strong assumptions as general independence of iA -s
(as vector random variables). Instead we can make weaker assumptions of distribution of
iA -s having BN decomposition:
m
i
BAiim iiAPAAP
1
1 ;|);,,()(
Where iA denotes the parents of iA in the latter decomposition,
iAkA
kii AAB
and
iB stands for parameters of all ij BF . Making this assumption, we get a
decomposition of );,,,( 1 nFFCP which satisfies Assumption 2, as:
- 130 -
m
i
BAii
m
i
Aiin iiiCAPCPCAPCPFFCP
11
1 ;,|)(;|)();,,,(
Because conditional independence in C implies that we can add knowledge on other kA -s
without any effect on the conditional distribution. Using the above and )( we get the
correctness of Assumption 2 for this decomposition (i.e. removing C preserves the
structure of underlying BN).
An important implementation note at this point is that if we want to efficiently use the
BN structure of the iA -s themselves, connections between the nodes in the iA -s BN
should be more specific. For instance if kA is a parent of iA in the iA -s BN, then it
would be more efficient to establish a specific subset of jx elements of kA , each with a
set of its specific children which are elements of iA . If we do so, we can work with a
decomposition of );,,,( 1 nFFCP over jF space, rather then over iA space which
would usually be more efficient due to the reduced local kernel size.
12. References
1. Aji, S. M., McEliece, R. J. The Generalized Distributive Law. IEEE Trans. Inform.
Theory, vol. 46, no. 2 (March 2000), pp. 325--343.
2. Aji, S. M., McEliece, R. J. The Generalized Distributive Law and Free Energy
Minimization. Presented at 39th Allerton Conference, October 4, 2001.
3. Amir, E. Efficient Approximation for Triangulation of Minimum Treewidth.
Proceedings of 17th Conference on Uncertainty in Artificial Intelligence (UAI '01), p.
7-15.
4. Bodlaender, H.L. Necessary edges in k-chordalizations of graphs. Technical Report
UU-CS-2000-27, Utrecht University.
- 131 -
5. Bringuier, V., Chavane, F., Glaeser, L. & Frégnac, Y. Horizontal Propagation of
Visual Activity in the Synaptic Integration Field of Area 17 Neurons. Science, 283,
695-699, 1999.
6. Chow, C. K. and C. N. Liu (1968). Approximating discrete probability distributions
with dependence trees. IEEE Transaction on Information Theory 14, 462-467.
7. Cooper, G. F. The computational complexity of probabilistic inference using
Bayesian belief networks. Artificial Intelligence, vol.42, pp.393-405, 1990.
8. Epshtein, B., Ullman, S. Hierarchical features are better than whole features.
Unpublished.
9. Friedman, N., Geiger, D., Goldszmidt, M. Bayesian Network Classifiers. Machine
Learning, 29:2/3, 1997.
10. Girard, P., Hupé, J.M. & Bullier, J. Feedforward and Feedback Connections Between
Areas V1 and V2 of the Monkey Have Similar Rapid Conduction Velocities. J
Neurophysiol 85, 1328-1331, 2001.
11. Jensen, F. V. An Introduction to Bayesian Networks. New York: Springer-Verlag,
1996.
12. Kschischang, F. R., Frey B. J., Loeliger, H.-A. Factor graphs and the sum-product
algorithm. IEEE Transactions on Information Theory 47:2, pp. 498-519, February
2001.
13. Laferte, J.-M., Perez, P., Heitz, F. Discrete Markov Image Modeling and Inference on
the Quadtree. IEEE Transactions on Image Processing, vol. 9, no. 3, March 2000.
- 132 -
14. Lauritzen, S. L., Spiegelhalter, D. J. Local computation with probabilities on
graphical structures and their application to expert systems. J. Roy. Statist. Soc. B, pp.
157–224, 1988.
15. McEliece, R. J., MacKey, D., Cheng, J.-F. Turbo Decoding as an Instance of Pearl‟s
„Belief Propagation‟ Algorithm. IEEE J. Sel. Areas Comm., vol.16, no.2 (Feb. 1998),
pp.140 –152.
16. Pearl, J. Probabilistic Reasoning in Intelligent Systems. San Francisco: Morgan
Kaufmann,1988.
17. Redner, R. A., Walker, H. F. Mixture densities, maximum likelihood and the EM
algorithm. SIAM Rev., vol. 26, no. 2, pp. 195-239, 1984.
18. Segal, E., Battle, A., Koller, D. Decomposing Gene Expression into Cellular
Processes. Proceedings of the 8th Pacific Symposium on Biocomputing (PSB), Kaua'i,
January 2003.
19. Shimony, S. E. Finding MAPs for belief networks is NP-hard. Artificial Intelligence,
vol.68, pp.399--410, 1994.
20. Vidal-Naquet, M., and Ullman, S. Object Recognition with Informative Features and
Linear Classification. Proceedings of the 9th International Conference on Computer
Vision, 281-288. Nice, France, 2003.
21. Weiss Y., Belief Propagation and Revision in Networks with Loops. Technical
Report, AIM-1616, MIT, 1997.
22. Yedidia, J. S., Freeman, W.T., Weiss, Y. Bethe free energy, Kikuchi approximations,
and belief propagation algorithms. Available at http://www.merl.com/papers/TR2001-
16/
- 133 -
23. Yuille, A.L. CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies:
Convergent Alternatives to Belief Propagation. Neural Computation, v.14 n.7,
p.1691-1722, July 2002.
24. Yuille, A.L. A Double-Loop Algorithm to Minimize the Bethe Free Energy.
Proceedings of the Third International Workshop on Energy Minimization Methods in
Computer Vision and Pattern Recognition, pp. 3-18, 2001.