Shafer - Probabilistic Expert Systems

CBMS-NSF REGIONAL CONFERENCE SERIESIN APPLIED MATHEMATICS

A series of lectures on topics of current research interest in applied mathematics under the direction ofthe Conference Board of the Mathematical Sciences, supported by the National Science Foundation andpublished by SIAM.

GARRETT BIRKHOFF, The Numerical Solution of Elliptic EquationsD. V. LINDLEY, Bayesian Statistics, A ReviewR. S. VARGA, Functional Analysis and Approximation Theory in Numerical AnalysisR. R. BAHADUR, Some Limit Theorems in StatisticsPATRICK BILLINGSLEY, Weak Convergence of Measures: Applications in ProbabilityJ. L. LIONS, Some Aspects of the Optimal Control of Distributed Parameter SystemsROGER PENROSE, Techniques of Differential Topology in RelativityHERMAN CHERNOFF, Sequential Analysis and Optimal DesignJ. DURBIN, Distribution Theory for Tests Based on the Sample Distribution FunctionSOL I. RUBINOW, Mathematical Problems in the Biological SciencesP. D. LAX, Hyperbolic Systems of Conservation Laws and the Mathematical Theory of Shock

WavesI. J. SCHOENBERG, Cardinal Spline InterpolationIVAN SINGER, The Theory of Best Approximation and Functional AnalysisWERNER C. RHEINBOLDT, Methods of Solving Systems of Nonlinear EquationsHANS F. WEINBERGER, Variational Methods for Eigenvalue ApproximationR. TYRRELL ROCKAFELLAR, Conjugate Duality and OptimizationSIR JAMES LIGHTHILL, Mathematical BiofluiddynamicsGERARD SALTON, Theory of IndexingCATHLEEN S. MORAWETZ, Notes on Time Decay and Scattering for Some Hyperbolic ProblemsF. HOPPENSTEADT, Mathematical Theories of Populations: Demographics, Genetics and EpidemicsRICHARD ASKEY, Orthogonal Polynomials and Special FunctionsL. E. PAYNE, Improperly Posed Problems in Partial Differential EquationsS. ROSEN, Lectures on the Measurement and Evaluation of the Performance of Computing SystemsHERBERT B. KELLER, Numerical Solution of Two Point Boundary Value ProblemsJ. P. LASALLE, The Stability of Dynamical Systems - Z. ARTSTEIN, Appendix A: Limiting Equations

and Stability of Nonautonomous Ordinary Differential EquationsD. GOTTLIEB AND S. A. ORSZAG, Numerical Analysis of Spectral Methods: Theory and ApplicationsPETER J. HUBER, Robust Statistical ProceduresHERBERT SOLOMON, Geometric ProbabilityFRED S. ROBERTS, Graph Theory and Its Applications to Problems of SocietyJURIS HARTMANIS, Feasible Computations and Provable Complexity PropertiesZOHAR MANNA, Lectures on the Logic of Computer ProgrammingELLIS L. JOHNSON, Integer Programming: Facets, Subadditivity, and Duality for Group and Semi-

Group ProblemsSHMUEL WINOGRAD, Arithmetic Complexity of ComputationsJ. F. C. KINGMAN, Mathematics of Genetic DiversityMORTON E. GURTTN, Topics in Finite ElasticityTHOMAS G. KURTZ, Approximation of Population Processes

(continued on inside back cover)

ProbabilisticExpert Systems

This page intentionally left blank

Glenn ShaferRutgers UniversityNewark, New Jersey

Probabilistic ExpertSystems

SJLHJTL.SOCIETY FOR INDUSTRIAL AND APPLIED MATHEMATICS

PHILADELPHIA

Copyright © 1996 by the Society for Industrial and Applied Mathematics.

1 0 9 8 7 6 5 4 3 2 1

All rights reserved. Printed in the United States of America. No part of this book may bereproduced, stored, or transmitted in any manner without the written permission of thepublisher. For information, write to the Society for industrial and Applied Mathematics,3600 University City Science Center, Philadelphia, PA 19104-2688.

Library of Congress Cataloging-in-Publication Data

Shafer, Glenn, 1946-Probabilistic expert systems / Glenn Shafer.

p. cm. -- (CBMS-NSF regional conference series in appliedmathematics ; 67)

"Sponsored by Conference Board of the Mathematical Sciences"--Cover.

Includes bibliographical references and index.ISBN 0-89871-373-0 (pbk.)1. Expert systems (Computer science) 2. Probabilities.

I. Conference Board of the Mathematical Sciences. II. Title.III. Series.QA76.76.E95S486 1996006.3'3--dc20 96-18757

^IH lTL« is a registered trademark.

Contents

Preface vii

Chapter 1. Multivariate Probability 11.1 Probability distributions 21.2 Marginalization 31.3 Conditionals 51.4 Continuation 71.5 Posterior distributions 101.6 Expectation 121.7 Classifying probability distributions 131.8 A limitation 14

Chapter 2. Construction Sequences 172.1 Multiplying conditionals 182.2 DAGs and belief nets 202.3 Bubble graphs 272.4 Other graphical representations 30

Chapter 3. Propagation in Join Trees 353.1 Variable-by-variable summing out 373.2 The elementary architecture 413.3 The Shafer-Shenoy architecture 443.4 The Lauritzen-Spiegelhalter architecture 503.5 The Aalborg architecture 563.6 COLLECT and DISTRIBUTE 633.7 Scope and alternatives 66

Chapter 4. Resources and References 694.1 Meetings 694.2 Software 694.3 Books 70

V

vi CONTENTS

4.4 Review articles 714.5 Other sources 73

Index 79

Preface

Based on lectures at an NSF/CMBS Regional Conference at the University ofNorth Dakota at Grand Forks during the week of June 1-5, 1992, this mono-graph analyzes join-tree methods for the computation of prior and posteriorprobabilities in belief nets. These methods, pioneered by Pearl [42], [8] Lau-ritzen and Spiegelhalter [37], and Shafer, Shenoy, and Mellouli [45] in the late1980s, continue to be central to the theory and practice of probabilistic expertsystems.

In the North Dakota lectures, I began with the topics discussed here andthen moved on in two directions. First, I discussed how the basic architecturesfor join-tree computation apply to other methods for combining evidence, espe-cially the belief-function (Dempster^Shafer) method, and also how they apply tomany other problems in applied mathematics and operations research. Second,I looked at other aspects of computation in expert systems, especially Markovchain Monte Carlo approximation, computation for model selection, and com-putation for model evaluation.

I completed a draft of the three chapters that form the body of this mono-graph in the summer of 1992, shortly after delivering the lectures. Unfortunately,I set the project aside at the end of that summer, expecting to return in a fewmonths to write additional chapters covering at least the other major topics Ihad discussed in Grand Forks. As it turned out, my return to the project wasdelayed for three years, as I found myself increasingly concerned with anotherset of ideas—the use of probability trees to understand probability and causality.Rather than extend this monograph, I completed a new and much longer book,The Art of Causal Conjecture (MIT Press, 1996).

The field of probabilistic expert systems has continued to flourish in thepast three years, yet the understanding of join-tree architectures set out in myoriginal three chapters is still missing from the literature. Moreover, the broaderresearch question that motivated my presentation—how well a general theory ofpropagation along the same lines can account for the wide variety of recursivecomputation in applied mathematics—remains open. I have decided, therefore,to publish these three chapters on their own, essentially as they were written in1992. I have resisted even attempting a brief survey of related topics. Instead I

vii

viii PREFACE

have added a brief chapter on resources, which gives information on software andincludes an annotated bibliography. I have also added some exercises that willhelp the reader begin to explore the problem of generalizing from probability tobroader domains of recursive computation.

The resulting monograph should be useful to scholars and students in artificialintelligence, operations research, and the various branches of applied statisticsthat use probabilistic methods. Probabilistic expert systems are now used inareas ranging from diagnosis (in medicine, software maintenance, and space ex-ploration) and auditing to tutoring, and the computational methods describedhere are basic to nearly all implementations in all these areas.

I wish to thank Lonnie Winnrich, who organized the conference in NorthDakota, as well as the other participants. They made the week very pleasantand productive for me. I also wish to thank the many students and colleagues,at the University of Kansas and around the world, who helped me learn aboutexpert systems in the late 1980s and early 1990s. Foremost among them isPrakash P. Shenoy, my colleague in the School of Business at the Universityof Kansas from 1984 to 1992. I am grateful for his steadfast friendship andindispensable collaboration.

Augustine Kong and A. P. Dempster, who joined with Shenoy and me inthe early 1980s in the study of join-tree computation for belief functions, werealso important in the development of the ideas reported here. Section 3.1 is in-spired by an unpublished memorandum by Kong. Other colleagues and studentswith whom I collaborated particularly closely during this period include KhalidMellouli, Debra K. Zarley, and Rajendra P. Srivastava.

Special thanks are due Niven Lianwen Zhang, Chingfu Chang, and the lateGeorge Kryrollos, all of whom made useful comments on the 1992 draft of themonograph.

I would also like to acknowledge the friendship and encouragement of manyother scholars whose work is reported here, especially A. P. Dawid, Finn V.Jensen, Steffen L. Lauritzen, Judea Pearl, and David Spiegelhalter. The field ofprobabilistic expert systems has benefited not only from their energy, intellect,and vision, but also from their generosity and good humor.

Finally, at an even more personal level, I would like to thank my wife, NellIrvin Painter, who has supported this and my other scholarly work through thickand thin.

CHAPTER 1

Multivariate Probability

This chapter reviews the basic ingredients of the theory of multivariate proba-bility: marginals, conditionals, and expectations. These will be familiar topicsfor many readers, but our approach will take us down some relatively unex-plored paths. One of these paths opens when we develop an explicit notationfor marginalization. This notation allows us to recognize properties of marginal-ization that are shared by many types of recursive computation. Another pathopens when we distinguish among probability distributions on the basis of howthey are stored. We distinguish between tabular distributions, which are sim-ply tables of probabilities, and algorithmic distributions, which are algorithmsfor computing probabilities. A parametric distribution is a special kind of al-gorithmic distribution; it consists of a few numerical parameters and a rela-tively simple algorithm, usually a formula, for computing probabilities from thoseparameters.

The most complex topic in this chapter is conditional probability. Our pur-poses require that we understand conditional probability from several viewpoints,and we rely on some careful terminology to keep the viewpoints distinct. Wedistinguish between conditional probabilities in general, which can stand ontheir own, without reference to any prior probability distribution, and poste-rior probabilities, which are conditional probabilities obtained by conditioninga probability distribution on observations. And we distinguish two kinds oftables of conditional probabilities: conditionals and posterior distributions. Aconditional consists of many probability distributions for a set of variables (theconditional's head)—one for each configuration of another set of variables (itstail). A posterior distribution is a single probability distribution consisting ofposterior probabilities.

In the next chapter, we study how to construct a probability distributionby multiplying conditional probabilities—or, more precisely, by multiplying con-ditionals. When we multiply the conditionals in an appropriate order, eachmultiplication produces a larger marginal of the final distribution. This meansthat each conditional is a continuer for the final distribution; it continues it froma smaller to a larger set of variables. The concept of a continuer will help usminimize complications arising from the presence of zero probabilities, which areunavoidable in expert systems, where much of our knowledge is in the form of

1

youngmiddle-agedold

female maleDem ind Rep.08 .16 .08.05 .05 .05.05 .05 .05

Dem ind Rep.02 .04 .02.00 .00 .00.10 .10 .10

rules that do not admit exceptions. Continuers will also help us, in Chapter 3,to understand architectures for recursive computation.

This chapter is about multivariate probability, not about probability in gen-eral. Not all probability models are multivariate. The chapter concludes with abrief explanation of why multivariate models are sometimes inadequate.

1.1. Probability distributions.

The quickest way to orient those not familiar with multivariate probability isto give an example. Table 1.1 gives a probability distribution for three variables:Age, Sex, and Party. Notice that the numbers are nonnegative and add to one.This is what it takes to be a discrete probability distribution.

We will write QX for the set of possible values of a variable X, and we willwrite £lx for the set of configurations of a set of variables x. We call QX and Ox

the frames for X and x, respectively. In general, Ox is the Cartesian product ofthe frames of the individual variables: f^ = Hxex ̂ x- In Table 1.1, we assumethat

&Age = {young,middle-aged,old},

^Sex = {male, female},

and

Q Party = {Democrat, independent,Republican}.

Thus the frame ^Age,Sex,Party consists of eighteen configurations:

(young,male,Democrat),(old,male,independent),...

and Table 1.1 gives a probability for each of them. In general, as in this example,a discrete probability distribution for x gives a probability to every element offjx; abstractly, it is a nonnegative function on £lx whose values add to one.

If we add together the numbers for males and females in Table 1.1, we getmarginal probabilities for Age and Party, as in Table 1.2. Adding further, weget marginal probabilities for Age, as in Table 1.3.

Some readers may be puzzled by the name "marginal." The name is derivedfrom the example of a bivariate table, where it is convenient and conventionalto write the sums of the row and columns in the margins. In Table 1.4, for

2 CHAPTER 1

TABLE 1.1A discrete tabular probability distribution for three variables.

We can write a formula for P^w:

for each configuration c of w. Here x\w consists of the variables in x but notw and c.d is the configuration of x that we get by combining the configuration cof w and the configuration d of x \ w. For example, if x = {Age,Sex,Party} andw — {Age,Party}, then x \ u> = {^ex}; if c = (old,Democrat) and d — (male),then c.d = (old,male,Democrat).

The arrow notation emphasizes the variables that remain when we marginal-ize. Sometimes we use instead a notation that emphasizes the variables we sumout: P~y is the marginal obtained when we sum out the variables in y. Thuswhen x = w U y, where w and y are disjoint sets of variables, and P is a proba-bility distribution on x, both P^w and P~y will represent P's marginal on w.

Though we are concerned primarily with probability distributions, anynumerical2 function / on a set of variables x has a marginal f ^ w for every subsetw of x. The function / need not be nonnegative or sum to one. If w is notempty, then f ^ w is a function on w.

for each configuration c of w. If w is empty, then f ^ w is simply a number:

The number f^ will be equal to one if / is a probability distribution. Thefunction f ^ w will be equal to / if w = x.

Here are two important properties of marginalization:

Property 1. If / is a function on y, and

Property 2. If / is a function on x, and g is a function on y, then

We leave it to the reader to derive these properties from equation (1.2).It is informative to rewrite Properties 1 and 2 using the /~ notation. This

gives the following:

2A numerical function is one that takes real numbers as values. We will consider onlynumerical functions in this monograph.

3In order to understand this equation, we must recognize that the product fg is a functionon x U y. Its value for a configuration c of x U y is given by (fg)(c) — f ( c ^ x ) g ( c ^ y ) , wherec^x is the result of dropping from c the values for variables not in x. For example, if / is afunction on {Age,Party} and g is a function on {Sex,Party}, then (/g)(old, male, Democrat) =/(old, Democrat)(7(male, Democrat).

4 CHAPTER 1

MULTIVARIATE PROBABILITY

FIG. 1.1. Removing y\x from y leaves x n y; removing y\x from x U y leaves x.

Property 1. If / is a function on y. and u and v are disjoint subsetsof y, the

Property 2. If / is a function on x, and g is a function on y, then

This version of Property 2 makes it clear that we are summing out the samevariables on both sides of the equation (fg)^-x = /(fl^xny)- Summing thesevariables out of f g , which is a function on x U y. leaves the variables in x, butsumming them out of g, which is a function on y, leaves the variables in x d y(see Figure 1.1).

The second version of Property 2 also suggests the following generalization:

Property 3. If / is a function on x. g is a function on y, and

We leave it to the reader to derive this property also from equation (1.2).As we will see in Chapter 3, Properties 1 and 2 are responsible for the pos-

sibility of recursively computing marginals of probability distributions given asproducts of tables. These properties also hold and justify recursive computationin other domains, where we work with different objects and different meaningsfor marginalization and multiplication. Because of their generality, we call Prop-erties 1 and 2 axioms; Property 1 is the transitivity axiom, and Property 2 is thecombination axiom.

The definition of marginalization, equation (1.2), together with the proofsof Properties 1, 2, and 3, can be adapted to the continuous case by replacingsummation with integration. We leave this to the reader. We also leave asidecomplications that arise if infinities are allowed—if the sum or integral is over aninfinite frame or an unbounded function. Our primary interest is in distributionsgiven by tables, and here the frames are both discrete and finite.

1.3. Conditionals.

Table 1.5 gives conditional probabilities for Party given Age and Sex. We callthese numbers conditional probabilities because they are nonnegative and eachgroup of three (the three probabilities for Party given each Age-Sex configura-tion) sums to one. In other words, the marginal for {Age,Sex}, Table 1.6, consistsof ones.

We call Table 1.5 as a whole a conditional. We call {Party} its head, andwe call {Age,Sex} its tail In general, a conditional is a nonnegative function Q

5

CHAPTER 1

TABLE 1.5A conditional for Party given Age and Sex.

youngmiddle-agedold

female maleDem ind Rep1/4 1/2 1/41/3 1/3 1/31/3 1/3 1/3

Dem ind Rep1/4 1/2 1/41/5 1/5 3/51/3 1/3 1/3

TABLE 1.6The marginal of Table 1.5 on its tail.

female maleyoung 1 1middle-aged 1 1old 1 1

on the union of two disjoint sets of variables, its head h and its tail t, with theproperty that Q^ = 1^, where lt is the function on t that is identically equal toone.

Two special cases deserve mention. If t is empty, then Q is a probabilitydistribution for h. If h is empty, then Q = It. We are interested in conditionalsnot for their own sake but because we can multiply them together to constructprobability distributions. This is the topic of the next chapter.

Frequently, we are interested only in a subtable of a conditional. In Table 1.5,for example, we might be interested only in the conditional probabilities forfemales—the subtable shown in Table 1.7. We call such a subtable a slice. Ingeneral, if / is a table on x and c is a configuration of a subset w of x, then wewrite f\w=c for the table on x \ w given by

and we call f\w=c the slice of / on w = c. We leave it to the reader to verify thefollowing proposition.

PROPOSITION 1.1. Suppose Q is a conditional with head h and tail t, andsuppose w Ct. Then Q\w=c is a conditional with head h and tail t\w.

Table 1.7 illustrates Proposition 1.1; it is a conditional with {Party} as itshead and {^4<?e} as its tail.

We will sometimes find it convenient to generalize the notation for slicing byallowing the variables whose values we fix to include variables that are outsidethe domain of the table and hence have no effect on the result. In general, if / isa table on x, w is a set of variables, and c is a configuration of iu, then we writef\w=c for the table on x \ w given by

for each configuration d of x \ w.

6

TABLE 1.7The slice of Table 1.5 on Sex = female.

Dem ind Rep

young 1/4 1/2 1/4middle-aged 1/3 1/3 1/3old 1/3 1/3 1/3

TABLE 1.8The marginal of Table 1.1 for Age and Sex.

female maleyoung .32 .08middle-aged .15 .00old .15 .30

1.4. Continuation.

If / is a function on re, w C re, and

then we say that Q continues f from w; to re.Here is an example. Suppose x — {Age,Sex,Party} and w — {Age,Sex}, and

consider the probability distribution P given by Table 1.1 and the conditionalQ given by Table 1.5. The marginal P^w is given by Table 1.8, and the readercan easily check that P = P^WQ.4 Thus Q continues P from w to x.

When do continuers exist, and when are they unique?PROPOSITION 1.2. Suppose f is a function on x, and suppose w C x.1. // all of f's values are positive, then there is a unique function Q on x

that continues f from w to x. This continuer Q is a conditional.2. // all of f's values are nonnegative, then there is at least one function Q

on x that continues f from w to x. We can choose Q to be a conditional.Proof. Try to divide both sides of equation (1.6) by f^w to obtain

or

where c is a configuration of w and d is a configuration of x \ w. If the values of/ are all positive, then the values of f ^ w are as well, and the division succeeds;

4Bear in mind that P = PiwQ means P(c) = Piw(c^w)Q(c). Thus each entry in Table 1.8multiplies a whole row (three entries) in Table 1.5.

MULTIVARIATE PROBABILITY 7

8 CHAPTER 1

it produces the unique Q on x satisfying equation (1.6). Using the combinationaxiom, we find that

so Q is a conditional with tail w. If the values of / are merely all nonnegative,then the division in equation (1.8) may fail for some c, but if f ^ w ( c ] — 0, thenf(c.d) = 0 for all d, and hence equation (1.6) will be satisfied with arbitrary val-ues of Q(c.d) for that c. In particular, we may chose the Q(c.d) to be nonnegativeand add to one for each such c, so that Q is a conditional.

Since a probability distribution has nonnegative but not necessarily all pos-itive values, it has continuers but not necessarily unique continuers. In ourexample, the nonuniqueness is in the conditional probabilities for middle-agedmales. Since middle-aged males have probability zero in Table 1.8, we can changethe numbers 1/5, 1/5, and 3/5 in Table 1.5 however we want without falsifyingequation (1.6).

In addition to continuation to the whole domain of a function, we are alsointerested in continuation to subsets. So we generalize the definition of continu-ation: If / is a function on y, w C x C y, and

then we say that Q continues f from w to x.In the following chapters, we will frequently be interested in marginals and

continuers for probability distributions that are proportional to a given function.The next proposition lists some relatively obvious but important aspects of thissituation.

PROPOSITION 1.3.1. Suppose f is proportional to g. Then any marginal of f is proportional to

the corresponding marginal of g, with the same constant of proportionality. Inother words, if f = kg and w is a subset of the domain of f , then f^w = kg^w.

2. Suppose f is proportional to g; f — kg for some nonzero constant k. Thenany continuer for f is also a continuer for g.

3. Suppose the probability distribution P is proportional to the function f onx. Then the constant of proportionality is l/f^, and P is f's unique continuerfrom 0 to x:

and

Moreover,

4. A probability distribution is its own unique continuer from the empty setto its domain.

Proof. Statement 1 follows directly from the definition of marginalization,equation (1.2).

To prove statement 2, we substitute kg for / in equation (1.9), obtaining(kg)^x = (kg)^wQ. By the combination axiom, this becomes kg^x = kg^wQ, or9lx = 9lwQ-

Again by the combination axiom, P — kf implies P^ — kf^. Since Pis a probability distribution, P^ = 1, whence k = l/f^. So equation (1.10)holds. Since f^ is a positive number, equation (1.10) is the unique solution ofequation (1.11); P is the unique continuer of / from 0 to x.

To prove statement 4, substitute P for / in equation (1.11) and again applythe combination axiom.

Equations (1.6) and (1.9) do not require that Q be a function on x. Theyrequire only that Q's domain, say v, should satisfy x — w U v or, equivalently,x \ w C v C x. In some cases (when the right-hand side of equation (1.8) doesnot depend on all the coordinates of c), there is a continuer with a domain vthat is smaller than x. The situation is illustrated in Figure 1.2, where we havewritten u\ for w\v, u^ for wHv, and ^3 for v\w. We may say, in this situation,that u-2 is sufficient for the continuation from w to x; the other variables in w,those in wi , can be neglected.

If the function / that we are continuing is a probability distribution, thenthe idea of sufficiency can be elaborated in terms of the meaning of the proba-bilities. If we give the probabilities an objective interpretation, then we can saythat once the configuration of u^ is determined, the configuration of u\ will notaffect the determination of the configuration of 143. If we give the probabilities asubjective interpretation, then we can say that once we know the configurationof U2, information about the configuration of u\ will not affect our beliefs aboutthe configuration of 143.

The philosophy of probability that underlies this monograph is neither strictlyobjective nor strictly subjective. Instead, it is constructive. We see a probabilitydistribution as something we deliberately construct in order to make predictions.Though these predictions may be the best we can do, we need not be fully com-mitted to them as beliefs. And though they should be evaluated empirically,they need not individually represent stable frequencies. In terms of this con-structive interpretation, sufficiency simply means adequacy for prediction. Oncethe configuration of u^ is specified, we ignore information about u\ when wepredict u3.

Instead of saying that u<2 is sufficient for the continuation from w to x, wemay say that 113 is independent of u\ given u^. The concept of conditional inde-pendence thus defined is mathematically interesting. Its properties include thesymmetry suggested by Figure 1.2: if u^ is independent of u\ given u-2, then u\is independent of u% given u^ (see Dawid [27], Pearl [8], or Appendix F of Shafer[9]). Conditional independence is an important concept for both the objectiveand subjective interpretations of probability. In the objective interpretation, a


CHAPTER 1

FlG. 1.2. Sufficiency and conditional independence.

conditional independence relation is a hypothesis about population frequenciesor perhaps about causation. In the subjective interpretation, it is a hypothesisabout a person's beliefs. It is also important for the constructive interpretationof probability, but it does not play a large role in the purely computational issuesconsidered in this monograph.

1.5. Posterior distributions.

Suppose the probability distribution P on x expresses our beliefs about the valuesof the variables in x. And suppose we now observe the values of the variablesin a subset w of x\ we observe that w has the configuration c. How should thischange our beliefs about the remaining variables, the variables in x \ w!

The standard answer is that we should change our beliefs by conditioning Pon w = c. This means that we should change our belief that x \ w = d fromPlx\w(d) to

We call this number P's posterior probability for d given c. It exists only ifP^w(c) > 0, but we may suppose that if P^w(c) is zero we will not observew = c.

Equation (1.13) defines a whole probability distribution—a distribution onx\w that we may designate by px\w\w=c

:

We call this distribution P's posterior distribution for w = c. As the followingproposition notes, it is proportional to a subtable of P, and it is equal to asubtable of any continuer of P from w to x.

PROPOSITION 1.4. Suppose P is a probability distribution on x, w C x, andc is a configuration of w such that P^w(c) > 0.

1. p*\v\w=c oc P\w=c.2. IfQ continues P from w to x, then px\™\™=c = Q\w=c.Proof. Statement 1 follows from equation (1.14) and the definition of slice,

equation (1.4). Statement 2 follows from equations (1.8) and (1.14).

10

Sometimes it is convenient to consider the posterior probability distributionnot just for x\w but for the entire set of variables x. This is the probabilitydistribution p\w=c on x given by

We will refer to P\w~c as P's extended posterior distribution for w = c. Itconsists mostly of zeros. The posterior for the remaining variables, px\w\w=c^ \s

related to p\w=c in two ways. It is a slice:

And it is also a marginal:

Equation (1.15) says that p\w=c is equal to the product of P and the functionon w that assigns the value l/P^w(c) to the configuration c and the value 0 toall other configurations. It follows that p\w=c is proportional to the productof P and a function on w that assigns 1 to c and 0 to all other configurations.This point is sufficiently important to merit being stated in symbols. To thisend, we write Iw=c for the function on w that assigns 1 to c and 0 to all otherconfigurations:

and we state the following proposition.PROPOSITION 1.5. If P is a probability distribution on x and c is a configu-

ration of a subset w of x such that P^w(c) > 0, then

and the constant of proportionality is l/P^w(c).In the following chapters, we will be interested in a probability distribution

P given in factored form, say

where the fi are tables of reasonable size, but the number of variables involvedaltogether is too large to allow the actual computation and storage of the table P.(It will not be difficult to compute the value of P for a particular configuration,at least if we know the constant of proportionality. But there may be too manyconfigurations for us to compute the value of P for all of them.) In this situation,as we will see, we can often work from the factorization to find marginals for P,even though we cannot compute P itself. We may also be interested in computingmarginals for posteriors of P, and therefore we will be interested in transforming


(1.17) into a factorization of the posterior. The following proposition tells ushow to do this.

PROPOSITION 1.6. Suppose P is a probability distribution on x,

and c is a configuration of a subset w of x such that P^w(c) > 0. Supposew = {Xi,... ,Xn} and c = {ci , . . . , cn}. Then

and

Proof. Equation (1.18) follows from statement 1 of Proposition 1.4, togetherwith the fact that a slice of a product is the product of the corresponding slicesof the factors.

Equation (1.19) follows from Proposition 1.5, together with the fact thatIw=c = Ixi=ci ' ' ' Ixn=c,l-

1.6. Expectation.

Most readers will be familiar with the idea of the expectation of a functionV on x with respect to a probability distribution P on x. This is a number,usually denoted by EP(V}. In the discrete case, it is obtained by multiplyingcorresponding values of P arid V and adding the products. Thus

Expectation generalizes to conditional expectation. If w; is a subset of x, andQ is a continuer of P from w to x, then we call the function Ep(V w} on w givenby

a conditional expectation of V given w (if we are out of breath, we may neglectto say "with respect to P"). If P is strictly positive, so that it has only onecontinuer from w to x, then the conditional expectation for V given x is alsounique; in fact, equation (1.21) can be written

The ratio on the right-hand side of (1.22) is unchanged if we substitute for Pany function / proportional to P.

If w is not empty, then the conditional expectation Ep(V\w) is a function,not a single number. It assigns a value to every configuration c of w. Usually,however, we write Ep(V w = c) instead of (Ep(V\w))(c). If P^w(c) > 0, thenEP(V w = c} is uniquely defined; it is equal to (PV)iw(c)/Plw(c).

12 CHAPTER 1

1.7. Classifying probability distributions.

The probability distributions we have been studying are tabular. A tabular dis-tribution is a table that gives a probability for each configuration. We will findit useful to distinguish tabular distributions from algorithmic distributions. Analgorithmic distribution consists of an algorithm, together possibly with somenumerical information, that enables us to compute the probabilities of individ-ual configurations. Algorithmic distributions can involve more or less complexalgorithms and more or less numerical information. At one extreme are dis-tributions such as the Poisson. which are specified by a single number {themean in the case of the Poisson) and a simple formula. At another extremeare the posterior distributions that arise in Bayesian statistics, which may in-volve many numbers and complicated algorithms. In the next few chapters,we will be concerned with an intermediate case; we define a distribution fora large number of variables as the product of many tables of numbers, eachinvolving only a few variables. Here there are many numbers but, a simplealgorithm: multiply.

The line between tabular and algorithmic distributions cuts across the linebetween discrete arid continuous distributions. A continuous distribution, like adiscrete distribution, can be cither tabular or discrete. In the tabular case, westore the values of the density at a sufficiently large number of configurations. Inthe algorithmic case, we store instead a formula or algorithm that enables us tocompute the value of the density at any configuration. To some extent, the linealso cuts across the line between numerical and categorical variables. (Variableslike Age, Sex, and Party are called categorical, because they have categories—e.g., young, old, and middle-aged—rather than numbers as possible values.)Distributions for categorical variables are usually tabular, but distributions fornumerical variables can be tabular or algorithmic.

When an algorithmic distribution involves only a few numbers, we call thenumbers parameters, and we call the distribution parametric. The distributionswith names—Poisson, multinomial, Gaussian, and so on—are parametric.

The terms tabular, parametric, and algorithmic can be applied to conditionalsand other functions as well as to distributions. These terms can help us keep trackof complications involved in finding marginals and continuers of distributions andin multiplying conditionals. Figure 1.3 shows the main points. When we computemarginals, we generally stay in the same class of distributions; a marginal of atable is a table, a marginal of a Gaussian is a Gaussian, and so on. A continueror posterior for a tabular distribution is tabular, but only in a few cases (such asthe multinomial and the Gaussian) do continuers or posteriors stay in the sameparametric family as their distributions. Multiplication usually takes us out, ofthe class of tabular distributions. Given a collection of tables for the same smallset of variables, we can perform the multiplication to obtain a new table, butgiven tables for many different small sets of variables, the size of the frame forall the variables may prevent us from computing and storing the product wemay have to settle for thinking of the multiplication as an algorithm that allowsus to find the probability for a particular configuration when we want it.


CHAPTER 1

FlG. 1.3. The effect of computation.

The distinction between tabular and algorithmic distributions is based on thehandling of probabilities or density values for individual configurations. It is onlyprobabilities for individual configurations that are explicitly stored by a tabu-lar distribution; probabilities for sets of configurations must still be computed.This emphasis on individual configurations is appropriate for expert systems,but it is not appropriate for all applications of probability. It is inappropriatefor advanced mathematical probability, which is concerned with infinitely manyvariables.

1.8. A limitation.

Though the multivariate framework for probability is widely used, it has itslimitations. A principal limitation is that it requires every variable to have avalue no matter how matters come out. This is often appropriate in statisticalwork; in our example, every individual has an age and a sex, and we invent thecategory "independent" so that every individual will have a party affiliation. It isless appropriate in expert-system work, where the meaningfulness of a variableoften depends on the values of other variables. A particular medical test orprocedure only has a result if it is carried out, and we carry it out only forsome patients. A particular phoneme has a certain characteristic in the seventhmillisecond only if it lasts that long, and sometimes it may not. "Number ofpregnancies" is applicable only to women, not to men and children. We canpretend that these variables always have values, but when there are many ofthem, this is computationally awkward as well as artificial.

It is one thing to recognize this limitation and another to correct it. Themultivariate framework is flexible as well as expressive, and the obvious alter-natives lack much of its flexibility. A tree, for example, allows us to representsome variables as being meaningful only if others have certain values but al-lows access to the variables only in a certain order. Consequently, most work inprobability—both theory and application—is carried out within the multivariateframework, and extensions to the framework are developed and used on a fairlyad hoc basis.

The graphical models that we will study in the following chapters are squarelywithin the multivariate framework. For some ideas about going beyond it, seeDempster [16] and Chapter 16 of Shafer [9].

14

Exercises.

EXERCISE 1.1. Derive the three properties of marginalization listed in §1.2from equation (1.2).

EXERCISE 1.2. Here are some familiar problems, each with its own conceptof combination and its own concept of marginalization. Discuss, in each case,how to formalize the problems so that the axioms of transitivity and combinationare satisfied.

1. Systems of equations (or, more generally, systems of constraintson numerical variables) are combined by pooling and marginalized(we usually say "reduced") by eliminating variables.

2. Linear programming problems can be combined by adding (orperhaps multiplying) their objective functions and pooling their con-straints. They can be reduced by maximizing their objective functionsover variables that are eliminated.

3. Discrete belief functions are combined by Dempster's rule andmarginalized by restricting the events for which beliefs are demanded.(One formalization is provided by Shafer, Shenoy, and Mellouli [45]and another by Shenoy and Shafer [48].)

In which of these problems do continuers exist?

EXERCISE 1.3. Fix a set of variables X, and consider all pairs of the form( f , V ) , where f is a strictly positive table on some subset x of X, and V is anarbitrary table on the same set of variables x. Call x the domain of ( f , V ) .Define multiplication for such pairs by setting

Define marginalization by setting

Show that these operations satisfy the axioms of transitivity and combination.(Compare equation (1.22).) This example, suggested to the author by RobertCowell, is relevant to computation in decision theory, where f may representa probability distribution and V may represent a utility function.

EXERCISE 1.4. Consider a function f on a set of variables x, together with acollection hx,xcx of functions on the individual variables in x. For each subsetw of x, let f^w be the marginal on w of the function obtained by multiplying fby the hx for X not in w. In symbols,


The function f^w is called the out-marginal of f on w, since it involves leavingcertain factors out (Cowell and Dawid [25]).

Show that out-marginalization and multiplication satisfy the axioms of tran-sitivity and combination. What is the meaning of out-marginalization in thecontext of equation (1.19)?

EXERCISE 1.5. The numerical functions on a given set of discrete variablesand its subsets form a commutative, semigroup under multiplication. The sets ofvariables themselves form a lattice. Each element of the semigroup is labeled byan element of the lattice. Marginalization reduces an element of the semigroupto a,n element with a smaller label.

Formulate axioms of transitivity and combination in the abstract setting of acommutative semigroup and associated lattice. Give examples where continuersdo and do not exist.

EXERCISE 1.6. In unpublished work [28], A. P. Dempster has shown how theKalman filter can be understood in terms of the combination and multiplicationof belief functions. Dempster calls the belief functions involved normal belieffunctions. A normal belief function on a given linear space of variables consistsof a linear functional and an inner product on a subspace of the linear space.Intuitively, the linear functional tells the expected values of variables in the sub-space, and the inner product tells their covariances. Marginalizaiiori amountsto restricting the linear functional and inner product to a yet smaller subspace.Combination is most easily described in the dual of the linear space of variables—the linear space of configurations. Here the normal belief function looks like aninner product (the dual of the covariance inner product) on a hyperplane, andcombination amounts to intersecting hyperplanes and adding the inner products.

Verify that the axioms of transitivity and combination are satisfied in thisgeometric framework.

CHAPTER 116

CHAPTER

Construction Sequences

Under certain conditions on the heads and tails of a sequence of conditionals, theproduct of the conditionals will be a probability distribution. We call a sequenceof conditionals satisfying these conditions a construction sequence.

As we will see, the conditionals in a construction sequence are coritinuers forthe probability distribution obtained by multiplying them together. Initial seg-ments of the sequence produce marginals of this probability distribution. Thusthe construction sequence represents a step-by-step construction of the proba-bility distribution.

After constructing a probability distribution, we may want to find a marginalfor it or one of its posteriors. This may be difficult computationally, especiallyif the joint frame of all the variables is too large to permit us to carry out themultiplication of the conditionals. Were we able to carry out this multiplication,we could store the resulting table and work directly with it to find marginals.But if we are obliged to keep the probability distribution stored as a product oftables, then we must look for less direct methods.

In some cases, as we will see in this chapter, a computationally inexpensiveadaptation of a construction sequence will produce a construction sequence forthe marginal we desire. To obtain the marginal for the variables in an initialsegment of a construction sequence, we need only omit the later factors from theconstruction sequence. To obtain the posterior for later variables given valuesof the variables in an initial segment, we need only slice the later factors. If theconstruction sequence is a chain, then we can find a construction sequence forthe variables in a final segment by a simple forward propagation. The generalcase, however, requires the more general methods that we will study in the nextchapter -methods that apply to any distribution stored as a product of tables,whether or not the tables form a construction sequence.

If each new conditional in a construction sequence involves a single new vari-able, then the most essential qualitative aspects of the construction sequencecan be represented by a directed acyclic graph (DAG). Such graphs have beenwidely used for knowledge acquisition for probabilistic expert systems, and onthe theoretical side, they have been studied as a representation of conditional in-dependence relations (Pearl [8]). Here we emphasize the value of DAGs for repre-senting alternative construction sequences—construction sequences that use the

17

Qi,Age as

TABLE 2.1a probability distribution for Age. (This is a

its head.)

youngmiddle-agedold

conditional with

.40

.15

.45

an empty tail and with

same conditionals but order them differently. By bringing these alternative or-derings into the picture, a DAG enlarges the number of marginals and posteriorsthat we can find by simple manipulations. In the general case, where each newconditional is allowed to involve more than one new variable, we can similarlyindicate alternative orderings with a bubble graph, which is slightly more generalthan a DAG.

2.1. Multiplying conditionals.

Table 2.1 gives a probability distribution Q\ for Age (its single column adds toone), and Table 2.2 gives a conditional Q% for Sex given Age (each row adds toone). When we multiply these two tables, we get Table 2.3, which qualifies as aprobability distribution for Age and Sex (its six entries add to one). Notice thatQi is a marginal of this probability distribution and hence Qi is a continuer.

We need not carry out the numerical multiplication in order to see that theproduct Q\Qi is a probability distribution. We can instead perform an abstractcomputation:

CHAPTER 218

TABLE 2.3QiQ2, a probability distribution for Age and Sex.

youngmiddle-agedold

female.32.15.15

male.08.00.30

TABLE 2.2Q2, a conditional with Age as its tail and Sex as its head.

youngmiddle-agedold

female4/511/3

male1/502/3

Here we have first broken the summation into a summation over Sex followedby a summation over Age. Since Qi does not involve Sex, it can be factored outof the first summation, leaving Qi, which sums to one over Sex because it is aconditional. This leaves us with the sum of Qi over Age, which is one becauseQi is a probability distribution.

Consider more generally any two conditionals Q\ and Q^. Write ti for thetail, hi for the head, and di for the domain of Q%. (Recall that dl = ^ U/iz .) Ourexample generalizes to the following proposition.

PROPOSITION 2.1. Suppose t\ is empty, t? is contained in d\, and hi isdisjoint from d\.

1. The product Q\Qz is a probability distribution on d\ U di.2. The conditional Qi is Q\Qi 's marginal on d\.3. The, conditional Qi continues Q\Qi from d\ to d\ U di.Proof. Since we do not have symbols for individual variables, we will not use

summations like those in equation (2.1); instead, we will use our notation formarginalization. We prove statement 1 by writing

Here we have used both the transitivity and the combination axioms.Since Qi has an empty tail, it is a probability distribution. By the combina-

tion axiom,

Thus Qi is Q\Qi$ marginal on d\, and therefore, by the definition of continuer,Qi continues Q\Qi from di to d\ U d%.

Now consider a sequence of n conditionals, Qi,..., Qn. Proposition 2.1 gen-eralizes, by induction, as follows.

PROPOSITION 2.2. Suppose t\ is empty. Suppose ti is contained in di U • • • Udi-i and hi is disjoint from d\ U • • • U dz-i for i = 2 , . . . , n.

1. Qi • • • Qn is <i probability distribution with domain d\ U • • • U dn.2. For i — 1,... ,n — 1, Q\ • • • Qi is the marginal of Q\ • • • Qn on d\ U - • - U d j .3. Fori = 2,... ,n, Qi continues Q\ • • -Qn fromdiU- • - U d j _ i to d\\J- • - U d j .4. More generally, if 1 < i < j < n, then Qi- • • Qj continues Q\- • • Qn from

di U • • • U di-i to di U • • • U dj.When the hypotheses of Proposition 2.2 are satisfied, we call the sequence

Qi,.--,Qn a construction sequence for the probability distribution Q\ ---Qn,

CONSTRUCTION SEQUENCES 19

20 CHAPTER 2

FlG. 2.1. Left: the first tail is empty. The. second tail in contained in the first domain,and the second head is disjoint from the. first domain. Right: two more head-tail pairs havebeen added. Each time, the new tail is contained in the existing domain, and the new head isdisjoint from, it.

and we say that the construction sequence represents this probability distribu-tion. The restrictions on the head tail structure of a construction sequence areillustrated in Figure 2.1.

Statement 2 of Proposition 2.2 indicates one way that we can exploit a con-struction sequence. If we are interested only in the variables in di U • • • U di andnot in the remaining variables—those in /ii+1 U • • • U hn—then we can simplyomit the last n — i conditionals from the construction sequence: Q\- • • Qi is aconstruction sequence for the marginal probability distribution on d\ U • • • U ci,.

Another way to exploit a construction sequence is to fix the values of variableswe have observed. If these variables appear at the beginning of the constructionsequence, then this produces a construction sequence for the posterior distribu-tion.

PROPOSITION 2.3. Suppose Qi,---,Qn is a construction sequence. Suppose1 < i < n. Write d for U"=1/ij, the domain of Q\- • • Qn, and write i for U*=1 hj,

the domain of Q^ • • • Q,. Suppose c is a configuration o f t . Then

Proof. By Statement 4 of Proposition 2.2. Ql+i • • • Qn continues Q\ • • • Qn

from t to d. So the proposition follows from Proposition 1.4, together with thefact that a slice of a product is equal to the product of the corresponding slicesof the factors.

2.2. DAGs and belief nets.

The expert-systems literature has devoted considerable attention to constructionsequences that add one new variable at a time—i.e., construction sequences inwhich each head consists of a single variable. In this case, we can write

where P is the probability distribution being constructed, Xi is the single variablein the head of Ql, and t^ C {.Xi,... ,Xi_i}. We began the chapter with anexample of equation (2.2):

T,A conditional j

youngmiddle-agedold

\rn.K 2. 1For Party

Dem

1/41/31/3

given .

hid

1/21/31/3

Age.

Rep

1/41/31/3

We leave it to the reader to check that we if also multiply in the conditional Q%given by Table 1.5. then we obtain the probability distribution PAge.Sex.Partygiven by Table 1.1:

Notice that if we use instead the conditional Q'3 given by Table 2.4, then weobtain the same probability distribution PA<;K.Sex,Party'-

Like equation (2.3). equations (2.4) and (2.5) represent one-new-variable-at-a-time construction sequences.

When one new variable is added at a time, the head-tail structure of theconstruction sequence can be represented by a directed acyclic graph (DAG forshort). This graph has the variables as nodes, and it has arrows to Xi fromeach element of £$, for i — 2 , . . . ,n. We call this graph directed because thelinks between the nodes are arrows, and we call it acyclic because there are nocycles following the arrows.5 (Since the arrows we draw to each Xt are all fromX} with j < i, any path following the arrows always goes in the direction ofincreasing indices; it cannot cycle back to a smaller index.) Figure 2.2 showsDAGs for the construction sequences represented by equations (2.3), (2.4), and(2.5), respectively. Figure 2.3 shows the DAG for the more complex constructionsequence represented by the equation

The middle graph in Figure 2.2 and the graph in Figure 2.3 both have cycles,but not cycles following the arrows. The cycle Xi,X3,X/i,Xi in Figure 2.3, forexample, goes against an arrow on its last step.

A belief net is a finite DAG with variables as nodes, together with, for eachnode X, a conditional that has X as its head and X's immediate predecessors

5 Some authors prefer the name acyclic directed graph in order to emphasize that onlydirected cycles are forbidden; a path that does not always follow the arrows is allowed to be acycle. But the name directed acyclic graph and the acronym DAG are strongly established inthe literature.


22 CHAPTER 2

FIG. 2.2. DAGs for the numerical example.

FIG. 2.3. A more complex DAG.

in the DAG as its tail.6 We have just explained how a construction sequencedetermines a belief net. It is also true that the conditionals in a belief net canalways be ordered so as to form a construction sequence. This follows from thefollowing lemma.

LEMMA 2.1. The nodes of a finite DAG can always be ordered so that eachvariable's immediate predecessors in the DAG precede it in the ordering. In otherwords, we can find an ordering X\,..., Xn such that the immediate predecessorsof Xi in the DAG are a subset of {X\,...,Xi}. (In particular, Xi has nopredecessors in the DAG.)

Proof. The simplest proof is by induction on n, the number of variables inthe DAG. There is at least one node in the DAG that has no successors; ifevery node had a successor, then we could form a cycle by going from each nodeto a successor until (because there are only finitely many nodes) we repeatedourselves. If we choose a node with no successors as Xn, and if we then removethis node and the arrows to it, then we obtain, a DAG with only n — I nodeswhich, by the inductive hypothesis, has an ordering Xi,..., Xn-\ satisfying thecondition. The ordering Xi,..., Xn then also satisfies the condition.

We may call an ordering of the nodes of a DAG that satisfies the conditionsof Lemma 2.1 a DAG construction ordering. Unless a DAG is merely a chain,it has more than one DAG construction ordering. The DAG in Figure 2.3, forexample, has five:

6 A variety of other names are also in use, including Bayesian network and graphical model.

Every DAG construction ordering for the DAG of a belief net gives, of course,an ordering of its conditionals that is a construction sequence for the probabil-ity distribution represented by the belief net. Thus the five DAG constructionorderings we just listed produce five construction sequences for the probabil-ity distribution in equation (2.6)—five ways to permute the Qi and still have aconstruction sequence.

We can talk about a belief net representing a probability distribution, withoutreference to any particular construction sequence: a belief net represents a prob-ability distribution P if P is equal to the product of the conditionals attachedto its DAG. We can also talk about a DAG by itself representing a probabilitydistribution: a DAG represents P if by attaching appropriate conditionals wecan make it into a belief net representing P—i.e., if P factors into conditionalsin the way indicated by the DAG.

Considered abstractly, a belief net represents a probability distribution moreconcisely than a construction sequence does. It provides the same conditionals,but it refrains from ordering them completely. For this reason, belief nets areconsidered more fundamental than construction sequences in much of the litera-ture on probabilistic expert systems. As a practical matter, however, belief netsarise from a step-by-step construction that provides a complete ordering, andwe usually preserve this ordering when we store a belief net. Moreover, as wewill see in the next section, there is no practical advantage in considering onlyconstruction sequences that introduce one new variable at a time. So in thismonograph, we take construction sequences as fundamental, and we treat beliefnets as secondary tools—tools that help us see alternative orderings for particu-lar one-new-variable-at-a-time construction sequences. In small problems, wherewe can actually draw the DAG, it enables us to see alternative orderings at aglance. In larger problems, the idea of the DAG reminds us of the existence ofalternative orderings.

Marginals and posteriors. From a computational point of view, the alterna-tive construction sequences that we can discern by studying a DAG are importantbecause they broaden the application of Propositions 2.2 and 2.3. Since we canapply these propositions to any construction sequence consistent with the DAG,we can obtain construction sequences for a much larger class of marginals andposteriors than we can obtain by working with a single construction sequence.

Propositions 2.2 and 2.3 are concerned with initial segments of a constructionsequence. We may also talk about initial segments of a DAG. We say that a setw of nodes of a DAG is an initial segment of the DAG if all the immediatepredecessors of each element of w are also in w.

LEMMA 2.2. A set w of nodes in a finite DAG is an initial segment of theDAG if and only if the DAG has a DAG construction ordering X\,..., Xn suchthat

for some k.


Proof. It is obvious that if a DAG construction ordering satisfying the twoconditions exists, then w is an initial segment in the DAG. To derive the existenceof such an ordering from the assumption that w is an initial segment in the DAG,we adapt the proof of Lemma 2.1. We argue by induction on m, the number ofnodes not in w. If TO = 0, then the ordering exists by Lemma 2.1. If m ^ 0—i.e.,w does not include all the nodes in the DAG—then there is at least one nodeoutside w that has 110 successors, for if every node outside w had a successor, thissuccessor would also be outside w, and we could form a cycle of nodes outside wby going from each node to a successor until we repeated ourselves. If we choosea node that lies outside w and has no successors as Xn, and if we then removethis node and the arrows to it, then we obtain a DAG with only m — 1 nodesoutside w which, by the inductive hypothesis, has a DAG construction orderingXi,...,Xn-i satisfying (2.7). By adding Xn to the end of this ordering, weobtain a DAG construction ordering X\,...,Xn for the original DAG that alsosatisfies (2.7).

The definition of initial segment in a DAG, together with Lemma 2.2 andPropositions 2.2 and 2.3, yields the following proposition.

PROPOSITION 2.4. Suppose w is an initial segment of a belief net that rep-resents a probability distribution P.

1. Suppose we delete the nodes not in w, together with the arrows to them, andthe conditionals associated with them. Then the resulting belief net represents P 'smarginal on w.

2. Suppose c is a configuration of w. Suppose, we delete the nodes in w,together with the arrows from them and the conditionals associated with them,and suppose we change the conditional on each of the remaining nodes by slicingit on w = c. Then the resulting belief net represents P's posterior given w — c.

The simplicity and visual clarity of this proposition accounts for much of theappeal of belief nets.

Proposition 2.4 can be thought of as a statement about alternative construc-tion sequences. It says that if we begin with one construction sequence (the onewe used to construct the belief net), then we can shift to an alternative one toget marginals and conditionals. We can say this without reference to the beliefnet as follows.

PROPOSITION 2.5. Suppose Qi,..., Qn is a one-new-variable-at-a-time con-struction sequence for a probability distribution P. Suppose ii,...,ik is a se-quence of distinct integers between 1 and n such that t^ is empty and tij iscontained in {X^, . . . , Xi^^} for j = 2 , . . . , k. Write w for {Xtl,... ,X^k}.

1. Q j j , . . . , Qlk is a construction sequence for P^w.2. Suppose c is a configuration of w. Suppose we modify the sequence

Qi,---:Qn by deleting each Qi, and by slicing each of the other conditionalson w = c. Then the result is a construction sequence for P's posterior givenw — c.

Forward propagation in chains. As we have seen, it is trivial to reduce abelief net to a belief net for an initial segment. If the belief net is a chain, thenwith a bit of work we can also reduce it to a belief net for a final segment.

CHAPTER 224

CONSTRUCTION SEQUENCES

FlG. 2.4. A belief chain.

We call a DAG a chain if its nodes can be ordered, as in Figure 2.4, so thatthe first has no immediate predecessors in the DAG and each of the others hasits predecessor in the ordering as its only immediate predecessor in the DAG.Notice that a chain has only one DAG construction ordering: Xi,... ,Xn is theunique DAG construction ordering for the chain X\ —> • • - —> Xn.

We call a belief net a belief chain if its DAG is a chain. Thus a belief chainconsists of a chain X\ — + . . . — > Xn and corresponding conditionals Q\,..., Qn.The first conditional has X\ as its head and an empty tail; the ith conditionalhas Xi as its head and _X";_i as its tail. The idea of forward propagation in sucha chain is based on the following lemma.

LEMMA 2.3. In a belief chain,

Thus (Q\Qi)^x^,Qs, - - - ,Qn is a construction sequence for the marginal on{X2,... ,Xn}.

Proof. Since {^2} is the intersection of {X2,...,Xn} with the domain ofQiQ2, equation (2.8) is an instance of the combination axiom.

By applying Lemma 2.3 repeatedly, we can reduce our initial constructionsequence Qi,.-.,Qn to a construction sequence for any final segment of thebelief chain. Indeed, once we have a construction sequence Ri, Q;+i , . . . , Qn forXi —> • • • —f Xn, we can obtain a construction sequence Ri+i,Ql+'2, • • . ,Qn forXl+l -» > Xn by setting Ri+l = (RiQi+i)i{x'+l}.

The point of this step-by-step computation is that the tables will generally besmall enough for it to be implemented. In theory, we can move directly from theconstruction sequence Qi,.--,Qn to a construction sequence for the marginalon {Xi,.. ., Xn}, for the combination axiom implies that

But Q\- • -Qi may be too large a table to compute.

Markov chains and hidden Markov models. Readers familiar with thetheory of Markov chains may find it illuminating to note that a finite Markovchain is a special kind of belief net. It is a belief chain such that each variable hasthe same frame and all the conditionals after the first are identical. Figure 2.5shows a simple Markov chain.

Most of the theory of Markov chains is concerned with their repetitive natureand hence does not extend to belief nets in general or even to belief chains ingeneral. For example, a Markov chain is sometimes described in terms of its stategraph. This is a directed graph (not usually acyclic) with the states (elements ofthe common frame) as nodes and with an arrow from state i to state j wheneverthe (i,j)th entry of the common conditional is positive. (Figure 2.6 shows the

25

FIG. 2.6. The state graph for the Markov chain in Figure 2.5.

state graph for the Markov chain of Figure 2.5.) In general, we cannot draw astate graph for a belief chain because the successive variables may have differentframes. Even if the frames are the same, the possible transitions or at least theirprobabilities will vary.

In recent years, considerable use has been made of belief nets of a typeslightly more general than Markov chains—hidden Markov models. To form ahidden Markov model, we begin with a Markov chain, say X\ — > • • • — + Xn, andfrom each node Xi we add an arrow to a new node, say Yi, so as to obtain aDAG as in Figure 2.7. All the Yi have the same frame (possibly different from theframe for the Xi) and the same conditional. In applications, the Yi are observed,while the Xi are not—the Markov chain X\ — > • - - — » • Xn is hidden. We areinterested in rinding posterior probabilities for the Xi, We may, for example,want to find the most likely configuration of Xi,... ,Xn. Since the Yi do notform an initial segment of the belief net, we cannot use Proposition 2.4 to findposterior probabilities for the Xi. But efficient methods for finding posteriorprobabilities (and for finding most likely configurations) have been developed inthe literature on hidden Markov models, and these methods, as it turns out, arespecial cases of more general methods that we will study in Chapter 3.

Figure 2.7 represents only the simplest type of hidden Markov model; inpractice, the model is elaborated in various ways. One common elaborationinvolves attaching more than one observable variable to each X;. There may bea fixed number of observable variables for each Xit or this number itself may bean observable variable. In speech recognition, for example, each Xi represents

CHAPTER 226

FIG. 2.5. A Markov chain.

27

FIG. 2.7. A hidden Markov model.

a phoneme, and we observe features of the sound every successive millisecondthat the phoneme lasts. Since the length of a phoneme varies, the number ofobservations will vary; it itself will be an observed variable. Strictly speaking,this takes us outside the framework of the belief net—it even takes us outsidethe multivariate framework. Fortunately, the computational methods needed arenatural extensions of the multivariate methods we will study in Chapter 3.

2.3. Bubble graphs.

Though the visual clarity of belief nets is very attractive, there is no practi-cal reason to limit ourselves to construction sequences involving only one newvariable at a time. All the computational ideas we considered in the precedingsection generalize to the general case, and we can also generalize the graphicalrepresentation itself.

The simplest graphical representation of a general construction sequence isthe bubble graph. This graph has a node for each conditional. This node—calleda bubble—contains all the variables in the head and has an arrow to it from eachvariable in the tail. Figure 2.8 shows a bubble graph for a construction sequencefor ten variables:

A bubble graph is acyclic in the same sense that a DAG is acyclic—we cannotgo in a cycle following the arrows. Moreover, a bubble graph, like a DAG,permits us to pick out alternative construction orderings for the nodes i.e.,alternative construction sequences for the probability distribution. In Figure 2.8,for example, the bubbles can be ordered in seven different ways:

And hence there are seven ways of ordering the conditionals to form a construc-tion sequence:

CONSTRUCTION SEQUENCES

28 CHAPTER 2

FTG. 2.8. A bubble graph.

Marginals and posteriors. In the general case, as in the one-new-variable-at-a-time case, we can exploit alternative construction sequences to find priormarginals for initial segments or posterior marginals given initial segments, andwe can propagate forward in chains to find prior marginals for final segments.

The idea of initial segments is defined for bubble graphs just as for DAGs,and Proposition 2.4 continues to hold. Translating this proposition into a di-rect statement about alternative construction sequences, we get the followinggeneralization of Proposition 2.5.

PROPOSITION 2.6. Suppose Qi----,Qn is a construction sequence for P.Suppose ii,.... ik is a sequence of distinct integers between 1 and n such that tt~is empty and ti.. is contained in h^ U • • • U hlj_l for j = 2 , . . . , k. Write w forh^ U • • • U hik. '

1. Qil,..., Qik is a construction sequence for P^w.2. Suppose c is a configuration of w. Suppose we modify the sequence

Q},--.,Qn by deleting each Q.L} and by slicing each of the other conditionalson w = c. Then the result is a construction sequence for P 's posterior givenw = c.

A construction sequence Q i , . . . , Qn is a construction chain if each ti is con-tained in ht-i for i = 2 , . . . , n. Figure 2.9 shows a bubble graph for a constructionchain: the bubbles are ordered, and each bubble has arrows only from variablesin the preceding bubble.

Lemma 2.3 generalizes as follows.LEMMA 2.4. Suppose Q\.... ,Qn is a construction chain. Then

Thus (QiQ2)ih2,Qs, • • • ,Qn is a construction chain for the marginal on /i2 U• • • U / i n .

Forward propagation proceeds, based on this lemma, just as in the one-new-variable-at-a-time case; from the sequence fi,, Qz+i , . • . , Qn for the marginal on


FIG. 2.9. A bubble graph for a chain.

FIG. 2.10. The join chain for Figure 2.9.

h% U • • • U /?,„,, we obtain the sequence -R.;+i, Qt+j2, • • • • Qn f°r the marginal on

hi+i U • • • U /in by setting Rl+l = (#,Q;+i) l / l '+J.Figure 2.10 shows an alternative to the bubble graph in Figure 2.9. Here

instead of showing arrows from the individual variables, we put these variables inthe following bubble. They can still be identified; they constitute the intersectionof the two bubbles. A graph of the type shown in Figure 2.fO is called a joingraph. It has the property that the variables that a given node has in commonwith any of the preceding nodes are all in the immediately preceding node. In thenext chapter, we will generalize the idea of a join chain to the idea of a join tree.

A more difficult example. For a concrete example of a construction se-quence for which we cannot so easily find the marginals we want, consider theexternal audit of an organization's financial statement. Figure 2.11 sketches, ina simplified form, the structure of the evidence in one such audit. The auditoris concerned with the accounts receivable, and she has distinguished betweenthe accounts receivable riot allowing for bad debts and the net accounts receiv-able, which do allow for bad debts. The accounts receivable are fairly statedonly if they are complete, properly classified, and properly valued. The auditorhas obtained evidence for completeness by tracing a sample from a subsidiaryledger. Customer confirmations have provided evidence that the accounts areproperly classified and properly valued. In addition, the auditor's assessment ofthe internal accounting system ("review of the environment") provides evidencefor the accounts receivable being correct, and her assessment of the state of theeconomy ("analytic review") provides evidence for the adequacy of the allowancefor bad debts.

The bubble graph in Figure 2.12 depicts a probability model for the situationdescribed by Figure 2.11. Using the abbreviations indicated in Figure 2.13, wewrite

30 CHAPTER 2

FlG. 2.11. The audit evidence.

Each abbreviation represents a variable corresponding to an assertions or item ofevidence shown in Figure 2.11. The variable N, for example, might be a binaryvariable indicating whether the net accounts receivable are fairly stated (N = 1)or not (N = 0).

The auditor's evidence consists of observed values of the variables E, R: T,and CC, which we may designate by corresponding lower case letters. We areinterested in the posterior distribution of the remaining variables given theseobservations, arid according to Equation 1.18 in Proposition 1.6, this is propor-tional to the function obtained by substituting the observations in the right-handside of equation (2.11):

We are particularly interested in the marginal of this posterior for the variableN, which corresponds to an overall judgment that the financial statement is fairlystated. Since the observed variables do not form an initial segment of the bubblegraph, we cannot find this marginal using the methods we have studied in thischapter. Instead, we must use the methods of the next chapter, which apply toarbitrary factorizations.

2.4. Other graphical representations.

There are a number of alternatives to the bubble graph for representing the head-tail structure of construction sequences, including chain graphs (Wermuth andLauritzen [50]) and valuation networks (Shenoy [47]). Figure 2.14 shows a chaingraph and Figure 2.15 shows a valuation network corresponding to the bubblgraph of Figure 2.12. Both types of graph have uses beyond that of-representingconstruction sequences. In the chain graph for a construction sequence, all the


FIG. 2.12. A bubble graph for the audit.

FlG. 2.13. Variables for the construction sequence.

variables in each head are linked with each other, but by omitting some of theselinks, we can represent additional conditional independence relations. By varyingthe shape of the relational nodes and its arrows in a valuation network, we canrepresent a wide variety of relations.

Another more complex graphical representation has been developed by Heck-erman [30] under the name similarity network. A similarity network is a tool forknowledge acquisition; it allows someone constructing a probability distributionto allow certain variables in a construction sequence to be sufficient for othervariables given some values for earlier variables but not given other values forthese earlier variables.

Exercises.

EXERCISE 2.1. The idea of a construction sequence for a probability distri-bution generalizes to the idea of a construction sequence for a conditional. In

32 CHAPTER 2

FlG. 2.14. A chain grapli for the audit.

FIG. 2.15. A valuation, network for the audit,

this generalization, we no longer require that the first tail be empty and that eachnew tail, be contained in the existing domain. We require only that each new headbe disjoint from the existing domain.

Consider first two conditionals Qi and Qi. Under the hypothesis that hi isdisjoint from d\ (Figure 2.16), prove the following statements:

1. The product Q\Q^ is a conditional with head hi U h% and do-main d\ U d-2..

2. The product Qilt2 *s Q\Qi 's marginal on d\ U t%.3. The conditional Q^ continues Q\Qz from d\ U t? to d\ U d^-

Then consider a sequence of conditionals Q\,..., Qn. Under the hypothesis thathi is disjoint from d,\ U • • • Ud,_i for i = 2 , . . . , n, prove the following statements:

1. The product Q\ • • • Qn is a conditional with head h\ U • • - U hn

and domain d\ U • • • U dn.2. For i = 2, ...,n, Qi • • • Qi-]l(dlij-udn)\(h,\j-uh.n) is the

marginal of Qi • • • Qn on (d\ U • • • U dn) \ (ht U • • • U hn).


FIG. 2.16. Here we ask only that the second head be disjoint from the first domain.

FIG. 2.17. The "and" structure of the audit.

3. For i = 2 , . . . , n, the conditional Qi continues Q\- • • Qn from(di U • • • U dn) \ (h, U • • • U hn) to (d\ U • • • U dn) \ (hi+i U • • • U hn).

4. More generally, ifl<i<j<n, then the product Qi • • • Qjcontinues Q\- • -Qn from (d\ U • • • U dn) \ (hi U • • • U hn) to (di U • • • Udn)\(hj+1\J---\Jhn).

When hi is disjoint from di(J- • -\Jdi-i fori = 2 , . . . , n, we say that Q i , . . . , Qn

is a construction sequence for the conditional Q\ • • • Qn- Notice that any subse-quence of a construction sequence is itself a construction sequence.

EXERCISE 2.2. Discuss how the idea of a state graph for a Markov chain canbe generalized so as to apply to more general belief chains.

EXERCISE 2.3. Devise graphical representations for hidden Markov modelsin which the number of observed variables attached to a node in the Markovchain is itself an observed variable.

EXERCISE 2.4. The basic graph in Figure 2.11 can be interpreted as an "andgraph": N = 1 if and only if A = 1 and B — 1, and A = I if and onlyif C = I, PC = I, and PV = 1. This suggests arrows pointing the otherway, as in Figure 2.17. Show that the marginal on {N,A,B,C,PC,PV} ofa probability distribution of the form provided by equation (2.11) will not, ingeneral, be represented by the DAG in Figure 2.17.

EXERCISE 2.5. The conditionals involving a particular set of variables formonly a partial commutative semigroup, since products and marginals are not al-ways conditionals.

Generalize the axioms of transitivity and combination you formulated in Ex-ercise 1.5 to the case where the semigroup may be only partial. Consider alsothe case where labels are binary—head and tail.

Propagation in Join Trees

In this chapter, we study the problem, which we encountered in the precedingchapter, of computing marginals of a function given as a product of tables ondifferent sets of variables, say

where /j is a table on a^. We want to compute /'s marginal on a particularvariable X, on one of the sets x^, or on some other set x of variables. The frameof all the variables, $l\jXi, is too large for us to compute the table / and then sumvariables out of this table. So our task is to compute marginals for / withoutcomputing / itself.

The approach we take in this chapter is the obvious one: we exploit thefactorization as we sum variables out. We sum variables out one at a time, andwe deal each time only with factors that involve the variable we are summing out;the others we factor out of the summation. Each step produces a new product ofthe same form as the right-hand side of equation (3.1), possibly involving somelarger clusters of variables (when we sum Y out, we must multiply togetherall the fi involving Y", and the resulting cluster may be large even after Y isremoved). The next step must deal with these larger clusters, but with luck anda good choice of the order in which we sum variables out, we may be able tocompute a given marginal without encountering a prohibitively large cluster.

As it turns out, this variable-by-variable summing out produces a join tree,and the process can be understood directly in terms of the join tree. A join treeis a tree with clusters of variables as nodes, with the property that any variablein two nodes is also in any node on the path between the two (equivalently, thenodes containing any particular variable are connected). The join tree producedby summing variables out in a given order has the clusters produced by thesumming out as its nodes, and each summing out can be thought of in terms ofa message passed (or "propagated") from one node to a neighbor in this tree.7

7The name "join tree" was coined in the theory of relational databases in the early 1980s(Beeri et al. [22]). An alternative, "junction tree," is also current in the literature on beliefnets.

35

CHAPTER 3

36 CHAPTER 3

There are a number of ways to arrange the details of propagation in a jointree. We can sum out more than one variable at a time. We can carry outa multiplication after each summing out, or we can leave the multiplicationsuntil they are required for a new summing out. In some cases, we can re-duce the number of multiplications by judicious divisions. Thus we can distin-guish different architectures for join-tree marginalization. In this chapter, westudy four: the elementary, Shafer-Shenoy, Lauritzen-Spiegelhalter, and Aal-borg architectures. The elementary architecture produces the marginal for asingle node of the join tree. The other architectures produce marginals for allnodes of the tree. The Shafer-Shenoy architecture achieves this by storing theresults of each summing out so they can be used for propagation in any di-rection. This architecture is very general; it applies not only to the problemwe study in this chapter but also to other problems of recursive computationinvolving unrestricted combination and marginalization operations that satisfythe transitivity and combination axioms. It is somewhat wasteful, however, inits appetite for multiplication. The Lauritzen Spiegelhalter and Aalborg archi-tectures eliminate some of the multiplication by substituting a smaller number ofdivisions.

If we are concerned only with calculating marginals of factored probabilitydistributions, the Aalborg architecture is the architecture of choice. Moreover,the Aalborg architecture handles new evidence quite flexibly. Once it has com-puted marginals for given observations, it can adjust the marginal for a particularvariable X after the further observation of a variable Y using only the part ofthe join tree that lies between X and Y. But the alternative architectures comeinto play for a wide variety of collateral problems that do not, for one reasonor another, satisfy all the assumptions made by the Aalborg architecture. Forexample, when observations are subject to retraction, the Aalborg architecturecannot be used because it does not retain the original inputs; Jensen [32] resortsto the Shafer-Shenoy architecture in this case.

The methods of this chapter require only that the function / be given as aproduct of tables; it need not be a probability distribution, and even if it is,the tables need not be conditionals. (In the case of the elementary and Shafer-Shenoy architectures, they can even have negative entries.) But we are most in-terested in the case where / is equal or proportional to a probability distribution.If / is only proportional to a probability distribution P, it is usually the marginalsof P, not the marginals of /, that we want, but most of the work will be in findingthe marginals of /; we can obtain P's marginals from /'s by equation (1.2).

As noted in the preface and in the exercises at the end of this chapter, join-tree computation is much broader and older than the problem of finding marginalposterior probabilities in probabilistic expert systems. In fact, techniques similarto each of the architectures studied in this chapter have been applied to a varietyof problems in applied mathematics and operations research. Perhaps the oldestsuch problem is that of solving a "sparse" set of linear equations—one in whichonly a few variables appear in each equation. Other examples include the four-color problem, dynamic programming, and constraint propagation (Diestel [2]).

PROPAGATION IN JOIN TREES 37

The feasibility and efficiency of join-tree computation depends, of courseon the nodes of the tree being sufficiently small. In the case of probabilitypropagation, they must be small enough that multiplication and marginalizationwithin nodes is inexpensive. Roughly speaking, this means that the the sum ofthe frame sizes must be small, or even more roughly, that the largest frame mustbe small. Finding a join tree that achieves either of these minima exactly is anNP-complete problem, but it is known that such minima are always achieved byjoin trees that are produced by summing variables out in some order (Mellouli[39]). Moreover, there are good heuristics for finding reasonable join trees if theyexist (Kong [36], Kjasrulff [35]).

3.1. Variable-by-variable summing out.

A simple example will suffice to show how variable-by-variable summing outproduces a join tree and how the summing out can be interpreted as message-passing in this join tree.

Here is a function on seven variables given as a product of five tables:

The clusters of variables involved in the tables are shown in Panel 1 of Figure 3.1.Let us imagine summing the variables out in the reverse of the order in whichthey are numbered, keeping track as we go of the new clusters we create.

Summing Xj out yields

where we have written f 1 ( X ^ } for ̂ x f^X^.X-j}. The clusters in this newfactorization are shown in Panel 2. Above them, we have begun to construct ajoin tree by drawing a node representing the variables involved in the summation,Xc, and X-?. We temporarily link this node to the single variable Xr>, which isthe only variable involved in the new table resulting from the summation.

Next, we sum X$ out, obtaining

38 CHAPTER 3

The result is shown in Panel 3, where we have added to the join-tree-to-be asecond node consisting of the variables involved in the summation on this step.We have linked the new node to the cluster of variables involved in the new tableresulting from the summation.

The next step, which produces Panel 4, is more interesting. Here we sum ^5out, obtaining

FlG. 3.1. Constructing the join tree.

We again add a node consisting of the variables involved in the summation. Weremove the clusters for tables absorbed in the summation, replacing them with asingle cluster for the new table resulting from the summation. One node alreadyin the picture was linked to a cluster removed from the list; it is now linked tothe new node.

The reader can write down the formulas for the remaining steps, which arerepresented by Panels 5-8. At each step, we pull out from our product the factorsinvolving the variable we are summing out, multiply them together, perform thesummation, and give a new name to the resulting table (our system for namingidentifies the original tables involved in the subscript and the variables summedout in the superscript, but this is of no importance). We add to our picture a noderepresenting the variables involved in the summation. We remove from the listall the clusters corresponding to tables absorbed into the summation, replacingthem with the single cluster for the new table resulting from the summation—this is the union of the clusters removed minus the variable summed out. Welink the node created to the cluster added. When a linked cluster is removedfrom the list, the link is inherited by the new node that absorbs it.

The final result in Panel 8 is indeed a join tree. It is a tree with sets fornodes, and whenever a variable is contained in two nodes, it is also contained inall the nodes on the path joining the two. For example, the variable 2, which iscontained in both 23 and 1245, is also contained in the two nodes between them,12 and 124.

Though we have worked in terms of an example, we have spelled out a generalalgorithm. This algorithm applies to any product of tables and to any order forsumming the variables out of such a product. It identifies the clusters involved inthe variable-by-variable summing out, and it arranges these clusters in a graph.Is this graph always a join tree?

Certainly the graph is always a tree—i.e., it is always connected and acyclic.We introduce the nodes in a sequence. Each node except the last is linked withsome later node, so the graph is connected. (Since we can follow the links fromany node to the last node, we can follow them from one node to the last nodeand then back to any other node we please.) Each node is linked with only onelater node, so there cannot be any cycles. (If there were a cycle, the earliestnode in it would have to be linked with two later nodes.)

To see that the tree is always a join tree, consider Figure 3.2, where the linkshave become arrows pointing from old to new nodes, and each arrow is labeledwith the variable that was summed out when the node from which the arrow


40 CHAPTER 3

FlG. 3.2. The join tree with arrows to the root.

comes was created. The node to which an arrow points always includes all thevariables in the node from which the arrow comes, except the variable that wassummed out. For any particular variable X, any node n containing X must beconnected to the node n' created when X is summed out, because the tablescreated as we go downward from n continue to contain X until it is summed out.It follows that all the nodes containing X are connected in the tree (i.e.—forma subtree), and this is equivalent to the tree being a join tree.

The join tree that we construct is this way is interesting because it can beinterpreted as a picture of the computations involved in the variable-by-variablesumming out. We interpret a node x as a register that can store a table for itsvariables, and we interpret an arrow from x to y as an instruction to sum out avariable from x's table and multiply y's table by the result.

We begin by putting tables in the storage registers; in Figure 3.2, for example,we put the table /i in 23, the table /2 in 57, the product /3/4 in 1234, and thetable /s in 146. We put tables of ones in the other three nodes. The numberbeside each arrow tells us which variable to sum out of the table in the nodepreceding the arrow. Figure 3.3 shows the summations we perform when wefollow these instructions.

We summed the variables out in the reverse of the order in which they werenumbered: 7, 6, 5, 4, 3, 2. Figures 3.2 and 3.3 make it clear, however, thatthis order can be varied to some extent without changing the join tree or thecomputations performed. The only constraint is that we sum out of a given nodeonly after the node has absorbed messages from all nodes with arrows pointingto it. Only the three nodes 23, 57, and 146 can begin the computation, 1245 canact after 57, 124 can act after 1245 and 146, and so on.

We do not need the numbers beside the arrows in Figure 3.2. These numberstell us which variable to sum out, but we can also find this information bycomparing the node sending the message to the node receiving it. The senderalways sums out the variable it has that its neighbor does not have. In otherwords, it marginalizes to its intersection with the neighbor.

The final result of the computation is f ^ X l , the marginal of / for X\. If wecontinue by summing X\ out of this table, then we obtain /^0, the marginal of/ on the empty set. Figure 3.2 can be extended to include this final summation;we simply add 0 as a node, with an arrow to it from 1.


FlG. 3.3. The successive summations.

3.2. The elementary architecture.

Marginalization in join trees can be understood directly, without any referenceto an ordering of the variables. If we place tables in the nodes of an arbitraryjoin tree and propagate to a root following the algorithm just described, thenthe final table on the root will always be the marginal on the root of the productof the initial tables. It is not necessary that the join tree or the placement of thetables should have been determined by an ordering of the variables.

In this section, we will spell out the marginalization algorithm in terms of anarbitrary join tree. Then we will prove, using only the transitivity and combi-nation axioms, that the algorithm always produces the marginal on the root.

Before beginning the algorithm, we place in each node x of the join tree atable on x, say (px. We write (p for the product of the <£>x; (p = HxgAr Vx, where Nis the set consisting of all the nodes in the tree. The purpose of the algorithm isto find the marginal (p^r for a particular node r, which we call the root of the tree.

To begin the algorithm, we make all the links in the tree into arrows in thedirection of r. (Each node other than r will then have exactly one arrow outward,pointing to its unique neighbor in the direction of r.) Then we have each nodepass a message to its neighbor nearer r according to the rules we learned inFigure 3.1:

42 CHAPTER 3

Rule 1. Each node waits to send its message to its neighbor nearerto r until it has received messages from all its other neighbors.

Rule 2. When a node is ready to send its message, it computes themessage by summing out of its current table any variables it has butthe neighbor to whom it is sending the message does not have. (Thiswas always a single variable in Figure 3.1, but it could be severalvariables or none.) In other words, it marginalizes its current tableto its intersection with the neighbor.

Rule 3. When a node receives a message, it replaces its current tablewith the product of that table and the message.

Eventually, all the nodes except r will have sent messages, and r will have re-ceived a message from each of its neighbors and will have multiplied its originaltable by all these messages.

Here is the proposition we need to prove.PROPOSITION 3.1. At the end of the algorithm just described, the table on r

will be (f>^r, the marginal on r of the product of the initial tables.Proof. Imagine for the moment that the nodes are peeled away from the join

tree as they send their messages, so that in the end only r remains. Thus a singlestep of the algorithm consists of three parts: (1) a node t computes the marginalof its table to b D t, (2) the neighbor b multiplies this marginal into its currenttable, and (3) the node t is removed from the tree. This allows us to state thefollowing lemma.

LEMMA 3.1. After each step, the product of the tables that remain is themarginal to the variables that remain of the product of the tables before the step.

To see that Lemma 3.1 is true, write N\ for the set of nodes in the tree beforethe step, iV2 for the set of nodes in the tree after the step, and i/)x for the tablein node x before the step. Thus the product of the tables before the step isrizeAT! ^xi and the product of the tables after the step is (O^eTv ^o;)W 0< (seeFigure 3.4). Since the tree is a join tree, b H t = (UA^) H t. So we find, using thecombination axiom, that

which is a restatement of Lemma 3.1. Lemma 3.1, together with the transitivityaxiom, yields the next lemma.

LEMMA 3.2. After each step, the product of the tables that remain is themarginal to the variables that remain of the product of the initial tables.


FlG. 3.4. The loaded join tree before and after t 'sends its inward message to b.

At the end of the algorithm, we have only one table, the table on the root,and so we obtain Proposition 3.1 as a special case of Lemma 3.2.

We can gain some further insight into the algorithm by noting that when anode b receives a message from a neighbor t, it is also receiving, indirectly, infor-mation from the nodes on the other side of t. After any step (message-passingand multiplication) in the algorithm, we can identify the nodes from which agiven node b has received information, either directly or indirectly. These nodes,together with b itself, form a subtree, which we may call the b's informationbranch at that point (see Figure 3.5). The steps we have taken within this sub-tree are the same as the steps we would have taken had we implemented thealgorithm on it alone, with b as the root. So as a corollary of Proposition 3.1,we have the following proposition.

PROPOSITION 3.2. After each step, the table on a given node b will be themarginal on b of the product of the initial tables in b's current information branch.

This is a generalization of Proposition 3.1, because at the end of the algo-rithm, the root's information branch is the whole tree.

In the course of explaining our algorithm, we have found ourselves talkingabout the nodes of the join tree as storage registers and even as individualprocessors. Each node can store tables for a certain set of variables, multiplysuch tables, and marginalize them. In effect, we have made the join tree, togetherwith the algorithm, into an architecture for marginalization. We call it theelementary architecture. In the next few sections, we consider some alternativearchitectures, based on the same join tree, that are able to compute marginalsfor all the nodes, not merely for a single root node.

Join-tree architectures are potentially applicable to any instance of the gen-eral problem of computing marginals of a function given as a product of tables,as in equation (3.1), but in order to apply a join-tree architecture to such a prob-lem, we first find a join tree that covers the product, one that includes for eachfactor a node containing the domain of that factor. (If we want the marginal for a

44 CHAPTER 3

FlG. 3.5. The dashed arrows are those over which messages have already been sent. Thecircled subtree is b's information branch at this point.

cluster of variables that is not the domain of one of the factors, then we mustmake sure that the join tree also has a node containing this cluster.) Once wehave such a join tree, we place each factor in a node containing its domain. Ifa node x receives more than one factor, we multiply them together, and we alsomultiply by lx if necessary in order to obtain a table that involves all the vari-ables in x. If a node x does not receive a factor, we simply assign it the table lx.

If the join tree has more than one node containing the domain of a particularfactor, we can put the factor in whichever of these nodes we please. In Figure 3.2,for example, we have two different nodes that can accept a table on 124. Tominimize computation, we should choose the node with the smaller frame size,but this is a minor consideration.

The choice of the join tree is much more important. We want a join-treecover with nodes small enough to permit computation. If such a join-tree coverdoes not exist, we will have to turn to alternative methods for marginalization,such as Markov-chain Monte Carlo.

As we noted at the beginning of the chapter, there are heuristics that doproduce reasonable choices for join-tree covers. Some of these heuristics doinvolve choosing an order for eliminating (summing out) the variables. This notonly produces a join-tree cover; it also determines a placement of the factors inthe join tree—each factor goes as close as possible to the root.

3.3. The Shafer—Shenoy architecture.

The elementary architecture allows us to find the marginal for an arbitrary rootof a join tree. If we then want to find the marginal for another node, we canuse the same join tree, but we must repeat the algorithm using the new node


FIG. 3.6. The partial Shafer-Shenoy architecture. Like the elementary architecture, itsfinds the marginal for a single root node. In each separator, we have indicated the set ofvariables involved in the messages that will be stored there; this is always the intersection ofthe two neighboring nodes.

as the root. This usually involves a great deal of duplication. In Figure 3.4, forexample, most of the steps for computing the marginal on w will be the same asthose for computing the marginal on r.

The Shafer- Shenoy architecture provides one way to eliminate much of thisduplication. In this architecture, each node sends messages in all directions. Itis allowed to send its message to a particular neighbor as soon as it has messagesfrom all its other neighbors. In order that the computations for a message in onedirection should not interfere with those for a message in another direction, anode no longer replaces its table each time it receives a message. Instead, it keepsits initial table, stores the incoming messages, and performs multiplications onlyas needed for computing outgoing messages.

As a first step in describing the Shafer-Shenoy architecture, we will describea partial version, in which, as in the elementary architecture, messages are prop-agated only to a single root r. Figure 3.6 shows this partial architecture. Thesquares on the arrows in this figure are called separators; they contain storageregisters for storing the messages sent in the direction of the arrows. As in theelementary architecture, we begin with a table <px on each node x and we wantto find (f>^r for a particular node r, where </? is the product of the (px. The storageregisters in the separators are initially empty.

Here are the rules for propagation in the partial Shafer Shenoy architecture:

Rule 1. Each node waits to send its message to its neighbor nearer tor until it has received messages from all its other neighbors. (Moreprecisely, it waits until messages have been received by the separatorsbetween it and these other neighbors.)

46 CHAPTER 3

Rule 2. When a node is ready to send its message to its neighbornearer r (or, more precisely, to the separator between it and its neigh-bor nearer r), it computes the message by collecting all its messagesfrom neighbors farther from r, multiplying its own table by thesemessages, and marginalizing the product to its intersection with theneighbor nearer r.

Rule 1 is the same as in the elementary architecture. Here, however, the messagesare intercepted by the separators, where they are stored until they are collectedin accordance with Rule 2. Rule 3, which provides for changing the tables onnodes, has been omitted. In this architecture, propagation only has the effect offilling the storage registers in the separators. It does not change the tables onthe nodes.

Since the rules for message-passing are the same in the partial Shafer-Shenoyarchitecture as in the elementary architecture, the course of the propagation andthe messages sent will be the same. At the end of the propagation, the root rwill have a message from each neighbor stored in the separator it shares withthat neighbor. Thus we have the following proposition.

PROPOSITION 3.3. At the end of the partial Shafer-Shenoy propagation, wecan get (p^r by collecting all of r's incoming messages and multiplying r's tableby them.

The full Shafer-Shenoy architecture extends the partial architecture byputting two storage registers in each separator, one for a message in each direc-tion, as in Figure 3.7. Each node sends messages to all its neighbors, followingthese rules:

Rule 1. Each node waits to send its message to a given neighbor untilit has received messages from all its other neighbors.

Rule 2. When a node is ready to send its message to a particularneighbor, it computes the message by collecting all its messages fromother neighbors, multiplying its own table by these messages, andmarginalizing the product to its intersection with the neighbor towhom it is sending.

Here, as in the partial architecture, the tables on the nodes do not change. Atthe end of the propagation, each node x still has its initial table y>x. The onlyeffect of the propagation is to fill all the storage registers in the separators.

A comparison of the rules for the full and partial architectures makes it clearthat the full architecture produces the same messages towards any particularnode as the partial architecture with that node as root. So once we have com-pleted the propagation in the full architecture, we can find the marginal forany particular node by collecting all its incoming messages and multiplying thenode's table by them.

PROPOSITION 3.4. At the end of the full Shafer-Shenoy propagation, we canget <p^x for any node x by collecting all ofx's incoming messages and multiplyingx's table by them.


FlG. 3.7. The full Shafer-Shenoy architecture. The arrow in each storage register indi-cates the direction of the message to be stored there.

We will find it useful, when we compare the Shafer-Shenoy architecture toother architectures, to express its computations in formulas. Let us write mn^x

for the Shafer-Shenoy message to x from neighbor n. Then Rule 2 says that themessage from x to neighbor w is given by

where Nx consists of x's neighbors.Because of Rule 1, the computation must begin with the leaves, the nodes

that have only one neighbor. In Figure 3.7, for example, the leaves are 1, 23, 57,and 146. Any of these leaves can begin, and the message they send is the onlymessage they send in the course of the computation. The situation for the othernodes is more complicated. Node 12, for example, can send a message to 124 assoon as it has heard from leaves 1 and 23, but it must wait then wait to hearback from 124 before it can send messages back to 1 and 23.

Figure 3.8 shows one sequence in which messages might be sent in the archi-tecture of Figure 3.7. The messages first move inward to the node 1 and thenback outward again. The inward pass is identical to propagation to 1 in thepartial Shafer-Shenoy architecture of Figure 3.6.

48 CHAPTER 3

FIG. 3.8. One order in which messages might be sent in the full Shafer-Shenoy architec-ture.

If the computations are performed serially, there will necessarily be one node,such as 1 in Figure 3.8, that is the first to receive messages from all its neighbors.This node can be considered the root. The propagation consists of a pass inwardto the root and another pass back outward. It is not necessary, however, tospecify the root in advance. If the computations are performed in parallel (apossibility suggested when we talk as if the nodes were individual processors),then which node is the first to receive all its messages will depend on the paceof the computations for the different nodes farther out in the tree, and it is evenpossible that two nodes will tie for first. This happens in Figure 3.9, wherethe computations proceed in parallel and in synchrony, and 124 and 12 receivemessages from each other simultaneously on the third step of the computation.


FlG. 3.9. An example of parallel computation.

By comparing Figures 3.6 and 3.8, we can understand better why the ShaferShenoy architecture stores so many messages. The elementary architecture usesand discards each message when it is sent. But what would happen if we wereto follow the inward pass of the elementary architecture with an outward pass?In the case of Figures 3.6 and 3.8, this means that after 1 absorbed the messagefrom 12, it would send a message back to 12. By the usual rule, the message backwould simply be its current table, which was obtained by multiplying its originaltable by the message (no marginalization is needed, because the intersection of1 with 12 is simply 1). Intuitively, this is the wrong, because it forces 12 toabsorb again the message it just sent, effectively counting it twice. The Shafer-Shenoy architecture sends instead only the original table, uncontaminated withthe message from 12. It is able to do this because it has kept both its originaltable and the message. The same thing happens at each further step on theoutward pass. Node 12, for example, since it still has both its original table andthe messages from 23 and 1, is able to send a message back to 124 that is notcontaminated with the message it received from 124.

Roughly speaking, the Shafer Shenoy architecture computes marginals forall the nodes at about three times the price for a single marginal. We doublethe computation because we compute two messages instead of one for each link,

50 CHAPTER 3

and then we increase it by about the same amount again when we do the finalmultiplications to get the marginal for each node. This contrasts with repeat-ing the elementary architecture for each node, which multiplies the amount ofcomputation for a single marginal by the number of nodes.

Unfortunately, the Shafer-Shenoy architecture is still rather wasteful in itsdemand for multiplication. Each node computes a message for each of its neigh-bors only once (in contrast to what happens if we use the elementary architectureover and over), but the multiplication a node performs to compute the messageto one neighbor still duplicates much of the multiplication it performs to com-pute the message to another. In Figure 3.7, for example, node 124 will multiplyits original table by the message from 1245 once when it sends its message to146 and again when it sends a message to 12. With yet more storage, we couldreduce this remaining duplication somewhat, but it is more effective to take an-other tack. Instead of trying to keep the message a node sends on the inwardpass from being included in the message it gets back, we can allow for the mes-sage's later return by dividing the it out of the node's current table as it is sent.This is the tack taken by the Lauritzen-Spiegelhalter architecture.

3.4. The Lauritzen-Spiegelhalter architecture.

The Lauritzen-Spiegelhalter architecture explicitly designates a particular noder as the root of the propagation. It does not use separators. It begins with apass inward to r that duplicates the elementary architecture, except that whena node sends a message, it divides its own table by that message. It then followswith a pass outward from r, during which it follows the elementary architecture'srule for propagation, without the division. This is illustrated by Figure 3.10.

Here is a precise statement of the rules for the inward pass.

Rule 1. Each node waits to send its message to its neighbor nearer runtil it has received messages from all its other neighbors.Rule 2. When a node is ready to send its message to its neighbornearer to r, it computes the message by marginalizing its current ta-ble to its intersection with its neighbor. It sends this marginal to theneighbor nearer to r, and then it divides its own current table by it.


These rules are the same as the rules for the elementary architecture, except forthe addition of the italicized phrase in Rule 2. For the outward pass, we use thesame rules, without the divisions:

Rule 1. Each node waits to send its message to a particular neigh-bor outward from r until it has received messages from all its otherneighbors.

Rule 2. When a node is ready to send its message to a particularneighbor outward from r, it computes the message by marginalizingits current table to its intersection with this neighbor.


FIG. 3.10. Rules for the Lauritzen-Spiegelhalter architecture. The message, In or Out, isalways the marginal of the sender's current table to the sender's intersection with the receiver.


Since each node received messages from all its outward neighbors on the inwardpass, we can restate Rule 1 for the outward pass in a simpler way: Each nodewaits to send its messages outward until it has received a message from its uniqueneighbor nearer to r. (This neighbor may be r itself; r must begin the outwardpass by sending one or more messages.)

Let us check that the Lauritzen-Spiegelhalter architecture produces the ap-propriate marginals for all the nodes.

PROPOSITION 3.5. At the end of the Lauritzen-Spiegelhalter propagation, thetable on each node x is </^x.

Proof. First consider the situation at the end of the inward pass. On theinward pass, the messages sent are the same as in the elementary architectureand hence also the same as in the Shafer-Shenoy architecture. If x is not equal tor, then during the inward pass, x sends its inward neighbor w the Shafer-Shenoymessage mx^w. At the end of the inward pass, x has received messages fromall its own outward neighbors (if any) and has sent only the message to w. Thisgives the following lemma.

LEMMA 3.3. At the end of the inward pass, a node x not equal to the roothas as its table

where w is x's inward neighbor.The root r, on the other hand, receives messages from all its neighbors and

sends no messages on the inward pass. So at the end of the inward pass, it hasthe same table as at the end of the elementary architecture.

which proves Lemma 3.5 (see Figure 3.11).Since the hypothesis of Lemma 3.6 is always true, its conclusion is too: the

Lauritzen-Spiegelhalter message from w back out to x is always the product ofthe Shafer-Shenoy messages in both directions. This substantiates the intuitivecharacterization of the Lauritzen Spiegelhalter architecture with which we be-gan: dividing out the inward message when we send it compensates for the factthat it will be part of the message that comes back.

Another equally important way of describing the message from w back outto x is to say that it is the marginal of (f> on w fl x. This is because w has the

52 CHAPTER 3

LEMMA 3.4. At the end of the inward pass, the table on r is (p^T.Now consider the outward pass. On the outward pass, each node except the

root receives just one message: the message from its inward neighbor. The rootitself sends messages but does not receive any. So the table on the root does notchange, and each of the other tables changes exactly once, when it is multipliedby the message from its inward neighbor. Since the propagation moves outwardfrom the root, Proposition 3.5 follows by induction from Lemma 3.4 togetherwith the following lemma.

LEMMA 3.5. Suppose w has (p^w as its table when it sends its message to out-ward neighbor x. Then after absorbing the message, x will have (p^x as its table.

To prove Lemma 3.5, we need a formula for the message w sends to x.LEMMA 3.6. If w has (p^w as its table when it sends its message to outward

neighbor x, then the message it sends is the product of the Shafer-Shenoy mes-sages in both directions: mu,^xmx^w.

To prove Lemma 3.6, we note that by its hypothesis and equation (3.3), thetable on w is

The message w sends out to x is the marginal of this table to w r\ x, which isequal, by the combination axiom and equation (3.2), to

When we multiply the expressions in Lemmas 3.6 and 3.3, we obtain

53

After w has received messages from all its neighbors,including x and its neighbor nearer r, and before it sendsa message back to x.

PROPAGATION IN JOIN TREES

After w sends a message back to x.

FlG. 3.11. The node x and its neighbor w nearer the root before and after w sends amessage back to x.

marginal of (p on w as its table before sending the message, and it computes themessage by marginalizing this table to w fl x.

Using continuers. The alert reader will have noticed that we glossed over theproblem of zero probabilities in our description of the Lauritzen Spiegelhalterarchitecture. If the table mx^w has zero values, then we will not be able toperform the division in equation (3.4). Fortunately, it is not really necessary toperform this division. The reasoning with which we proved Proposition 3.5 willwork if we can find a continuer, say Qxnw-^xj of (px HneAf \w mn^x from x PI wto x, for we can use Qxr\w-+x as x's table after it has sent its message inwardto u>, and this will have the same effect as the division. When the messagemw-+xmx^w comes back, we obtain

54 CHAPTER 3

as our table on x, so that Lemma 3.3 and Proposition 3.5 still hold.The requirement that continuers should exist makes the Lauritzen-

Spiegelhalter architecture slightly less general than the Shafer-Shenoy architec-ture, which allows negative entries in the tables (px. Continuers may fail to existwhen negative values are allowed. But if the product of the <px is proportionalto a probability distribution, then we can take it for granted that all the entriesare all nonnegative, because dropping minus signs will not change the product.And, in this case, continuers exist by Proposition 1.1.

Notice the other implication of Proposition 1.1: we can choose the continuersto be conditionals. More precisely, we can choose the continuer Qxr\w—>x to be aconditional with head x \ w and tail x n w.

When we look beyond probability to other problems satisfying the transitiv-ity and combination axioms (see the exercises at the end of Chapter 1 and at theend of this chapter), we find that the Shafer-Shenoy and Lauritzen-Spiegelhalterarchitectures have overlapping but distinct ranges of application. The Shafer-Shenoy architecture works whenever there are no restrictions on multiplicationand marginalization, even if continuers do not exist. The Lauritzen-Spiegelhalterarchitecture, on the other hand, can sometimes work under restrictions onmultiplication or marginalization that prevent the use of the Shafer-Shenoyarchitecture.

The new construction sequence. One interesting feature of the Lauritzen-Spiegelhalter architecture is that the product of the tables on the nodes remainsequal to (p during the inward pass. This is clear when we divide: each time wedivide one of the tables by a message, we multiply another by the same message,so the product does not change. It is equally clear in terms of continuers: eachtime we factor a table into a marginal and a continuer and remove the continuerfrom the node, we add it as a factor in another node.

Suppose we always choose the continuers to be conditionals. Then at theend of the inward pass, we have transformed the original factorization of </?,(p = IlxeAr Vx, into a new factorization,

where w(x) is x's inward neighbor. This new factorization, as it turns out, canbe interpreted as a construction sequence.

In order to make the interpretation as a construction sequence precise, let ustake one more step, continuing the inward pass, as it were, from r to the empty


FIG. 3.12. The tables at the end of the inward pass.

set 0. In other words, we factor the marginal <^r into the product of (/^0 anda continuer from 0 to r. Since (p is proportional to a probability distributionP, (p^ ^ 0, and hence the continuer is unique; it is the marginal P^r. Soequation (3.6) becomes

If we imagine the a node 0 added to the join tree, with an arrow to it from r,then at the end of the inward pass, we have the factors on the right-hand sideof equation (3.7) on the nodes of the tree (see Figure 3.12).

By Proposition 1.3, the probability distribution P is equal to (p/tp^. Soequation (3.7) tells us that

It is the conditionals on the right-hand side of this equation that can be arrangedin a construction sequence for P. Indeed, suppose x i , . . . , xm is an ordering ofthe nodes of the join tree that moves outward from the root—i.e., such that x\is the root and each later Xi is an outward neighbor of one of r c i , . . . , x^-i. (Suchorderings exist in any tree.) Write Qi for QXir\w(xi)-*xii f°r z = 2 , . . . ,m. Thenwe have the following lemma.

LEMMA 3.7. P^r, Q^-, • • • , Qm is a construction sequence for P.Proof. Equation (3.8) says that P is the product of P^ r,Q2, • • • , Qmi and

the union of their heads is clearly equal to TV, the domain of P. So to provethe lemma, we need only show that the head of each conditional is disjoint fromthe domain of the preceding ones. But this is an obvious property of join trees:whenever we order the nodes in a sequence moving outward from a root, theintersection of each node Xi with the preceding nodes is always contained in itsinward neighbor w(xi), and hence Xi \w(xi) is disjoint from x\ U- • -Uzj-i.

Lemma 3.7 says that at the end of the inward pass, the tables on the nodesare conditionals, and any outward sequence is a construction sequence.

56 CHAPTER 3

FIG. 3.13. The propagation back from r to x.

The outward pass of the Lauritzen-Spiegelhalter architecture can be under-stood in terms of the construction sequences produced by the inward pass. Con-sider, for example, the action of the outward pass on the path going outwardfrom the root r to a particular node x (Figure 3.13). It is evident that theconditionals along this path form a construction chain for the marginal of Pon the variables involved, and the propagation outward in this chain is forwardpropagation in the sense of Chapter 2.

3.5. The Aalborg architecture.

As we have seen, the message from x in to w in the Lauritzen-Spiegelhalterarchitecture is the Shafer-Shenoy message in that direction, mx-+w, while themessage from w back out to x is the product of the Shafer-Shenoy messages inboth directions, mx^wmw^x. When we send mx^w inward, we divide it out ofthe table on x in order to compensate for its later return.

The Aalborg architecture takes a more direct tack. In this architecture, wedo not divide mx^w out of the table on x as we send it inward. Instead, we savemx^w and divide it out of mx^wmw^x when this message comes back. Thisrequires more storage, but it saves computation, because the division is now inthe domain w n x rather than in the larger domain x. Each entry in mx^,wdivides a whole row, as it were, in the table on x, but only a single entry in thetable mx^wmw-^x.

Messages are stored in separators, just as in the Shafer-Shenoy architecture.Each message is computed as in the Lauritzen Spiegelhalter architecture: thenode marginalizes its current table to the intersection with the node to whichit is sending the message. On the inward pass, we both store the messages inthe separators (as in the Shafer-Shenoy architecture) and multiply them into thereceiving nodes (as in the elementary and Lauritzen-Spiegelhalter architectures).On the outward pass, the separator divides the outward message by the messageit has stored before passing it on to be multiplied into the table on the receivingnode. (See Figure 3.14.) By the end, the initial table on each node x will bemultiplied by the Shafer-Shenoy messages from all of x's neighbors. So the finaltable on x will be the marginal (p^x.

When a node w computes a message for its outward neighbor x, its own tableis already its marginal, (p^x. So the message it sends to the separator is ip^xnw.


FlG. 3.14. The inward and outward action of the Aalborg architecture between x and itsinward neighbor w. Here ifjx and t^w are the tables on x and w, respectively, just before xcomputes its message to w, and ipx and ifr'w are the tables just before w computes a messageto send back. The table on w may have changed one or more times as a result of messagesfrom other outward neighbors and its own inward neighbor.

Since we are more interested in this marginal than in the Shafer-Shenoy message,we store it in the separator after we forward its quotient by the old message.

The action of the separator on the inward pass seems different from its actionon the outward pass, but Figure 3.15 shows how to describe it in a way thatmakes it similar. Instead of beginning with the separator empty, we begin withit containing l^nx, a table of ones. Since In is the same as In/lwr\Xi we cansay that here too the separator is sending forward a quotient rather than merelysending forward the message it receives. Thus we have the uniform action shownin Figure 3.16; the separator always stores New but sends forward New/ Old.

In summary, the Aalborg architecture uses a rooted tree with a separatorbetween each pair of nodes. Initially, each node x has a table </?x, and eachseparator has a table of ones. The propagation follows these rules:

Rule 1. Each nonroot node waits to send its message to a givenneighbor until it has received messages from all its other neighbors.

Rule 2. The root waits to send messages to its neighbors until it hasreceived messages from them all.

Rule 3. When a node is ready to send its message to a particularneighbor, it computes the message by marginalizing its current tableto its intersection with this neighbor, and then it sends the messageto the separator between it and the neighbor.

Rule 4. When a separator receives a message New from one of its twonodes, it divides the message by its current table Old, sends the quo-tient New/ Old on to the other node, and then replaces Old with New.


58 CHAPTER 3

FIG. 3.15. If we suppose that the separator begins with a table of ones, then the inwardaction is the same as the outward.

FIG. 3.16. The uniform action of the Aalborg architecture: When u sends New to itsneighbor v, the message is intercepted by the separator, which divides it by Old and passes thequotient on.

Rules 1 and 2 force the propagation to move in to the root and then back out.At the end of the propagation, the tables on all the nodes and separators aremarginals of </?, where ip = Y\ ¥x-

Dealing with zeros. We have again been making the simplifying assumptionthat there are no negative or zero values in the <px, so that division is alwayspossible. Now let us relax this to the assumption that there are no negativevalues, which is sufficient for continuers to exist.

When zeros are not allowed in the table Old, the quotient New/ Old is theunique solution ty of the equation Old • tp = New. As it turns out, this equationcan still be solved when we allow zeros; the solution is not unique, but it doesnot matter what solution we use. So there are two ways we can proceed. Wecan stop talking about division—we can talk instead about solving the equationOld • ip = New. Or we can extend the definition of division by picking out aparticular solution of the equation Old • ty = New and calling it the quotientNew/ Old.


We will explore both approaches. First, let us see what happens when wedrop talk about division. Since division appears only in Rule 4, all we need todo is replace that rule with the following rule:

Rule 4'. When a separator containing Old receives a new message,say New, it solves the equation

for ip and sends tp on to its other node. It then discards Old andstores New in its place.

As the following proposition shows, this works; it is always possible to solveequation (3.9), and doing so produces the result we want.

PROPOSITION 3.6. If there are no negative values in the initial tables on thenodes, then propagation under Rules 1,2,3,4', and 5 will result in each node andseparator containing its marginal of (p.

Proof. Since the propagation proceeds inward just as in the elementary archi-tecture, the root will have its marginal at the end of the inward pass. So we canprove the proposition by induction on the outward pass. Suppose propagationto w on the outward pass has resulted in the table (p^w on w, and let us showthat the next step will produce (p^x on ID'S outward neighbor x. On the inwardpass, x had sent in mx^w, and w now sends back (p^xC]w, or mx-+wmw-+x. Soequation (3.9) can be rewritten as

or

Equation (3.11) obviously has a solution, but it may have more than one. Weneed to show that any solution will produce the marginal on x when it multipliesthe table now on x. To this end, let Qxt^w-^x De a Lauritzen Spiegelhalter contin-uer for x. The current table on x is Qxr\w-*xmx->wi so the result of multiplyingit by any solution of equation (3.10) is

which is equal, by equation (3.5), to (p^x.Though the solution ty of equation (3.11) may not be unique, the range of

choice is simple. Since all the tables involved in the equation are the same size,the multiplications are all entry-by-entry. When an entry in mx^w is nonzero,the corresponding entry in -0 is unique; we obtain it by division. When an entryin mx—>w is zero, the corresponding entry in mx^wmw^x is also zero, and so wecan choose the entry in 1/1 however we please. It is this fact—the fact that wecan choose the entries of T/> arbitrarily when they are not fully determined—thatallows us to handle the situation by extending the definition of division.

60 CHAPTER 3

In the case at hand, we want to divide one table by another of the samesize, but with an eye to further developments, let us consider a more generalsituation, where we want to divide one table by another of the same or possiblysmaller size. Say we want to divide a table B on y by a table A on x, wherex C y. We will show how to do so under the assumption that whenever an entryin A is zero, everything in the corresponding row in B is zero—i.e.,

or, equivalently,

We will say that A supports B when this condition is met. Given a table A onx that supports a table B on y, we define a table B/A on y by

Here we have set the value of the quotient equal to zero when the value ofdenominator is zero. Any other number would do just as well for our immediatepurpose, but zero will prove convenient later.

This extended definition of division immediately yields the following lemma.LEMMA 3.8. If A supports B, then

This lemma, in turn, yields the following proposition.PROPOSITION 3.7. // there are no negative values in the initial tables on the

nodes, and we use equation (3.14) as the definition of division, then propagationunder Rules 1,2,3,4, and 5 will result in each node and separator containing itsmarginal of (p.

Proof. Since Old is mx^w and New is mx^wmw^x, Old supports New. Soby Lemma 3.8, New/Old, defined as in equation (3.14), is a solution of equa-tion (3.9). So Rule 4 with our extended definition of division is a special case ofRule 4', and the proposition follows from Proposition 3.6.

As the following lemma asserts, we can work with extended division in muchthe same way that we work with ordinary division. We can combine numeratorsand denominators (statement 5), and we can cancel factors in denominators bymultiplication (statement 6).

LEMMA 3.9.1. f (c) = 0 if and only if B(c) = 0.2. If A supports B, then A supports BC.3. If A supports B and C supports D, then AC supports BD.4. If B is a table on y and x C y, then B^x supports B.5. If A supports B, then f • C = $£-.6. If A supports B and C supports D, then ^- % — ̂ •


7. // A and C both support B, then -jjfe • C = ^.8. If A and C both support B, then ;f • § = § • (This may not be true if C

does not support B.)9. If A on x supports B on y, then (^) i x = ^-.We leave the proofs of these statements to the reader. In contrast to

Lemma 3.7, most of them (namely, 1 and 5-9) do depend on our having chosenzero as the value of a quotient, when the denominator is zero.

The Aalborg formula. Let us return, for just a moment, to the assumptionthat our tables never have zero entries. Write N for the set of nodes, S for theset of separators, Tx for the current table on the node x, and Us for the currenttable on the separator s. At the beginning of the propagation, Tx — <px, UK = 1S)

and hence

At each step, we change the table on one node and on one separator. The tableon the node is multiplied by New/Old, and the table on the separator is changedfrom Old to New-— i.e., it also is multiplied by New/ Old. Since the table on thenode is multiplied by the same factor as the table on the separator, the ratio

This is the Aalborg formula. In words, the function whose marginals we want isalways the ratio of the product of the tables on the nodes to the product of thetables on the separators.

The Aalborg formula still holds even if zero entries are allowed in our tables,but the reasoning with which we established it holds only if we plug a couple ofholes.

First, we must check that Hse?^8 alwavs supports Ilze/v^-' so tnal' ^ne

ratio (3.16) is defined. To check this, we write x ( s ) for the outward neighborof the separator s. Since [/.,. if it is not equal to l s , is a marginal of T x ( s ) , Us

supports Tx(s) (statement 4 of Lemma 3.9). Hence Pises ̂ suPP°rts FLes ^(s)(statement 3) and also Tr HseS1 ^(«) (statement 2), which is equal to Y\xN Tx.

Second, we must check that multiplying the top and bottom of the ra-tio (3.16) by New/0ldvfi\\ not change it. This follows from statements 6 and 8of Lemma 3.9, together with the fact that New/Old supports the numerator. Weknow that New/ Old supports the numerator because New is a marginal of oneof its factors, and by statement 1 of Lemma 3.9, New/'Old supports whateverNew supports.

62 CHAPTER 3

There is one point of notation that should be clarified in connection with theAalborg formula. For simplicity, we have been using a notation that identifieseach node x with a set of variables. We could also identify each separator with aset of variables—we could say that the separator s between the nodes u and v isequal to uC\v. It is better, however, to assume that the names of the separatorsare distinct from the sets of variables involved, for two or more separators mightinvolve the same set of variables. (We might have one pair of neighboring nodesu\ and v\ and another pair 11% and V2 with uiHvi = u^ Pi v-2.) It would burdenour notation unnecessarily for us introduce distinct symbols for the separatorand its set of variables, but the distinction should be kept in mind, even when,as will happen shortly, we write as if they are the same.

Loading the separators. Though we have presented the Aalborg architectureunder the assumption that the tables on the separators are initially tables ofones, this assumption too can be relaxed. Suppose we put nonnegative tablesTx and Us on the nodes and separators in such a way that the table on eachseparator supports the tables on the neighboring nodes. Then the denominatorin equation (3.16) supports the numerator. If we set the quotient equal to <p andpropagate by the Aalborg rules, then we have the following proposition.

PROPOSITION 3.8. At the end of the propagation, the tables on the nodes andseparators will be the corresponding marginals oftp.

Proof. By statements 5 and 6 of Lemma 3.9,

where x(s) is the outward neighbor of the separator s. This suggests that wecompare propagation with Us on s arid Tx on x to propagation with ls on s. Tr

on r, and Tx^s->/Us on x ( s ) . Call the former the loaded propagation (because theseparators are loaded at the beginning) and the latter the adjusted propagation(because the tables on the nodes are adjusted). We know that the adjustedpropagation results in the marginals of (p on all the nodes and separators; let usshow that the loaded propagation gives the same results.

For the moment, we reserve Tx and Us for the initial tables in the loadedpropagation; we write T_J°aded and y]oaded for the current tables in the loadedpropagation and T*dJusted and [/adjusted for the current tables in the adjustedpropagation. Initially,

and

These equations will hold throughout the inward pass, for if they hold before aninward step, they hold after it. To see this, write Mx(s^s for the message from


x ( s ) to s on the inward pass. We have

the inward loaded message from x(s) is multiplied by Us in comparison with theinward adjusted message. Since this is the new table for s, equation (3.20) willstill hold. But the loaded propagation divides Us out before sending the messageon to the neighbor w; hence the message multiplied into w is the same in the twopropagations, and the relation between T^oaded and J1£djusted (equation (3.19) or(3.21)) will also be unaffected.

Since the root has the same table at the end of the inward pass in the twopropagations, it sends the same messages back out. So we can complete theproof by induction on the outward pass. We need only show that if the messagefrom w out back to s is the same in the two propagations, then the table on x ( s )will end up the same. But if we write Mw->a for the message from w back to s,then the table we get on x(s) in the loaded propagation is

which is the table we get in the adjusted propagation.The Aalborg formula can be used to find a probability distribution that has

given marginals.PROPOSITION 3.9. Suppose we are given a probability distribution Tx for

each node x in a join tree. And suppose these distributions are consistent inthe sense that for neighboring nodes x and y, T^xny = T^xCly. Set Us for the.separator s between x and y equal to this common marginal. Then the func-tion f given by equation (3.17) is a probability distribution with the Tx as itsmarginals.

Proof. When we run the Aalborg propagation, nothing changes. The tableson the separators are already the marginals of the tables on the nodes, so themessage to the separator is always identical with the table already there, andthe ratio, which is passed on to the neighboring node, is always a table of ones.So the tables are already the marginals of (p. And any nonnegative table with aprobability distribution as a marginal is itself a probability distribution.

3.6. COLLECT and DISTRIBUTE.

The three major architectures we have studied in this chapter—the Shafer-Shenoy, Lauritzen-Spiegelhalter, and Aalborg architectures—move inward in atree and then back outward. How should we organize or program this move-ment? This is a very general question, for many computations are tree recursive.But we should take a moment to consider it.

We have described each of the three architectures by giving, along with rulesfor what the nodes do, rules for when they are allowed to do it. The simplicityof this description made it convenient for the theoretical understanding we havebeen seeking, but at the programming level, it suggests rather expensive control

64 CHAPTER 3

regimes. Were the nodes independent processors, we seem to be suggesting aregime in which each node constantly checks on whether it is allowed to act. Ina serial machine, we seem to be suggesting a regime (as in a rule-based program)in which we constantly search for nodes that are ready to act (rules that areready to fire).

A more economical approach is to use the connections of the tree to propagatesignals to act as well as the results of actions. To trigger the inward pass, wecan have the root ask for inward messages from its neighbors, which, in orderto comply with the request, must ask for inward messages from their otherneighbors, and so on. To trigger the outward pass, we can have the root sendmessages to its neighbors, together with the request that they pass messages onto their other neighbors, and so on.

If we run the propagation in this way, the root need not be specified in thedata structure representing the tree; it is merely the node at which we begin thepropagation. Having propagated with one node as the root, and perhaps thenhaving made changes in the input tables, we can propagate with a different nodeas the root.

The tree itself can be represented in object-oriented fashion, with each nodeas an object. Each node has a list of neighbors and the ability to communicatewith these neighbors. At a coarse level of description that is common to all threearchitectures, a node has two actions, COLLECT, which is used on the inwardpass, and DISTRIBUTE, which is used on the outward pass. Both actions can becalled from outside the system or from a neighboring node. These actions arerecursive, and they also trigger a more basic action, SENDMESSAGE.

When the action COLLECT is called in a node from outside the system, thatnode in turn calls COLLECT in all its neighbors. When COLLECT is called in anode by a neighbor, that node calls COLLECT in all its other neighbors and also,after the neighbors have completed their action, performs SENDMESSAGE to theneighbor that made the call. This means that we can trigger the inward passsimply by calling COLLECT in the node that we want to act as the root. Thecall is automatically relayed out toward the leaves, and when it has reached theleaves, the messages come back in (Figure 3.17).

When the action DISTRIBUTE is called in a node from outside the system,that node performs SENDMESSAGE to each neighbor and then calls DISTRIBUTEin that neighbor. When DISTRIBUTE is called in a node by a neighbor, the nodeperforms SENDMESSAGE to and calls DISTRIBUTE in its other neighbors. So wecan trigger the outward pass by calling DISTRIBUTE in the node we have chosento be the root. The call will automatically move outward in the tree, precededby outward messages (Figure 3.18).

The action SENDMESSAGE differs from architecture to architecture. In theLauritzen-Spiegelhalter architecture, there are actually two distinct SENDMES-SAGE actions, SENDMESSAGElN, which is used by COLLECT, and SENDMES-SAGEOuT, which is used by DISTRIBUTE. But the other two architectures,the Shafer-Shenoy architecture and the Aalborg architecture, use the sameSENDMESSAGE in COLLECT as in DISTRIBUTE.


FlG. 3.17. After COLLECT is called outward from the root, messages move inward.

In the Lauritzen Spiegelhalter architecture, SfiNDMESSAGElN affects boththe sending node and the receiving node. The message sent is divided out ofthe table in the first and multiplied into the table in the second. The actionSENDMESSAGEOUT, on the other hand, affects only the receiving node.

The description of SENDMESSAGE in the Shafer-Shenoy and Aalborg archi-tectures is affected by where we place the separators. In the case of the Shafer-Shenoy architecture, it is most convenient to split the separator and put eachstorage register in the node to which its messages are directed, so that the affectof SENDMESSAGE is to fill the storage register in the receiving node. In the caseof the Aalborg architecture, it seems most appropriate to place copies of theseparator in both nodes; when a message is sent, it is stored in the copy in thesending node and then sent to the receiving node, where it is stored again afterbeing used to compute the quotient that is multiplied into the node's main table.

To complete the picture, we can also provide each node with a REPORTaction, which results in the node's marginal being sent to the user of the system.In the Lauritzen-Spiegelhalter and Aalborg architectures, this action involvesno computation, but in the Shafer-Shenoy architecture, it requires the node tocollect the messages in its separators and multiply them all into its main table.We can make REPORT an action that is called from outside the system, or wecan make it part of DISTRIBUTE, so that marginals are reported as the outwardpass proceeds.

66 CHAPTER 3

FIG. 3.18. As DISTRIBUTE is called outward from the root, messages move outward.

3.7. Scope and alternatives.

Join-tree propagation may or may not succeed in finding marginals of a par-ticular product of tables. It will not succeed if the belief net is so highly con-nected that no feasible join-tree cover exists. In this case, we may be able touse approximate rather than exact methods. Presently, the most widely usedapproximate methods are Gibbs sampling and its cousins—methods now col-lectively called "Markov-chain Monte Carlo." These methods were proposedfor probabilistic expert systems by Pearl [43], but they have been less success-ful for expert systems than for vision (Geman and Geman [29]) and Bayesianstatistics (Besag et al. [13]). The small or zero conditional probabilities oftenencountered in expert systems—where a priori knowledge is stronger—tend toviolate the conditions that allow the Markov-chain methods to converge. A re-cent candidate to fill the gap left by the weakness of Markov-chain methods forexpert systems is mean-field theory, also borrowed from statistical physics (Saulet al. [44]).

In this chapter, we have discussed only the problem of finding marginalsof probability distributions given as products of tables. In principle, join-treepropagation is applicable to finding marginals in any other problem in which thetransitivity and combination axioms are satisfied. (Examples are given in theexercises.) There arc, however, problems in which the axioms are satisfied butthe operations are not feasible. Join-tree propagation depends on marginaliza-


tion and multiplication being computationally feasible in small domains (smallnumbers of variables), and sometimes it is not. Continuous probability den-sities provide an example. We know how to marginalize (integrate) in manyparametric families of densities, but multiplication usually takes us outside theparametric family, producing densities that are difficult to integrate, even ifonly a few variables are involved. As a practical matter, join-tree propaga-tion for continuous densities has been limited mainly to the multivariate nor-mal distribution, where it is often discussed in connection with the Kalmanfilter.

We should also note another limitation of the join-tree method—in general,it only helps us find marginals for small clusters of variables. In many problems,we want to compute other numbers: probabilities involving many variables andexpectations. Markov-chain Monte Carlo, when it works, allows us to computethese numbers as well.

Exercises.

EXERCISE 3.1. How great is the computational advantage of the Lauritzen-Spiegelhalter architecture over the Shafer-Shenoy architecture? For a first passat answering this question, you may wish to assume that each nonleaf in the jointree has the same number of neighbors (the tree's "branching factor"), that eachvariable has the same number of elements in its frame, and that each node hasthe sam,e number of variables in common with its branch as well as the samenumber of new variables.

EXERCISE 3.2. Compare the three architectures on the basis of their storagerequirements. Consider the case where we need to keep the initial inputs and the.case where we do riot.

EXERCISE 3.3. Show how to use join-tree computation to find P^w(x) forany set w of variables and any single configuration x of w -even ifw is too largeto be contained in any node of the join tree. (Hint: Pretend x is observed, andexploit the fact that Piw(x) is the inverse of the normalizing constant for theposterior probabilities.)

EXERCISE 3.4. Discuss ways of measuring the amount of computation re-quired by a join tree. (In the introduction to Chapter 3, two measures weresuggested: the, sum of the sizes of the frames, and the size of the largest frame.)Discuss the issue separately for probability propagation and for each of the prob-lems listed in Exercise 1.2.

EXERCISE 3.5. Verify that the elementary and Shafer-Shenoy architecturesalways work in the. abstract framework you formulated in Exercise 1.5.

EXERCISE 3.6. Explore the analogy between the outward pass of theLauriLzen-Spiegelhalter architecture and the outward pass in recursive, dynamicprogramming, in which solutions of reduced problems are. used to build up anoverall solution (Mitten [40], Bertele and Rrioschi [1], Shenoy [46]). Formulate,an abstract theory that includes both examples as special cases.

68 CHAPTER 3

EXERCISE 3.7. What constraints must be imposed on the placement of con-ditionals in the nodes of a join tree in order for the results of Shafer-Shenoycomputations to remain within the partial semigroup of conditionals' (See Ex-ercise 2.5.) Explore conditions on the existence of continuers that allow theLauritzen-Spieyelhalter architecture to work in this context.

EXERCISE 3.8. In some problems, the mathematical objects that one com-bines can be embedded in a larger class that comes closer to being a group, sothat the division required by the Aalborg architecture is possible. Discuss theextent to which this is possible in the examples considered in Exercise 1.2.

CHAPTER

Resources and References

4.1. Meetings.

The annual Conference on Uncertainty in Artificial Intelligence (UAI) plays aleading role in the development of probabilistic, belief-function, fuzzy, and qual-itative expert systems. Papers given in its first six years (1985-1990) were col-lected and published by North-Holland in a series entitled Uncertainty in Arti-ficial Intelligence. Proceedings of subsequent meetings have been published byMorgan Kaufmann. The Association for Uncertainty in Artificial Intelligence,the sponsor of the conference, has a site on the World-Wide Web:

http: / /www .auai .org/

This site gives instructions for subscribing to the association's electronic mailinglist and includes links to many other sources of information about the manage-ment of uncertainty in expert systems.

The biennial International Workshop on Artificial Intelligence and Statisticsis also devoted in part to uncertainty in expert systems. The Web site for itssponsor, the Society for Artificial Intelligence and Statistics, is

http://www.vuse.vanderbilt.edu/~dfisher/ai-stats/socicty.html

This site is maintained by Douglas H. Fisher at Vanderbilt University.Another important conference for this community is the International Confer-

ence on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU), which has been held biennially since 1986. The pro-ceedings of the most recent conference, held in Paris in 1994, was publishedby Springer-Verlag in 1995 under the title Advances in Intelligent Computing,edited by Bernadette Bouchon-Meunier, Ronald R. Yager, and Lotfi A. Zadeh.

4.2. Software.

A number of software packages for probabilistic expert systems are available.The most highly developed is the commercial product HUGIN. Developed atAalborg, Denmark, it uses the Aalborg architecture described in Chapter 3.Information on HUGIN can be obtained at:

http: //www.hugin .dk69

4

http://www.auai.org/

http://www.vuse.vanderbilt.edu/~dfisher/ai-stats/society.html

http://www.hugin.dk

70 CHAPTER 4

The most thorough implementation of the Shafer-Shenoy architecture is Pul-cinella. Developed by the IRIDIA research group in Brussels, it handles belieffunctions, categorical judgments, and possibility measures as well as probabili-ties. It is implemented in Common Lisp and is distributed free. Information isavailable from IRIDIA's Web site:

http://iridia.ulb.ac.be/pulcinella/

Further information on these and other packages, some commercial and somefree, is available at a Web site maintained by R.ussell Almond:

http://bayes.stat.washington.edu/almond/belief.html

4.3. Books.

There are now many excellent books on probabilistic expert systems and relatedtopics.

[1] Bertele, Umberto, and Francesco Briosdii (1972). Nonserial Dy-namic Programming. Academic Press. New York. A readable treat-ment of join-tree computation for decomposable dynamic program-ming problems.

[2] Diestel, R. (1990). Graph Decompositions. Clarendon Press. Ox-ford. A general perspective on decompositions of the type exemplifiedby join trees, with hints at the diversity of the applied problems thatinspire these decompositions.

[3] Jensen, Finn V. (1996). An Introduction to Bayesian Networks.University College Press. London. An engaging and readable intro-duction to probabilistic networks, with an emphasis on constructionand computation within the Aalborg architecture.

[4] Judd, J. Stephen (1990). Neural Network Design and the Com-plexity of Learning. MIT Press. Cambridge. This interesting andreadable book demonstrates the relevance of join-tree ideas to theproblem of learning in neural networks.

[5] Lauritzen, Steffen L. (1996). Graphical Models. Oxford Univer-sity Press. London. A superb treatment of probabilistic networks asmodels for data, this book marries probabilistic expert systems withup-to-date statistical methodology. Relatively comprehensive, it cov-ers undirected as well as directed graphs, and continuous (normal)as well as discrete probability distributions. Its greatest originalitylies in its treatment of mixed cases: chain graphs, which combinedirected and undirected graphs, and models with both discrete andcontinuous variables.

[6] Neapolitan, E. (1990). Probabilistic Reasoning in Expert Systems.John Wiley. New York. This readable book covered the state of the

http://iridia.ulb.ac.be/pulcinella/

http://bayes.stat.washington.edu/almond/belief.html

RESOURCES AND REFERENCES 71

art in computation in probabilistic expert systems at the time of itspublication. It is now somewhat dated.

[7] Oliver, Robert M., and James Q. Smith, eds. (1990). InfluenceDiagrams, Belief Nets, and Decision Analysis. John Wiley. NewYork. Still a good introduction to the motivations behind influencediagrams, which generalize probabilistic expert systems by includingvariables representing a user's decisions. It includes an introduc-tory essay by Ron Howard, the most influential proponent of thesediagrams.

[8] Pearl, Judea (1988). Probabilistic Reasoning in Intelligent Sys-tems. Morgan Kaufmann. San Mateo, California. In a series ofarticles preceding this book, its author initiated the study and useof probabilistic expert systems as the term is now understood. Thebook, lively and energetic, introduced them to a wide audience.

[9] Shafer, Glenn (1996). The Art of Causal Conjecture. MIT Press.Cambridge. A study of causality in terms of the dynamics of proba-bility, this book shows that the causal interpretation of probabilisticexpert systems, like the causal interpretation of other statistical mod-els, is often complex: models may have more than one possible causalinterpretation. This book also explores some generalizations of theDAG structure.

[10] Shafer, Glenn, and Judea Pearl, eds. (1990). Readings in Un-certain Reasoning. Morgan Kaufmann. San Mateo, California. Thisvolume collects classic and recent papers on uncertain reasoning inartificial intelligence. Probabilistic, belief-function, fuzzy, and quali-tative approaches are included.

[11] Spirtes, Peter, Clark Glymour, and Richard Schemes (1993).Causation, Prediction, and Search. Lecture Notes in Statistics 81.Springer-Verlag. New York. This monograph explores a variety ofnon-Bayesian ideas for constructing belief nets from data. The em-phasis is on using limited a priori assumptions about causal relationsamong variables together with observed independencies among thosevariables.

[12] Whittaker, J. (1990). Graphical Models in Applied MultivariateStatistics. John Wiley. Chichester. A pioneering statistical treat-ment of belief nets, emphasizing the multivariate normal distribution.Many examples.

4.4. Review articles.

These articles review several topics mentioned in preceding chapters.

[13] Besag, Julian, Peter Green, David Higdon, and Kerrie Mengersen(1995). Bayesian computation and stochastic systems (with

72 CHAPTER 4

discussion). Statistical Science. 10, pp. 1-66. A review of Markov-chain Monte Carlo methods, with an emphasis on Bayesian statisticalproblems.

[14] Buntine, Wray (1996). A guide to the literature on learninggraphical models. IEEE Transactions on Knowledge and Data En-gineering. An excellent review of the problem of selecting graphicalmodels for probabilistic expert systems on the basis of data.

[15] Charniak, Eugene (1991). Bayesian networks without tears. AIMagazine. Winter 1991, pp. 50-63. A nontechnical introductionto belief nets, especially useful for students with limited interest inmathematical probability theory.

[16] Dempster, A. P. (1971). An overview of multivariate data anal-ysis. Journal of Multivariate Analysis. 1, pp. 316-346. This classicarticle includes a discussion of the limitations of the multivariateframework, limitations still not overcome in the main body of workin statistics and probabilistic expert systems.

[17] Neal, Radford M. (1993). Probabilistic inference using Markovchain Monte Carlo methods. Technical Report. Department of Com-puter Science. University of Toronto. In contrast to Besag et al., thisreview emphasizes probabilistic expert systems.

[18] Rabiner, L. R. (1989). A tutorial on hidden Markov models andselected applications in speech recognition. Proceedings of the IEEE.77, pp. 257-286. Still one of the best introductions to hidden Markovmodels.

[19] Spiegelhalter, David J., A. Philip Dawid, Steffen L. Lauritzen,and Robert G. Cowell (1993). Bayesian analysis in expert systems(with discussion). Statistical Science. 8, pp. 219-283. Currentlythe best brief overview of the state of the art of probabilistic expertsystems.

[20] Tatman, J. A., and Ross Shachter (1990). Dynamic program-ming and influence diagrams. IEEE Transactions on Systems, Man,and Cybernetics. 20, pp. 365-379. This article reviews influence dia-grams, which generalize belief nets by including nodes for decisions,and shows how dynamic programming can be understood within theframework of influence diagrams.

[21] Xu, Hong, and Robert Kennes (1994). Steps towards an effi-cient implementation of Dempster-Shafer theory. Advances in theDempster-Shafer Theory of Evidence. R. R. Yager, M. Fedrizzi, andJ. Kacprzyk, eds. John Wiley. New York. Pp. 153 174. This articlereviews various ways of making the Shafer-Shenoy architecture asefficient as possible for belief functions.


4.5. Other sources.

This is not a comprehensive bibliography of the very extensive work on proba-bilistic expert systems, but it contains the articles and dissertations that havemost engaged the author's attention.

[22] Beeri, Catriel, Ronald Fagin, David Maier, and Mihalis Yan-nakakis (1983). On the desirability of acyclic database schemes.Journal of the Association for Computing Machinery. 30, pp. 479-513. This very widely cited paper first introduced the idea of a jointree into the literature on relational databases. It is also responsiblefor the name "join tree."

[23] Cano, Jose, Miguel Delgado, and Serafin Moral (1993). An ax-iomatic framework for propagating uncertainty in directed acyclicnetworks. International, Journal of Approximate Reasoning. 8, pp.253-280. This article extends the axioms for join-tree computation,discussed in Chapter 1 and in Shenoy and Shafer [48], to computa-tion within directed acyclic graphs, in the style developed in Pearl'sProbabilistic Reasoning in Intelligent Systems [8].

[24] Cooper, Gregory F., and Edward Herskovits (1992). A Bayesianmethod for the induction of probabilistic networks from data. Ma-chine Learning. 9, pp. 309-347. An influential exposition of astraightforward Bayesian approach to choosing and parametrizing aDAG from data for a given set of variables. The method developedin this article can be contrasted with the non-Bayesiari methods de-veloped in Spirtes, Glymour, arid Scheines's Causation, Prediction,and Search [11].

[25] Cowell, Robert G., and A. Philip Dawid (1992). Fast retractionof evidence in a probabilistic expert system. Statistics and Com-puting. 2, pp. 37-40. Using out-marginalization (see Exercise 1.4),this article gives a quick join-tree algorithm for adjusting marginalprobabilities to allow for the omission of previously included obser-vations. The algorithm allows efficient computation of statistics formonitoring the performance of a belief net.

[26] Cox, David R., and Nanny Wermuth (1993). Linear dependenciesrepresented by chain graphs (with discussion). Statistical Science. 8,pp. 204-283. Taking DAGs and chain graphs as a starting point,this article discusses a wide variety of graphical representations ofmultivariate probability distributions.

[27] Dawid, A. Philip (1980). Conditional independence for statisticaloperations. Annals of Statistics. 8, pp. 598-617. This pioneeringarticle studies general properties of conditional independence thatwere later studied as axioms by Judea Pearl.

74 CHAPTER 4

[28] Dempster, A. P. (1990). Normal belief functions and the Kalmanfilter. Technical Report. Department of Statistics. Harvard Univer-sity.

[29] Geman, Stuart, and Donald Geman (1984). Stochastic relax-ation, Gibbs distributions, and the Bayesian restoration of images.IEEE Transactions on Pattern Analysis and Machine Intelligence. 6,pp. 721-741. This article shows how image-analysis problems can bemodeled so that the computation problems are susceptible to reso-lution by Gibbs sampling. Very much influenced by the work of UlfGrenander, the article was in turn very influential in vision, artificialintelligence, and Bayesian statistics.

[30] Heckerman, David (1990). Probabilistic similarity networks.Networks. 20, pp. 607 636. This article explores an interesting gen-eralization of belief networks, in which the factorization that permitsrepresentation by a DAG may apply only conditionally on some val-ues of the preceding variables.

[31] Jensen, Finn V. (1991). Calculation in HUGIN of probabilitiesfor specific configurations—a trick with many applications. Scandi-navian Conference on Artificial Intelligence 91. IOS Press. Burke,Virginia. Pp. 176-186. This article puts the trick of Exercise 3.3 touse for practical tasks in probabilistic expert systems: comparison ofcompeting hypotheses, analysis of conflicts in data, and evaluationof approximate calculations.

[32] Jensen, Finn V. (1995). Cautious propagation in Bayesian net-works. Proceedings of the llth Conference on Uncertainty in Arti-ficial Intelligence. Philippe Besnard and Steve Hanks, eds. MorganKaufmann. San Mateo, California. Pp. 323-328. This article usesthe Shafer- Shenoy architecture to supply a more general solution tothe problem considered by Cowell and Dawid [25].

[33] Jensen, Finn V., and Frank Jensen (1994). Optimal junctiontrees. Proceedings of the IQth Conference on Uncertainty in ArtificialIntelligence. R. L. Mantaras and D. Poole, eds. Morgan Kaufmann.San Mateo, California. Pp. 360-366. Even when sets of variables canbe arranged in a join tree, there may be more than one arrangement,some more efficient than others. This paper presents an algorithmfor choosing an optimal one.

[34] Jensen, Finn V., Steffen L. Lauritzen, and K. G. Olesen (1990).Bayesian updating in causal probabilistic networks by local compu-tation. Computational Statistics Quarterly. 4, pp. 269-282. Thisarticle, all of whose authors work at the University of Aalborg inAalborg, Denmark, introduced the architecture named after that cityin Chapter 3.


[35] Kjagrulff, Uffe (1992). Optimal decomposition of probabilisticnetworks by simulated annealing. Statistics and Computing. 2, pp.7-17. This article suggests a sophisticated heuristic for near-optimaljoin trees (or. in the terminology its uses, near-optimal "decomposi-tions" or "triangulations"). It also gives references to other heuristics.

[36] Kong, Augustine (1986). Multivariate belief functions and graph-ical models. Doctoral dissertation. Department of Statistics. Har-vard University. This dissertation spells out how the concept of join-tree cover is related to the concept of triangulation, which is usedmore often in the older literature. It also studies some heuristics forrinding join-tree covers or triangulations.

[37] Lauritzen, Steffen, and David Spiegelhalter (1988). Local com-putations with probabilities on graphical structures and their ap-plication to expert systems (with discussion). Journal of the RoyalStatistical Society, Series B. 50, pp. 157-224. This classic article in-troduced probabilistic expert systems to the statistical community.It is the source of the Lauritzen-Spiegelhalter architecture discussedin Chapter 3. The reader of this article should be cautioned that theheuristic it uses for finding join-tree covers, maximum cardinalitysearch, gives rather poor results in general. See [35] and [36].

[38] Li, Zhaoyu, and Bruce D'Ambrosio (1994). Efficient inferencein Bayes networks as a combinatorial optimization problem. Inter-national Journal of Approximate Reasojiing. 11, pp. 55-81. The au-thors formulate the problem of rinding an optimal order for summingvariables out as a combinatorial problem.

[39] Mellouli, Khaled (1987). On the propagation of beliefs in net-works using the Dempster Shafer theory of evidence. Doctoral disser-tation. School of Business. University of Kansas. This dissertationincludes a demonstration that the class of join-tree covers obtainedby summing out is always large enough to include optimal join-treecovers.

[40] Mitten. L. G. (1964). Composition principles for synthesis ofoptimal multistage processes. Operations Research. 12, pp. 610-619.An early exploration of the extent of applicability of recursive meth-ods for optimization such as those described in Bertele and Brioschi'sbook.

[41] Ndilikilikesha, Pierre C. (1994). Potential influence diagrams.International Journal of Approximate Reasoning. 10, pp. 251-285.This article shows how influence diagrams can be solved using arooted join tree.

[42] Pearl, Judea (1986). Fusion, propagation, and structuring in be-lief networks. Artificial Intelligence. 29, pp. 241-288. An extremelyinfluential contribution to computation in belief nets, emphasizing

76 CHAPTER 4

methods that preserve a net's directed semantics in the course ofthe computation. The material in this article was incorporated intoPearl's 1988 book [8].

[43] Pearl, Judea (1987). Evidential reasoning using stochastic simu-lation. Artificial Intelligence. 32, pp. 245-257. This may be the firstproposal to nse Markov-chain Monte Carlo for computations in beliefnets. The method had long been used in statistical physics and inoperations research.

[44] Saul, Lawrence K., Tommi Jaakkola, and Michael I. Jordan(1995). Mean field theory for sigmoid belief networks. Computa-tional Cognitive Science Technical Report 9501, Center for Biologicaland Computational Learning. Massachusetts Institute of Technol-ogy. This article sketches a program for borrowing the idea of mean-field theory from statistical physics in order to address the prob-lem of approximate computation in belief nets with extremely highconnectivity.

[45] Shafer, Glenn, Prakash P. Shenoy, and Khaled Mellouli (1987).Propagating belief functions in qualitative Markov trees. Interna-tional Journal of Approximate Reasoning. 1, pp. 349-400. Thispaper explores a way of understanding constraint propagation aridbelief-function computation abstractly, without variables.

[46] Shenoy. Prakash P. (1991). Valuation based systems for dis-crete optimization. Uncertainty in Artificial Intelligence 6. P. P.Bonissone, M. Henrion, L. N. Kanal, arid J. F. Leinmer, eds. North-Holland. Amsterdam. Pp. 385-400. The abstract understanding ofinward and outward passes in join-tree computation in this articlegeneralizes the method of nonserial dynamic programming discussedby Bertele and Brioschi [1].

[47] Shenoy, Prakash P. (1994). Representing conditional indepen-dence relations by valuation networks. International Journal of Un-certainty, Fuzziness and Knowledge-Based Systems. 2, pp. 143-165.This article advances a general framework for propagating informa-tion in expert systems. Shenoy's framework applies not only to prob-ability but also to belief functions and other calculi satisfying theaxioms of Chapter 1.

[48] Shenoy, Prakash P., and Glenn Shafer (1990). Axioms for prob-ability and belief-function propagation. Uncertainty in Artificial In-telligence 4. R. D. Shachter, T. S. Levitt, L. N. Kanal, and J. F.Lemrner, eds. North-Holland. Amsterdam. Pp. 169-198. The ax-ioms for join-tree computation, discussed in Chapter 1, were firstisolated in this article. The article also describes the Shafer-Shenoyarchitecture.


[49] Srivastava, Rajendra P., and Glenn Shafer (1992). Belief-function formulas for audit risk. The Accounting Review. 67. pp.249-283. This article discusses the propagation of evidence for finan-cial audits, using belief functions rather than probabilities.

[50] Wermuth, Nanny, and Steffen L. Lauritzen (1990). On sub-stantive research hypotheses, conditional independence graphs, andgraphical chain models (with discussion). Journal of the Royal Sta-tistical Society, Series B. 52, pp. 21 50. This wide-ranging articleincludes a good introduction to the uses of cha,in graphs.

[51] Xu, Hong, and Philippe Smets (1996). Reasoning in evidentialnetworks with conditional belief functions. International Journal ofApproximate Reasoning. 14. pp. 158 185. This article adds a conceptof conditionals to the theory of belief functions and shows how theycan be implemented in join-tree computation.

[52] Zhang, Neviri Liariwen, Runping Qi, and David Poole (1994). Acomputational theory of decision networks. International Journal ofApproximate Reasoning. 11, pp. 83-158. This article extends join-tree computation to influence diagrams and even to slightly moregeneral networks; forgetting is allowed.

Index

Aalborg architecture, 56Aalborg formula, 61audit evidence, 29

Bayesian network, 22Bayesian statistics, 66belief chain, 25, 33belief functions, 15belief net, 21bubble graph, 27

categorical variables, 13chain, 25chain graph, 30COLLECT, 64combination axiom, 5computational cost, 67computional cost, 50conditional, 5, 18conditional probabilities, 5conditioning, 10configuration, 2constraint propagation, 36construction chain, 28construction sequence, 19, 54constructive interpretation of

probability, 9continuer, 7, 15, 16, 18, 53

DAG, 21construction ordering, 22initial segment, 23

density, 3directed acyclic graph, 21

DISTRIBUTE, 64domain. 3dynamic programming, 36

elementary architecture, 43expectation, 12extended division, 60

factorization, 35, 54four-color problem, 36frame, 2

Gibbs sampling, 66graphical model, 22

head, 5heuristics, 37hidden Markov model, 26. 33

independence, 9information branch, 43

join graph, 29join tree, 35, 39

cover, 43heuristics, 37root, 41

junction tree, 35

Kalman filter, 16, 67

lattice, 16Lauritzen-Spiegelhalter

architecture, 50linear programming, 15

79

80 INDEX

marginal, 2, 3, 18Markov chain, 25Markov-chain Monte Carlo, 66mean field theory, 66multivariate framework, 2, 14

object-oriented computation, 64out-marginal, 16

parallel computation, 48parameter. 13posterior probability, 10probability distribution, 2

algorithmic, 13continuous, 3discrete, 2parametric, 13posterior, 10tabular, 13with given marginals, 63

recursive computation, 5recursive dynamic programming, 67

relational database, 35rules, 63

semigroup, 16, 33, 68SENDMESSAGE, 64separator, 45, 56, 62Shafer-Shenoy architecture, 45similarity network, 31slice, 6state graph, 25, 33sufficient, 9support, 60systems of equations, 15, 36

tail, 5transitivity axiom, 5

valuation network, 30variable, 2vision, 66

zeros, 58

(continued from inside front cover)

JERROLD E. MARSDEN, Lectures on Geometric Methods in Mathematical PhysicsBRADLEY EFRON, The Jackknife, the Bootstrap, and Other Resampling PlansM. WOODROOFE, Nonlinear Renewal Theory in Sequential AnalysisD. H. SATTINGER, Branching in the Presence of SymmetryR. TEMAM, Navier-Stokes Equations and Nonlinear Functional AnalysisMiKL6s Cs6RGO, Quantile Processes with Statistical ApplicationsJ. D. BUCKMASTER AND G. S. S. LuDFORD, Lectures on Mathematical CombustionR. E. TARJAN, Data Structures and Network AlgorithmsPAUL WALTMAN, Competition Models in Population BiologyS. R. S. VARADHAN, Large Deviations and ApplicationsKIYOSI Ir6, Foundations of Stochastic Differential Equations in Infinite Dimensional SpacesALAN C. NEWELL, Solitons in Mathematics and PhysicsPRANAB KUMAR SEN, Theory and Applications of Sequential NonparametricsLASZLO LOVASZ, An Algorithmic Theory of Numbers, Graphs and ConvexityE. W, CHENEY, Multivariate Approximation Theory: Selected TopicsJOEL SPENCER, Ten Lectures on the Probabilistic MethodPAUL C. FIFE, Dynamics of Internal Layers and Diffusive InterfacesCHARLES K. CHUI, Multivariate SplinesHERBERT S. WILF, Combinatorial Algorithms: An UpdateHENRY C. TUCKWELL, Stochastic Processes in the NeurosciencesFRANK H. CLARKE, Methods of Dynamic and Nonsmooth OptimizationROBERT B. GARDNER, The Method of Equivalence and Its ApplicationsGRACE WAHBA, Spline Models for Observational DataRICHARD S. VARGA, Scientific Computation on Mathematical Problems and ConjecturesINGRID DAUBECHIES, Ten Lectures on WaveletsSTEPHEN F. McCoRMiCK, Multilevel Projection Methods for Partial Differential EquationsHARALD NIEDERREITER, Random Number Generation and Quasi-Monte Carlo MethodsJOEL SPENCER, Ten Lectures on the Probabilistic Method, Second EditionCHARLES A. MICCHELLI, Mathematical Aspects of Geometric ModelingROGER TEMAM, Navier-Stokes Equations and Nonlinear Functional Analysis, Second EditionGLENN SHAFER, Probabilistic Expert Systems

Shafer - Probabilistic Expert Systems

Documents

Transcript of Shafer - Probabilistic Expert Systems