Structured Probabilistic Reasoning · ematics, computer science, artiﬁcial intelligence and...

Structured Probabilistic Reasoning(Incomplete draft)

Bart JacobsInstitute for Computing and Information Sciences, Radboud University Nijmegen,

P.O. Box 9010, 6500 GL Nijmegen, The Netherlands.

[email protected] http://www.cs.ru.nl/∼bart

Version of: August 4, 2019

Contents

Preface page v

1 Collections and Channels 11.1 Cartesian products 21.2 Lists 41.3 Powersets 91.4 Multisets 131.5 Probability distributions 221.6 Frequentist learning: from multisets to distributions 291.7 Channels 331.8 Parallel products 401.9 A Bayesian network example 501.10 The role of category theory 56

2 Predicates and Observables 622.1 Validity 632.2 The structure of observables 712.3 Conditioning 792.4 Transformation of observables 882.5 Reasoning along channels 952.6 Discretisation, and coin bias learning 1042.7 Inference in Bayesian networks 1082.8 Validity-based distances 1122.9 Variance, covariance and correlation 1192.10 Dependence and covariance 126

3 Directed Graphical Models 1313.1 String diagrams 1333.2 Equations for string diagrams 1413.3 Accessibility and joint states 146

iii

iv Chapter 0. Contentsiv Chapter 0. Contentsiv Chapter 0. Contents

3.4 Hidden Markov models 1503.5 Disintegration 1613.6 Disintegration for states 1673.7 Bayesian inversion 1703.8 Categorical aspects of Bayesian inversion 1793.9 Factorisation of joint states 1863.10 Inference in Bayesian networks, reconsidered 1943.11 Updating and adapting: Pearl versus Jeffrey 199

4 Learning of States and Channels 2104.1 Learning a channel from tabular data 2114.2 Learning with missing data 2144.3 Increasing valdity via updating 2194.4 A general sum-increase method 2224.5 Learning Markov chains 2294.6 Learning hidden Markov models 2344.7 Data, validity, and learning 2414.8 Learning along a channel 2504.9 Expectation Maximisation 2604.10 A clustering example 273References 279

iv

Preface

The phrase ‘structure in probability’ sounds like a contradictio in terminis:it seems that probability is about randomness, like in the tossing of coins, inwhich one may not expect to find much structure. Still, as we know since theseventeenth century, via the pioneering work of Christiaan Huygens, PierreFermat, and Blaise Pascal, there is quite some mathematical structure in thearea of probability. The raison d’être of this book is that there is more structure— especially algebraic and categorical — than is commonly emphasised.

The scientific roots of this book’s author lie outside probability theory, intype theory and logic (including some quantum logic), in semantics and speci-fication of programming languages, in computer security and privacy, in state-based computation (coalgebra), and in category theory. This scientific distanceto probability theory has advantages and disadvantages. Its obvious disadvan-tage is that there is no deeply engrained familiarity with the field and with itsdevelopment. But at the same time this distance may be an advantage, sinceit provides a fresh perspective, without sacred truths and without adherence tocommon practices and notations. For instance, the terminology and notationin this book are somewhat influenced by quantum theory, for instance in us-ing the word ‘state’ for ‘distribution’ or the word ‘observable’ for an R-valuedfunction on a sample space, in using ket notation | − 〉 for discrete probabil-ity distributions, or in using daggers as reversals in analogy with conjugatetransposes (for Hilbert spaces).

It should be said: for someone trained in formal methods, the area of prob-ability theory can be rather sloppy: everything is called ‘P’, types are hardlyever used, crucial ingredients (like distributions in expected values) are leftimplicit, basic notions (like conjugate prior) are introduced only via examples,calculation recipes and algorithms are regularly just given, without explana-tion, goal or justification, etc. This hurts, especially because there is so muchbeautiful mathematical structure around. For instance, the notion of a channel

v

vi Chapter 0. Prefacevi Chapter 0. Prefacevi Chapter 0. Preface

formalises the idea of a conditional probability and carries a rich mathematicalstructure that can be used in compositional reasoning, with both sequential andparallel composition. The Bayesian inversion (‘dagger’) of a channel does notonly come with appealing mathematical (categorical) properties — e.g. smoothinteraction with sequential and parallel composition — but is also extremelyuseful in inference (in Jeffrey’s rule) and in connecting forward and backwardinference via reversal (see Theorem 3.7.7, the author’s favourite), and in learn-ing. It is a real pity that such pearls of the field receive relatively little attention.

We even dare to think that this ‘sloppiness’ is ultimately a hindrance to fur-ther development of the field, especially in computer science. For instance, itis hard to even express a result like Theorem 3.7.7 in standard probabilistic no-tation. One can speculate that state/distributions are kept implicit in traditionalprobability theory because in many examples they are used as a fixed implicitassumption in the background. Indeed, in mathematical notation one tendsto omit — for efficiency — the least relevant (implicit) parameters. But theessence of probabilistic computation is state transformation, where it has be-come highly relevant to write explicitly in which state one is working at whichmoment. Similarly, the difference between a ‘multiple-state’ and a ‘copied-state’ perspective — which plays an important role in this book, e.g. in learn-ing — comes naturally once states are made explicit. The notation developedin this book helps in such situations — and in many other situations as well,we hope.

This book uses the notion of a (probabilistic) channel as cornerstone of its‘channel-based’ account of probabilistic reasoning. It does so while emphasis-ing the use of channels (or: Kleisli maps, in categorical terms) for other formsof computation. Indeed, the book starts by laying out the similarities betweenthe basic ‘collection’ data types of lists, subsets, multisets, and (probability)distributions, with their own properties and associated channels, for instancefor non-deterministic computation. Especially the interplay between multisetsand distributions, notably in learning, is a recurring theme.

But it is not just mathematical aesthetics that drives the developments in thisbook. Probability theory nowadays forms the basis of large parts of big data an-alytics and of artificial intelligence. These areas are of increasing societal rele-vance and provide the basis of the modern view of the world — more based oncorrelation than on causation — and also provide the basis for much of mod-ern decision making, that may affect the lives of billions of people in profoundways. There are increasing demands for justification of such probabilistic rea-soning methods and decisions, for instance in the legal setting provided byEurope’s General Data Protection Regulation (GDPR). Its recital 71 is aboutautomated decision-making and talks about a right to obtain an explanation:

vi

viiviivii

In any case, such processing should be subject to suitable safeguards, which shouldinclude specific information to the data subject and the right to obtain human interven-tion, to express his or her point of view, to obtain an explanation of the decision reachedafter such assessment and to challenge the decision.

These and other developments have led to a new area called Explainable Arti-ficial Intelligence (XAI), which strives to provide decisions with explanationsthat can be understood easily by humans, without bias or discrimination. Al-though this book will not contribute to XAI as such, it aims to provide a math-ematically solid basis for such explanations.

In this context it is appropriate to quote Judea Pearl [81] from 1989 about adivide that is still wide today.

To those trained in traditional logics, symbolic reasoning is the standard, and non-monotonicity a novelty. To students of probability, on the other hand, it is symbolicreasoning that is novel, not nonmonotonicity. Dealing with new facts that cause proba-bilities to change abruptly from very high values to very low values is a commonplacephenomenon in almost every probabilistic exercise and, naturally, has attracted specialattention among probabilists. The new challenge for probabilists is to find ways of ab-stracting out the numerical character of high and low probabilities, and cast them inlinguistic terms that reflect the natural process of accepting and retracting beliefs.

This book does not pretend to fill this gap. One of the big embarrassments ofthe field is that there is no widely accepted symbolic logic for probability, to-gether with proof rules and a denotational semantics. Such a logic will be anon-trivial, because it will have to be non-monotonic1 — a property that manylogicians shy away from. This book does aim to contribute towards bridgingthe divide mentioned by Pearl, by providing a mathematical basis for such asymbolic probabilistic logic, consisting of channels, states, predicates, trans-formations, conditioning, disintegration, etc.

From the perspective of this book, the structured categorical approach toprobability theory began with the work of Bill Lawvere (already in the 1960s)and his student Michèle Giry. They recognised that taking probability dis-tributions has the structure of a monad, which was published in the early1980s in [33]. Roughly at the same time Dexter Kozen started the systematicinvestigation of probabilistic programming languages and logics, publishedin [62, 63]. The monad introduced back then is now called the Giry monadG, whose restriction to finite discrete probability distributions is written as D.Much of this book, certainly in the beginning, concentrates on this discreteform. The language and notation that is used, however, covers both discrete

1 Informally, a logic is non-monotonic if adding assumptions may make a conclusion less true.For instance, I may think that scientists are civilised people, until, at some conference dinner,a heated scientific debate ends in a fist fight.

vii

viii Chapter 0. Prefaceviii Chapter 0. Prefaceviii Chapter 0. Preface

and continuous probability — and quantum probability too (inspired by thegeneral categorical notion of effectus, see [14, 41]).

Since the early 1980s the area remained relatively calm. It is only in thenew millenium that there is renewed attention, sparked in particular by severaldevelopments.

• The grown interest in probabilistic programming languages that incorporateupdating (conditioning) and/or higher order features, see e.g. [18, 19, 78,95].

• The compositional approach to Bayesian networks [17, 29] and to Bayesianreasoning [53, 55].

• The use of categorical and diagrammatic methods in quantum foundations,including quantum probability, see [16] for an overview.

This book builds on these developments.The intended audience consists of students and professionals — in math-

ematics, computer science, artificial intelligence and related fields — with abasic background in probability and in algebra and logic — and with an inter-est in formal, logically oriented approaches. This book’s goal is not to provideintuitive explanations of probability, like [97], but to provide clear and preciseformalisations. Mathematical abstraction (esp. categorical air guitar playing)is not a goal in itself: instead, the book tries to uncover relevant abstractionsin concrete problems. It includes several basic algorithms, with a focus on thealgorithms’ correctness, not their efficiency. Each section ends with a seriesof exercises, so that the book can also be used for teaching and/or self-study.It aims at an undergraduate level. No familiarity with category theory is as-sumed. The basic, necessary notions are explained along the way. People whowish to learn more about category theory can use the references in the text,consult modern introductory texts like [4, 65], or use online resources such asncatlab.org or Wikipedia.

The first chapter of the book covers introductory material that is meant toset the scene. It starts from basic collection types like lists and powersets, andcontinues with multisets and distributions, which will receive much attention.The chapter introduces the concept of a channel and shows how channels canbe used for state transformation and how channels can be composed, both se-quentially and in parallel. This culminates in an illustration of Bayesian net-works in terms of (probabilistic) channels. At the same time it is shown howpredictions are made about such Bayesian networks via state transformationand via compositional reasoning, bascially by translating the network struc-ture into (sequential and parallel) composites of channels.

The second chapter is more logically oriented, via obervables (including

viii

https://ncatlab.org/nlab/show/category+theory

ixixix

factors, predicates and events) that can be defined on sample spaces, providingnumerial information. The chapter’s three key notions are validity, condition-ing, and transformation of observables. The validity, or expected value, ω |= pof an observable p in a state (distribution) ω captures the level of truth. Condi-tioningω|p allows us to update a state with evidence given by p. It satisfied sev-eral basic properties, including Bayes’ rule. Where the first chapter introducedstate transformation along a channel in a forward direction, the second chapteradds observable (predicate) transformation in a backward direction. These twooperations are of fundamental importance in program semantics, and also inquantum computation — where they are distinguished as Schrödinger’s (for-ward) Heisenberg’s (backward) approach, see [45]. With these three basic no-tions in place, the techniques of forward inference and backward inference canbe defined in a precise manner, via suitable combinations of conditioning andtransformation. These inference techniques are illustrated in many examples,including Bayesian networks. The chapter ends with some further illustrationsof the use of validity, in uniformly defining distances between states and be-tween predicates, and in precisely defining (co)variance and corrrelation.

The third chapter is on (directed) graphical techniques. It first introducesstring diagrams, which form a graphical language for channels — or, more ab-stractly, for symmetric monoidal categories. These string diagrams are similarto the graphs used for Bayesian networks, but they have explicit operationsfor copying and discarding and are thus more expressive. But most impor-tantly, string diagrams have a clear semantics, namely in terms of channels.The chapter illustrates these string diagrams in a channel-based description ofthe basics of Markov chains and of hidden Markov models. But the most fun-damental technique that is is introduced in this chapter, via string diagrams, isdisintegration. In essence, it is the well-known procedure of extracting a con-ditional probability p(y | x) from a joint probability p(x, y). One of the keyfeatures of the area of probability is that the components of a joint state ‘lis-ten’ to each other, in the sense that if you update in one component, you see achange in another component — unless the joint is state is not entwined. Oneof the themes running through this book is how such ‘crossover’ influence canbe captured via channels — extracted from joint states via disintegration —in particular via forward and backward inference. This phenomenon is whatmakes (reasoning in) Bayesian networks work. Disintegration is of interest initself, but also leads to the Bayesian inversion of a channel. It gives a tech-nique for reversing the direction of a channel. This corresponds to turning aconditional probability p(y | x) into p(x | y), essentially via Bayes’ rule. ThisBayesian inversion is also called the ‘dagger’ of a channel, since it satisfies, aswill be shown, the properties of a dagger operation (or conjugate transpose),

ix

x Chapter 0. Prefacex Chapter 0. Prefacex Chapter 0. Preface

in Hilbert spaces (and in quantum theory). This dagger operation is actually oflogical interest too, since it captures an alternative form of inference along achannel, commonly attributed to Richard Jeffrey. When precisely to use whichinference rule is an open question.

Essentially all of the material in these first three chapters is known from theliterature, but typically not in the channel-based form in which it is presentedhere. This book includes many examples, often copied from familiar sources,with the deliberate aim of illustrating how the channel-based approach actuallyworks. Since many of these examples are taken from the literature, the inter-ested reader may wish to compare the channel-based description used herewith the original description.

The fourth chapter on learning is mostly new work, of which only a smallpart has been published so far (in [48]). One way of learning is to ask experts.However, it is more efficient and cheaper to learn from data, if available. Thisforms the basis of the whole data-driven economy that it now developing. Onecommonly distinguishes structure learning from parameter learning, where theformer is about learning graphical structure in the data. This chapter is aboutparameter learning, for states and channels, based on what is called MaximalLikelihood Estimation (MLE). In essence, ‘learning’ is seen in this chapteras increasing a validity ω |= p, by adjusting the state ω while keeping theevidence/data p fixed. Hence, learning is about adjusting one’s state to theevidence at hand. The learning techniques used in this chapter are all basedon a basic sum-increase lemma from [7]. It is shown how this technique canbe used to learn the initial state and channel(s) for Markov chains and hiddenMarkov models. Such learning is known from the literature, but is described inhere in channel-based form. This fourth chapter distinguishes data in variousforms — such as multisets of points or multisets of factors — and uses differentforms of validity, each with their own form of learning. In this way one canrecognise different forms of learning, which in the literature are all labeled asExpectation-Maximisation (EM). This EM technique is explored here from achannel-based perspective, and is illustrated in many ways.

Status of the current incomplete version

An incomplete version of this book is made available online, in order to gen-erate feedback and to justify a pause in the writing process. Feedback is mostwelcome, both positive and negative, especially when it suggests concrete im-provements of the text. This may lead to occasional updates of this text. Thedate on the title page indicates the current version.

Some additional points.

x

xixixi

• The (non-trivial) calculations in this book have been carried out with theEfProb library [12] for channel-based probability. When this book reachesits final form, the sources for these calculations will be made public. Also,several calculations in this book are (or: can be) done by hand, typicallywhen the outcomes are described as fractions, like 117

2012 . Such calculationsare meant to be reconstructable by a motivated reader who really wishes tolearn the ‘mechanics’ of the field. Outcomes obtained via EfProb are typ-ically written as decimal notation 0.1234, as approximations, or as plots.They serve to give an impression of the results of a computation.

• The plans for the rest of this book, beyond Chapter 4, are not completelyfixed yet, but will most likely involve additional chapters on: undirectedgraphical models, with (reversible) multiset-channels as underlying seman-tics, probabilistic programming and sampling, continuous probability, andpossibly also a bit of causality and quantum probability.

Bart Jacobs, July 28, 2019.

xi

1

Collections and Channels

There are several ways to combine elements from a given set, for instanceas lists, subsets, multisets, and as probability distributions. This introductorychapter takes a systematic look at such collections and seeks to bring out manysimilarities. For instance, lists, subsets and multisets all form a monoid, es-sentially by suitable unions of collections. Unions of distributions are moresubtle and take the form of convex combinations. Also, subsets, multisets anddistributions can be combined naturally as parallel products ⊗, but not lists.

These collections are important in themselves, in many ways, but also asoutputs of channels. Channels are functions of the form input → T (output),where T is a ‘collection’ operator, for instance, for lists, subsets, multisets, ordistributions. Such channels capture a form of computation, directly related tothe form of collection that is produced on the outputs. For instance, channelswhere T is powerset are used as interpretations of non-deterministic computa-tions, where each input element produces a subset of possible output elements.In the probabilistic case these channels produce distributions, for a differentinstantiation of the operator T . Channels can be used as elementary units ofcomputation, which can be used to build more complicated computations viasequential and parallel composition.

Eventually, the chapter will concentrate on distributions, especially in Sec-tion 1.9 where an example of a Bayesian network is analysed. It is shown thatthe conditional probability tables that are associated with nodes in a Bayesiannetwork are instances of probabilistic channels. As a result, one can system-atically organise computations in Bayesian networks as suitable (sequentialand/or parallel) compositions of channels. This is illustrated via calculation ofpredicted probabilities.

First, the reader is exposed to some general considerations that set the scenefor what is coming. This requires some level of patience. But it is useful tosee the similarities between distributions and other collections first, so that

1

2 Chapter 1. Collections and Channels2 Chapter 1. Collections and Channels2 Chapter 1. Collections and Channels

constructions, techniques, notation, terminology and intuitions that we use fordistributions can be put in a wider perspective and thus may become natural.

The final section of this chapter explains where the abstractions that we usecome from, namely from category theory. It gives a quick overview of the mostrelevant parts of this theory and also how category theory will be used in theremainder of this book to elicit structure in probability.

1.1 Cartesian products

This section briefly reviews some (standard) terminology and notation relatedto Cartesian products of sets.

Let X1 and X2 be two arbitrary sets. We can form their Cartesian productX1 × X2, as the new set containing all pairs of elements from X1 and X2, as in:

X1 × X2 B {(x1, x2) | x1 ∈ X1 and x2 ∈ X2}.

(We use the sign B for mathematical definitions.)We thus write (x1, x2) for the ‘pair’ or ‘tuple’ of elements x1 ∈ X1 and

x2 ∈ X2. We have just defined a binary product set, constructed from two givensets X1, X2. We can also do this in n-ary form, for n sets X1, . . . , Xn. We thenget an n-ary Cartesian product:

X1 × · · · × Xn B {(x1, . . . , xn) | x1 ∈ X1 and · · · and xn ∈ Xn}.

The tuple (x1, . . . , xn) is sometimes called an n-tuple. For convenience, it maybe abbreviated as a vector ~x. The product X1 × · · · × Xn is sometimes writtendifferently using the sign

∏, as:∏

1≤i≤n

Xi or more informally as:∏

Xi.

In the latter case it is left implicit what the range is of the index element i.We allow n = 0. The resulting ‘empty’ product is then written as a singleton

set 1, containing the empty tuple () as sole element, as in:

1 B {()}.

For n = 1 the product X1 × · · · × Xn is (isomorphic to) the set X1.Notice that if one of the sets Xi in a product X1 × · · · × Xn is empty, then

the whole product is empty. Also, if all of the sets Xi are finite, then so is theproduct X1 × · · · × Xn. In fact, the number of element of X1 × · · · × Xn is thenobtained by multiplying all the numbers of elements of the sets Xi.

2

1.1. Cartesian products 31.1. Cartesian products 31.1. Cartesian products 3

1.1.1 Projections and tuples

If we have sets X1, . . . , Xn as above, then for each number i with 1 ≤ i ≤ nthere is a projection function πi out of the product to the set Xi, as in:

X1 × · · · × Xnπi // Xi given by πi

(x1, . . . , xn

)B xi.

This gives us functions out of a product. We also wish to be able to definefunctions into a product, via tuples of functions: if we have a set Y and nfunctions f1 : Y → X1, . . . , fn : Y → Xn, then we can form a new functionY → X1 × · · · × Xn, namely:

Y〈 f1,..., fn〉 // X1 × · · · × Xn via 〈 f1, . . . , fn〉(y) B ( f1(y), . . . , fn(y)).

There is an obvious result about projecting after tupling of functions:

πi ◦ 〈 f1, . . . , fn〉 = fi. (1.1)

This is an equality of functions. It can be proven easily by applying both sidesto an arbitrary element y ∈ Y .

There are some more ‘obvious’ equations about tupling of functions:

〈 f1, . . . , fn〉 ◦ g = 〈 f1 ◦ g, . . . , fn ◦ g〉〈π1, . . . , πn〉 = id. (1.2)

The latter function id is the identity function on the product X1 × · · · × Xn.In a Cartesian product we put sets ‘in parallel’. We can also put functions

between them in parallel. Suppose we have n functions fi : Xi → Yi. Then wecan form the parallel composition:

X1 × · · · × Xnf1×···× fn // Y1 × · · · × Yn

via:

f1 × · · · × fn = 〈 f1 ◦ π1, . . . , fn ◦ πn〉

so that: (f1 × · · · × fn

)(x1, . . . , xn) = ( f1(x1), . . . , fn(xn)).

The latter formulation clearly shows how the functions fi are applied in parallelto the elements xi.

We overload the product symbol ×, since we use it both for sets and forfunctions. This may be a bit confusing at first, but it is in fact quite convenient.

3


1.1.2 Powers

Let X be an arbitrary set. A power of X is an n-product of X’s, for some n. Wewrite the n-th power of X as Xn, in:

Xn B X × · · · × X︸︷︷︸n times

= {(x1, . . . , xn) | xi ∈ X for each i}.

As special cases we have X1 = X and X0 = 1, where 1 = {()} is the singletonset with the empty tuple () as sole inhabitant. Since powers are special cases ofCartesian products, they come with projection functions πi : Xn → X and tuplefunctions 〈 f1, . . . , fn〉 : Y → Xn for n functions fi : Y → X. Finally, for a func-tion f : X → Y we write f n : Xn → Yn for the obvious n-fold parallelisation off .

Exercises

1.1.1 Check what a tuple function 〈π2, π3, π6〉 does on a product set X1 ×

. . . × X8. What is the codomain of this function?1.1.2 Check that, in general, the tuple function 〈 f1, . . . , fn〉 is the unique

function h : Y → X1 × · · · × Xn with πi ◦ h = fi for each i.1.1.3 Prove, using Equations (1.1) and (1.2) for tuples and projections, that:(

g1 × · · · × gn)◦

(f1 × · · · × fn

)= (g1 ◦ f1) × · · · × (gn ◦ fn).

1.1.4 Check that for each set X there is a unique function X → 1. Becauseof this property the set 1 is sometimes called ‘final’ or ‘terminal’. Theunique function is often denoted by !.

Check also that a function 1→ X corresponds to an element of X.1.1.5 Define functions in both directions, using tuples and projections, that

yield isomorphisms:

X × Y � Y × X 1 × X � X X × (Y × Z) � (X × Y) × Z.

Try to use Equations (1.1) and (1.2) to prove these isomorphisms,without reasoning with elements.

1.2 Lists

The datatype of (finite) lists of elements from a given set is well-known incomputer science, especially in functional programming. This section collectssome basic constructions and properties.

For an arbitrary set X we write L(X) for the set of all finite lists [x1, . . . , xn]

4

1.2. Lists 51.2. Lists 51.2. Lists 5

of elements xi ∈ X, for arbitrary n ∈ N. Notice that we use square brackets[−] for lists, to distinguish them from tuples, which are typically written withround brackets (−).

Thus, the set of lists over X can be defined as a union of all powers of X.

L(X) B⋃n∈N

Xn.

When the elements of X are letters of an alphabet, thenL(X) is the set of words— the language — over this alphabet. The set L(X) is alternatively written asX?, and called the Kleene star of X.

We zoom in on some trivial cases. One has L(0) � 1, since one can onlyform the empty word over the empty alphabet 0 = ∅. If the alphabet containsonly one letter, a word consists of a finite number of occurrences of this singleletter. Thus: L(1) � N.

We consider lists as an instance of what we call a collection data type, sinceL(X) collects elements of X in a certain manner. What distinguishes lists fromother collection types is that elements may occur multiple times, and that theorder of occurrence matters. The lists [a, b, a] and [a, a, b] and [a, b] differ. Aswe shall see later on, within a subset orders and multiplicities do not matter,see Section 1.3; and in a multiset the order of elements does not matter, butmultiplicities do matter, see Section 1.4.

Let f : X → Y be an arbitrary function. It can be used to map lists over X intolists over Y by applying f element-wise. This is what functional programmerscall map-list. Here we like overloading, so we write L( f ) : L(X) → L(Y) forthis function, defined as:

L( f )([x1, . . . , xn]

)B [ f (x1), . . . , f (xn)].

Thus, L is an operation that not only sends sets to sets, but also functions tofunctions. It does so in such a way that identity maps and compositions arepreserved:

L(id) = id L(g ◦ f ) = L(g) ◦ L( f ).

We shall say: L is functorial.Functoriality can be used to define the marginal of a list on a product set,

via L(πi), where πi is a projection map. For instance, let α ∈ L(X × Y) be ofthe form α = [(x1, y1), . . . , (xn, yn)]. The first marginalL(π1)(α) ∈ L(X) is thencomputed as:

L(π1)(α) = L(π1)([(x1, y1), . . . , (xn, yn)]

)= [π1(x1, y1), . . . , π1(xn, yn)]= [x1, . . . , xn].

5


1.2.1 Monoids

A monoid is a very basic mathematical structure, sometimes called semigroup.For convenience we define it explicitly.

Definition 1.2.1. A monoid consists of a set M with a special ‘unit’ elementu ∈ M and a binary operation M × M → M, written for instance as infix +,which is associative and has u has unit on both sides. That is, for all a, b, c ∈ M,

a + (b + c) = (a + b) + c and u + a = a = a + u.

The monoid is called commutative if a + b = b + a, for all a, b ∈ M. It is calledidempotent if a + a = a for all a ∈ M.

Let (M, u,+) and (N, v, ·) be two monoids. A function f : M → N is called ahomomorphism of monoids if f preserves the unit and binary operation, in thesense that:

f (u) = v and f (a + b) = f (a) · f (b), for all a, b ∈ M.

For brevity we also say that such an f is a map of monoids, of simply a monoidmap.

The natural numbers N with addition form a commutative monoid (N, 0,+).But also with multiplication they form a commutative monoid (N, 1, ·). Thefunction f (n) = 2n is a homomorphism of monoids f : (N, 0,+)→ (N, 1, ·).

Various forms of collection types form monoids, with ‘union’ as binary op-eration. We start with lists, in the next result. The proof is left as (easy) exerciseto the reader.

Lemma 1.2.2. 1 For each set X, the set L(X) of lists over X is a monoid, withthe empty list [] ∈ L(X) as unit element, and with concatenation ++ : L(X)×L(X)→ L(X) as binary operation:

[x1, . . . , xn] ++ [y1, . . . , ym] B [x1, . . . , xn, y1, . . . , ym].

This monoid (L(X), [],++) is neither commutative nor idempotent.2 For each function f : X → Y the associated map L( f ) : L(X) → L(Y) is a

homomorphism of monoids.

1.2.2 Unit and flatten for lists

Each element x ∈ X can be turned into a singleton list [x] ∈ L(X). We introducea separate function unit : X → L(X) for this, so unit(x) B [x].

There is also a ‘flatten’ function which turns a list of lists into a list by

6

1.2. Lists 71.2. Lists 71.2. Lists 7

removing inner brackets [− ]. This function is written as flat : L(L(X)) →L(X). It is defined as:

flat([[x11, . . . , x1n1 ], . . . , [xk1, . . . , xknk ]]

)B [x11, . . . , x1n1 , . . . , xk1, . . . , xknk ].

The next result contains some basic properties about unit and flatten. Theseproperties will first be formulated in terms of equations, and then, alternativelyas commuting diagrams. The latter style is preferred in this book.

Lemma 1.2.3. 1 For each function f : X → Y one has:

unit ◦ f = L( f ) ◦ unit and flat ◦ L(L( f )) = L( f ) ◦ flat .

Equivalently, the following two diagrams commute.

X

f��

unit // L(X)

L( f )��

L(L(X))

L(L( f ))��

flat // L(X)

L( f )��

Yunit// L(Y) L(L(Y))

flat// L(Y)

2 One further has:

flat ◦ unit = id = flat ◦ L(unit) flat ◦ flat = flat ◦ L(flat).

These two equations can equivalently be expressed via commutation of:

L(X) unit // L(L(X))

flat��

L(X)L(unit)oo L(L(L(X)))

L(flat)��

flat // L(L(X))

flat��

L(X) L(L(X))flat

// L(X)

Proof. We shall do the first cases of each point, leaving the second cases tothe interested reader. First, for f : X → Y and x ∈ X one has:(L( f ) ◦ unit

)(x) = L( f )(unit(x)) = L( f )([x])

= [ f (x)] = unit( f (x)) =(unit ◦ f

)(x).

Next, for the second point we take an arbitrary list [x1, . . . , xn] ∈ L(X). Then:(flat ◦ unit

)([x1, . . . , xn]) = flat

([[x1, . . . , xn]]

)= [x1, . . . , xn](

flat ◦ L(unit))([x1, . . . , xn]) = flat

([unit(x1), . . . , unit(xn)]

)= flat

([[x1], . . . , [xn]]

)= [x1, . . . , xn].

7


The equations in point 1 of this lemma are so-called naturality equations.They express that unit and flat work uniformly, independent of the set X in-volved. The equations in point 2 show that L is a monad, see Section 1.10 formore information.

Exercises

1.2.1 Let X = {a, b, c} and Y = {u, v} be sets with a function f : X → Ygiven by f (a) = u = f (c) and f (b) = v. Write α = [c, a, b, a] andβ = [b, c, c, c]. Compute consecutively:

• α ++ β

• β ++ α

• α ++ α

• α ++ (β ++ α)• (α ++ β) ++ α

• L( f )(α)• L( f )(β)• L( f )(α) ++L( f )(β)• L( f )(α ++ β).

1.2.2 We write log for the logarithm function with base 2, so that log(x) = yiff x = 2y. Verify that logarithm is a map of monoids:

(R>0, 1, ·)log// (R, 0,+).

Often the log function is used to simplify a computation, by turningmultiplications into additions. Then one uses that log is precisely thishomomorphism of monoids. (An additonal useful property is that logis monotone: it preserves the order.)

1.2.3 Define a length function len : L(X) → N on lists in the obvious wayas len([x1, . . . , xn]) = n.

1 Prove that len is a map of monoids (L(X), [],++)→ (N, 0,+).2 Write ! for the unique function X → 1 and check that len is L(!).

Notice then that the previous point can then be seen as an instanceof Lemma 1.2.2 (2).

1.2.4 Let X be an arbitrary set and (M, 0,+) an arbitrary monoid, with afunction f : X → M. Prove that there is a unique homomorphism ofmonoids f : (L(X), [],++)→ (M, 0,+) with f ◦ unit = f .

This expresses that L(X) is the free monoid on the set X. The ho-momorphism f is called the free extension of f .

8

1.3. Powersets 91.3. Powersets 91.3. Powersets 9

1.3 Powersets

The next collection type that will be studied is powerset P. We will see thatthere many similarities with lists L from the previous section.

For an arbitrary set X we writeP(X) for the set of all subsets of X andPfin(X)for the set of finite subsets. Thus:

P(X) B {U | U ⊆ X} and Pfin(X) B {U ∈ P(X) | U is finite}.

If X is a finite set itself, there is no difference between P(X) and Pfin(X). Inthe sequel we shall talk mostly about P, but basically all properties of interesthold for Pfin as well.

First of all P is functorial. Given a function f : X → Y we can define a newfunction P( f ) : P(X)→ P(Y) by taking the image of f on a subset. Explicitly,for U ⊆ X,

P( f )(U) = { f (x) | x ∈ U}.

The right-hand-side is clearly a subset of Y , and thus an element of P(Y). Wehave equations:

P(id) = id P(g ◦ f ) = P(g) ◦ P( f ).

Again we can use functoriality for marginalisation: for a subset (relation) R ⊆X × Y on a product set we get its first marginal P(π1)(R) ∈ P(X) as subset:

P(π1)(R) = {π1(z) | z ∈ R} = {π1(x, y) | (x, y) ∈ R} = {x | ∃y. (x, y) ∈ R}.

The next topic is the monoid structure on powersets.

Lemma 1.3.1. 1 For each set X, the powerset P(X) is a commutative andidempotent monoid, with empty subset ∅ ∈ P(X) as neutral element andunion ∪ of subsets of X as binary operation.

2 Each P( f ) : P(X)→ P(Y) is a map of monoids, for f : X → Y.

Next we define unit and flatten maps for powerset, much like for lists. Thefunction unit : X → P(X) sends an element to a singleton subset: unit(x) B{x}. The flatten function flat : P(P(X)) → P(X) is given by union: for A ⊆P(X),

flat(A) B⋃

A = {x ∈ X | ∃U ∈ A. x ∈ U}.

We mention, without proof, the following analogue of Lemma 1.2.3.

9


Lemma 1.3.2. 1 For each function f : X → Y the following ‘naturality’ dia-grams commute.

X

f��

unit // P(X)

P( f )��

P(P(X))

P(P( f ))��

flat // P(X)

P( f )��

Yunit// P(Y) P(P(Y))

flat// P(Y)

2 Additionaly, the ‘monad’ diagrams below commute.

P(X) unit // P(P(X))

flat��

P(X)P(unit)oo P(P(P(X)))

P(flat)��

flat // P(P(X))

flat��

P(X) P(P(X))flat

// P(X)

1.3.1 From list to powerset

We have seen that lists and subsets behave in a similar manner. We strengthenthis connection by defining a support function supp : L(X)→ Pfin(X) betweenthem, via:

supp([x1, . . . , xn]

)B {x1, . . . , xn}.

Thus, the support of a list is the subset of elements occurring in the list. Thesupport function removes order and multiplicities. The latter happens implic-itly, via the set notation, above on the right-hand-side. For instance,

supp([b, a, b, b, b]) = {a, b} = {b, a}.

Notice that there is no way to go in the other direction, namelyPfin(X)→ L(X).Of course, one can for each subset choose an order of the elements in order toturn the subset into a list. However, this process is completely arbitrary and isnot uniform (natural).

The support function interacts nicely with the structures that we have seenso far. This is expressed in the result below, where we use the same notationunit and flat for different functions, namely for L and for P. The context, andespecially the types, of these operations will make clear which one is meant.

Lemma 1.3.3. Consider the support map supp : L(X)→ Pfin(X) defined above.

1 It is a map of monoids (L(X), [],++)→ (P(X), ∅,∪).

10

1.3. Powersets 111.3. Powersets 111.3. Powersets 11

2 It is natural, in the sense that for f : X → Y one has:

L(X)

L( f )��

supp// Pfin(X)

Pfin ( f )��

L(Y) supp// Pfin(Y)

3 It commutes with the unit’s and flatten’s of list and powerset, as in:

X

unit��

X

unit��

L(L(X))

flat��

L(supp)// L(Pfin(X))

supp// Pfin(Pfin(X))

flat��

L(X) supp// Pfin(X) L(X) supp

// Pfin(X)

Proof. The first point is easy and skipped. For point 2,(Pfin( f ) ◦ supp

)([x1, . . . , xn]) = Pfin( f )(supp([x1, . . . , xn]))

= Pfin( f )({x1, . . . , xn})= { f (x1), . . . , f (xn)}= supp([ f (x1), . . . , f (xn)])= supp(L( f )([x1, . . . , xn]))=

(supp ◦ L( f )

)([x1, . . . , xn]).

In point 3, commutation of the first diagram is easy:(supp ◦ unit

)(x) = supp([x]) = {x} = supp(x).

The second diagram requires a bit more work. Starting from a list of lists weget: (

flat ◦ supp ◦ L(supp))([[x11, . . . , x1n1 ], . . . , [xk1, . . . , xknk ]])

=(⋃◦ supp

)([supp([x11, . . . , x1n1 ]), . . . , supp([xk1, . . . , xknk ])])

=(⋃◦ supp

)([{x11, . . . , x1n1 }, . . . , {xk1, . . . , xknk }])

=⋃

({{x11, . . . , x1n1 }, . . . , {xk1, . . . , xknk }})= {x11, . . . , x1n1 , . . . , xk1, . . . , xknk }

= supp([x11, . . . , x1n1 , . . . , xk1, . . . , xknk ])=

(supp ◦ flat

)([[x11, . . . , x1n1 ], . . . , [xk1, . . . , xknk ]]).

1.3.2 Extraction

So far we have concentrated on how similar lists and subsets are: the only struc-tural difference that we have seen up to now is that subsets form an idempotent

11


and commutative monoid. But there are other important differences. Here welook at subsets of product sets, also known as relations.

The observation is that one can extract functions from a relation R ⊆ X × Y ,namely functions of the form extr1(R) : X → P(Y) and extr2(R) : Y → P(X),namely:

extr1(R)(x) = {y ∈ Y | (x, y) ∈ R} extr2(R)(y) = {x ∈ X | (x, y) ∈ R}.

In fact, one can easily reconstruct the relation R from extr1(R), and also fromextr2(R), via:

R = {(x, y) | y ∈ extr1(R)(x)} = {(x, y) | x ∈ extr2(R)(y)}.

This all looks rather trivial, but such function extraction is less trivial for otherdata types, as we shall see later on, for distributions, where it will be called:disintegration.

In general, we write BA for the set of functions from A to B. Using thisnotation we can summarise the situation as: there are isomorphisms:

P(Y)X � P(X × Y) � P(X)Y . (1.3)

Functions of the form A→ P(B) will later be called ‘channels’ from A to B, seeSection 1.7. What have just seen will then be described in terms of ‘extractionof channels’.

Exercises

1.3.1 Continuing Exercise 1.2.1, compute:

• supp(α)• supp(β)• supp(α ++ β)• supp(α) ∪ supp(β)• supp(L( f )(α))• Pfin( f )(supp(α)).

1.3.2 We have used finite unions (∅,∪) as monoid structure on P(X) inLemma 1.3.1 (1). Intersections (X,∩) give another monoid structureon P(X).

1 Show that the negation/complement function ¬ : P(X) → P(X),given by:

¬U = X − U = {x ∈ X | x < U},

is a homomorphism of monoids between (P(X), ∅,∪) and (P, X,∩),

12

1.4. Multisets 131.4. Multisets 131.4. Multisets 13

in both directions. In fact, if forms an isomorphism of monoids,since ¬¬U = U.

2 Prove that the intersections monoid structure is not preserved bymaps P( f ) : P(X)→ P(Y).

Hint: Look at preservation of the unit X ∈ P(X).

1.3.3 Let X be an set and (M, 0,+) a commutative idempotent monoid, witha function f : X → M between them. Prove that there is a uniquehomomorphism of monoids f : (Pfin(X), ∅,∪) → (M, 0,+) with f ◦unit = f .

1.4 Multisets

So far we have discussed two collection data types, namely lists and subsets ofelements. In lists, elements occur in a particular order, and may occur multipletimes (at different positions). Both properties are lost, when moving from liststo sets. In this section we look at multisets, which are ‘sets’ in which elementsmay occur multiple times. Hence multisets are somehow inbetween lists andsubsets, since they do allow multiple occurrences, but no order.

The list and powerset examples are somewhat remote from probability the-ory. But multisets are much more directly relevant: first, because we use a sim-ilar notation for multisets and distribution; and second, because observed datacan be organised nicely in terms of multisets. For instance, for statistical analy-sis, a document is often seen as a multiset of words, in which one keeps track ofthe words that occur in the document together with their frequency (multiplic-ity); in that case, the order of the words is ignored. Also, tables with observeddata can be organised naturally as multisets; learning from such tables will bedescribed in Section 1.5 as a (natural) transformation from multisets to distri-butions.

We start with an introduction about notation, terminology and conventionsfor multisets. Consider a set A = {a, b, c}. An example of a multiset over A is:

2|a〉 + 5|b〉 + 0|c〉

In this multiset the element a occurs 2 times, b occurs 5 times, and c does notoccur. The funny brackets | − 〉 are called ket notation; this is frequently usedin quantum theory. Here it is meaningless notation, used to separate the naturalnumbers, used multiplicities, and the elements from A.

Let X be an arbitrary set. Following the above ket-notation, a (finite) multiset

13


over X is an expression of the form:

n1| x1 〉 + · · · + nk | xk 〉 where ni ∈ N and xi ∈ X.

This expression is a formal sum, not an actual sum (for instance in R). We maywrite it as

∑i ni| xi 〉. We use the convention:

• 0| x〉may be omitted; but it may also be written explicitly in order to empha-sise that the element x does not occur in a multiset;

• a sum n| x〉 + m| x〉 is the same as (n + m)| x〉;• the order and brackets (if any) in a sum do not matter.

Thus, for instance, there is an equality of multisets:

2|a〉 +(5|b〉 + 0|c〉

)+ 4|b〉 = 9|b〉 + 2|a〉.

There is an alternative description of multisets. A multiset can be defined asa function ϕ : X → N that has finite support. The support supp(ϕ) ⊆ X is theset supp(ϕ) = {x ∈ X | ϕ(x) , 0}. For each element x ∈ X the number ϕ(x) ∈ Ntells how often x occurs in the multiset ϕ. Such a function ϕ can also be writtenas a formal sum

∑x ϕ(x)| x〉, where x ranges over supp(ϕ).

For instance, the multiset 9|b〉 + 2|a〉 over A = {a, b, c} corresponds to thefunction ϕ : A → N given by ϕ(a) = 2, ϕ(b) = 9, ϕ(c) = 0. Its support is thus{a, b} ⊆ A.

We shall freely switch back and forth between the ket-description and thefunction-description of multisets, and use whichever form is most convenientfor the goal at hand.

Having said this, we stretch the idea of a multiset and do not only allownatural numbers n ∈ N as multiplicities, but also allow non-negative numbersr ∈ R≥0. Thus we can have a multiset of the form 3

2 |a〉 + π|b〉 where π ∈ R≥0

is the famous Archimedes’ constant. This added generality will be useful attimes, although many examples of multisets will simply have natural numbersas multiplicities. We call such multisets natural.

Definition 1.4.1. For a set X we shall writeM(X) for the set of all multisetsover X. Thus, using the function approach:

M(X) B {ϕ : X → R≥0 | supp(ϕ) is finite}.

The elements ofM(X) may be called mass functions, as in [89].We shall write N(X) ⊆ M(X) for the subset of ‘natural’ multisets, with

natural numbers as multiplicities. Thus, N(X) contains functions ϕ ∈ M(X)with ϕ(x) ∈ N, for all x ∈ X.

14


We shall writeM∗(X) for the set of non-empty multisets. Thus:

M∗(X) B {ϕ ∈ M(X) | supp(ϕ) is non-empty}= {ϕ : X → R≥0 | supp(ϕ) is finite and non-empty}.

Similarly, N∗(X) ⊆ M∗(X) contains the non-empty natural multisets.All ofM,N ,M∗,N∗ are functorial, in the same way. Hence we concentrate

onM. For a function f : X → Y we can defineM( f ) : M(X) →M(Y) in twoequivalent ways:

M( f )(∑

i ri| xi 〉)B

∑i ri| f (xi)〉 or as M( f )(ϕ)(y) B

∑x∈ f −1(y)

ϕ(x).

It may take a bit of effort to see that these descriptions are the same, seeExercise 1.4.1 below. Notice that in the sum

∑i ri| f (xi)〉 it may happen that

f (xi1 ) = f (xi2 ) for xi1 , xi2 , so that ri1 and ri2 are added. Thus, the support ofM( f )

(∑i ri| xi 〉

)may have fewer elements than the support of

∑i ri| xi 〉, but the

sum of all multiplicities is the same inM( f )(∑

i ri| xi 〉)

and∑

i ri| xi 〉.

Lemma 1.4.2. 1 The setM(X) of multisets over X is a commutative monoid.In functional form, addition and zero (unit) element 0 ∈ M(X) are definedas:

(ϕ + ψ)(x) B ϕ(x) + ψ(x) 0(x) B 0.

These sums restrict to N(X).

2 The setM(X) is also a cone: it is closed under ‘scalar’ multiplication withnon-negative numbers r ∈ R≥0, via:

(r · ϕ)(x) B r · ϕ(x).

This scalar multiplication r · (−) : M(X)→M(X) preserves the sums (0,+)from the previous point, and is thus a map of monoids.

3 For each f : X → Y, the function M( f ) : M(X) → M(Y) is a map ofmonoids and also of cones. The latter means:M( f )(r · ϕ) = r · M( f )(ϕ).

The element 0 ∈ M(X) used in point (1) is the empty multiset. Similarly, thesum + used there may be understood as an approriate union of multisets. Thisunion is implicit in the ket-notation. The set N(X) is not closed in general un-der scalar multiplication with r ∈ R≥0. It is closed under scalar multiplicationwith n ∈ N, but such multiplications add nothing new since they can also bedescribed via repeated addition.

15


1.4.1 Tables of data as multisets

Consider the table (1.4) below where we have combined numeric informationabout blood pressure (either high H or low L) and certain medicines (eithertype 1 or type 2 or no medicine, indicated as 0). There is data about 100 studyparticipants:

no medicine medicine 1 medicine 2 totals

high 10 35 25 70low 5 10 15 30

totals 15 45 40 100

(1.4)

We claim that we can capture this table as a (natural) multiset. To do so, wefirst form sets B = {H,T } for blood pressure values, and M = {0, 1, 2} for typesof medicine. The above table can then be described as a multiset τ over theproduct B × M, that is, as an element of τ ∈ N(B × M), namely:

τ = 10|H, 0〉 + 35|H, 1〉 + 25|H, 2〉 + 5|L, 0〉 + 10|L, 1〉 + 15|L, 2〉.

We see that Table (1.4) contains ‘totals’ in its vertical and horizontal margins.They can be obtained from τ as marginals, using the functoriality of N . Thisworks as follows.

N(π1)(τ) = 10|π1(H, 0)〉 + 35|π1(H, 1)〉 + 25|π1(H, 2)〉+ 5|π1(L, 0)〉 + 10|π1(L, 1)〉 + 15|π1(L, 2)〉

= 10|H 〉 + 35|H 〉 + 25|H 〉 + 5|L〉 + 10|L〉 + 15|L〉= 70|H 〉 + 30|L〉.

N(π2)(τ) = (10 + 5)|0〉 + (35 + 10)|1〉 + (25 + 15)|2〉= 15|0〉 + 45|1〉 + 40|2〉.

In Section 1.5 we describe how to obtain probabilities from tables in a system-atic manner.

1.4.2 Unit and flatten for multisets

As may be expected by now, there are also unit and flatten maps for multisets.The unit function unit : X → M(X) is simply unit(x) B 1| x〉. Flattening in-volves turning a multiset of multisets into a multiset. Concretely, this is doneas:

flat(

13

∣∣∣ 2|a〉 + 2|c〉⟩

+ 5∣∣∣ 1|b〉 + 1

6 |c〉⟩ )

= 23 |a〉 + 5|b〉 + 3

2 |c〉.

16


More generally, flattening is the function flat : M(M(X))→M(X) with:

flat(∑

i ri|ϕi 〉)B

∑x

(∑i ri · ϕi(x)

) ∣∣∣ x⟩.

Notice that the big outer sum∑

x is a formal one, whereas the inner sum∑

i isan actual one, in R≥0, see the earlier example.

The following result does not come as a surprise anymore.

Lemma 1.4.3. 1 For each function f : X → Y one has:

X

f��

unit //M(X)

M( f )��

M(M(X))

M(M( f ))��

flat //M(X)

M( f )��

Yunit//M(Y) M(M(Y))

flat//M(Y)

2 Moreover:

M(X) unit //M(M(X))

flat��

M(X)M(unit)oo M(M(M(X)))

M(flat)��

flat //M(M(X))

flat��

M(X) M(M(X))flat

//M(X)

1.4.3 From list to multiset, and then to powerset

For a multiset ϕ we have already used the ‘support’ definition supp(ϕ) =

{x | ϕ(x) , 0}. This yields a map supp : M(X) → Pfin(X), which is well-behaved, in the sense that it is natural and preserves the monoid structures onM(X) and Pfin(X).

In the previous section we have seen a support map from lists L to finitepowerset Pfin . This support map factorises through multisets, as described inthe following triangle of different support functions.

L(X)supp

//

supp ++

Pfin(X)

N(X) supp

CC

(1.5)

What is missing is the support map supp : L(X)→ N(X). Intuitively, it countshow many times an element occurs in a list, while ignoring the order of occur-rences. Thus, for a list α ∈ L(X),

supp(α)(x) B n if x occurs n times in the list α. (1.6)

More concretely:

supp([a, b, a, b, c, b, b]

)= 2|a〉 + 4|b〉 + 1|c〉.

17


The above diagram (1.5) expresses what we have stated informally at the be-ginning of this section, namely that multisets are somehow inbetween lists andsubsets.

1.4.4 Max and argmax

Given a multiset we can determine the maximum multiplicity and the set of ele-ments for which this maximum is reached. The latter is referred to as ‘argmax’.These notions are well-defined since we only consider finite multisets. Thus,for:

ϕ = 4|a〉 + 32 |b〉 + 4|c〉 + 3|d 〉 one has

maxϕ = 4argmaxϕ = {a, c}.

This is formalised next.

Definition 1.4.4. For an arbitrary set X there are functions:

M(X) max // R≥0 M(X)argmax

// Pfin(X)

defined as follows.

• maxϕ is the least number s ∈ R≥0 with ϕ(x) ≤ s, for all x ∈ X; this meansthat maxϕ is the supremum of the subset {ϕ(x) | x ∈ X} ⊆ R≥0.

• argmaxϕ = {x ∈ X | ϕ(x) = maxϕ}.

The following equivalent notations are used:

maxϕ = maxx∈X

ϕ(x)

= max {ϕ(x) | x ∈ X}argmaxϕ = argmax

x∈Xϕ(x).

For future use we collect some basic results about max and argmax. Theyare a bit technical and may be skipped at this stage until they are really needed.

Lemma 1.4.5. Let ϕ ∈ M(X), ψ ∈ M(Y) and χ ∈ M(X,Y). For a binaryoperator � ∈ {+, ·} and for r ∈ R≥0 we have the following results.

1 Maxima can be be taken separately in:

maxx∈X,y∈Y

ϕ(x) � ψ(y) =(

maxx∈X

ϕ(x))�(

maxy∈Y

ψ(y))

maxx∈X

r · ϕ(x) = r ·(

maxx∈X

ϕ(x)).

2 For argmax one has, when r > 0,

argmaxx∈X,y∈Y

ϕ(x) � ψ(y) =(

argmaxx∈X

ϕ(x))×

(argmax

y∈Yψ(y)

)argmax

x∈Xr · ϕ(x) = argmax

x∈Xϕ(x).

18


3 For (u, v) ∈ argmax χ one has:

χ(u, v) = maxx∈X,y∈Y

χ(x, y) = maxy∈Y

χ(u, y) = maxx∈X

χ(x, v).

4 maxx∈X,y∈Y

ϕ(x) � χ(x, y) = max{ϕ(x) � χ(x, vx) | x ∈ X, vx ∈ argmaxy∈Y

χ(x, y)}.

5 (u, v) ∈ argmaxx∈X,y∈Y

ϕ(x) � χ(x, y) ⇐⇒ u ∈ argmaxx∈X

ϕ(x) �(

maxy∈Y

χ(x, y))

and v ∈ argmaxy∈Y

χ(u, y).

Proof. The first two points are easy and left to the interested reader. For point (3)the only case that requires some care is:

maxx∈X,y∈Y

χ(x, y) ≤ maxy∈Y

χ(u, y).

It is obtained as below, where the first equation holds by the fact that the pair(u, v) is in the argmax:

maxx∈X,y∈Y

χ(x, y) = χ(u, v) ≤ maxy∈Y

χ(u, y).

For point (4) we first abbreviate:

r B maxx∈X,y∈Y

ϕ(x) � χ(x, y)

s B max{ϕ(x) � χ(x, vx) | x ∈ X, vx ∈ argmaxy∈Y

χ(x, y)}.

We prove r = s via the next two inequalities.(r ≤ s) For arbitrary x ∈ X, y ∈ Y we have:

χ(x, y) ≤ maxz∈Y

χ(x, z) = χ(x, vx) for each vx ∈ argmaxy∈Y

χ(x, y).

But then ϕ(x) � χ(x, y) ≤ ϕ(x) � χ(x, vx) ≤ s, and thus r ≤ s, since r is the leastupperbound.

(s ≤ r) For x ∈ X and vx ∈ argmaxy∈Y χ(x, y) we obviously have ϕ(x) �χ(x, vx) ≤ r and thus s ≤ r.

Finally, for point (5) we need to prove two implications.(⇒) Let (u, v) ∈ argmaxx∈X,y∈Y ϕ(x) � χ(x, y). Then:

maxx∈X

ϕ(x) �maxy∈Y

χ(x, y) = maxx∈X,y∈Y

ϕ(x) � χ(x, y)

= ϕ(u) � χ(u, v)= max

y∈Yϕ(u) � χ(u, y) by point (3)

= ϕ(u) �maxy∈Y

χ(u, y)

19


From this we may conclude:

u ∈ argmaxx∈X

ϕ(x) �maxy∈Y

χ(x, y).

Using (3) once again yields:

ϕ(u) � χ(u, v) = maxy∈Y

ϕ(u) � χ(u, y) = ϕ(u) �maxy∈Y

χ(u, y)

By removing ϕ(u), via subtraction or division, we get χ(u, v) = maxy∈Y χ(u, y),so that v ∈ argmaxy∈Y χ(u, y).

(⇐) Let

u ∈ argmaxx∈X

ϕ(x) �maxy∈Y

χ(x, y) and v ∈ argmaxy∈Y

χ(u, y).

Then χ(u, v) = maxy∈Y

χ(u, y) so that:

ϕ(u) � χ(u, v) = ϕ(u) �maxy∈Y

χ(u, y)

= maxx∈X

ϕ(x) �maxy∈Y

χ(x, y) since u is in the argmax

= maxx∈X,y∈Y

ϕ(x) � χ(x, y).

Hence we are done.

1.4.5 Extraction

At the end of the previous section we have seen how to extra a function (chan-nel) from a binary subset, that is, from a relation. It turns out that one can dothe same for a binary multiset, that is, for a table. More specifically, there areisomorphisms:

M(Y)X � M(X × Y) � M(X)Y . (1.7)

This is analogous to (1.3) for powerset.How does this work in detail? Suppose we have an arbitary multiset/table

σ ∈ M(X × Y). From σ one can extract a function extr1(σ) : X →M(Y), andalso extr2(σ) : Y →M(X), via:

extr1(σ)(x) =∑

y σ(x, y)|y〉 extr2(σ)(y) =∑

x σ(x, y)| x〉.

Notice that we are — conveniently — mixing ket and function notation formultisets. Conversely, σ can be reconstructed from extr1(σ), and also fromextr2(σ), via σ(x, y) = extr1(σ)(x)(y) = extr2(σ)(y)(x).

Functions of the form A →M(B) will also be used as channels A → B, seeSection 1.7. That’s why we speak about ‘channel extraction’.

20


As illustration, we apply extraction to the medicine - blood pressure Ta-ble 1.4, described as multiset τ ∈ M(B × M). It gives rise to two channelsextr1(τ) : B→M(M) and extr2(τ) : M →M(B). Explicitly:

extr1(τ)(H) =∑

x∈M τ(H, x)| x〉 = 10|0〉 + 35|1〉 + 25|2〉extr1(τ)(L) =

∑x∈M τ(L, x)| x〉 = 5|0〉 + 10|1〉 + 15|2〉.

We see that this extracted function captures the two rows of Table 1.4. Similarlywe get the columns via the second extracted function:

extr2(τ)(0) = 10|L〉 + 5|H 〉extr2(τ)(1) = 35|L〉 + 10|H 〉extr2(τ)(2) = 25|L〉 + 15|H 〉.

Exercises

1.4.1 In the setting of Exercise 1.2.1, consider multisets ϕ = 3|a〉 + 2|b〉 +

8|c〉 and ψ = 3|b〉 + 1|c〉. Compute:

• ϕ + ψ

• ψ + ϕ

• M( f )(ϕ), both in ket-formulation and in function-formulation• idem forM( f )(ψ)• M( f )(ϕ + ψ)• M( f )(ϕ) +M( f )(ψ).

1.4.2 Consider, still in the context of Exercise 1.2.1, the ‘joint’ multisetϕ ∈ M(X × Y) given by ϕ = 2|a, u〉 + 3|a, v〉 + 5|c, v〉. Determine themarginalsM(π1)(ϕ) ∈ M(X) andM(π2)(ϕ) ∈ M(Y).

1.4.3 Check thatM(id) = id andM(g ◦ f ) =M(g) ◦ M( f ).1.4.4 Prove Lemma 1.4.3.1.4.5 Check that the support map supp : L(X) → M(X) is a homomor-

phism of monoids. Do the same for supp : M(X)→ Pfin(X).1.4.6 Prove that the support maps supp : L(X)→M(X) and supp : M(X)→

Pfin(X) are natural: for an arbitrary function f : X → Y both rectan-gles below commute.

L(X)

L( f )��

supp//M(X)

M( f )��

supp// Pfin(X)

Pfin ( f )��

L(Y) supp//M(Y) supp

// Pfin(Y)

21


1.4.7 Show, along the lines of the proof of Lemma 1.4.5 (5) that:

(u, v,w) ∈ argmaxx∈X,y∈Y,z∈Z

ϕ(x) � ψ(x, y) � χ(y, z)

⇐⇒ u ∈ argmaxx∈X

ϕ(x) �(

maxy∈Y

ψ(x, y) �(

maxz∈Z

χ(y, z)))

and v ∈ argmaxy∈Y

ψ(u, y) �(

maxz∈Z

χ(y, z))

and w ∈ argmaxz∈Z

χ(v, z).

1.4.8 Verify that the support map supp : M(X) → Pfin(X) commutes withextraction functions, in the sense that the following diagram com-mutes.

M(X × Y)

extr1��

supp// Pfin(X × Y)

extr1��

M(Y)XsuppX

// Pfin(Y)X

Equationally, this amounts to showing that for τ ∈ M(X × Y) andx ∈ X one has:

extr1(supp(τ)

)(x) = supp

(extr1(τ)(x)

).

Here we use that suppX( f ) B supp ◦ f , so that suppX( f )(x) =

supp( f (x)).1.4.9 Let X be an set and (M, 0,+) a commutative monoid, with a function

f : X → M.

1 Prove that there is a unique homomorphism f : (N(X), 0,+) →(M, 0,+) of monoids with f ◦ unit = f .

2 Assume now that M is also a cone, i.e. comes with scalar multipli-cation r · (−) : M → M, for each r ∈ R≥0, that is a homomorphismof monoids. Prove that there is now a unique homomorphism ofcones f : M(X)→ M with f ◦ unit = f .

1.5 Probability distributions

This section is finally about probability, albeit in elementary form. It intro-duces finite discrete probability distributions, which we often simply call dis-tributions1 or states. The notation and definitions that we use for distributions1 In the literature they are also called ‘multinomial’ or ‘categorical’ distributions, but we will

not use that terminology.

22

1.5. Probability distributions 231.5. Probability distributions 231.5. Probability distributions 23

are very much like for multisets, except that multiplicities are required to addup to one.

This section follows to a large extend the structure of previous sections, inorder to emphasise that distributions are not so different from lists, subsets andmultisets. But there are also crucial differences: distributions do not form amonoid, but a ‘convex’ set, in which convex combinations can be formed.

A distribution, or a probability distribution, or a state, over a set X is a finiteformal convex sum of the form:

r1| x1 〉 + · · · + rn| xn 〉 where xi ∈ X, ri ∈ [0, 1] with∑

i ri = 1.

We can write such an expression as a sum∑

i ri| xi 〉. It is called a convex sumsince the ri add up to one. Thus, a distribution is a special ‘probabilistic’ mul-tiset.

We write D(X) for the set of distributions on a set X of possible outcomes,so that D(X) ⊆ M(X). Via this inclusion, we use the same conventions fordistributions, as for multisets; they were described in the three bullet points inthe beginning of Section 1.4. This set X is often called the sample space, seee.g. [87].

For instance, for a coin we use the set {H,T } with elements for head and tailas sample space. A fair coin is described on the left below, as a distributionover this set; the distribution on the right gives a coin with a slight bias.

12 |H 〉 +

12 |T 〉 0.51|H 〉 + 0.49|T 〉.

In general, for a finite set X = {x1, . . . , xn} we use the Greek letter ‘upsilon’for the uniform distribution υX ∈ D(X), given by υX =

∑1≤i≤n

1n | xi 〉. The above

fair coin is a uniform distribution on the two-element set {H,T }. Similarly, afair dice can be described as υpips , where pips = {1, 2, 3, 4, 5, 6}. Figure 1.1shows bar charts of several distributions.

Example 1.5.1. We shall explicitly described biased coins and binomials usingthe above formal convex sum notation.

1 The coin that we have seen above can be parametrised via a ‘bias’ probabil-ity r ∈ [0, 1]. The resulting coin is often called flip and is defined as:

flip(r) B r|1〉 + (1 − r)|0〉

where 1 may understood as ‘head’ and 0 as ‘tail’. We may thus see flip as afunction flip : [0, 1] → D({0, 1}) from probabilities to distributions over thesample space {0, 1} of Booleans.

23


Figure 1.1 Plots of a slightly biased coin distribution 0.51|H 〉 + 0.49|T 〉 and afair (uniform) dice distribution on {1, 2, 3, 4, 5, 6} in the top row, together with arandom distribution on the space {0, 1, 2, . . . , 24} at the bottom.

2 For each number K ∈ N and probability r ∈ [0, 1] there is the familiar bino-mial distribution binom[K](r) ∈ D({0, 1, . . . ,K}). It captures probabilitiesfor iterated coin flips, and is given by the convex sum:

binom[K](r) B∑

0≤k≤K

(Nk

)· rk · (1 − r)K−k

∣∣∣k⟩.

The multiplicity probability before |k 〉 in this expression is the chance ofgetting k heads of out K coin flips, where each flip has bias r ∈ [0, 1]. Inthis way we obtain a function binom[K] : [0, 1] → D({0, 1, . . . ,K}). Thesemultiplicities are plotted as bar charts in Figure 1.2, for two binomial distri-butions, both on the sample space {0, 1, . . . , 15}.

We should check that binom[K](r) really is a distribution, that is, that itsmultiplicities add up to one. This is a direct result of the binomial expansiontheorem: ∑

0≤k≤K

(Kk

)· rk · (1 − r)K−k =

(r + (1 − r)

)K= 1K = 1.

So far we have described distributions as formal convex sums. But they can

24


binom[15]( 13 ) binom[15]( 3

4 )

Figure 1.2 Plots of bionmial distributions

be described, equivalently, in functional form. This is done in the definitionbelow, which is much like Definition 1.4.1 for multisets. Also for distributionswe shall freely switch between the above ket-formulation and the function-formulation given below.

Definition 1.5.2. The setD(X) of all distributions over a set X can be definedas:

D(X) B {ω : X → [0, 1] | supp(ω) is finite, and∑

x ω(x) = 1}.

Such a function ω : X → [0, 1], with finite support and values adding up toone, is often called a probability mass function, abbreviated as pmf.

ThisD is functorial: for a function f : X → Y we haveD( f ) : D(X)→ D(Y)defined either as:

D( f )(∑

i ri| xi 〉)B

∑i ri| f (xi)〉 or as: D( f )(ω)(y) B

∑x∈ f −1(y)

ω(x).

A distribution of the formD( f )(ω) ∈ D(Y), for ω ∈ D(X), is sometimes calledan image distribution.

One has to check that the D( f )(ω) is a distribution again, that is, that itsmultiplicities add up to one. This works as follows.∑

yD( f )(ω)(y) =∑

y∑

x∈ f −1(y) ω(x) =∑

x ω(x) = 1.

We present two examples where functoriality of D is frequently used, butnot always in explicit form.

Example 1.5.3. 1 Computing marginals of ‘joint’ distributions involves func-toriality of D. In general, one speaks of a joint distribution if its samplespace is a product set, of the form X1 × X2, or more generally, X1 × · · · × Xn,

25


for n ≥ 2. The i-th marginal of ω ∈ D(X1 × · · · × Xn) is defined asD(πi)(ω),via the i-th projection function πi : X1 × · · · × Xn → Xi.

For instance, the first marginal of the joint distribution,

ω = 112 |H, 0〉 +

16 |H, 1〉 +

13 |H, 2〉 +

16 |T, 0〉 +

112 |T, 1〉 +

16 |T, 2〉

on the product space {H,T } × {0, 1, 2} is obtained as:

D(π1)(ω) = 112 |H 〉 +

16 |H 〉 +

13 |H 〉 +

16 |T 〉 +

112 |T 〉 +

16 |T 〉

= 712 |H 〉 +

512 |T 〉.

2 Let ω ∈ D(X) be distribution. In Chapter 2 we shall discuss random vari-ables in detail, but here it suffices to know that it involves a function R : X →R from the sample space of the distribution ω to the real numbers. Often,then, the notation

P[R = r] ∈ [0, 1]

is used to indicate the probability that the random variable R takes valuer ∈ R.

Since we now know that D is functorial, we map apply it to the functionR : X → R. It gives another functionD(R) : D(X)→ D(R), so thatD(R)(ω)is a distribution on R. We observe:

P[R = r] = D(R)(ω)(r) =∑

x∈R−1(r)

ω(x).

Thus, P[R = (−)] is an image distribution, on R. In the notation P[R = r]the distribution ω on the sample space X is left implicit.

Here is concrete example. Recall that we write pips = {1, 2, 3, 4, 5, 6} forthe sample space of a dice. Let M : pips × pips → pips take the maximum,so M(i, j) = max(i, j). We consider M as a function pips × pips → R viathe inclusion pips ↪→ R. Then, using the uniform distribution υ ∈ D(pips ×pips),

P[M = k] = the probability that the maximum of two dice throws is k= D(M)(υ)(k)

=∑

i, j with max(i, j)=k

υ(i, j) =∑i≤k

υ(i, k) +∑j<k

υ(k, j) =2k − 1

36.

In the notation of this book we write the image distributionD(M)(υ) on pipsas:

D(M)(υ) = 136 |1〉 +

336 |2〉 +

536 |3〉 +

736 |4〉 +

936 |5〉 +

1136 |6〉

= P[M = 1] |1〉 + P[M = 2] |2〉 + P[M = 3] |3〉+ P[M = 4] |4〉 + P[M = 5] |5〉 + P[M = 6] |6〉

26


Since in our notation the underlying (uniform) distribution υ is explicit, wecan also change it to another distribution ω ∈ D(pips × pips) and still dothe same computation. In fact, once we have seen product distributions inSection 1.8 we can computeD(M)(ω1⊗ω2) whereω1 is a distribution for thefirst dice (say a uniform, fair one) and ω2 a distribution for the second dice(which may be unfair). This notation in which states are written explicitlygives much flexibility in what we wish to express and compute, see alsoSubsection 2.1.1 below.

At this stage one may expect a result saying that D(X) forms a monoid,corresponding to the idea of taking sums of multisets. But that does not work.Instead, one can take convex sums of multisets. This works as follows. Supposewe have two distributions ω, ρ ∈ D(X) and a number s ∈ [0, 1]. Then we canform a new distribution σ ∈ D(X), as convex combination of ω and ρ, namely:

σ B s · ω + (1 − s) · ρ that is σ(x) = r · ω(x) + (1 − s) · ρ(x).

At this stage we shall not axiomatise structures with such convex sums; theyare sometimes called simply ‘convex sets’ or ‘barycentric algebras’, see [96]or [40] for details. A brief historical account occurs in [59, Remark 2.9].

1.5.1 Unit and flatten for distributions

The unit and flatten maps for multisets from Subsection 1.4.2 restrict to distri-butions. The unit function unit : X → D(X) is simply unit(x) B 1| x〉. Flatten-ing is the function flat : D(D(X))→ D(X) with:

flat(∑

i ri|ωi 〉)B

∑x

(∑i ri · ωi(x)

) ∣∣∣ x⟩.

The only thing that needs to be checked is that the right-hand-side is again aconvex sum, i.e. that its probabilities add up to one. This is easy:∑

x(∑

i ri · ωi(x))

=∑

i ri ·∑

x ωi(x) =∑

i ri · 1 = 1.

The familiar properties of unit and flatten hold for distributions too: the ana-logue of Lemma 1.4.3 holds, with ‘multiset’ replaced by ‘distribution’.

We conclude with an observation about the size of supports of distributions.

Remark 1.5.4. We have restricted ourselves to finite probability distributions,by requiring that the support of a distribution is a finite set. This is fine in manysituations of practical interest, such as in Bayesian networks. But there arerelevant distributions that have infinite support, like the Poisson distribution

27


pois[λ] on N, with ‘mean’ or ‘rate’ parameter λ ≥ 0. It can be described asinfinite formal convex sum:

pois[λ] B∑

k∈Ne−λ ·

λk

k!

∣∣∣k⟩. (1.8)

This does not fit inD(N). Therefor we sometimes useD∞ instead ofD, wherethe finiteness of support reqirement has been dropped:

D∞(X) B {ϕ : X → [0, 1] |∑

x ϕ(x) = 1}.

ThisD∞ is also functorial and comes with unit and flat maps, just like forD.Notice by the way that the multiplicities add up to one in the Poisson distri-

bution because of the basic formula: eλ =∑

k∈Nλk

k! . The Poisson distribution istypically used for counts of rare events.

Exercises

1.5.1 Check that a marginal of a uniform distribution is again a uniformdistribution; more precisely,D(π1)(υX×Y ) = υX .

1.5.2 1 Prove that flip : [0, 1]→ D(2) is an isomorphism.2 Check that flip(r) is the same as binom[1](r).3 Describe the distribution binom[3]( 1

4 ) concretely and interepret thisdistribution in terms of coin flips.

1.5.3 We often write n B {0, . . . , n−1} so that 0 = ∅, 1 = {0} and 2 = {0, 1}.Verify that:

L(0) � 1 P(0) � 1 M(0) � 1 D(0) � 0L(1) � N P(1) � 2 M(1) � R≥0 D(1) � 1

P(n) � 2n M(n) � (R≥0)n D(2) � [0, 1].

The setD(n + 1) is often called the n-simplex. Describe it as a subsetof Rn+1, and also as a subset of Rn.

1.5.4 Formulate and prove the analogue of Lemma 1.4.3 for D, instead offorM.

1.5.5 For a (non-zero) probability r ∈ (0, 1] one defines the geometric dis-tribution geom[r] ∈ D∞(N) as:

geom[r] B∑

k∈N>0r · (1 − r)k−1

∣∣∣k⟩.

It captures the probability of being successful for the first time afterk − 1 unsuccesful tries. Prove that this is a distribution indeed: itsmultiplicities add up to one.Hint: Recall the summation formula for geometric series.

28

1.6. Frequentist learning: from multisets to distributions 291.6. Frequentist learning: from multisets to distributions 291.6. Frequentist learning: from multisets to distributions 29

1.5.6 Prove that a distribution ϕ ∈ D∞(X) necessarily has countable sup-port.Hint: Use that each set {x ∈ X | ϕ(x) > 1

n }, for n > 0, can have onlyfinitely many elements.

1.5.7 For N ∈ N, let’s write N[K](X) ⊆ M(X) for the subset of naturalmultisets with K ∈ N elements, that is, of

∑i ni| xi 〉 with

∑i ni = K.

1 Check that N[K](2) � {0, 1, . . . ,K}.2 Re-describe the binomial distribution from Example 1.5.1 (2) as a

mapping

D(2)binom[K]

// D(N[K](2)

)3 Consider the ‘trinomial’ generalisation:

D(3)trinom[N]

// D(N[K](3)

)given by:

trinom[K](r0|0〉 + r1|1〉 + r2|2〉)

=∑

k0,k1,k2k0+k1+k2=K

K!k0! · k1! · k2!

· rk00 · r

k11 · r

k22

∣∣∣ k0|0〉 + k1|1〉 + k2|2〉⟩

Check that this yields a probability distribution. Interpret the trino-mial distribution in analogy with the binomial distribution.Hint: Use the multinomial theorem.

4 Generalise to ‘multinomial’ distributions:

D(m)multinom[K]

// D(N[K](m)

)(An answer will appear in Example 2.1.5.)

1.6 Frequentist learning: from multisets to distributions

In earlier sections we have seen several (natural) mappings between collectiontypes, in the form of support maps, see the overview in Diagram (1.5). We arenow introducing mappings Flrn : M∗(X) → D(X) which are of a different na-ture. The name Flrn stands for ‘frequentist learning’, and may be pronouncedas ‘eff-learn’. The frequentist interpretation of probability theory views prob-abilities as long term frequencies of occurrences. Here, these occurrences aregiven via multisets, which form the inputs of the Flrn function. Recall thatM∗(X) is the collection of non-empty multisets

∑i ri| xi 〉, with ri , 0 for at

29


least one index i. Equivalently one can require that the sum s B∑

i ri is non-zero.

The Flrn maps turns a (non-empty) multiset into a distribution, essentiallyby normalisation:

Flrn(r1| x1 〉 + · · · + rk | xk 〉

)B r1

s | x1 〉 + · · · +rks | xk 〉

where s B∑

i ri.(1.9)

The normalisation step forces the sum on the right-hand-side to be a convexsum, with factors adding up to one. Clearly, from an empty multiset we cannotlearn a distribution — technically because the above sum s is then zero so thatwe cannot divide by s.

Using scalar multiplication from Lemma 1.4.2 (2) we can define more suc-cintly:

Flrn(ϕ) B 1|ϕ|· ϕ where |ϕ| B

∑x ϕ(x).

We call this form Flrn of learning ‘frequentist’, following [48], because itboils down to counting.

Example 1.6.1. We present two illustrations of frequentist learning.

1 Suppose we have some coin of which the bias is unkown. There are experi-mental data showing that after tossing the coin 50 times, 20 times come uphead (H) and 30 times yield tail (T ). We can present these data as a multisetσ = 20|H 〉+30|T 〉 ∈ M∗({H,T }). When we wish to learn the resulting prob-abilities, we apply the frequentist learning map Flrn and get a distributioninD({H,T }, namely:

Flrn(σ) = 2020+30 |H 〉 +

3020+30 |T 〉 = 2

5 |H 〉 +35 |T 〉.

Thus, the bias (twowards head) is 25 . In this simple case we could have ob-

tained this bias immediately from the data, but the Flrn map captures thegeneral mechanism.

Notice that with frequentist learning, more (or less) consistent data givesthe same outcome. For instance if we knew that 40 out of 100 tosses werehead, or 2 out of 5, we would still get the same bias. Intuitively, these datagive more (or less) confidence about the data. These aspects are not coveredby frequentist learning, but by the more sophisticated form of ‘Bayesian’learning. Another disadvantage of the rather primitive form of frequentistlearning is that prior knowledge, if any, about the bias is not taken into ac-count.

30

1.6. Frequentist learning: from multisets to distributions 311.6. Frequentist learning: from multisets to distributions 311.6. Frequentist learning: from multisets to distributions 31

2 Recall the medical table (1.4) captured by the multiset τ ∈ N(B×M). Learn-ing from τ yields the following joint distribution:

Flrn(τ) = 0.1|H, 0〉 + 0.35|H, 1〉 + 0.25|H, 2〉+ 0.05|L, 0〉 + 0.1|L, 1〉 + 0.15|L, 2〉.

Such a distribution, directly derived from a table, is sometimes called anempirical distribution [20].

In [32] it is argued that in general, people are not very good at probabilistic(esp. Bayesian) reasoning, but that they are much better at reasoning with “fre-quency formats”. To put it simply: the information that there is a 0.04 probabil-ity of getting a disease is more difficult to process than the information that 4out of 100 people get the disease. In the current setting these frequency formatswould correspond to natural multisets; they can be turned into distributions viathe frequentist learning map Flrn. In the sequel we regularly return to the re-lationship between multisets and distributions in relation to various forms oflearning, see esp. Chapter 4.

It turns out that the learning map Flrn is ‘natural’, in a mathematical sense.

Lemma 1.6.2. The learning maps Flrn : M∗(X)→ D(X) from (1.9) are natu-ral in X. This means that for each function f : X → Y the following diagramcommutes.

M∗(X)

M∗( f )��

Flrn // D(X)

D( f )��

M∗(Y)Flrn// D(Y)

As a result, frequentist learning commutes with marginalisation.

Proof. Pick an arbitrary non-empty multiset σ =∑

i ri| xi 〉 inM∗(X) and writes B

∑i ri. By non-emptyness of σ we have s , 0. Then:(

Flrn ◦ M∗( f ))(σ) = Flrn

(∑i ri| f (xi)〉

)=

∑i

ris | f (xi)〉

)= D( f )

(∑i

ris | xi 〉

)=

(D( f ) ◦ Flrn

)(σ).

We can apply this basic result to the medical data in Table (1.4), via multisetτ ∈ N(B×M). We have already seen in Section 1.4 that the multiset-marginalsN(πi)(τ) produce the marginal columns and rows with totals. We can learn thedistributions from the colums as:

Flrn(M(π1)(τ)

)= Flrn

(70|H 〉 + |30〉|L〉

)= 0.7|H 〉 + 0.3|L〉.

31


We can also take the distribution-marginal of the ‘learned’ distribution fromthe table, as described in Example 1.6.1 (2):

M(π1)(Flrn(τ)

)= (0.1 + 0.35 + 0.25)|H 〉 + (0.05 + 0.1 + 0.15)|L〉= 0.7|H 〉 + 0.3|L〉.

Hence the basic operations of learning and marginalisations commute. Thisis a simple result, which many practitioners in probability are surely awareof, at an intuitive level, but maybe not in the mathematically precise form ofLemma 1.6.2.

An obvious next question is whether multiset-extraction — see the end ofSection 1.4 — also commutes with ‘distribution-extraction’. The latter is calleddisintegration, and will be studied as a separate topic in Section 3.5; it will berelated to frequentist learning in Section 4.1.

We conclude with an elementary operation in probability, which is com-monly described as drawing an object from an urn. We shall describe it via fre-quentist learning Flrn. An urn typically contains finitely many objects, wherethe order does not matter, but multiplicities do matter. Hence we can describethe contents of an urn as a (natural) multiset σ ∈ N(X), where X is the samplespace of objects that may occur in the urn. For instance, we can use X = {r, b}when the urn contains red and blue balls.

For x ∈ X we define a draw operation Dx : N(X) → N(X) which removesthe element x from multiset ϕ, if present. This draw is defined as:

Dx(ϕ) =

ϕ if ϕ(x) = 0, that is, if x is not in ϕ

(ϕ(x) − 1)| x〉 +∑x′,x

ϕ(x′)| x′ 〉 otherwise.

These draws of individual elements are used in a general draw operation ona non-empty multiset. We describe its oucome as a distribution over pairs,consiting of the drawn element and the remaining multiset.

Definition 1.6.3. There is a draw operation:

N∗(X) D // D(X × N(X)

)given by:

D(ϕ) B∑

x∈supp(ϕ)

Flrn(ϕ)(x)∣∣∣ x,Dx(ϕ)

⟩.

For instance, suppose we have an urn/multiset ϕ = 2|r 〉 + 1|b〉 with two redballs and 1 blue ball, then performing a draw yields a distribution:

D(ϕ) = 23

∣∣∣r, 1|r 〉 + 1|b〉⟩

+ 13

∣∣∣b, 2|r 〉⟩.32

1.7. Channels 331.7. Channels 331.7. Channels 33

The first term describes the 23 probability of drawing a red ball, with a multiset

of one red and one blue ball remaining in the urn. The second term capturesthe 1

3 probability of drawing a blue ball with two red ones remaining.

Exercises

1.6.1 Check that frequentist learning from a constant multiset yields a uni-form distribution.

1.6.2 Show that Diagram (1.5) can be refined to:

L∗(X)supp

//

supp ))

Pfin(X)

M∗(X)Flrn

// D(X) supp

GG

(1.10)

where L∗(X) ⊆ L(X) is the subset of non-empty lists.1.6.3 Show that frequentist learning commutes with units, i.e. that Flrn ◦

unit = unit , but that it does not commute with flattens (like supportcommutes with flattens, in Lemma 1.3.3 (3)).Hint: Start e.g. from Φ = 1

∣∣∣ 2|a〉 + 4|c〉⟩

+ 2∣∣∣ 1|a〉 + 1|b〉 + 1|c〉

⟩in

M∗(M∗({a, b, c})).

1.7 Channels

The previous sections covered the collection types of lists, subsets, multisetsand distributions, with much emphasis on the similarities between them. In thissection we will exploit these similarities in order to introduce the concept ofchannel, in a uniform approach, for all of these collection types at the sametime. This will show that these data types are not only used for certain types ofcollections, but also for certain types of computation. Much of the rest of thisbook builds on the concept of a channel.

Let T be be either L, P, M or D. A state of type Y is an element ω ∈T (Y); it collects a number of elements of Y in a certain way. In this section weabstract away from the particular type of collection. A channel, or sometimesmore explicitly T-channel, is a collection of states, parameterised by a set.Thus, a channel is a function of the form c : X → T (Y). Such a channel turnsan element x ∈ X into a certain collection c(x) of elements of Y . Just likean ordinary function f : X → Y can be seen as a computation, we see a T -channel as a computation of type T . For instance, a channel X → P(Y) is anon-deterministic computation, a channel X → M(Y) is a resource-sensitive

33


computation, and a channel X → D(Y) is a probabilistic computation. We havealready seen several examples of such probabilistic channels in Example 1.5.1,namely flip : [0, 1]→ D({0, 1}) and binom[K] : [0, 1]→ D({0, 1, . . . ,K}).

When it is clear from the context what T is, we often write a channel usingfunctional notation, as c : X → Y , with a circle on the shaft of the arrrow.Notice that a channel 1 → X from the singleton set 1 = {0} can be identifiedwith a state on X.

Definition 1.7.1. Let T ∈ {L,P,M,D}, each of which is functorial, with itsown flatten operation, as described in previous sections.

1 For a state ω ∈ T (X) and a channel c : X → T (Y) we can form a new statec � ω of type Y . It is defined as:

c � ω B(flat ◦ T (c)

)(ω) where T (X)

T (c)// T (T (Y)) flat // T (Y).

This operation c � ω is called state tranformation, sometimes with addi-tion: along the channel c.

2 Let c : X → Y and d : Y → Z be two channels. Then we can compose themand get a new channel d ◦· c : X → Y via:(

d ◦· c)(x) B d � c(x) so that d ◦· c = flat ◦ T (d) ◦ c

We first look at some examples of state transformation.

Example 1.7.2. Take X = {a, b, c} and Y = {u, v}.

1 For T = L an example of a state ω ∈ L(X) is ω = [c, b, b, a]. An L-channelf : X → L(Y) can for instance be given by:

f (a) = [u, v] f (b) = [u, u] f (c) = [v, u, v].

State transformation f � ω using ‘map list’ via f and then flattening: turn-ing a list of lists into a list:

f � ω = flat(L( f )(ω)

)= flat

([ f (c), f (b), f (b), f (a)]

)= flat

([[v, u, v], [u, u], [u, u], [u, v]]

)= [v, u, v, u, u, u, u, u, v].

2 We consider the analogue example for T = P. We thus take as state ω =

{a, b, c} and as channel f : X → P(Y) with:

f (a) = {u, v} f (b) = {u} f (c) = {u, v}.

34


Then:

f � ω = flat(P( f )(ω)

)=

⋃{f (a), f (b), f (c)

}=

⋃{{u, v}, {u}, {u, v}

}= {u, v}.

3 For multisets a state inM(X) could be of the form ω = 3|a〉 + 2|b〉 + 5|c〉and a channel f : X →M(Y) could have:

f (a) = 10|u〉 + 5|v〉 f (b) = 1|u〉 f (c) = 4|u〉 + 1|v〉.

We then get as state transformation:

f � ω = flat(M( f )(ω)

)= flat

(3| f (a)〉 + 2| f (b)〉 + 5| f (c)〉

)= flat

(3∣∣∣10|u〉 + 5|v〉

⟩+ 2

∣∣∣1|u〉⟩ + 5∣∣∣4|u〉 + 1|v〉

⟩)= 30|u〉 + 15|v〉 + 2|u〉 + 20|u〉 + 5|v〉= 52|u〉 + 20|v〉.

4 Finally, for distributions, let’s take as state ω = 16 |a〉 + 1

2 |b〉 + 13 |c〉 ∈ D(X)

and asD-channel f : X → D(Y),

f (a) = 12 |u〉 +

12 |v〉 f (b) = 1|u〉 f (c) = 3

4 |u〉 +14 |v〉.

We then get as state transformation:

f � ω = flat(D( f )(ω)

)= flat

( 16 | f (a)〉 + 1

2 | f (b)〉 + 13 | f (c)〉

)= flat

( 16

∣∣∣ 12 |u〉 +

12 |v〉

⟩+ 1

2

∣∣∣1|u〉⟩ + 13

∣∣∣ 34 |u〉 +

14 |v〉

⟩)= 1

12 |u〉 +1

12 |v〉 +12 |u〉 +

14 |u〉 +

112 |v〉

= 56 |u〉 +

16 |v〉.

We now prove some general properties about channels.

Lemma 1.7.3. 1 Channel composition ◦· has a unit, namely unit : Y → Y, sothat:

unit ◦· c = c and d ◦· unit = d,

for all channels c : X → Y and d : Y → Z. Another way to write the secondequation is: d � unit(y) = d(y).

2 Channel composition ◦· is associative:

e ◦· (d ◦· c) = (e ◦· d) ◦· c

for all channels c : X → Y, d : Y → Z and e : Z → W.

35


3 State tranformation via a composite channel is the same as two consecutivetransformations: (

d ◦· c)� ω = d �

(c � ω

).

4 Each ordinary function f : Y → Z gives rise to a ‘trivial’ or ‘deterministic’channel � f � B unit ◦ f : Y → Z. This construction �−� satisfies:

� f � � ω = T ( f )(ω),

where T is the type of channel involved. Moreover:

�g� ◦· � f � = �g ◦ f � � f � ◦· c = T ( f ) ◦ c d ◦· � f � = d ◦ f .

for all functions g : Z → W and channels c : X → Y and d : Y → W.

Proof. We can give generic proofs, without knowing the type T ∈ {L,P,M,D}

of channel, by using earlier results like Lemma 1.2.3, 1.3.2 and 1.4.3 and itsanalogue for D. Notice that we carefully distinguish channel composition ◦·and ordinary function composition ◦.

1 Both equations follow from the flat − unit law. By Definition 1.7.1 (2):

unit ◦· c = flat ◦ T (unit) ◦ c = id ◦ c = c.

For the second equation we use naturality of unit in:

d ◦· unit = flat ◦ T (d) ◦ unit = flat ◦ unit ◦ d = id ◦ d = d.

2 The proof of associativity uses naturality and also the commutation of flattenwith itself (the ‘flat − flat law’), expressed as flat ◦ flat = flat ◦ T (flat).

e ◦· (d ◦· c) = flat ◦ T (e) ◦ (d ◦· c)= flat ◦ T (e) ◦ flat ◦ T (d) ◦ c= flat ◦ flat ◦ T (T (e)) ◦ T (d) ◦ c by naturality of flat= flat ◦ T (flat) ◦ T (T (e)) ◦ T (d) ◦ c by the flat − flat law= flat ◦ T

(flat ◦ T (e) ◦ d

)◦ c by functoriality of T

= flat ◦ T(e ◦· d

)◦ c

= (e ◦· d) ◦· c

36


3 Along the same lines:(d ◦· c

)� ω =

(flat ◦ T (d ◦· c)

)(ω)

=(flat ◦ T (flat ◦ T (d) ◦ c)

)(ω)

=(flat ◦ T (flat) ◦ T (T (d)) ◦ T (c)

)(ω) by functoriality of T

=(flat ◦ flat ◦ T (T (d)) ◦ T (c)

)(ω) by the flat − flat law

=(flat ◦ T (d) ◦ flat ◦ T (c)

)(ω) by naturality of flat

=(flat ◦ T (d)

)((flat ◦ T (c)

)(ω)

)=

(flat ◦ T (d)

)(c � ω

)= d �

(c � ω

).

4 All these properties follow from elementary facts that we have seen before:

� f � � ω =(flat ◦ T (unit ◦ f )

)(ω)

=(flat ◦ T (unit) ◦ T ( f )

)(ω) by functoriality of T

= T ( f )(ω) by a flat − unit law�g� ◦· � f � = flat ◦ T (unit ◦ g) ◦ unit ◦ f

= flat ◦ unit ◦ (unit ◦ g) ◦ f by naturality of unit= unit ◦ g ◦ f by a flat − unit law= �g ◦ f �

� f � ◦· c = flat ◦ T (unit ◦ f ) ◦ c= flat ◦ T (unit) ◦ T ( f ) ◦ c by functoriality of T= T ( f ) ◦ c

d ◦· � f � = flat ◦ T (d) ◦ unit ◦ f= flat ◦ unit ◦ d ◦ f by naturality of unit= d ◦ f by a flat − unit law.

In the sequel we often omit writing the brackets �−� that turn an ordinaryfunction f : X → Y into a channel � f �. For instance, in a state transformationf � ω, it is clear that we use f as a channel, so that the expression should beread as � f � � ω.

1.7.1 Graphical representation of probabilistic channels

Graphs, of various sorts, will play an important role in this book. A channelc : X → Y can be represtented as an edge from vertices X to Y in a directedgraph. This representation will often be used. But there is a completely differ-ent representation of a probabilistic channel c : X → Y , namely as a transitiongraph. The elements x ∈ X and y ∈ Y will then be used as vertices. An edge

37


between them is a labeled arrow xc(x)(y)−−−−→ y, where the number c(x)(y) ∈ [0, 1]

is seen as the probability of a transition from x to y.For instance, the state ω and channel f : {a, b, c} → {u, v} from Exam-

ple 1.7.2 (4) can be represented as a transition graph of the form:

u

a

b

c

v

1/2

1/2

1

1/4

3/4

1/6

1/2

1/3

(1.11)

The black dot functions as initial node.

Exercises

1.7.1 Consider theD-channel f : {a, b, c} → {u, v} from Example 1.7.2 (4),together with a newD-channel g : : {u, v} → {1, 2, 3, 4} given by:

g(u) = 14 |1〉 +

18 |2〉 +

12 |3〉 +

18 |4〉 g(v) = 1

4 |1〉 +18 |3〉 +

58 |3〉.

Describe the general g ◦· f : {a, b, c} → {1, 2, 3, 4} concretely.1.7.2 Notice that a state of type X can be identified with a channel 1→ T (Y)

with singleton set 1 as domain. Check that under this identification,state transformation c � ω corresponds to channel composition c◦· ω.

1.7.3 1 Check that for M- and D-channels, state transformation can bedescribed as:

(c � ω)(y) =∑

x c(x)(y) · ω(x),

that is, as:

c � ω =∑

y

(∑x c(x)(y) · ω(x)

)∣∣∣y⟩.

where c : X → Y and ω a state of type X. We shall often use thisformula.

2 Notice that we can also describe state transformation as a (convex)sum of scalar multiplications: that is, as:

f � ω =∑

xω(x) · c(x).

38


1.7.4 Identify the channel f and the state ω in Example 1.7.2 (4) with ma-trices:

M f =

12 1 3

412 0 1

4

Mω =

161213

Such matrices are sometimes called stochastic, since each of theircolumns has non-negative entries that add up to one.

1 Check that the matrix associated with the transformed state f � ω

is the matrix-column multiplication M f · Mω.(A general description appears in Remark 2.4.5.)

2 Check also what state transformation f � ω means in the transi-tion graph (1.11).

1.7.5 Draw transition graphs for the powerset and multiset channels in Ex-ample 1.7.2.

1.7.6 Prove that state transformation along D-channels preserves convexcombinations, that is, for r ∈ [0, 1],

f � (r · ω + (1 − r) · ρ) = r · ( f � ω) + (1 − r) · ( f � ρ).

1.7.7 Let f : X → Y .

1 Prove that if f is a Pfin-channel, then the state transformation func-tion f � (−) : Pfin(X) → Pfin(Y) can also be defined via freeness,namely as the unique function f in Exercise 1.3.3.

2 Similarly, show that f � (−) = f when f is a M-channel, as inExercise 1.4.9.

1.7.8 Let c : X → D(Y) be a D-channel and σ ∈ M(X) be a multiset.BecauseD(Y) ⊆ M(Y) we can also consider c as anM-channel, anduse c � σ. Prove that:

Flrn(c � σ) = c � Flrn(σ) = Flrn(c � Flrn(σ)).

1.7.9 1 Describe how (non-deterministic) powerset channels can be reversed,via a bijective correspondence between functions:

X −→ P(Y)===========Y −→ P(X)

(A description of this situation in terms of ‘dagger functors’ willappear in Example 3.8.1.)

2 Show that for finite sets X,Y there is a similar correspondence formultiset channels.

39


3 Let X,Y be finite sets and c : X → D(Y) be a D-channel. We canthen define anM-channel c∗ : Y →M(X) by swapping arguments:c∗(y)(x) = c(x)(y). We call c a ‘bi-channel’ if c∗ is also aD-channel,i.e. if

∑x c(x)(y) = 1 for each y ∈ Y .

Prove that the identity channel is a bi-channel and that bi-channelsare closed under composition.

4 Take A = {a0, a1} and B = {b0, b1} and define a channel bell : A ×2→ D(B × 2) as:

bell(a0, 0) = 12 |b0, 0〉 + 3

8 |b1, 0〉 + 18 |b1, 1〉

bell(a0, 1) = 12 |b0, 1〉 + 1

8 |b1, 0〉 + 38 |b1, 1〉

bell(a1, 0) = 38 |b0, 0〉 + 1

8 |b0, 1〉 + 18 |b1, 0〉 + 3

8 |b1, 1〉bell(a1, 1) = 1

8 |b0, 0〉 + 38 |b0, 1〉 + 3

8 |b1, 0〉 + 18 |b1, 1〉

Check that bell is a bi-channel.(It captures the famous Bell table from quantum theory; we havedeliberately used open spaces in the above description of the chan-nel bell so that non-zero entries align, giving a ‘bi-stochastic’ ma-trix, from which one can read bell∗ vertically.)

1.8 Parallel products

In the first section of this chapter we have seen Cartesian, parallel productsX × Y of sets X,Y . Here we shall look at parallel products of states, and alsoat parallel products of channels. These new products will be written as tensors⊗. They express parallel combination. These tensors exist for P, M and D,but not for lists L. The reason for this absence will be explained below, in Re-mark 1.8.3. Increasinly, we are concentrating our attention on the probabilisticcase.

Definition 1.8.1. Tensors, also called parallel products, will be defined firstfor states, for each type separately, and then for channels, in a uniform manner.

1 Let X,Y be arbitrary sets.

a For U ∈ P(X) and V ∈ P(Y) we define U ⊗ V ∈ P(X × Y) as:

U ⊗ V B {(x, y) ∈ X × Y | x ∈ U and y ∈ V}.

This product U ⊗V is often written simply as a product of sets U ×V , butwe prefer to have a separate notation for this product of subsets.

40

1.8. Parallel products 411.8. Parallel products 411.8. Parallel products 41

b For ϕ ∈ M(X) and ψ ∈ M(Y) we get ϕ ⊗ ψ ∈ M(X × Y) as:

ϕ ⊗ ψ B∑

x,y(ϕ(x) · ψ(y))∣∣∣ x, y⟩

that is(ϕ ⊗ ψ)(x, y) = ϕ(x) · ψ(y).

c For ω ∈ D(X) and ρ ∈ D(Y) we get ω ⊗ ρ ∈ D(X × Y) as:

ω ⊗ ρ B∑

x,y(ω(x) · ρ(y))∣∣∣ x, y⟩

that is(ω ⊗ ρ)(x, y) = ω(x) · ρ(y).

In this case it needs to be checked that the above expression forms a con-vex sum, but that’s easy.

2 Let c : X → Y and d : A → B be two channels, both of the same typeT ∈ {P,M,D}. A channel c ⊗ d : X × A→ Y × B is defined via:(

c ⊗ d)(x, a) B c(x) ⊗ d(a).

The right-hand-side uses the tensor product of the appropriate type T , asdefined in the previous point.

We shall use tensors not only in binary form ⊗, but also in n-ary form ⊗ · · · ⊗,both for states and for channels.

We see that tensors ⊗ involve the products × of the underlying domains. Asimple illustration of a (probabilistic) tensor product is:( 1

6 |u〉 +13 |v〉 +

12 |w〉

)⊗

( 34 |0〉 +

14 |1〉

)= 1

8 |u, 0〉 +124 |u, 1〉 +

14 |v, 0〉 +

112 |v, 1〉 +

38 |w, 0〉 +

18 |w, 1〉.

These tensor products tend to grow really quickly in size, since the number ofentries of the two parts have to be multiplied.

Parallel products are well-behaved, as described below.

Lemma 1.8.2. 1 Transformation of parallel states via parallel channels is theparallel product of the individual transformations:

(c ⊗ d) � (ω ⊗ ρ) = (c � ω) ⊗ (d � ρ).

2 Parallel products of channels interact nicely with unit and composition:

unit ⊗ unit = unit (c1 ⊗ d1) ◦· (c2 ⊗ d2) = (c1 ◦· c2) ⊗ (d1 ◦· d2).

3 The tensor of trivial, deterministic channels is obtained from their product:

� f � ⊗ �g� = � f × g�

where f × g = 〈 f ◦ π1, g ◦ π2〉, see Subsection 1.1.1.

41


Proof. We shall do the proofs for M and for D of point (1) at once, usingExercise 1.7.3 (1), and leave the remaining cases to the reader.(

(c ⊗ d) � (ω ⊗ ρ))(y, b) =

∑x,a(c ⊗ d)(x, a)(y, b) · (ω ⊗ ρ)(x, a)

=∑

x,a(c(x) ⊗ d(a))(y, b) · ω(x) · ρ(a)=

∑x,a c(x)(y) · d(a)(b) · ω(x) · ρ(a)

=(∑

x c(x)(y) · ω(x))·(∑

a d(a)(b) · ρ(a))

= (c � ω)(y) · (d � ρ)(b)=

((c � ω) ⊗ (d � ρ)

)(y, b).

As promised we look into why parallel products don’t work for lists.

Remark 1.8.3. Suppose we have two list [a, b, c] and [u, v] and we wish toform their parallel product. Then there are many ways to do so. For instance,two obvious choices are:

[〈a, u〉, 〈a, v〉, 〈b, u〉, 〈b, v〉, 〈c, u〉, 〈c, v〉][〈a, u〉, 〈b, u〉, 〈c, u〉, 〈a, v〉, 〈b, v〉, 〈c, v〉]

There are many other possibilities. The problem is that there is no canonicalchoice. Since the order of elements in a list matter, there is no commutativityproperty which makes all options equivalent, like for multisets. Technically,the tensor for L does not exist because L is not a commutative (i.e. monoidal)monad; this is an early result in category theory going back to [60].

We illustrate the use of parallel composition for iterated drawing.

Example 1.8.4. Recall the draw operation D : N∗(X) → D(X × N(X)) fromDefinition 1.6.3. We will ignore the non-emptyness requirement and considerthis operation as a probabilistic channel:

N(X) ◦D // X × N(X)

If we wish to draw twice, consecutively, we can describe this as compositionof channels, namely as a probabilistic channel:

N(X) ◦D // X × N(X) ◦

id⊗D// X × X × N(X)

Concretely, for a multiset ϕ ∈ N(X) with at least two elements:((id ⊗ D) ◦· D

)(ϕ) =

∑x,y∈X

Flrn(ϕ)(x) · Flrn(Dx(ϕ)(y)∣∣∣ x, y,Dy(Dx(ϕ))

⟩.

42


For instance, for an urn with two red (r) and one blue (b) ball:

((id ⊗ D) ◦· D

)(2|r 〉 + 1|b〉)

= (id ⊗ D) �( 2

3

∣∣∣r, 1|r 〉 + 1|b〉⟩

+ 13

∣∣∣b, 2|r 〉⟩)= 2

3 ·12

∣∣∣r, r, 1|b〉⟩ + 23 ·

12

∣∣∣r, b, 1|r 〉⟩ + 13 · 1

∣∣∣b, r, 1|r 〉⟩= 1

3

∣∣∣r, r, 1|b〉⟩ + 13

∣∣∣r, b, 1|r 〉⟩ + 13

∣∣∣b, r, 1|r 〉⟩.Next we illustrate the use of parallel products ⊗ of probabilistic channels for

a standard ‘summation’ property of Poisson distributions pois[λ], see (1.8).This property is commonly expressed in terms of random variables Xi as: ifX1 ∼ pois[λ1] and X2 ∼ pois[λ2] then X1 + X2 ∼ pois[λ1 + λ2]. We have notdiscussed random variables yet, but we do not need them for the channel-basedreformulation below.

Recall that the Poisson distribution has infinite support, so that we need touseD∞ instead ofD, see Remark 1.5.4, but that difference is immaterial here.Tensors work forD∞ too. We now use the mapping λ 7→ pois[λ] as a functionR≥0 → D∞(N) and as a channel pois : R≥0 → N.

Lemma 1.8.5. Poisson commutes with finite sums, in the sense that the follow-ing diagrams of channels commute,

R≥0 × R≥0 ◦+ //

◦pois ⊗ pois��

R≥0

◦ pois��

1 ◦0 //

◦unit��

R≥0

◦ pois��

N × N ◦+ // N 1 ◦

0 // N

in which the sums (+, 0) are used (twice) as a deterministic channels. Notice,by the way, that we have put the circle ◦ that is typical for channels on thearrows in these rectangles.

This can be read as: the Poisson channel is a homomorphism of commutativemonoids, from (R≥0,+, 0) to (N,+, 0).

Proof. We first do the above diagram on the left, for which we pick arbitraryλ1, λ2 ∈ R≥0 and k ∈ N. We start with the south-east path in the rectangle and

43


will arrive at the east-south path:(+ ◦· (pois ⊗ pois)

)(λ1, λ2)(k)

= D(+)(pois[λ1] ⊗ pois[λ2]

)(k) by Lemma 1.7.3 (4)

=∑

k1,k2,k1+k2=k

(pois[λ1] ⊗ pois[λ2]

)(k1, k2)

=∑m≤k

pois[λ1](m) · pois[λ2](k − m)

(1.8)=

∑m≤k

(e−λ1 ·

λm1

m!

)·

e−λ2 ·λk−m

2

(k − m)!

=

∑m≤k

e−(λ1+λ2)

k!·

k!m!(k − m)!

· λm1 · λ

k−m2

=e−(λ1+λ2)

k!·∑m≤k

(km

)· λm

1 · λk−m2

=e−(λ1+λ2)

k!· (λ1 + λ2)k by the binomial expansion theorem

(1.8)= pois[λ1 + λ2](k)=

(pois ◦· +

)(λ1, λ2)(k).

Commutation of the above diagram on the right, describing preservation ofzero, amounts to the trivial equation pois[0] = 1|0〉.

The fact that the Poisson channel forms a homomorphism of monoids re-quires the wider perspective of category theory, which allows us to say thatboth (R≥0,+, 0) and (N,+, 0) are monoids in the symmetric monoidal categoryof D∞-channels. The interested reader is referred to the literature for details,see e.g. [4, 68], since this is beyond the current goals.

1.8.1 Marginalisation and entwinedness

In Section 1.1 we have seen projection maps πi : X1 × · · · × Xn → Xi for Carte-sian products of sets. Since these πi are ordinary functions, they can be turnedinto channels �πi� = unit ◦ πi : X1 × · · · × Xn → Xi. We will be sloppy and typ-ically omit these brackets �−� for projections. The kind of arrow,→ of→, orthe type of operation at hand, will then indicate whether πi is used as functionor as channel.

The next result only works in the probabilistic case, for D. The exercisesbelow will provide counterexamples for P andM.

Lemma 1.8.6. 1 Projection channels take parallel products of probabilistic

44


states apart, that is, for ωi ∈ D(Xi) we have:

πi �(ω1 ⊗ · · · ⊗ ωn

)= D(πi)

(ω1 ⊗ · · · ⊗ ωn

)= ωi.

Thus, marginalisation of parallel product yields its components.2 Similarly, projection channels commute with parallel products of probabilis-

tic channels, in the following manner:

X1 × · · · × Xn

◦πi��

◦c1⊗···⊗cn // Y1 × · · · × Yn

◦ πi��

Xi ◦ci

// Yi

Proof. We only do point (1), since point (2) then follows easily, using thatparallel products of channels are defined pointwise, see Definition 1.8.1 (2).The first equation in point (1) follows from Lemma 1.7.3 (4), which yields�πi� � (−) = D(πi)(−). We restrict to the special case where n = 2 and i = 1.Then:

D(π1)(ω1 ⊗ ω2)(x) =

∑y(ω1 ⊗ ω2)(x, y) see Definition 1.5.2

=∑

y ω1(x) · ω2(y)= ω1(x) ·

(∑y ω2(y)

)= ω1(x) · 1 = ω1(x).

The last line of this proof relies on the fact that probabilistic states (distri-butions) involve a convex sum, with multiplicities adding up to one. This doesnot work for multisets, see Exercise 1.8.5 below.

We introduce special, post-fix notation for marginalisation via ‘masks’.

Definition 1.8.7. A mask M is a finite list of 0’s and 1’s, that is an elementM ∈ L({0, 1}). For a state ω of type T ∈ {L,P,M,D} on X1 × · · · × Xn and amask M of length n we write:

ωM

for the marginal with mask M. Informally, it keeps all the parts from ω at aposition where there is 1 in M and it projects away parts where there is 0. Thisis best illustrated via an example:

ω[1, 0, 1, 0, 1

]= T (〈π1, π3, π5〉)(ω) ∈ T (X1 × X3 × X5)= 〈π1, π3, π5〉 � ω.

More generally, for a channel c : Y → X1 × · · · × Xn and a mask M of length nwe use pointwise marginalisation via the same postcomposition:

cM is the channel y 7−→ c(y)M.

45


With Cartesian products one can take an arbitrary tuple t ∈ X × Y apartinto π1(t) ∈ X, π2(t) ∈ Y . By re-assembling these parts the original tuple isrecovered: 〈π1(t), π2(t)〉 = t. This does not work for collections: a joint stateis typically not the parallel product of its marginals, see Example 1.8.9 below.We introduce a special name for this.

Definition 1.8.8. A joint state ω on X × Y is called non-entwined if it is theparallel product of its marginals:

ω = ω[1, 0

]⊗ ω

[0, 1

].

Otherwise it is called entwined. This notion of entwinedness may be formu-lated with respect to n-ary states, via a mask, see for example Exercise 1.8.1,but it may then need some re-arrangement of components.

Lemma 1.8.6 (1) shows that a probabilistic product state of the form ω1⊗ω2

is non-entwined. But in general, joint states are entwined, so that the differentparts are correlated and can influence each other. This is a mechanism that willplay an important role in the sequel.

Example 1.8.9. Take X = {u, v} and A = {a, b} and consider the state ω ∈D(X × A) given by:

ω = 18 |u, a〉 +

14 |u, b〉 +

38 |v, a〉 +

14 |v, b〉.

We claim thatω is entwined. Indeed,ω has first and second marginalsω[1, 0

]=

D(π1)(ω) ∈ D(X) and ω[0, 1

]= D(π2)(ω) ∈ D(A), namely:

ω[1, 0

]= 3

8 |u〉 +58 |v〉 and ω

[0, 1

]= 1

2 |a〉 +12 |b〉.

The original state ω differs from the product of its marginals:

ω[1, 0

]⊗ ω

[0, 1

]= 3

16 |u, a〉 +3

16 |u, b〉 +5

16 |v, a〉 +516 |v, b〉.

This entwined follows from a general characterisation, see Exercise 1.8.7 be-low.

1.8.2 Copying

For an arbitrary set X there is a copy function ∆n : X → Xn = X × · · · × X (ntimes), given as:

∆n(x) B [x, . . . , x] (n times x).

We often omit the subscript n, when it is clear from the context, especiallywhen n = 2. This copy function can be turned into a copy channel �∆n� =

46


unit ◦ ∆n : X → Xn, but, recall, we often omit writing �−� for simplicity.These ∆n are alternatively called copiers or diagonals.

As functions one has πi ◦ ∆n = id, and thus as channels πi ◦· ∆n = unit . Statetransformation with a copier ∆ � ω is different for taking a tensor productω ⊗ ω, as the following simple example illustrates. For clarity we now writethe brackets �−� explicitly. First,

�∆� � ( 13 |0〉 +

23 |1〉)

=∑

x,y

(∑z�∆�(z)(x, y) · ( 1

3 |0〉 +23 |1〉)(z)

)∣∣∣ x, y⟩=

∑x,y

(unit(∆(0))(x, y) · 1

3 + unit(∆(1))(x, y) · 23)∣∣∣ x, y⟩

=∑

x,y

(unit(0, 0)(x, y) · 1

3 + unit(1, 1)(x, y) · 23)∣∣∣ x, y⟩

= 13 |0, 0〉 +

23 |1, 1〉.

(1.12)

In contrast:

( 13 |0〉 +

23 |1〉) ⊗ ( 1

3 |0〉 +23 |1〉)

= 19 |0, 0〉 +

29 |0, 1〉 +

29 |1, 0〉 +

49 |1, 1〉.

(1.13)

One might expect a commuting diagram like in Lemma 1.8.6 (2) for copiers,but that does not work: channel do not commute with diagonals, see Exer-cise 1.8.8 below for a counterexample. Copiers do commute with deterministicchannels of the form � f � = unit ◦ f , as in:

∆ ◦· � f � = (� f � ⊗ � f �) ◦· ∆ because ∆ ◦ f = ( f × f ) ◦ ∆.

In fact, this commutation property may be used as definition for a channel tobe deterministic, see Exercise 1.8.11.

For an ordinary function f : X → Y one can form the graph of f as therelation gr( f ) ⊆ X × Y given by gr( f ) = {(x, y) | f (x) = y}. There is a similarthing that one can do for channels, given a state. But please keep in mind thatthis kind of ‘graph’ has nothing to do with the graphical models that we willbe using throughout, like Bayesian networks or string diagrams.

Definition 1.8.10. 1 For two channels c : X → Y and d : X → Z with thesame domain we can define a tuple channel:

〈c, d〉 B (c ⊗ d) ◦· ∆ : X → Y × Z

2 In particular, for a state σ on Y and a channel d : Y → Z we define the graphas the joint state on Y × Z defined by:

gr(σ, d) B 〈id, d〉 � σ so that gr(σ, d)(y, z) = σ(y) · d(y)(z).

47


Please note that we are overloading the tuple notation 〈c, d〉. Above we useit for the tuple of channels, so that 〈c, d〉 has type X → D(Y × Z). Interpreting〈c, d〉 as a tuple of functions, like in Subsection 1.1.1, would give a type X →D(Y) × D(Z). For channels we standardly use the channel-interpretation forthe tuple.

In subsection 1.4.4 we have looked at maxima for multiplicities of mul-tisets, via the functions max and argmax. These functions can be applied todistributions as well, via the inclusionsD(X) ⊆ M(X). Often we are interestedin finding such maxima in joint states, especially when they are of the graphform gr(σ, d) as described above.

Lemma 1.8.11. For a channel d : Y → Z and a state σ ∈ D(Y) one has:

1 max(gr(σ, d)

)= max{σ(y) · d(y)(z) | y ∈ Y, z ∈ argmax d(y)}

2 (u, v) ∈ argmax(gr(σ, d)

)⇐⇒ u ∈ argmax

y∈Yσ(y) ·max d(y)

and v ∈ argmax d(u).

Proof. Directly by Lemma 1.4.5 (4) and (5).

The formulation in the above second point forms the basis for a mini-algorithmfor computing argmax

(gr(σ, d)

), namely:

1 compute for each y ∈ Y the maximum my B max d(y) ∈ R≥0;2 determine the subset argmaxy σ(y) · my ⊆ Y;3 for each u in this subset, determine the subset argmax d(u) ⊆ Z;4 return such pairs (u, v) ∈ Y × Z where u ∈ argmaxy σ(y) · my and v ∈

argmax d(u).

The important thing is that this can be done without actually using the (possiblybig) joint state gr(σ, d). It applies in particular in Bayesian networks.

Exercises

1.8.1 Check that the distribution ω ∈ D({a, b} × {a, b} × {a, b}

)given by:

ω = 124 |aaa〉 + 1

12 |aab〉 + 112 |aba〉 + 1

6 |abb〉+ 1

6 |baa〉 + 13 |bab〉 + 1

24 |bba〉 + 112 |bbb〉

satisfies:

ω = ω[1, 1, 0

]⊗ ω

[0, 0, 1

].

1.8.2 Check that for finite sets X,Y , the uniform distribution on X × Y isnon-entwined — see also Exercise 1.5.1.

48


1.8.3 Describe, in the style of Example 1.8.4, how to draw three balls fromthe urn/multiset 2|r 〉 + 1|b〉.

1.8.4 Check that:

ω[0, 1, 1, 0, 1, 1

][0, 1, 1, 0

]= ω

[0, 0, 1, 0, 1, 0

]What is the general result behind this?

1.8.5 1 Let U ∈ P(X) and V ∈ P(Y). Show that the equation

(U ⊗ V)[1, 0

]= U

does not hold in general, in particular not when V = ∅ (and U , ∅).2 Similarly, check that for arbitrary ϕ ∈ M(X) and ψ ∈ M(Y) one

does not have:

(ϕ ⊗ ψ)[0, 1

]= ϕ.

It fails when ψ is but ϕ isn’t the empty multiset 0. But it also failsin non-zero cases, e.g. for the multisets ϕ = 2|a〉 + 4|b〉 + 1|c〉 andψ = 3|u〉 + 2|v〉. Check this by doing the required computations.

1.8.6 Prove that for a channel c with suitably typed codomain,

c[1, 0, 1, 0, 1

]= 〈π1, π3, π5〉 ◦· c.

1.8.7 Let X = {u, v} and A = {a, b} as in Example 1.8.9. Prove that a stateω = r1|u, a〉 + r2|u, b〉 + r3|v, a〉 + r4|v, b〉 ∈ D(X × A), where r1 +

r2 + r3 + r4 = 1, is non-entwined if and only if r1 · r4 = r2 · r3.1.8.8 Consider the probabilistic channel f : X → Y from Example 1.7.2 (4)

and show that on the hand hand ∆ ◦· f : X → Y × Y is given by:

a 7−→ 12 |u, u〉 +

12 |v, v〉

b 7−→ 1|u, u〉c 7−→ 3

4 |u, u〉 +14 |v, v〉.

On the other hand, ( f ⊗ f ) ◦· ∆ : X → Y × Y is described by:

a 7−→ 14 |u, u〉 +

14 |u, v〉 +

14 |v, u〉 +

14 |v, v〉

b 7−→ 1|u, u〉c 7−→ 9

16 |u, u〉 +3

16 |u, v〉 +3

16 |v, u〉 +116 |v, v〉.

1.8.9 Check that for ordinary functions f : X → Y and g : X → Z one canrelating channel tupling and ordinary tupling via the operation �−� as:

〈� f �, �g�〉 = �〈 f , g〉� i.e. 〈unit ◦ f , unit ◦ g〉 = unit ◦ 〈 f , g〉.

Conclude that the copy channel ∆ is 〈unit , unit〉.

49


1.8.10 Check that the tuple of channels from Definition 1.8.10 satisfies both:

πi ◦· 〈c1, c2〉 = ci and 〈π1, π2〉 = unit

but not, essentially as in Exercise 1.8.8,

〈c1, c2〉 ◦· d = 〈c1 ◦· d, c2 ◦· d〉.

1.8.11 Let c : X → Y commute with diagonals, in the sense that ∆ ◦· c =

(c ⊗ c) ◦· ∆. Prove that c is deterministic, i.e. of the form c = unit ◦ ffor a function f : X → Y .

1.8.12 Consider the joint state Flrn(τ) from Example 1.6.1 (2).

1 Take σ = 0.7|H 〉 + 0.3|L〉 ∈ D({H, L}) and c : {H, L} → {0, 1, 2}defined by:

c(H) = 17 |0〉 +

12 |1〉 +

514 |2〉 c(T ) = 1

6 |0〉 +13 |1〉 +

12 |2〉.

Check that gr(σ, c) = Flrn(τ).2 Find argmax

(gr(σ, c)

)using the formula of Lemma 1.8.11 (2).

1.8.13 Prove that for multisets σ ∈ M(X) and τ ∈ M(Y) one has:

Flrn(σ ⊗ τ

)= Flrn(σ) ⊗ Flrn(τ)

1.8.14 Consider the bi-channel bell : A× 2→ B× 2 from Exercise 1.7.9 (4).Show that both bell

[1, 0

]and bell∗

[1, 0

]are channels with uniform

states only.1.8.15 Prove that tensoring of multisets is linear, in the sense that for σ ∈

M(X) the operation ‘tensor with σ’ σ ⊗ (−) : M(Y) → M(X × Y) islinear w.r.t. the cone structure of Lemma 1.4.2 (2): for τ1, . . . , τn ∈

D(Y) and r1, . . . , rn ∈ R≥0 one has:

σ ⊗(r1 · τ1 + · · · + rn · τn

)= r1 ·

(σ ⊗ τ1

)+ · · · + rn ·

(σ ⊗ τn

).

The same hold in the other coordinate, for (−)⊗τ. As a special case weobtain that when σ is a probability distribution, then σ⊗(−) preservesconvex sums of distributions.

1.8.16 Prove also that marginalisation (−)[1, 0

]: M(X×Y)→M(X) is linear

— and preserves convex sums when restricted to distributions.

1.9 A Bayesian network example

This section uses the previous sections, especially their probabilistic parts, togive semantics for a Bayesian network example. At this stage we just describe

50

1.9. A Bayesian network example 511.9. A Bayesian network example 511.9. A Bayesian network example 51

�

�wet grass

(D)

�

�slippery road

(E)�

�sprinkler

(B)

99 �

�rain

(C)

dd 99

�

�winter

(A)

ee ::

Figure 1.3 The wetness Bayesian network (from [20, Chap. 6]), with only thenodes and edges between them; the conditional probability tables associated withthe nodes are given separately in the text.

an example, without giving the general approach, but we hope that it clarifieswhat is going on and illustrates to the reader the relevance of (probabilistic)channels and their operations. A general description of what a Bayesian net-work is appears much later, in Definition 3.1.2.

The example that we use is a standard one, copied from the literature, namelyfrom [20]. Bayesian networks were introduced in [80], see also [5, 8, 9, 57, 61].

Consider the diagram/network in Figure 1.3. It is meant to capture proba-bilistic dependencies between several wetness phenomena in the oval boxes.For instance, in winter it is more likely to rain (than when it’s not winter), andalso in winter it is less likely that a sprinkler is on. Still the grass may be wetby a combination of these occurrences. Whether a road is slippery depends onrain, not on sprinklers.

The letters A, B,C,D, E in this diagram are written exactly as in [20]. Herethey are not used as sets of Booleans, with inhabitants true and false, but in-stead we use these sets with elements:

A = {a, a⊥} B = {b, b⊥} C = {c, c⊥} D = {d, d⊥} E = {e, e⊥}

The notation a⊥ is read as ‘not a’. In this way the name of an elements suggeststo which set the element belongs.

The diagram in Figure 1.3 becomes a Bayesian network when we provideit with conditional probability tables. For the lower three nodes they look asfollows.

� ��

win

ter a a⊥

3/5 2/5 � ��

spri

nkle

r A b b⊥

a 1/5 4/5

a⊥ 3/4 1/4

� ��

rain

A c c⊥

a 4/5 1/5

a⊥ 1/10 9/10

51


And for the upper two nodes we have:

� ��

wet

gras

s

B C d d⊥

b c 19/20 1/20

b c⊥ 9/10 1/10

b⊥ c 4/5 1/5

b⊥ c⊥ 0 1 � ��

slip

pery

road C e e⊥

c 7/10 3/10

c⊥ 0 1

This Bayesian network is thus given by nodes, each with a conditional proba-bility table, describing likelihoods in terms of previous ‘ancestor’ nodes in thenetwork (if any).

How to interpret all this data? How to make it mathematically precise? It isnot hard to see that the first ‘winter’ table describes a probability distributionon the set A, which, in the notation of this book, is given by:

wi = 35 |a〉 +

25 |a

⊥ 〉.

Thus we are assuming with probability of 60% that we are in a winter situation.This is often called the prior distribution, or also the initial state.

Notice that the ‘sprinkler’ table contains two distributions on B, one for theelement a ∈ A and one for a⊥ ∈ A. But this is a channel, namely a channelA → B. This is a crucial insight! We abbreviate this channel as sp, and defineit as:

A ◦sp// B with

sp(a) = 15 |b〉 +

45 |b

⊥ 〉

sp(a⊥) = 34 |b〉 +

14 |b

⊥ 〉.

We read them as: if it’s winter, there is a 20% chance that the sprinkler is on,but if it’s not winter, there is 75% chance that the sprinkler is on.

Similarly, the ‘rain’ table corresponds to a channel:

A ◦ra // C with

ra(a) = 45 |c〉 +

15 |c

⊥ 〉

ra(a⊥) = 110 |c〉 +

910 |c

⊥ 〉.

Before continuing we can see that the formalisation (partial, so far) of thewetness Bayesian network in Figure 1.3 in terms of states and channels, al-ready allows us to do something meaningful, namely state transformation �.Indeed, we can form distributions:

sp � wi on B and ra � wi on C.

These (transformed) distributions capture the derived probabilities that the

52


sprinkler is on, and that it rains. Using Exercise 1.7.3 (1) we get:(sp � wi

)(b) =

∑x sp(x)(b) · wi(x) = sp(a)(b) · wi(a) + sp(a⊥)(b) · wi(a⊥)

= 15 ·

35 + 3

4 ·25 = 21

50(sp � wi

)(b⊥) =

∑x sp(x)(b⊥) · wi(x) = sp(a)(b⊥) · wi(a) + sp(a⊥)(b⊥) · wi(a⊥)

= 45 ·

35 + 1

4 ·25 = 29

50 .

Thus the overall distribution for the sprinkler (being on or not) is:

sp � wi = 2150 |b〉 +

2950 |b

⊥ 〉.

In a similar way one can compute the probability distribution for rain as:

ra � wi = 1325 |c〉 +

1225 |c

⊥ 〉.

Such distributions for non-initial nodes of a Bayesian network are called pre-dictions. As will be shown here, they can be obtained via forward state trans-formation, following the structure of the network.

But first we still have to translate the upper two nodes of the network fromFigure 1.3 into channels. In the conditional probability table for the ‘wet grass’node we see 4 distributions on the set D, one for each combination of elementsfrom the sets B and C. The table thus corresponds to a channel:

B ×C ◦wg// D with

wg(b, c) = 19

20 |d 〉 +120 |d

⊥ 〉

wg(b, c⊥) = 910 |d 〉 +

110 |d

⊥ 〉

wg(b⊥, c) = 45 |d 〉 +

15 |d

⊥ 〉

wg(b⊥, c⊥) = 1|d⊥ 〉.

Finally, the table for the ‘slippery road’ table gives:

C ◦sr // E with

sr(c) = 710 |e〉 +

310 |e

⊥ 〉

sr(c⊥) = 1|e⊥ 〉.

We illustrate how to obtain predictions for ‘rain’ and for ‘slippery road’. Westart from the latter. Looking at the network in Figure 1.3 we see that there aretwo arrows between the initial node ’winter’ and our node of interest ‘slipperyroad’. This means that we have to do two state successive transformations,giving: (

sr ◦· ra)� wi = sr �

(ra � wi

)= sr �

( 1325 |c〉 +

1225 |c

⊥ 〉)

= 91250 |e〉 +

159250 |e

⊥ 〉.

The first equation follows from Lemma 1.7.3 (3). The second one involveselementary calculations, where we can use the distribution ra � wi that wecalculated earlier.

53


Getting the predicted wet grass probability requires some care. Inspection ofthe network in Figure 1.3 is of some help, but leads to some ambiguity — seebelow. One might be tempted to form the parallel product ⊗ of the predicteddistributions for sprinkler and rain, and do state transformation on this productalong the wet grass channel wg, as in:

wg �((sp � wi) ⊗ (ra � wi)

).

But this is wrong, since the winter probabilities are now not used consistently,see the different outcomes in the calculations (1.12) and (1.13). The correctway to obtain the wet grass prediction involves copying the winter state, viathe copy channel ∆, see:

(wg ◦· (sp ⊗ ra) ◦· ∆

)� wi = wg �

((sp ⊗ ra) �

(∆ � wi

))= 1399

2000 |d 〉 +6012000 |d

⊥ 〉.

Such calcutions are laborious, but essentially straightforward. We shall do thisin one in detail, just to see how it works. Especially, it becomes clear that allsummations are automatically done at the right place. We proceed in two steps,where for each step we only elaborate the first case.

((sp ⊗ ra) �

(∆ � wi

))(b, c)

=∑

x,y(sp ⊗ ra)(x, y)(b, c) ·(∆ � wi

)(x, y)

=∑

x,y(sp(x) ⊗ ra(y))(b, c) ·(∑

z ∆(z)(x, y) · wi(z))(x, y)

= sp(a)(b) · ra(a)(c) · 35 + sp(a⊥)(b) · ra(a⊥)(c) · 2

5

= 15 ·

45 ·

35 + 3

4 ·1

10 ·25

= 63500(

(sp ⊗ ra) �(∆ � wi

))(b, c⊥)

= 147500(

(sp ⊗ ra) �(∆ � wi

))(b⊥, c)

= 197500(

(sp ⊗ ra) �(∆ � wi

))(b⊥, c⊥)

= 93500

We conclude that:

(sp ⊗ ra) �(∆ � wi

)= 63

500 |b, c〉 +147500 |b, c

⊥ 〉 + 197500 |b

⊥, c〉 + 93500 |b

⊥, c⊥ 〉.

54


This distribution is used in the next step:(wg �

((sp ⊗ ra) � (∆ � wi)

))(d)

=∑

x,y wg(x, y)(d) ·((sp ⊗ ra) � (∆ � wi)

)(x, y)

= wg(b, c)(d) · 63500 + wg(b, c⊥)(d) · 147

500

+ wg(b⊥, c)(d) · 197500 + wg(b⊥, c⊥)(d) · 93

500

= 1920 ·

63500 + 9

10 ·147500 + 4

5 ·197500 + 0 · 93

500

= 13992000(

wg �((sp ⊗ ra) � (∆ � wi)

))(d⊥)

= 6012000 .

We have thus shown that:

wg �((sp ⊗ ra) � (∆ � wi)

)= 1399

2000 |d 〉 +6012000 |d

⊥ 〉.

1.9.1 Redrawing Bayesian networks

We have illustrated how prediction computations for Bayesian networks canbe done, basically by following the graph structure and translating it into suit-able sequential and parallel compositions (◦· and ⊗) of channels. The matchbetween the graph and the computation is not perfect, and requires some care,especially wrt. copying. Since we have a solid semantics, we like to use itin order to improve the network drawing and achieve a better match betweenthe underlying mathematical operations and the picture. Therefor we prefer todraw Bayesian networks slightly differently, making some minor changes:

• copying is written explicity, for instance as in the binary case; in generalone can have n-ary copying, for n ≥ 2;

• the relevant sets/types — like A, B,C,D, E — are not included in the nodes,but are associated with the arrows (wires) between the nodes;

• final nodes have outgoing arrows, labeled with their type.

The original Bayesian network in Figure 1.3 is then changed according to thesepoints in Figure 1.4. In this way the nodes are clearly recognisable as channels,of type A1×· · ·×An → B, where A1, . . . , An are the types on the incoming wires,and B is the type of the outgoing wire. Initial nodes have no incoming wires,which formally leads to a channel 1 → B, where 1 is the empty product. Aswe have seen, such channels 1→ B correspond to distributions/states on B. Inthe adapted diagram one easily forms sequential and parallel compositions ofchannels, see the exercise below.

55


• •�� wet grass

DOO �� slippery road

EOO

•

dd 88

�� sprinkler

B

@@

�� rain

COO

•

ff ::

�� winter

AOO

Figure 1.4 The wetness Bayesian network from Figure 1.3 redrawn in a mannerthat better reflects the underlying channel-based semantics, via explicit copiersand typed wires.

Exercises

1.9.1 In [20, §6.2] the (predicted) joint distribution on D × E that arisesfrom the Bayesian network example in this section is reprented as atable. It translates into:

30,443100,000 |d, e〉 +

39,507100,000 |d, e

⊥ 〉 + 5,957100,000 |d

⊥, e〉 + 24,093100,000 |d

⊥, e⊥ 〉.

Following the structure of the diagram in Figure 1.4, it is obtained inthe present setting as:(

(wg ⊗ sr) ◦· (id ⊗ ∆) ◦· (sp ⊗ ra) ◦· ∆)� wi

= (wg ⊗ sr) � ((id ⊗ ∆) � ((sp ⊗ ra) � (∆ � wi))).

Perform the calculations and check that this expression equals theabove distribution.

(Readers with access to [20] may wish to compare the different cal-culation methods, using sequential and parallel composition of chan-nels — as here — or using multiplications of tables — as in [20].)

1.10 The role of category theory

The previous sections have high-lighted several structural properties of, andsimilarities between, the collection types list, powerset, multiset and distribu-

56

1.10. The role of category theory 571.10. The role of category theory 571.10. The role of category theory 57

tion. By now readers may ask: what is the underlying structure? Surely some-one must have axiomatised what makes all of this work!

Indeed, this is called category theory. It provides a foundational languagefor mathematics, which was first formulated in the 1950s by Saunders Maclaneand Samuel Eilenberg (see the first overview book [68]). Category theory fo-cuses on the structural aspects of mathematics and shows that many mathemat-ical constructions have the same underlying structure. It brings forward manysimilarities between different areas (see e.g. [69]). Category theory has becomevery useful in (theoretical) computer science too, since it involves a clear dis-tinction between specification and implementation, see books like [4, 6, 65,85]. We refer to those sources for more information.

The role of category theory in capturing the mathemaical essentials and es-tabilishing connections also applies to probability theory. William Lawvere,another founding father of the aera, first worked in this direction. Lawverepublished little himself, but this line of work was picked up, extended and pub-lished by his PhD student Giry, whose name contines in the ‘Giry monad’ G ofcontinuous probability distributions, see Section ??. The precise source of thedistribution monadD for discrete probability theory, introduced in Section 1.5,is less clear, but it can be regarded as the discrete version of G. Probabilisticautomata have been studied in categorical terms as coalgebras, see e.g. [93]and [42] for general background information on coalgebra. There is a recent in-terest in more foundational, semantically oriented studies in probability theory,through the rise of probabilistic programming languages [34] and probabilisticBayesian reasoning [55]. Probabilistic methods have received wider attentionfor instance, via the current interest in data analytics (see the essay [2]), inquantum probability [77, 16] and in cognition theory [39, 92].

Readers who know category theory will have recognised its implicit use inearlier sections. For readers who are not familiar (yet) with category theory,some basic concepts will be explained informally in this section. This is inno way a serious introduction to the area. The remainder of this book willcontinue to make implicit use of category theory, but will make this usageincreasingly explicit. Hence it is useful to know the basic concepts of category,functor, natural transformation and monad. Category theory is sometimes seenas a difficult area to get into. But our experience is that it is easiest to learncategory theory by recognising its concepts in constructions that you alreadyknow. That’s why this chapter started with concretely descriptions of variouscollections and their use in channels. For more solid expositions of categorytheory we refer to the sources listed above.

57


1.10.1 Categories

A category is a mathematical structure given by a collection of ‘objects’ with‘morphisms’ between them. The requirements are that these morphisms areclosed under (associative) composition and that there is an identity morphismon each object. Morphisms are also called ‘maps’, and are written as f : X →Y , where X,Y are objects and f is a homomorphism from X to Y . It is temptingto think of morphisms in a category as actual functions, but there are plenty ofexamples where this is not the case.

A category is like an abstract context of discourse, giving a setting in whichone is working, with properties of that setting depending on the category athand. We shall give a number of examples.

1 There is the category Sets, whose objects are sets and whose morphisms areordinary functions between them. This is a standard example.

2 One can also restrict to finite sets as objects, in the category FinSets, withfunctions between them. This category is more restrictive, since for instanceit contains objects n = {0, 1, . . . , n − 1} for each n ∈ N, but not N itself.Also, in Sets can take arbitrary products

∏i∈I Xi of objects Xi, over arbitrary

index sets I, whereas in FinSets only finite products exist. Hence FinSets isa more restrictive world.

3 Monoids and monoid maps have been mentioned in Definition 1.2.1. Theycan be organised in a category Mon, whose objects are monoids, and whosehomomorphisms are monoid maps. We now have to check that monoid mapsare closed under composition and that identity functions are monoid maps;this is easy. Many mathematical structures can be organised into categoriesin this way, where the morphisms preserve the relevant structure. For in-stance, one can form a category PoSets, with partially ordered sets (posets)as objects, and monotone functions between them as morphisms (also closedunder composition, with identity).

4 For T ∈ {L,P,M,D} we can form the category Chan(T ). Its objects arearbitrary sets X, but its morphisms X to Y are T -channels, X → T (Y), writtenas X → Y . We have already seen that channels are closed under composition◦· and have unit as identity, see Lemma 1.7.3. We can now say that Chan(T )is a category. In the sequel we often write Chan for Chan(D), since that isour most important example.

These categories of channels form good examples of the idea that a cate-gory forms a universe of discourse. For instance, in Chan(P) we are in theworld of non-deterministic computation, whereas Chan = Chan(D) is theworld of probabilistic computation.

58


We will encounter several more examples of categories later on in the book.Occasionally, the following construction will be used. Given a category C, anew ‘opposite’ category Cop is formed. It has the same objects as C, but itsmorphisms are reversed. Thus f : Y → X in Cop means f : X → Y in C.

1.10.2 Functors

Category theorists like abstraction, hence the question: if categories are so im-portant, then why not organise them as objects themselves in a superlarge cat-egory Cat, with morphisms between them preserving the relevant structure?The latter morphisms between categories are called ‘functors’. More precisely,given categories C and D, a functor F : C → D between them consists of twomappings, both written F, sending and object X in C to an object F(X) in D,and a morphism f : X → Y in C to a morphism F( f ) : F(X) → F(Y) in D.This mapping F should preserve composition and identities, as in: F(g ◦ f ) =

F(g) ◦ F( f ) and F(idX) = idF(X).Earlier we have already called some operations ‘functorial’ for the fact that

they preserve composition and identities. We can now be a bit more precise.

1 Each T ∈ {L,P,Pfin ,M,M∗,N ,N∗,D,D∞} is a functor T : Sets → Sets.This has been described in the beginning of each of the sections 1.2 – 1.5.

2 Taking lists is also a functorL : Sets→Mon. This is in essence that contentof Lemma 1.2.2. One can also view P,Pfin andM as functors Sets→Mon,see Lemmas 1.3.1 and 1.4.2. Even more, one can describeP,Pfin as a functorSets → PoSets, by considering each set of subsets P(X) and Pfin(X) withits subset relation ⊆ as partial order. In order to verify this claim one hasto check that P( f ) : P(X) → P(Y) is a morphism of posets, that it, forms amonotone function. But that’s easy.

3 There is also a functor J : Sets → Chan(T ), for each T . It is the identityon sets/objects: J(X) B X. But it sends a functon f : X → Y to the channelJ( f ) B � f � = unit ◦ f : X → Y . We have seen, in Lemma 1.7.3 (4), thatJ(g ◦ f ) = J(g) ◦ J( f ) and that J(id) = id, where the latter identity id isunit in the category Chan(T ). This functor J shows how to embed the worldof ordinary computations (functions) into the world of compuations of typeT (channels).

1.10.3 Natural transformations

Let’s move one further step up the abstraction ladder and look at morphismsbetween functors. They are called natural transformations. We have already

59


seen examples of those as well. Given two functors F,G : C → D, a naturaltransformation α : F → G is a collection of maps αX : F(X) → G(X) in D,indexed by objects X in C. Naturality means that α works in the same way onall objects and is expressed as follows: for each morphism f : X → Y in C, thefollowing rectangle in D commutes.

F(X)

F( f )��

αX // G(X)

G( f )��

F(Y)αY// G(Y)

We briefly review some of the examples of natural transformations that wehave seen.

1 The various support maps can now be described as natural transformationssupp : L → Pfin , supp : M → Pfin and supp : D → Pfin , see the overviewdiagram 1.5.

2 The naturality of the frequentist learning map Flrn : M∗ → D has occurredearlier as an explicit result, see Lemma 1.6.2.

3 For each T ∈ {L,P,M,D} we have described maps unit : X → T (X) andflat : T (T (X)) → T (X) and have seen naturality results about them. We cannow state more precisely that they are natural transformations unit : id →T and flat : (T ◦ T ) → T . Here we have used id as the identity functorSets → Sets, and T ◦ T as the composite of T with itself, also as a functorSets→ Sets.

Monads

A monad on a category C is a functor T : C → C that comes with twonatural transformations unit : id → T and flat : (T ◦ T ) → T satisfyingflat ◦ unit = id = flat ◦ T (unit) and flat ◦ flat = flat ◦ T (flat). All thecollection functors L,P,Pfin ,M,M∗,N ,N∗,D,D∞ that we have seen so farare monads, see e.g. Lemma 1.2.3, 1.3.2 of 1.4.3. For each monad T we canform a category Chan(T ) of T -channels, that capture computations of type T .In category theory this is called the Kleisli category of T . Monads have becomepopular in functional programming [73] as mechanisms for including specialeffects (e.g.. for input-output, writing, side-effects, continuations) into a func-tional programming language2. The structure of probabilistic computation is

2 See the online overview https://wiki.haskell.org/Monad_tutorials_timeline

60

https://wiki.haskell.org/Monad_tutorials_timeline


also given by monads, namely by the discrete distribution monads D,D∞ andby the continuous distribution monad G.

Exercises

1.10.1 We have seen the functor J : Sets → Chan(T ). Check that there isalso a functor Chan(T ) → Sets in the opposite direction, which isX 7→ T (X) on objects, and c 7→ c � (−) on morphisms. Checkexplicitly that composition is preserved, and find the earlier result thatstated that fact implicitly.

1.10.2 Recall from Exercise 1.5.7 the subset N[K](X) ⊆ M(X) of naturalmultisets with K elements. Prove thatN[K] is a functor Sets→ Sets.

61

2

Predicates and Observables

After a general introduction in the previous chapter, the focus now lies firmlyon probability. We have seen (discrete probability) distributions ω =

∑i ri| xi 〉

inD(X) and channels X → Y , as functions X → D(Y), describing probabilisticstates and computations. This section develops the tools to reason about suchstates and computations, via what we call observables. They are functions fromX to (a subset of) the real numbers R that associate some ‘observable’ numeri-cal information with an element x ∈ X of the sample space X of a distribution.The following table gives an overview, where X is a set (used as sample space).

name type

observable / utility function X → R

factor / potential X → R≥0

(fuzzy) predicate / (soft/uncertain) evidence X → [0, 1]

sharp predicate / event X → {0, 1}

(2.1)

We shall use the term of ‘observable’ as generic expression for the entries inthis table. A function X → R is thus the most general type of observable, anda sharp predicate is the most specific one. Predicates are the most appropriateobservable for probabilistic logical reasoning. Often attention is restricted tosubsets U ⊆ X as predicates (or events [87]), but here the fuzzy versions X →[0, 1] are the default. A technical reason for this choice is that these fuzzypredicates are closed under predicate transformation, and the sharp predicatesare not, see below for details.

This chapter starts with the definition of what can be seen as probabilistictruth ω |= p, namely the validity of an observable p in a state ω. We shall see

62

2.1. Validity 632.1. Validity 632.1. Validity 63

that many basic concepts can be defined in terms of validity, including mean,average, entropy, distance, (co)variance. The algebraic, logical and categori-cal structure of the various observables in Table 2.1 will be investigated inSection 2.2. For a factor (or predicate or event) p and a state ω, both on thesame sample space, one can form a new updated (conditioned) state ω|p. It isthe state ω which is updated in the light of evidence p. This updating satisfiesvarious properties, including Bayes’ rule. A theme that will run through thisbook is that probabilistic updating (conditioning) has ‘crossover’ influence.This means that if we have a joint state on a product sample space, then updat-ing in one component typically changes the state in the other component (themarginal). This crossover influence is characteristic for a probabilistic settingand depends on correlation between the two components. As we shall see later,channels play an important role in this phenomenon.

In the previous chapter we have seen that a state ω on the domain X of achannel c : X → Y can be transformed into a state c � ω on the channel’scodomain Y . Analogous to such state transformation � there is also observ-able transformation �, acting in the opposite direction: for an observable qon the codomain Y of a channel c : X → Y , there is a transformed observablec � q on X. When� is applied to predicates, it is called predicate transforma-tion. It is a basic operation in programming logics. The power of conditioningbecomes apparent when it is combined with transformation, especially for in-ference in probabilistic reasoning. We shall distinguish two forms. Forwardinference involves conditioning followed by state transformation. Backwardinference is observable transformation followed by conditioning. We illustratethe usefulness of these inference techniques in many examples in Sections 2.5and 2.6, but also in Section 2.7 for Bayesian networks. The last three sectionsof this chapter use validity |= for the definition of distances — both betweenstates and between predicates — in a dual manner, and for a systematic accountof (co)variance and correlation.

2.1 Validity

This section introduces the basic facts and terminology for observables, asdescribed in Table 2.1. Recall that we write YX for the set of functions from Xto Y . We will use notations:

• Obs(X) B RX for the set of observables on X;• Fact(X) B (R≥0)X for the factors on X;• Pred (X) B [0, 1]X for the predicates on X;

63

64 Chapter 2. Predicates and Observables64 Chapter 2. Predicates and Observables64 Chapter 2. Predicates and Observables

• SPred (X) B {0, 1}X for the sharp predicates (events) on X.

There are inclusions:

SPred (X) ⊆ Pred (X) ⊆ Fact(X) ⊆ Obs(X).

The first set SPred (X) = {0, 1}X of sharp predicates can be identified withthe powerset P(X) of subsets of X, see below. We first define some specialobservables.

Definition 2.1.1. Let X be an arbitrary set.

1 For a subset U ⊆ X we write 1U : X → {0, 1} for the characteristic functionof U, defined as:

1U(x) B

1 for x ∈ U

0 otherwise.

This function 1U : X → [0, 1] is the (sharp) predicate associated with thesubset U ⊆ X.

2 We use special notation for two extreme cases U = X and U = ∅, giving thetruth predicate 1 : X → [0, 1] and the falsity predicate 0 : X → [0, 1] on X.Explicitly:

1 B 1X and 0 B 1∅ so that 1(x) = 1 and 0(x) = 0,

for all x ∈ X.3 For a singleton subset {x}we simply write 1x for 1{x}. Such functions 1x : X →

[0, 1] are also called point predicates, where the element x ∈ X is seen as apoint.

4 There is a (sharp) equality predicate Eq : X × X → [0, 1] defined in theobvious way as:

Eq(x, x′) B

1 if x = x′

0 otherwise.

5 If the set X can be mapped into R in an obvious manner, we write this inclu-sion function as incl X : X ↪→ R and may consider it as a random variable onX. This applies for instance if X = n = {0, 1, . . . , n − 1}.

This mapping U 7→ 1U forms the isomorphism P(X) �→ {0, 1}X that we

mentioned just before Definition 2.1.1. In analogy with point (3) we sometimescall a state of the form unit(x) = 1| x〉 a point state.

Definition 2.1.2. Let X be a set with an observable p : X → R on X.

64


1 For a distribution/state ω =∑

i ri| xi 〉 on X we define its validity ω |= p as:

ω |= p B∑

i

ri · p(x) =∑x∈X

ω(x) · p(x).

Elsewhere, this validity is often called expected value.2 If X is a finite set one can define the average avg(p) of the observable p as

its validity in the uniform state υX on X, i.e.,

avg(p) B υX |= p =∑x∈X

p(x)n

where n is the number of elements of X.

The validity ω |= p is non-negative (is in R≥0) if p is a factor and lies in theunit interval [0, 1] if p is a predicate (whether sharp or not). The fact that themultiplicities ri in a distribution ω =

∑i ri| xi 〉 add up to one means that the

validity ω |= 1 is one. Notice that for a point predicate one has ω |= 1x = ω(x)and similarly, for a point state, unit(x) |= p = p(x).

As an aside, we typically do not write brackets in equations like (ω |= p) =

a, but use the convention that |= has higher precedence than =, so that theequation can simply be written as ω |= p = a. Similarly, one can have validityc � ω |= p in a transformed state, which should be read as (c � ω) |= p.

We add two more notions that are defined in terms of validity.

Definition 2.1.3. Let ω be a distribution/state on a set X.

1 Write I(ω) : X → R≥0 for the information content of the distribution ω,defined as:

I(ω)(x) B − log(ω(x)

)where log = log2 is the logarithm with base 2.

The Shannon entropy H(ω) of ω is the validity of the factor I(ω):

H(ω) B ω |= I(ω) = −∑

x ω(x) · log(ω(x)

).

(Sometimes we also use the natural logarithm ln instead of log.)2 In the presence of a map incl X : X ↪→ R, one can define the mean mean(ω)

of the distribution ω on X as the validity of incl X , considered as a randomvariable:

mean(ω) B ω |= incl X =∑x∈X

ω(x) · x.

Example 2.1.4. 1 Let flip(0.3) = 0.3|1〉 + 0.7|0〉 be a biased coin. Supposethere is a game where you can throw the coin and win €100 if head (1)comes up, but you loose €50 if the outcome is tail (0). Is it a good idea toplay the game?

The possible gain can be formalised as a random variable v : {0, 1} → R

65


with v(0) = −50 and v(1) = 100. We get an anwer to the above question bycomputing the validity:

flip(0.3) |= v =∑

x flip(0.3)(x) · v(x)= flip(0.3)(0) · v(0) + flip(0.3)(1) · v(1)= 0.7 · −50 + 0.3 · 100 = −35 + 30 = −5.

Hence it is wiser not to play.

2 Write pips for the set {1, 2, 3, 4, 5, 6}, considered as a subset of R, via themap incl : pips ↪→ R, which is in its turn is seen as an observable. Theaverage of the latter is its validity dice |= incl for the (uniform) fair dicestate dice = υpips = 1

6 |1〉+16 |2〉+

16 |3〉+

16 |4〉+

16 |5〉+

16 |6〉. This avergage

is 3.5. It is at the same time the expected outcome after a throw of the dice.

3 Suppose that we claim that in a throw of a (fair) dice the outcome is even.How likely is this claim? We formalise it as a (sharp) predicate e : pips →[0, 1] with e(1) = e(3) = e(5) = 0 and e(2) = e(4) = e(6) = 1. Then, asexpected:

dice |= e =∑

x dice(x) · e(x)= 1

6 · 0 + 16 · 1 + 1

6 · 0 + 16 · 1 + 1

6 · 0 + 16 · 1 = 1

2 .

Now consider a more subtle claim that the even pips are more likely thanthe add ones, where the precise likelihoods are described by the (non-sharp,fuzzy) predicate p : pips → [0, 1] with:

p(1) = 110 p(2) = 9

10 p(3) = 310

p(4) = 810 p(5) = 2

10 p(6) = 710 .

This new claim p happens to be equally probable as the even claim e, since:

dice |= p = 16 ·

110 + 1

6 ·9

10 + 16 ·

810 + 1

6 ·8

10 + 16 ·

210 + 1

6 ·710

=1 + 9 + 3 + 8 + 2 + 7

60=

3060

=12.

4 Recall the binomial distribution binom[K](r) on {0, 1, . . . ,K} from Exam-ple 1.5.1 (2). There is an inclusion function {0, 1, . . . ,K} ↪→ R that allows

66


us to compute the mean of the binomial distribution as:

mean(binom[K](r)

)=

∑0≤k≤K

binom[K](r) · k

=∑

0≤k≤K

(Kk

)· rk · (1 − r)K−k · k

=∑

1≤k≤K

k · K!k! · (K − k)!

· rk · (1 − r)K−k

since the term for k = 0 can be dropped

=∑

1≤k≤K

K · (K − 1)!(k − 1)! · ((K − 1) − (k − 1))!

· r · rk−1 · (1 − r)(K−1)−(k−1)

= K · r ·∑

0≤ j≤K−1

(K − 1)!j! · (K − 1 − j)!

· r j · (1 − r)K−1− j

= K · r ·∑

0≤ j≤K−1

binom[K − 1](r)( j)

= K · r.

In the two plots of binomial distributions binom[15]( 13 ) and binom[15]( 3

4 )in Figure 1.2 (on page 25) one can see that the associated means 15

3 = 5 and454 = 11.25 make sense.

We elaborate multinomial distributions in a separate example.

Example 2.1.5. In Exercise 1.5.7 we have suggested generalisations of thebinomial distribution to the multinomial case. Here we shall describe this gen-eralisation in detail, and describe the associated mean value calculations.

Fix a natural number K > 0. Recall that we write N[K](X) ⊆ M(X) for thesubset of multisets with K elements. For ~k = (k1, . . . , km) ∈ N[K](m) we have∑

i ki = K. One writes: (K~k

)B

K!k1! · · · kn!

=K∏i ki!

.

For ~x = (x1, . . . , xm) ∈ D(m), so that∑

i xi = 1, we write:

~x ~k B xk11 · · · · · x

kmm =

∏i xki

i .

We use these notations to define the multinomial map:

D(m)multinom[K]

// D(N[K](m)

)67


as the following distribution on multisets:

multinom[K](~x) B∑

~k∈N[K](m)

(K~k

)~x ~k

∣∣∣~k⟩.

This multinom[K](~x) is a distribution on multisets: its multiplicities add up toone by the multinomial theorem,∑

~k∈N[K](m)

(K~k

)~x ~k =

(x1 + · · · + xn

)K= 1K = 1.

Our aim is to calculate the mean of this distribution multinom[K](~x). We doso via the following auxiliary result:∑

~k∈N[K](m)

ki

(K~k

)~x ~k = Kxi. (2.2)

The proof involves some basic shifting:∑~k∈N[K](m)

ki

(K~k

)~x ~k =

∑~k∈N[K](m),ki>0

kiK!

k1! · · · ki! · · · km!xk1

1 · · · xkii · · · x

kmm

=∑

~k∈N[K](m),ki>0

K(K − 1)!k1! · · · (ki − 1)! · · · km!

xk11 · · · xi xki−1

i · · · xkmm

= Kxi

∑~k∈M[K−1](m)

(K − 1)!k1! · · · ki! · · · km!

xk11 · · · x

kii · · · x

kmm

= Kxi.

We can use inclusions N[K](m) ⊆ M(m) ⊆ (R≥0)m to see multisets on m asm-tuples ~k ∈ (R≥0)m. Thus we have m projection factors πi : N[K](m) → R≥0,for which we can obtain the validity:

multinom[K](~x) |= πi =∑

~k∈N[K](m)

(K~k

)~x ~kki

(2.2)= Kxi.

The m-tuple of these validities is sometimes considered as the mean of themultinomial multinom[K](~x).

Remark 2.1.6. In the previous example we have considered a multiset as afactor. This works, since a multiset on X can be seen as function X → R≥0

with finite support. In the same way one can regard a distribution on X, seen asa function X → [0, 1] with finite support, as a predicate on X. Thus there areinclusions:

M(X) ⊆ Fact(X) = (R≥0)X and D(X) ⊆ Pred (X) = [0, 1]X .

68


The first inclusion ⊆ may actually be seen as an equality = when the set X isfinite.

We are however reluctant to identify distributions with certain predicates,since they belong to different universes and have quite different algebraic prop-erties. For instance, distributions are convex sets, whereas predicates are effectmodules carrying a commutative monoid, see the next section. Keeping statesand predicates apart is a matter of mathematicaly hygiene1. We have alreadyseen state transformation� along a channel; it preserves the convex structure.Later on in this chapter we shall also see predicate (or: observable) transfor-mation � along a channel, in opposite direction; this operation preserves theeffect module structure on predicates.

Apart from these mathematical differences, states and predicates play en-tirely different roles and are understood in different ways: states play an ono-tologoical role and describe ‘states of affairs’; predicates are epistemologicalin nature, and describe evidence (what we know).

Whenever we make use of the above inclusions, we shall make this usageexplicit.

2.1.1 Random variables

So far we have avoided discussing the concept of ‘random variable’. It turnsout to be a difficult notion for people who start studying probability theory.One often encounters phrases like: let Y be a random variable with expecta-tion E[Y]. How should this be understood? We choose to interprete a randomvariable as a pair.

Definition 2.1.7. 1 A random variable on a sample space (set) X consists oftwo parts:

• a state ω ∈ D(X);• an observable R : X → R.

2 The probability mass function P[R = (−)] : R → [0, 1] associated with therandom variable (ω,R) is the image distribution on R given by:

P[R = (−)] B R � ω = D(R)(ω).

Hence:

P[R = r] =(R � ω

)(r) = ω � R |= 1r.

(We use the function R : X → R as deterministic channel in writing R � ω,see Lemma 1.7.3 (4).)

1 Moreover, in continuous probability there are no inclusions of states in predicates.

69


3 The expected value of a random variable (ω,R) is the mean:

mean(R � ω) = ω |= R.

The last equation holds since:

mean(R � ω) =∑

r (R � ω)(r) · r see Definition 2.1.3 (2)=

∑r D(R)(ω)(r) · r by Lemma 1.7.3 (4)

=∑

r(∑

x∈R−1(r) ω(x))· r

=∑

x ω(x) · R(x)= ω |= R.

The interpretation of a random variable as a pair, of a state ω and an ob-servable R, makes sense in the light of the habit in traditional probability tokeep the state implicit, and thus only talk about the observable R. However,expressions like P[R = r] and E[R] only make sense in presence of a stateon the underlying sample space of the observable R. Here, we make this stateexplicit.

Example 2.1.8. An easy illustration is the expectated value for the sum of twodices. In this situation we have an observable S : pips × pips → R, on thesample space pips = {1, 2, 3, 4, 5, 6}, given by S (i, j) = i + j. It forms a randomvariable together with the product dice ⊗ dice ∈ D(pips × pips) of the uniformdistribution dice = υ =

∑i∈pips

16 | i〉.

The expected value of the random variable (dice ⊗ dice, S ) is, according toDefinition 2.1.7 (3),

mean(S � dice ⊗ dice) = dice ⊗ dice |= S=

∑i, j∈pips(dice ⊗ dice)(i, j) · S (i, j)

=∑

i, j∈pips16 ·

16 · (i + j)

=

∑i, j∈pips i + j

36=

25236

= 7.

In the context of this book, and more broadly, in probabilistic programming,state transformation � plays an important role. Therefore we insist on writ-ing states explicit, so that it is always clear in which state we are doing what.In mathematical contexts, the state/distribution is often the same (not trans-formed) and can be understood from the context — which justifies omitting it.That’s why we shall use ω |= p in the sequel, instead of E[p].

70

2.2. The structure of observables 712.2. The structure of observables 712.2. The structure of observables 71

Exercises

2.1.1 Check that the average of the set n + 1 = {0, 1, . . . , n}, considered asrandom variable, is n

2 .2.1.2 In Example 2.1.8 we have seen that dice ⊗ dice |= S = 7, for the

observable S : pips × pips → R by S (x, y) = x + y.

1 Now define T : pips3 → R by T (x, y, z) = x + y + z. Prove thatpips ⊗ pips ⊗ pips |= T = 21

2 .2 Can you generalise and show that summing on pipsn yields validity

7n2 ?

2.1.3 Consider an arbitrary distribution ω ∈ D(X).

1 Check that for a function f : X → Y and an observable q ∈ Obs(Y),

f � ω |= q = ω |= q ◦ f

2 Observe that we get as special case, for an observable q : X → R,

q � ω |= id = ω |= q,

where the identity function id : R→ R is used as observable.

2.1.4 The point of this section is to show that “learning after multinomial”is the identity, where “after” refers to composition of probabilisticchannels.

1 Describe frequentist learning as channel Flrn : M∗(m) → m andmultinomial as channel multinom[K] : D(m)→M∗(m), for a fixedK > 0. Further, notice that the identity function D(m) → D(m)forms a channel id : D(m)→ m.

2 Use (2.2) to prove that the following triangle of channels com-mutes:

D(m) ◦multinom[K]

//

◦id **

M∗(m)◦ Flrn��

m

2.2 The structure of observables

This section describes the algebraic structures that the various types of observ-ables have, without going too much into mathematical details. We don’t wantto turn this section into a sequence of formal definitions. Hence we present theessentials and refer to the literature for details. We include the observation thatmapping a set X to observables on X is functorial, in a suitable sense. This

71


will give rise to the notion of weakening. It plays the same (or dual) role forobservables that marginalisation plays for states.

It turns out that our four types of observables are all (commutative) monoids,via multiplication / conjunction, but in different universes. The findings aresummarised in the following table.

name type monoid in

observable X → R ordered vector spaces

factor X → R≥0 ordered cones

predicate X → [0, 1] effect modules

sharp predicate X → {0, 1} join semilattices

(2.3)

The least well-known structures in this table are effect modules. They will thusbe described in greatest detail, in Subsection 2.2.3.

2.2.1 Observables

Observables are R-valued functions on a sample space which are often writtenas capitals X,Y, . . .. Here, these letters are typically used for samples spacesthemselves. We shall use letters p, q, . . . for observables in general, and in par-ticular for predicates. In some settings one allows observables X → Rn tothe n-dimensional space of real numbers. Whenever needed, we shall use suchmaps as n-ary tuples 〈p1, . . . , pn〉 : X → Rn of random variables pi : X → R,see also Section 1.1.

Let’s fix a set X, and consider the collection Obs(X) = RX of observables onX. What structure does it have?

• Given two observables p, q ∈ Obs(X), we can add them point-wise, givingp + q ∈ Obs(X) via (p + q)(x) = p(x) + q(x).

• Given an observable p ∈ Obs(X) and a scalar s ∈ R we can form a ‘re-scaled’ observable s · p ∈ Obs(X) via (s · p)(x) = s · p(x). In this way we get−p = (−1) · p and 0 = 0 · p for any observable p, where 0 = 1∅ ∈ Obs(X) istaken from Definition 2.1.1 (2).

• For observables p, q ∈ RX we have a partial order p ≤ q defined by: p(x) ≤q(x) for all x ∈ X.

Together these operations of sum + and scalar multiplication · make RX into a

72


vector space over the real numbers, since + and · satisfy the appropriate axiomsof vector spaces. Moreover, this is an ordered vector space by the third bullet.

One can restricts to bounded observables p : X → R for which there is apositive bound B ∈ R such that −B ≤ p(x) ≤ B for all x ∈ X. The collection ofsuch bounded observables forms an order unit space [1, 51, 75, 79].

On Obs(X) = RX there is also a commutative monoid structure (1,&), where& is pointwise multiplication: (p & q)(x) = p(x) · q(x). We prefer to write thisoperation as logical conjunction because that is what it is when restricted topredicates. Besides, having yet another operation that is written as dot · mightbe confusing. We will occasionally write pn for p & · · · & p (n times).

2.2.2 Factors

We recall that a factor is a non-negative observables and that we write Fact(X) =

(R≥0)X for the set of factors on a set X. Factors occur in probability theory forinstance in the so-called sum-product algorithms for computing marginals andposterior distributions in graphical models. These matters are postponed; herewe concentrate on the mathematical properties of factors.

The set of Fact(X) looks like a vector space, expect that there are no nega-tives. Using the notation introduced in the previous subsection, we have:

Fact(X) = {p ∈ Obs(X) | p ≥ 0}.

Factors can be added pointwise, with unit element 0 ∈ Fact(X), like randomvariables. But a factor p ∈ Fact(X) cannot be re-scaled with an arbitrary realnumber, but only with a non-negative number s ∈ R≥0, giving s · p ∈ Fact(X).The structures are often called cones.

The monoid (1,&) on Obs(X) restricts to Fact(X), since 1 ≥ 0 and if p, q ≥ 0then also p & q ≥ 0.

2.2.3 Predicates

We first note that the set Pred (X) = [0, 1]X of predicates on a set X containsthe falsity 0 and truth 1 random variables, which are always 0 (resp. 1). Butthere are some noteworthy differences between predicates on the one hand andobservables and factors on the other hand.

• Predicates are not closed under addition, since the sum of two numbers in[0, 1] may ly outside [0, 1]. Thus, addition of predicates is a partial opera-tion, and is then written as p > q. Thus: p > q is defined if p(x) + q(x) ≤ 1for all x ∈ X, and in that case (p > q)(x) = p(x) + q(x).

73


This operation > is commutative and associative in a suitably partialsense. Moreover, it has 0 has unit element: p > 0 = p = 0 > p. This isstructure (Pred (X), 0,>) is called a partial commutative monoid.

• There is a ‘negation’ of predicates, written as p⊥, and called orthosupple-ment. It is defined as p⊥ = 1−p, that is, as p⊥(x) = 1−p(x). Then: p>p⊥ = 1and p⊥⊥ = p.

• Predicates are closed under scalar multiplication s·p, but only if one restrictsthe scalar s to be in the unit interval [0, 1]. It interacts nicely with partialaddition >, in the sense that s · (p > q) = (s · p) > (s · q).

The combination of these points means that the set Pred (X) carries the struc-ture of an effect module [41], see also [28]. These structures arose in mathe-matical physics [30] in order to axiomatise the structure of quantum predicateson Hilbert spaces.

The effect module Pred (X) also carries a commutative monoid structure forconjunction, namely (1,&). Indeed, when p, q ∈ Pred (X), then also p & q ∈Pred (X). We have p & 0 = 0 and p & (q1 > q2) = (p & q1) > (p & q2).

Since these effect structures are not so familiar, we include more formaldescriptions.

Definition 2.2.1. 1 A partial commutative monoid (PCM) consists of a set Mwith a zero element 0 ∈ M and a partial binary operation > : M × M → Msatisfying the three requirements below. They involve the notation x ⊥ y for:x > y is defined; in that case x, y are called orthogonal.

1 Commutativity: x ⊥ y implies y ⊥ x and x > y = y > x;2 Associativity: y ⊥ z and x ⊥ (y > z) implies x ⊥ y and (x > y) ⊥ z and

also x > (y > z) = (x > y) > z;3 Zero: 0 ⊥ x and 0 > x = x;

2 An effect algebra is a PCM (E, 0,>) with an orthosupplement. The latter isa total unary ‘negation’ operation (−)⊥ : E → E satisfying:

1 x⊥ ∈ E is the unique element in E with x > x⊥ = 1, where 1 = 0⊥;2 x ⊥ 1⇒ x = 0.

A homomorphism E → D of effect algebras is given by a function f : E →D between the underlying sets satisfying f (1) = 1, and if x ⊥ x′ in E thenboth f (x) ⊥ f (x′) in D and f (x > x′) = f (x) > f (x′). Effect algebras andtheir homomorphisms form a category, denoted by EA.

4 An effect module is an effect algebra E with a scalar multiplication s · x, fors ∈ [0, 1] and x ∈ E forming an action:

1 · x = x (r · s) · x = r · (s · x),

74


and preserving sums (that exist) in both arguments:

0 · x = 0 (r + s) · x = r · x > s · xs · 0 = 0 s · (x > y) = s · x > s · y.

We write EMod for the category of effect modules, where morphisms aremaps of effect algebras that preserve scalar multiplication (i.e. are ‘equivari-ant’).

2.2.4 Sharp predicates

The structure of the set SPred (X) = {0, 1}X is most familiar to logicians: theyform Boolean algebras. The join p ∨ q of p, q ∈ SPred (X) is the point-wisejoin (p ∨ q)(x) = p(x) ∨ q(x), where the latter disjunction ∨ is the obvious oneon {0, 1}. One can see these disjunctions (0,∨) as forming an additive structurewith negation (−)⊥. Conjunction (1,&) forms a commutative structure on top.

Formally one can say that SPred (X) also has scalar multiplication, withscalars from {0, 1}, in such a way that 0 · p = 0 and 1 · p = p.

In summary, observables all share the same multiplicative structure (1,&)for conjunction, but the additive structures and scalar multiplications differ.The latter, however, are preserved under taking validity, and not (1,&).

Lemma 2.2.2. Let ω ∈ D(X) be a distribution on a set X. Operations onobservables p, q on X satisfy, whenever defined,

1 ω |= 0 = 0;2 ω |= (p + q) = (ω |= p) + (ω |= q);3 ω |= (s · p) = s · (ω |= p).

2.2.5 Parallel products and weakening

Earlier we have seen parallel products ⊗ of states and of channels. This ⊗ canalso be defined for observables, and is then often called parallel conjunction.The difference between parallel conjunction ⊗ and sequential conjunction &is that ⊗ acts on observables on different sets X,Y and yields an outcome onthe product set X × Y , whereas & works for observables on the same set Z,and produces a conjunction observable again on Z. These ⊗ and & are inter-definable, via transformation� of observables, see Section 2.4 — in particularExercise 2.4.5.

75


Definition 2.2.3. 1 Let p be an observable on a set X, and q on Y . Then wedefine a new observable p ⊗ q on X × Y by:(

p ⊗ q)(x, y) B p(x) · q(x).

2 Suppose we have an observable p on a set X and we like to use p on theproduct X ×Y . This can be done by taking p⊗ 1 instead. This p⊗ 1 is calleda weakening of p. It satisfies (p ⊗ 1)(x, y) = p(x).

More generally, for a projection πi : X1 × · · · ×Xn → Xi and an observablep on Xi, we can take as weakening of p the predicate on X1 × · · · × Xn givenby:

1 ⊗ · · · ⊗ 1︸︷︷︸i−1 times

⊗ p ⊗ 1 ⊗ · · · ⊗ 1︸︷︷︸n−i times

.

Weakening is a structural operation in logic which makes it possible to usea predicate p(x) depending on a single variable x in a larger context whereone has for instance two variables x, y by ignoring the additional variable y.Weakening is usually not an explicit operation, except in settings like linearlogic where one has to be careful about the use of resources. Here, we needweakening as an explicit operation in order to avoid type mismatches betweenobservables and underlying sets.

Recall that marginalisation of states is an operation that moves a state toa smaller underlying set by projecting away. Weakening can be seen as adual operation, moving an observable to a larger context. There is a closerelationship between marginalisation and weakening via validity: for a stateω ∈ D(X1 × · · · × Xn) and an observable p on Xi we have:

ω |= 1 ⊗ · · · ⊗ 1 ⊗ p ⊗ 1 ⊗ · · · ⊗ 1 = D(πi)(ω) |= p= ω

[0, . . . , 0, 1, 0, . . . , 0

]|= p,

(2.4)

where the 1 in the latter marginalisation mask is at position i. Soon we shall seean alternative description of weakening in terms of predicate transformation.The above equation then appears as a special case of a more general result,namely of Proposition 2.4.3.

2.2.6 Functoriality

We like to conclude this section with some categorical observations. They arenot immediately relevant for the sequel and may be skipped. We shall be usingfour (new) categories:

• Vect, with vector spaces (over the real numbers) as objects and linear mapsas morphisms between them (preserving addition and scalar multiplication);

76


• Cone, with cones as objects and also with linear maps as morphisms, butthis time preserving scalar multiplication with non-negative reals only;

• EMod, with effect modules as objects and homomorphisms of effect mod-ules as maps between them (see Definition 2.2.1);

• BA, with Boolean algebras as objects and homomorphisms of Boolean al-gebras (preserving finite joins and negations, and then also finite meets).

Recall from Subsection 1.10.1 that we write Cop for the opposite of categoryC, with arrows reversed. This opposite is needed in the following result.

Proposition 2.2.4. Taking particular observables on a set is functorial: thereare functors:

1 Obs : Sets→ Vectop;2 Fact : Sets→ Coneop;3 Pred : Sets→ EModop;4 SPred : Sets→ BAop.

On maps f : X → Y in Sets these functors are all defined by the ‘pre-composewith f ’ operation q 7→ q ◦ f . They thus reverse the direction of morphisms,which necessitates the use of opposite categories (−)op.

The above functors all preserve the partial order on observables and alsothe commutative monoid structure (1,&), since they are defined point-wise.

Proof. We consider the first instance of observables in some detail. The othercases are similar. For a set X we have have seen that Obs(X) = RX is a vectorspace, and thus an object of the category Vect. Each function f : X → Y in Setsgives rise to a function Obs( f ) : Obs(Y)→ Obs(X) in the opposite direction. Itmaps an observable q : Y → R on Y to the observable q ◦ f : X → R on X. It isnot hard to see that this function Obs( f ) = (−) ◦ f preserves the vector spacestructure. For instance, it preserves sums, since they are defined point-wise.We shall prove this in a precise, formal manner. First Obs( f )(0) is the functionthat map x ∈ X to:

Obs( f )(0)(x) = (0 ◦ f )(x) = 0( f (x)) = 0.

Hence Obs( f )(0) maps everything to 0 and is thus equal to the zero functionitself: Obs( f )(0) = 0. Next, addition + is preserved since:

Obs( f )(p + q) = (p + q) ◦ f=

[x 7→ (p + q)( f (x))

]=

[x 7→ p( f (x)) + q( f (x))

]=

[x 7→ Obs( f )(p)(x) + Obs( f )(q)(x)

]= Obs( f )(p) + Obs( f )(q).

77


We leave preservation of scalar multiplication to the reader and conclude thatObs( f ) is a linear function, and thus a morphism Obs( f ) : Obs(Y) → Obs(X)in Vect. Hence Obs( f ) is a morphism Obs(X) → Obs(Y) in the opposite cat-egory Vectop. We still need to check that identity maps and composition ispreserved. We do the latter. For f : X → Y and g : Y → Z in Sets we have, forr ∈ Obs(Z),

Obs(g ◦ f )(r) = r ◦ (g ◦ f ) = (r ◦ g) ◦ f= Obs(g)(r) ◦ f= Obs( f )

(Obs(g)(r)

)=

(Obs( f ) ◦ Obs(g)

)(r)

=(Obs(g) ◦op Obs( f )

)(r).

This yields Obs(g ◦ f ) = Obs(g) ◦op Obs( f ), so that we get a functor of theform Obs : Sets→ Vectop.

Notice that saying that we have a functor like Pred : Sets → EModop con-tains remarkably much information, about the mathematical structure on ob-jects Pred (X), about preservation of this structure by maps Pred ( f ), and aboutpreservation of identity maps and composition by Pred (−) on morphisms. Thismakes the language of category theory both powerful and efficient.

Exercises

2.2.1 Find examples of predicates p, q on a set X and a distribution ω on Xsuch that ω |= p & q and (ω |= p) · (ω |= q) are different.

2.2.2 1 Check that (sharp) indicator predicates 1E : X → [0, 1], for subsetsE ⊆ X, satisfy:

• 1E∩D = 1E & 1D;• 1E∪D = 1E > 1D, if E,D are disjoint;• (1E)⊥ = 1¬E , where ¬E = {x ∈ X | x < E} is the complement of

E.

Formally, the function 1(−) : P(X) → Pred (X) = [0, 1]X is a homo-morphism of effect algebras, see [28, 41] for details.

2 Now consider subsets E ⊆ X and D ⊆ Y of different sets X,Ytogether with their product subset E×D ⊆ X×Y . Show that 1E×D =

1E ⊗ 1D, with as special case 1(x,y) = 1x ⊗ 1y.2.2.3 Show that 1 ⊗ 1 = 1.2.2.4 1 Let pi, qi be observables on Xi. Prove that:(

p1 ⊗ · · · ⊗ pn)

&(q1 ⊗ · · · ⊗ qn

)= (p1 & q1) ⊗ · · · ⊗ (pn & qn).

78

2.3. Conditioning 792.3. Conditioning 792.3. Conditioning 79

2 Conclude that parallel composition p ⊗ q of observables can bedefined in terms of weakening and conjunction, namely as (p⊗1) &(1 ⊗ q).

2.2.5 Prove that(σ ⊗ τ |= p ⊗ q

)=

(σ |= p

)·(τ |= q

).

2.2.6 Check that the weakening 1 ⊗ · · · ⊗ 1 ⊗ p ⊗ 1 ⊗ · · · ⊗ 1 of p in Defini-tion 2.2.3 (2) can also be described as Pred (πi)(p), using the functo-riality of Pred from Proposition 2.2.4.

2.2.7 Let p be a predicate on a finite set X = {x1, . . . , xn}. Check that:

p = >x∈X

p(x) · 1x = p(x1) · 1x1 > · · ·> p(xn) · 1xn .

2.2.8 For a random variable (ω, p), show that the validity of the observablep − (ω |= p) · 1 is zero, i.e.,

ω |=(p − (ω |= p) · 1

)= 0.

2.3 Conditioning

What we call conditioning is sometimes called update, or belief update. It is theincorporation of evidence into a distribution, where the evidence is given by apredicate (or more generally by a factor). This section describes the definitionand basic results, including Bayes’ theorem. The relevance of conditioning inprobabilistic reasoning will be demonstrated in many examples later on in thischapter.

Definition 2.3.1. Let ω ∈ D(X) be a distribution on a sample space X andp ∈ Fact(X) = (R≥0)X be a factor, on the same space X.

1 If the validity ω |= p is non-zero, we define a new distribution ω|p ∈ D(X)as:

ω|p(x) Bω(x) · p(x)ω |= p

i.e. ω|p =∑

x

ω(x) · p(x)ω |= p

∣∣∣ x⟩.

This distribution ω|p may be pronounced as: ω, given p. It is the result ofconditioning/updating ω with p as evidence.

2 The conditional expectation or conditional validity of an observable q on X,given p and ω, is the validity:

ω|p |= q.

79


3 For a channel c : X → Y and a factor q on Y we define the updated channelc|q : X → Y via pointwise updating:

c|q(x) B c(x)|q.

In writing c|q we assume that validity c(x) |= q = (c � q)(x) is non-zero, foreach x ∈ X.

Often, the state ω before updating is called the prior or the a priori state,whereas the state ω|p is called the posterior or the a posteriori state.

The conditioning c|q of a channel in point (3) is in fact a generalisation of theconditioning of a state ω|p in point (1), since the state ω ∈ D(X) can be seenas a channel ω : 1 → X with a one-element set 1 = {0} as domain. We shallsee the usefulness of conditioning of channels in Subsection 2.5.1. For now weonly notice that updating the identity channel has no effect: unit |q = unit .

The conditional probability notation P(E | D) for events E,D ⊆ X corre-sponds to the validityω|1D |= 1E of the sharp predicate 1E in the stateω updatedwith the sharp predicate 1D. Indeed,

ω|1D |= 1E =∑

xω|1D (x) · 1E(x)

=∑

x∈E

ω(x) · 1D(x)ω |= 1D

=

∑x∈E∩D ω(x)∑

x∈D ω(x)=

P(E ∩ D)P(D)

= P(D | E).

The formulation ω|p of conditioning that is used above involves no restrictionto sharp predicates, but works much more generally for fuzzy predicates/factorsp. This is sometimes called updating with soft or uncertain evidence [11, 20,49]. It is what we use right from the start.

Example 2.3.2. 1 Let’s take the numbers of a dice as sample space: pips =

{1, 2, 3, 4, 5, 6}, with a fair/uniform dice distribution dice = υpips = 16 |1〉 +

16 |2〉 + 1

6 |3〉 + 16 |4〉 + 1

6 |5〉 + 16 |6〉. We consider the predicate evenish ∈

Pred (pips) = [0, 1]pips expressing that we are fairly certain of the outcomebeing even:

evenish (1) = 15 evenish (3) = 1

10 evenish (5) = 110

evenish (2) = 910 evenish (4) = 9

10 evenish (6) = 45

80


We first compute the validity of evenish for our fair dice:

dice |= evenish =∑

x dice(x) · evenish (x)

= 16 ·

15 + 1

6 ·9

10 + 16 ·

110 + 1

6 ·9

10 + 16 ·

110 + 1

6 ·45

=2+9+1+9+1+8

60=

1

2.

If we take evenish as evidence, we can update our state and get:

dice∣∣∣evenish =

∑x

dice(x) · evenish (x)dice |= evenish

∣∣∣ x⟩=

1/6 · 1/5

1/2|1〉 +

1/6 · 9/10

1/2|2〉 +

1/6 · 1/10

1/2|3〉

+1/6 · 9/10

1/2|4〉 +

1/6 · 1/10

1/2|5〉 +

1/6 · 4/5

1/2|6〉

= 115 |1〉 +

310 |2〉 +

130 |3〉 +

310 |4〉 +

130 |5〉 +

415 |6〉.

As expected, the probabilities of the even pips is now, in the posterior, higherthan the odd ones.

2 The following alarm example is due to Pearl [80]. It involves an ‘alarm’ setA = {a, a⊥} and a ‘burglary’ set B = {b, b⊥}, with the following a priori jointdistribution ω ∈ D(A × B).

0.000095|a, b〉 + 0.009999|a, b⊥ 〉 + 0.000005|a⊥, b〉 + 0.989901|a⊥, b⊥ 〉.

The a priori burglary distribution is the second marginal:

ω[0, 1

]= 0.0001|b〉 + 0.9999|b⊥ 〉.

Someone reports that the alarm went off, but with only 80% certaintybecause of deafness. This can be described as a predicate p : A → [0, 1]with p(a) = 0.8 and p(a⊥) = 0.2. We see that there is a mismatch betweenthe p on A and ω on A × B. This can be solved via weakening p to p ⊗ 1, sothat it becomes a predicate on A × B. Then we can take the update:

ω|p⊗1

In order to do so we first compute the validity:

ω |= p ⊗ 1 =∑

x,y ω(x, y) · p(x) = 0.206.

We can then compute the updated joint state as:

ω|p⊗1 =∑

x,y

ω(x, y) · p(x)ω |= p

∣∣∣ x, y⟩= 0.0003688|a, b〉 + 0.03882|a, b⊥ 〉

+ 0.000004853|a⊥, b〉 + 0.9608|a⊥, b⊥ 〉.

81


The resulting posterior burglary distribution — with the alarm evidencetaken into account — is obtained by taking the second marginal of the up-dated distribution:(

ω|p⊗1)[

0, 1]

= 0.0004|b〉 + 0.9996|b⊥ 〉.

We see that the burglary probability is four times higher in the posteriorthan in the prior. What happens is noteworthy: evidence about one compo-nent A changes the probabilities in another component B. This ‘crossoverinfluence’ (in the terminology of [54]) or ‘crossover inference’ happens pre-cisely because the joint distribution ω is entwined, so that the different partscan influence each other. We shall return to this theme repeatedly, see forinstance in Corollary 2.5.9 below.

One of the main results for conditioning is Bayes’ theorem. We present ithere for factors, and not just for sharp predicates (events).

Proposition 2.3.3. Let ω be distribution on a sample space X, and let p, q befactors on X.

1 The product rule holds:

ω|p |= q =ω |= p & qω |= p

.

2 Bayes’ rule holds:

ω|p |= q =(ω|q |= p) · (ω |= q)

ω |= p.

This result carefully distinguishes a product rule, in point 1, and Bayes’rule, in point (2). This distinction is not always made, and the product rule issometimes also called Bayes’ rule.

Proof. 1 We straightforwardly compute:

ω|p |= q =∑

x ω|p(x) · q(x) =∑

xω(x) · p(x)ω |= p

· q(x)

=

∑x ω(x) · p(x) · q(x)

ω |= p

=

∑x ω(x) · (p & q)(x)

ω |= p=ω |= p & qω |= p

.

2 This follows directly by using the previous point twice, in combination withthe commutativity of &, in:

ω|p |= q(1)=

ω |= p & qω |= p

=ω |= q & pω |= p

(1)=

(ω|q |= p) · (ω |= q)ω |= p

.

82


Example 2.3.4. We instantiate Proposition 2.3.3 with sharp predicates 1E , 1D

for subsets/events E,D ⊆ X. Then the familiar formulations of the prod-uct/Bayes rule appear.

1 The product rule specialises to the definition of conditional probability:

P(E | D) = ω|1D |= 1E =ω |= 1D & 1E

ω |= 1D=ω |= 1D∩E

P(D)=

P(D ∩ E)P(D)

.

2 Bayes rule, in the general formulation of Proposition 2.3.3 (2) specialisesto:

P(E | D) = ω|1D |= 1E =(ω|1E |= 1D) · (ω |= 1D)

ω |= 1E=

P(D | E) · P(E)P(D)

.

We add a few more basic facts about conditioning.

Lemma 2.3.5. Let ω be distribution on X, with factors p, q ∈ Fact(X).

1 Conditioning with truth has no effect:

ω|1 = ω.

2 Conditioning with a point predicate gives a point state: for a ∈ X,

ω|1a = 1|a〉.

3 Iterated conditionings commute:(ω|p

)|q = ω|p&q =

(ω|q

)|p.

4 Conditioning is stable under multiplication of the factor with a scalar s > 0:

ω|s·p = ω|p.

5 Conditioning can be done component-wise, for product states and factors:

(σ ⊗ τ)∣∣∣(p⊗q) =

(σ|q

)⊗

(τ|p

).

6 Marginalisation of a conditioning with a (similarly) weakened predicate isconditioning of the marginalised state:

ω|1⊗q[0, 1

]= ω

[0, 1

]∣∣∣q.

7 For a function f : X → Y, used as deterministic channel,(f � ω

)∣∣∣q = f �

(ω∣∣∣q◦ f

).

Proof. 1 Trivial since ω |= 1 = 1.2 Assuming ω(a) , 0 we get for each x ∈ X,

ω|1a (x) =ω(x) · 1a(x)ω |= 1a

=ω(a) · 1|a〉(x)

ω(a)= 1|a〉(x).

83


We conclude with the use of conditioning in definining a draw from a subset.It gives a refinement of the draw operation D from Definition 1.6.3.

Definition 2.3.6. Let U ⊆ X be a non-empty subset. It gives rise to a ‘drawfrom U’ mapping

N∗(X)DU // D

(X × N(X)

)given by:

DU(ϕ) B D(ϕ)∣∣∣1U⊗1 =

∑x∈U

Flrn(ϕ)(x)∑y∈U Flrn(ϕ)(y)

∣∣∣ x,Dx(ϕ)⟩,

Notice that DX = D, for the largest subset X ⊆ X.

Example 2.3.7. The so-called Monty Hall problem is a famous riddle in prob-ability due to [91], see also e.g. [36, 97]:

Suppose you’re on a game show, and you’re given the choice of three doors: Behind onedoor is a car; behind the others, goats. You pick a door, say No. 1, and the host, whoknows what’s behind the doors, opens another door, say No. 3, which has a goat. Hethen says to you, “Do you want to pick door No. 2?” Is it to your advantage to switchyour choice?

We describe the three doors via a set X = {1, 2, 3} and the situation in which tomake the first choice as a multiset ϕ = 1|1〉 + 1|2〉 + 1|3〉. One may also viewϕ as an urn from which to draw numbered balls. Let’s assume without loss ofgenerality that the car is behind door 2. We shall represent this logically byusing a (sharp) ‘goat’ predicate 1G : {1, 2, 3} → [0, 1], for G = {1, 3}.

We analyse the situation via the draw maps D from Definition 1.6.3 and DG

from Definition 2.3.6. This D gives, after your first choice, the first distributionbelow, on {1, 2, 3} ×M({1, 2, 3}).

D(ϕ) = 13

∣∣∣1, 1|2〉 + 1|3〉⟩

+ 13

∣∣∣2, 1|1〉 + 1|3〉⟩

+ 13

∣∣∣3, 1|1〉 + 1|2〉⟩.

In the outer ket expressions∣∣∣ · · · , · · · ⟩ the first element is your choice, and the

second element is the multiset with remaining options.The host’s opening of a door with a goat is modeled by another draw, but

now using the draw map DG from Definition 2.3.6, which ensures that theoutcome is in G = {1, 3}. It yields on the three multisets obtained above:

DG(1|2〉 + 1|3〉

)= 1|3, 1|2〉〉

DG(1|1〉 + 1|3〉

)= 1

2 |1, 1|3〉〉 +12 |3, 1|1〉〉

DG(1|1〉 + 1|2〉

)= 1|1, 1|2〉〉.

85


Your choice, followed by the host’s choice, can be described via channel com-position, like in Example 1.8.4:(

(id ⊗ DG) ◦· D)(ϕ)

= 13

∣∣∣1, 3, 1|2〉⟩ + 16

∣∣∣2, 1, 1|3〉⟩ + 16

∣∣∣2, 3, 1|1〉⟩ + 13

∣∣∣3, 1, 1|2〉⟩.Inside

∣∣∣ − ⟩the first component is your choice, the second component is the

host’s choice, and the third component contains the remaining option.Inspection of this outcome shows that in two out of three cases — when you

choose 1 or 3, in the first and last term — it makes sense to change your choice,because the car is behind the remaining door 2. Hence changing is better thansticking to your original choice.

We have given a formal account of the situation. More informally, the hostknows where the car is, so his choice is not arbitrary. By opening a door witha goat behind it, the host is giving you information that you can exploit toimprove your choice: two of your possible choices are wrong, but in those twoout three cases the host gives you information how to correct your choice.

Exercises

2.3.1 In the setting of Example 2.3.2 (1) define a new predicate oddish =

evenish⊥ = 1 − evenish .

1 Compute dice|oddish

2 Prove the equation below, involving a convex sum of states on theleft-hand-side (see before Subsection 1.5.1).

(dice |= evenish ) · ω|evenish + (dice |= oddish ) · ω|oddish = dice.

2.3.2 Let p1, . . . , pn be an n-tuple of predicates on X with p1 > · · ·> pn = 1.Such a tuple is also called an n-test or simply a test. Let ω ∈ D(X)and q ∈ Fact(X).

1 Check that 1 =∑

i ω |= pi.2 Prove what is called the law of total probability:

ω =∑

i (ω |= pi) · ω|pi . (2.5)

What happens to the expression on the right-hand-side if one ofthe pi has validity zero? Check that this equation generalises Exer-cise 2.3.1.(The expression on the right-hand-side in (2.5) is used to turn a testinto a ‘denotation’ functionD(X)→ D(D(X)) in [71, 72], namelyas ω 7→

∑i (ω |= pi)

∣∣∣ω|pi

⟩. This proces is described more abstractly

in terms of ‘hyper normalisation’ in [43].)

86


3 Show that:

ω |= q =∑

i ω |= q & pi.

4 Prove now:

ω|q |= pi =ω |= q & pi∑j ω |= q & p j

.

2.3.3 Check that an n-test on a set X, as described in the previous exercise,can be identified with a channel X → n.

Check that this generalises Exercise 2.4.8, where a single predicatep on X can be identified with a channel X → 2, by considering p as a2-test p, p⊥.

2.3.4 Let ω ∈ D(X) with x ∈ supp(ω). Prove that:

ω |= 1x = ω(x) and ω|1x = 1| x〉 if ω(x) , 0.

2.3.5 Let c : X → Y be a channel, with state ω ∈ D(X × Z).

1 Prove that for a factor p ∈ Fact(Z),((c ⊗ id) � ω

)∣∣∣1⊗p = (c ⊗ id) �

(ω∣∣∣1⊗p

).

2 Show also that for q ∈ Fact(Y × Z),(((c ⊗ id) � ω

)∣∣∣q

)[0, 1

]=

(ω∣∣∣(c⊗id)�q

)[0, 1

].

2.3.6 This exercise will demonstrate that conditioning may both create andremove entwinedness.

1 Write yes = 11 : 2→ [0, 1], where 2 = {0, 1}, and no = yes⊥ = 10.Prove that the following conditioning of a non-entwined state,

τ B (flip ⊗ flip)∣∣∣(yes⊗yes)>(no⊗no)

is entwined.2 Consider the state ω ∈ D(2 × 2 × 2) given by:

ω = 118 |111〉 + 1

9 |110〉 + 29 |101〉 + 1

9 |100〉+ 1

9 |011〉 + 29 |010〉 + 1

9 |001〉 + 118 |000〉

Prove that ω’s first and third component are entwined:

ω[1, 0, 1

], ω

[1, 0, 0

]⊗ ω

[0, 0, 1

].

3 Now let ρ be the following conditioning of ω:

ρ B ω|1⊗yes⊗1.

Prove that ρ’s first and third component are non-entwined.

87


The phenomenon that entwined states become non-entwined via con-ditioning is called screening-off, whereas the opposite, non-entwinedstates becoming entwined via conditioning, is called explaining away.

2.3.7 Show that for ω ∈ D(X) and p1, p2 ∈ Fact(X) one has:(∆ � ω

)∣∣∣p1⊗p2

= ∆ �(ω|p1&p2

).

Note that this is a consequence of Lemma 2.3.5 (7).2.3.8 We have mentioned (right after Definition 2.3.1) that updating of the

identity channel has no effect. Prove more generally that for an ordi-nary function f : X → Y , updating the associated deterministic chan-nel � f � : X → Y has no effect:

� f �|q = � f �.

2.3.9 Let c : Z → X and d : Z → Y be two channels with a common domainZ, and with factors p ∈ Fact(X) and q ∈ Fact(Y) on their codomains.Prove that the update of a tuple channel is the tuple of the updates:

〈c, d〉|p⊗q = 〈c|p, d|q〉.

Prove also that for e : U → X and f : V → Y ,

(e ⊗ f )|p⊗q = (e|p) ⊗ ( f |q).

2.4 Transformation of observables

One of the basic operations that we have seen so far is state transformation�.It is used to transform a state ω on the domain X of a channel c : X → Y into astate c � ω on the codomain Y of the channel. This section introduces trans-formation of observables�. It works in the opposite direction of a channel: anobservable q on the codomain Y of the channel is transformed into an observ-able c � q on the domain X of the channel. Thus, state transformation worksforwardly, in the direction of the channel, whereas observable transformation� works backwardly, against the direction of the channel. This operation� isoften applied only to predicates — and then called predicate transformation —but here we apply it more generally to observables.

This section introduces observable transformation� and lists its key math-ematical properties. In subsequent sections it will be used for probabilistic rea-soning.

88

2.4. Transformation of observables 892.4. Transformation of observables 892.4. Transformation of observables 89

Definition 2.4.1. Let c : X → Y be a channel. An observable q ∈ Obs(Y) istransformed into c � q ∈ Obs(X) via the definition:(

c � q)(x) B c(x) |= q =

∑y

c(x)(y) · q(y).

There is a whole series of basic facts about�.

Lemma 2.4.2. 1 The operation c � (−) : Obs(Y) → Obs(X) of transform-ing random variables along a channel c : X → Y restricts to factors c �(−) : Fact(Y) → Fact(X), and also to fuzzy predicates c � (−) : Pred (Y) →Pred (X), but not to sharp predicates.

2 Observable transformation c � (−) is linear: it preserves sums (0,+) ofobservables and scalar multiplication of observables.

3 Observable transformation preserves truth 1, but not conjunction &.4 Observable transformation preserves the (point-wise) order on observables:

q1 ≤ q2 implies (c � q1) ≤ (c � q2).5 Transformation along the unit channel is the identity: unit � q = q.6 Transformation along a composite channel is successive transformation:

(d ◦· c) � q = c � (d � q).7 Transformation along a trivial, deterministic channel is pre-composition:

f � q = q ◦ f .

Proof. 1 If q ∈ Fact(Y) then q(y) ≥ 0 for all y. But then also (c � q)(x) =∑y c(x)(y) · q(y) ≥ 0, so that c � q ∈ Fact(X). If in addition q ∈ Pred (Y), so

that q(y) ≤ 1 for all y, then also (c � q)(x) =∑

y c(x)(y) ·q(y) ≤∑

y c(x)(y) =

1, since c(x) ∈ D(Y), so that c � q ∈ Pred (X).The fact that a transformation c � p of a sharp predicate p need not be

sharp is demonstrated in Exercise 2.4.2.2 Easy.3 (c � 1)(x) =

∑y c(x)(y) · 1(y) =

∑y c(x)(y) · 1 = 1. The fact that & is not

preserved follows from Exercise 2.4.1.4 Easy.5 Recall that unit(x) = 1| x〉, so that (unit � q)(x) =

∑y unit(x)(y) · q(y) =

q(x).6 For c : X → Y and d : Y → Z and q ∈ Obs(Z) we have:(

(d ◦· c) � q)(x) =

∑z(d ◦· c)(x)(z) · q(z)

=∑

z(∑

y c(x)(y) · d(y)(z))· q(z)

=∑

y c(x)(y) ·(∑

z d(y)(z) · q(z))

=∑

y c(x)(y) ·(d � q

)(y)

=(c � (d � q)

)(x).

89


7 For a function f : X → Y ,(� f � � q

)(x) =

∑y unit( f (x))(y) · q(y) = q( f (x)) = (q ◦ f )(x).

There is the following fundamental relationship between transformations�,� and validity |=.

Proposition 2.4.3. Let c : X → Y be a channel with a state ω ∈ D(X) on itsdomain and an observable q ∈ Obs(Y) on its codomain. Then:

c � ω |= q = ω |= c � q. (2.6)

Proof. The result follows simply by unpacking the relevant definitions:

c � ω |= q =∑

y (c � ω)(y) · q(y) =∑

y(∑

x c(x)(y) · ω(x))· q(y)

=∑

x ω(x) ·(∑

y c(x)(y) · q(y))

=∑

x ω(x) · (c � q)(x)= ω |= c � q.

We have already seen several instances of this basic result.

• Earlier we mentioned that marginalisation (of states) and weakening (of ob-servables) are dual to each other, see Equation (2.4). We can now see this asan instance of (2.6), using a projection πi : X1 × · · · × Xn → Xi as (trivial)channel, in:

πi � ω |= p = ω |= πi � p

On the left-hand-side the i-th marginal πi � ω ∈ D(Xi) of state ω ∈ D(X1 ×

· · · ×Xn) is used, whereas on the right-hand-side the observable p ∈ Obs(Xi)is weakenend to πi � p on the product set X1 × · · · × Xn.

• The first equation in Exercise 2.1.3 is also an instance of (2.6), namely for atrivial channel f : X → Y given by a function f : X → Y , as in:

f � ω |= p(2.6)= ω |= f � p= ω |= q ◦ f by Lemma 2.4.2 (7).

Remark 2.4.4. In a programming context, where a channel c : X → Y is seenas a program taking inputs from X to outputs in Y , one may call c � q theweakest precondition of q, commonly written as wp(c, q), see e.g. [26, 62, 70].We briefly explain this view.

A precondition of q, w.r.t. channel c : X → Y , may be defined as an observ-able p on the channel’s domain X for which:

ω |= p ≤ c � ω |= q, for all states ω.

90


Proposition 2.4.3 tells that c � q is then a precondition of q. It is also theweakest, since if p is a precondition of q, as described above, then in particular:

p(x) = unit(x) |= p≤ c � unit(x) |= q = unit(x) |= c � q = (c � q)(x).

As a result, p ≤ c � q.Lemma 2.4.2 (6) expresses a familiar compositionality property in the the-

ory of weakest preconditions:

wp(d ◦· c, q) = (d ◦· c) � q = c � (d � q) = wp(c,wp(d, q)).

We close this section with two topics that dig deeper into the nature of trans-formations. First, we relate transformation of states and observables in termsof matrix operations. Then we look closer at the categorical aspects of trans-formation of observables.

Remark 2.4.5. Let c be a channel with finite sets as domain and codomain.For convenience we write these as n = {0, . . . , n − 1} and m = {0, . . . ,m − 1},so that the channel c has type n → m. For each i ∈ n we have that c(i) ∈ D(m)is given by an m-tuple of numbers in [0, 1] that add up to one. Thus we canassociate an m × n matrix Mc with the channel c, namely:

Mc =

c(1)(1) · · · c(n − 1)(1)

......

c(1)(m − 1) · · · c(n − 1)(m − 1)

.By construction, the columns of this matrix add up to one. Such matrices areoften called stochastic.

A state ω ∈ D(n) may be identified with a column vector Mω of length n, ason the left below. It is then easy to see that the matrix Mc�ω of the transformedstate, is obtained by matrix-column multiplication, as on the right:

Mω =

ω(0)...

ω(n − 1)

so that Mc�ω = Mc · Mω.

Indeed,

(c � ω)( j) =∑

ic(i)( j) · ω(i) =

∑i

(Mc

)j,i ·

(Mω

)i =

(Mc · Mω

)j.

An observable q : m → R on m can be identified with a row vector Mq =

91

92 Chapter 2. Predicates and Observables92 Chapter 2. Predicates and Observables92 Chapter 2. Predicates and Observables(q(0) · · · q(m − 1)

). Transformation c � q then corresponds to row-matrix

multiplication:

Mc�q = Mq · Mc.

Again, this is checked easily:

(c � q)(i) =∑

jq( j) · c(i)( j) =

∑j

(Mq

)j ·

(Mc

)j,i =

(Mq · Mc

)i.

We close this section by making the functoriality of observable transfor-mation � explicit, in the style of Proposition 2.2.4. The latter deals withfunctions, but we now consider functoriality wrt. channels, using the categoryChan = Chan(D) of probabilistic channels. Notice that wrt. Proposition 2.2.4the case of sharp predicates is omitted, simply because sharp predicates are notclosed under predicate transformation (see Exercise 2.4.2). Also, conjunction& is not preserved under transformation, see Exercise 2.4.1 below.

Proposition 2.4.6. Taking particular observables on a set is functorial

1 Obs : Chan→ Vectop;2 Fact : Chan→ Coneop;3 Pred : Chan→ EModop;

On a channel c : X → Y, all these functors are given by transformation c �(−), acting in the opposite direction.

Proof. Essentially, all the ingredients are already in Lemma 2.4.2: transfor-mation restricts appropriately (point (1)), transformation preserves identies(point (5)) and composition (point (6)), and the relevant structure (points (2)and (3)).

Exercises

2.4.1 Consider the channel f : {a, b, c} → {u, v} from Example 1.7.2 (4),given by:

f (a) = 12 |u〉 +

12 |v〉 f (b) = 1|u〉 f (c) = 3

4 |u〉 +14 |v〉.

Take as predicates p, q : {u, v} → [0, 1],

p(u) = 12 p(v) = 2

3 q(u) = 14 q(v) = 1

6 .

Compute:

• f � p• f � q• f � (p > q)

92


• ( f � p) > ( f � q)• f � (p & q)• ( f � p) & ( f � q)

This will show that predicate transformation� does not preserve con-junction &.

2.4.2 Still in the context of the previous exercise, consider the sharp (point)predicate 1u on {u, v}. Show that the transformed predicate f � 1u on{a, b, c} is not sharp. This proves that sharp predicates are not closedunder predicate transformation.

2.4.3 Let h : X → Y be an ordinary function. Recall from Lemma 2.4.2 (7)that h � q = q ◦ h, when h is considered as a deterministic chan-nel. Show that transformation along such deterministic channels doespreserve conjunctions:

h � (q1 & q2) = (h � q1) & (h � q2),

in contrast to the findings in Exercise 2.4.1 for arbitrary channels.2.4.4 Recall that a state ω ∈ D(X) can be identified with a channel 1 → X

with a trivial domain, and also that a predicate p : X → [0, 1] can beidentified with a channel X → 2, see Exercise 2.4.8. Check that underthese identifications validity ω |= p can be identified with:

• state transformation p � ω;• predicate transformation ω � p;• channel composition p ◦· ω.

2.4.5 This exercises shows how parallel conjunction ⊗ and sequential con-junction & are inter-definable via transformation�, using projectionchannels πi and copy channels ∆, that is, using weakening and con-traction.

1 Let observables p1 on X1 and p2 on X2 be given. Show that onX1 × X2,

p1 ⊗ p2 = (π1 � p1) & (π2 � p2) = (p1 ⊗ 1) & (1 ⊗ p2).

The last equation occurred already in Exercise 2.2.4.2 Let q1, q2 be observables on the same set Y . Prove that on Y ,

q1 & q2 = ∆ � (q1 ⊗ q2).

2.4.6 Prove that:

〈c, d〉 � (p ⊗ q) = (c � p) & (d � q)(e ⊗ f ) � (p ⊗ q) = (e � p) ⊗ ( f � q).

93


2.4.7 Let c : X → Y be a channel, with observables p on Z and q on Y × Z.Check that:

(c ⊗ id) �((1 ⊗ p) & q

)= (1 ⊗ p) &

((c ⊗ id) � q

).

(Recall that & is not preserved by�, see Exercise 2.4.1.)2.4.8 1 Check that a predicate p : X → [0, 1] can be identified with a chan-

nel p : X → 2. Describe this p explicitly in terms of p.2 Define a channel orth : 2→ 2 such that orth ◦· p = p⊥.3 Define also a channel conj : 2 × 2 → 2 such that conj ◦· 〈 p, q〉 =

p & q.4 Finally, define also a channel scal(r) : 2 → 2, for r ∈ [0, 1], so that

scal(r) ◦· p = r · p.2.4.9 From a categorical perspective, predicate transformation allows us to

adapt the functor

Sets Pred // EModop

from Proposition 2.2.4 to a functor:

Chan(D) Pred // EModop

by redefining Pred on morphisms c : X → D(Y) in Chan(D) asPred (c) : Pred (Y)→ Pred (X) with:

Pred (c)(q) B c � q.

1 Check in detail that Pred (c) is a morphism of effect modules. Checkalso that channel composition and identities are preserved — andthus that we have functors indeed.

2 Show that the functor Pred is faithful, in the sense that Pred (c) =

Pred (c′) implies c = c′, for channels c, c′ : X → Y .3 Let Y be a finite set. Show that for each map h : Pred (Y)→ Pred (X)

in the category EMod there is a unique channel c : X → D(Y) withPred (c) = h.Hint: Write a predicate p as finite sum

∑y p(y) · 1y like in Exer-

cise 2.2.7 and use the relevant preservation property.One says that the functor Pred : ChanN(D) → EModop is full andfaithful when restricted to the category ChanN(D) with finite setsas objects. In the context of programming (logics) this property iscalled healthiness, see [25, 26, 70], or [37] for an abstract account.

94

2.5. Reasoning along channels 952.5. Reasoning along channels 952.5. Reasoning along channels 95

2.5 Reasoning along channels

The combination of � and � with coniditioning of states from Section 2.3gives rise to the powerful techniques of forward and backward inference. Thiswill be illustrated in the current section. At the end, these channel-based infer-ence mechanisms are related to crossover inference for joint states.

The next definition captures the two basic patterns (first formulated in [53]).We shall refer to them jointly as channel-based inference, or as reasoning alongchannels.

Definition 2.5.1. Let ω ∈ D(X) be a state on the domain of a channel c : X →Y .

1 For a factor p ∈ Fact(X), we define forward inference as transformationalong c of the state ω updated with p, as in:

c �(ω|p

).

This is also called prediction or causal reasoning.2 For a factor q ∈ Fact(Y), backward inference is updating of ω with the

transformed factor:

ω|(c�q).

This is also called explanation or evidential reasoning.

In both cases the distribution ω is often called the prior distribution or simplythe prior. Similarly, c �

(ω|p

)and ω|(c�q) are called posterior distributions or

just posteriors.

Thus with forward inference one first conditions and then performs (for-ward, state) transformation, whereas for backward inference one first performs(backward, factor) transformation, and then one conditions. We shall illustratethese fundamental inferences mechanisms in several examples. An importantfirst step in these examples is to recognise the channel that is hidden in thedescription of the problem at hand. It is instructive to try and do this, beforereading the analysis and the solution.

Example 2.5.2. We start with the following question from [87, Example 1.12].

Consider two urns. The first contains two white and seven black balls, and the secondcontains five white and six black balls. We flip a coin and then draw a ball from thefirst urn or the second urn depending on whether the outcome was heads or tails. Whatis the conditional probability that the outcome of the toss was heads given that a whiteball was selected?

95


Our analysis involves two sample spaces {H,T } for the sides of the coin and{W, B} for the colors of the balls in the urns. The coin distribution γ is uni-form: γ = 1

2 |H 〉 + 12 |T 〉. The above description implicitly contains a channel

c : {H,T } → {W, B}, namely:

c(H) = 29 |W 〉 +

79 |B〉 and c(T ) = 5

11 |W 〉 +611 |B〉.

As in the above quote, the first urn is associated with heads and the second onewith tails. Notice that the distributions c(H), c(T ) are obtained via frequentistlearning, from the two urns, seen as multisets, see Section 1.6.

The evidence that we have is described in the quote after the word ‘given’.It is the singleton predicate 1W on the set of colours {W, B}, indicating that awhite ball was selected. This evidence can be pulled back (transformed) alongthe channel c, to a predicate c � 1W on the sample space {H,T }. It is given by:(

c � 1W)(H) =

∑x∈{W,B} c(H)(x) · 1W (x) = c(H)(W) = 2

9 .

Similarly we get(c � 1W

)(T ) = c(T )(W) = 5

11 .The answer that we are interested in is obtained by updating the prior γ with

the transformed evidence c � 1W , as given by γ|c�1H . This is an instance ofbackward inference.

In order to get this answer, we first we compute the validity:

γ |= c � 1W = γ(H) · (c � 1W )(H) + γ(T ) · (c � 1W )(T )= 1

2 ·29 + 1

2 ·511 = 1

9 + 522 = 67

198 .

Then:

γ|c�1W =1/2 · 2/9

67/198|H 〉 +

1/2 · 5/11

67/198|T 〉 = 22

67 |H 〉 +4567 |T 〉.

Thus, the conditional probability of heads is 2267 . The same outcome is obtained

in [87], of course, but there via an application of Bayes’ rule.

Example 2.5.3. Consider the following classical question from [98].

A cab was involved in a hit and run accident at night. Two cab companies, Green andBlue, operate in the city. You are given the following data:

• 85% of the cabs in the city are Green and 15% are Blue• A witness identified the cab as Blue. The court tested the reliability of the witness

under the circumstances that existed on the night of the accident, and concluded thatthe witness correctly identified each one of the two colors 80% of the time and failed20% of the time.

What is the probability that the cab involved in the accident was Blue rather than Green?

96


We use as colour set C = {G, B} for Green and Blue. There is a prior ‘baserate’ distribution ω = 17

20 |G 〉 + 320 |B〉 ∈ D(C), as in the first bullet above.

The reliability information in the second bullet translates into a ‘correctness’channel c : {G, B} → {G, B} given by:

c(G) = 45 |G 〉 +

15 |B〉 c(B) = 1

5 |G 〉 +45 |B〉.

The second bullet also gives evidence of a Blue car. It translates into a pointpredicate 1B on {G, B}. It can be used for backward inference, giving the answerto the query, as posterior:

ω|c�1B = 1729 |G 〉 +

1229 |B〉 ≈ 0.5862|G 〉 + 0.4138|B〉.

Thus the probability that the Blue car was actually involved in the incident isa bit more that 41%. This may seem like a relatively low probability, giventhat the evidence says ‘Blue taxicab’ and that observations are 80% accurate.But this low percentage is explained by the fact that are relatively few Bleutaxicabes in the first place, namely only 15%. This is in the prior, base ratedistribution ω. It is argued in [98] that humans find it difficult to take such baserates (or priors) into account. They call this “base rate neglect”, see also [32].

Notice that the channel c is used to accomodate the uncertainty of obser-vations: the point observation ‘Blue taxicab’ is transformed into a prediatec � 1B. The latter is G 7→ 0.2 and B 7→ 0.8. Updating ω with this predicategives the claimed outcome.

Example 2.5.4. Recall the Medicine-Blood Table (1.4) with data on differenttypes of medicine via a set M = {0, 1, 2} and blood pressure via the set B =

{H, L}. From the table we can extract a channel b : M → B describing theblood pressure distribution for each medicine type. This channel is obtainedby column-wise frequentist learning:

b(0) = 23 |H 〉 +

13 |L〉 b(1) = 7

9 |H 〉 +29 |L〉 b(2) = 5

8 |H 〉 +38 |L〉.

The prior medicine distribution ω = 320 |0〉 +

920 |1〉 +

25 |2〉 is obtained from the

totals row in the table.The predicted state b � ω is 7

10 |H 〉 + 310 |L〉. It is the distribution that is

learnt from the totals column in Table (1.4). Suppose we wish to focus on thepeople that take either medicine 1 or 2. We do so by conditioning, via thesubset E = {1, 2} ⊆ B, with associated sharp predicate 1E : B→ [0, 1]. Then:

ω |= 1E = 920 + 2

5 = 1720 so ω|1E =

9/20

17/20|1〉 +

2/5

17/20|2〉 = 9

17 |1〉 +8

17 |2〉.

97


Forward reasoning, precisely as in Definition 2.5.1 (1), gives:

b �(ω|1E

)= b �

( 917 |1〉 +

817 |2〉

)= ( 9

17 ·79 + 8

17 ·58 )|H 〉 + ( 9

17 ·29 + 8

17 ·38 )|H 〉

= 1217 |H 〉 +

517 |L〉.

This shows the distribution of high and low blood pressure among people usingmedicine 1 or 2.

We turn to backward reasoning. Suppose that we have evidence 1H on {H, L}of high blood pressure. What is then the associated distribution of medicineusage? It is obtained in several steps:(

b � 1H)(x) =

∑y

b(x)(y) · 1H(y) = b(x)(H)

ω |= b � 1H =∑

xω(x) · (b � 1H)(x) =

∑xω(x) · b(x)(H)

= 320 ·

23 + 9

20 ·79 + 2

5 ·58 = 7

10

ω|b�1H =∑

x

ω(x) · (b � 1H)(x)ω |= b � 1H

∣∣∣ x⟩=

3/20 · 2/3

7/10|0〉 +

9/20 · 7/9

7/10|1〉 +

2/5 · 5/8

7/10|2〉

= 17 |0〉 +

12 |1〉 +

514 |2〉 ≈ 0.1429|0〉 + 0.5|1〉 + 0.3571|2〉.

We can also reason with ‘soft’ evidence, using the full power of fuzzy pred-icates. Suppose we are only 95% sure that the blood pressure is high, due tosome measurement uncertainty. Then we can use as evidence the predicateq : B→ [0, 1] given by q(H) = 0.95 and q(L) = 0.05. It yields:

ω|b�q ≈ 0.1434|0〉 + 0.4963|1〉 + 0.3603|2〉.

We see a slight difference wrt. the outcome with sharp evidence.

Example 2.5.5. The following question comes from [97, §6.1.3] (and is alsoused in [78]).

One fish is contained within the confines of an opaque fishbowl. The fish is equallylikely to be a piranha or a goldfish. A sushi lover throws a piranha into the fish bowlalongside the other fish. Then, immediately, before either fish can devour the other,one of the fish is blindly removed from the fishbowl. The fish that has been removedfrom the bowl turns out to be a piranha. What is the probability that the fish that wasoriginally in the bowl by itself was a piranha?

Let’s use the letters ‘p’ and ‘g’ for piranha and goldfish. We are looking at asituation with multiple fish in a bowl, where we cannot distingish the order.Hence we describe the contents of the bowl as a (natural) muliset over {p, g},

98


that is, as an element of N({p, g}). The initial situation can then be describedas a distribution ω ∈ D(N({p, g})) with:

ω = 12

∣∣∣ 1| p〉⟩

+ 12

∣∣∣ 1|g〉⟩.

Adding a piranha to the bowl involves a function A : N({p, g}) → N({p, g}),such that A(σ) = (σ(p) + 1)| p〉 + σ(g)|g〉. It forms a deterministic channel.

We use a piranha predicate P : N({p, g}) → [0, 1] that gives the likelihoodP(σ) of taking a piranha from a multiset/bowl σ. Thus:

P(σ) B Flrn(σ)(p)(1.9)=

σ(p)σ(p) + σ(g)

.

We have now collected all ingredients to answer the question via backwardinference along the deterministic channel A. It involves the following steps.

(A � P)(σ) = P(A(σ)) =σ(p) + 1

σ(p) + 1 + σ(g)ω |= A � P = 1

2 · (A � P)(1| p〉) + 12 · (A � P)(1|g〉)

= 12 ·

1+11+1+0 + 1

2 ·0+1

0+1+1 = 12 + 1

4 = 34

ω|A�P =1/2 · 1

3/4

∣∣∣ 1| p〉⟩

+1/2 · 1/2

3/4

∣∣∣ 1|g〉⟩

= 23

∣∣∣ 1| p〉⟩

+ 13

∣∣∣ 1|g〉⟩.

Hence the answer to the question in the beginning of this example is: 23 .

The previous illustrations of channel-based inference are rather concrete.The next example is more abstract and involves the binomial distribution aschannel.

Example 2.5.6. Capture and recapture is a methodology used in ecology toestimate the size of a population. So imagine we are looking at a pond and wewish to learn the number of fish. We catch twenty of them, mark them, andthrow them back. Subsequently we catch another twenty, and find out that fiveof them are marked. What do we learn about the number of fish?

The number of fish in the pond must be at least 20. Let’s assume the maximalnumber is 300. We will be considering units of 10 fish. Hence the underlyingsample space F together with the uniform ‘prior’ state υF is:

F = {20, 30, 40, . . . , 300} with υF =∑

x∈F1

29 | x〉.

We now assume that K = 20 of the fish in the pond are marked. We can thencompute for each value 20, 30, 40, . . . in the fish space F the probability offinding 5 marked fish when 20 of them are caught. In order not to complicatethe calculations too much, we catch these 20 fish one by one, check if they are

99


marked, and then throw them back. This means that the probability of catchinga marked fish remains the same, and is described by a binomial distribution,see Example 1.5.1 (2). Its parameters are K = 20 with probability i

20 of catch-ing a marked fish, where i ∈ F is the assumed total number of fish. This isincorporated in the following ‘catch’ channel c : F → K + 1 = {0, 1, . . . ,K}.

c(i) B binom[K]( i

K

)=

∑k∈K+1

(Kk

)·

( iK

)k

·

(K − iK

)K−k ∣∣∣k⟩.

Once this is set up, we construct a posterior state by updating the prior withthe information that five marked fish have been found. The latter is expressedas point predicate 15 ∈ Pred (K + 1) on the codomain of the channel c. We canthen do backward inference, as in Definition 2.5.1 (2), and obtain the updateduniform distribution:

υF

∣∣∣c�15

The bar chart of this posterior is in Figure 2.1; it indicates the likelihoods ofthe various numbers of fish in the pond. One can also compute the expectedvalue (mean) of this posterior; it’s 116.5 fish. In case we had caught 10 markedfish out of 20, the expected number would be 47.5.

Note that taking a uniform prior corresponds to the idea that we have noidea about the number of fish in the pond. But possibly we already had a goodestimate from previous years. Then we could have used such an estimate asprior distribution, and updated it with this year’s evidence.

2.5.1 Conditioning after state transformation

After all these examples we develop some general results. We start with a fun-damental result about conditioning of a transformed state. It can be reformu-lated as a combination of forward and backward reasoning. This has manyconsequences.

Theorem 2.5.7. Let c : X → Y be a channel with a state ω ∈ D(X) on itsdomain and a factor q ∈ Fact(Y) on its codomain. Then:

(c � ω

)|q = c|q � (ω|c�q). (2.7)

100


Figure 2.1 The posterior fish number distribution after catching 5 marked fish.

Proof. For each y ∈ Y ,

(c � ω)|q(y) =(c � ω)(y) · q(y)

c � ω |= q

=∑

x

ω(x) · c(x)(y) · q(y)ω |= c � q

=∑

x

ω(x) · (c � q)(x)ω |= c � q

·c(x)(y) · q(y)(c � q)(x)

=∑

xω|c�q(x) ·

c(x)(y) · q(y)c(x) |= q

=∑

xω|c�q(x) · c(x)|q(y)

=∑

xω|c�q(x) · c|q(x)(y)

=(c|q � (ω|c�q)

)(y).

This result has a number of useful consequences.

Corollary 2.5.8. For appropriately typed channels, states, and factors:

1 (d ◦· c)|q = d|q ◦· c|d�q;

2 (〈c, d〉 � ω)|p⊗q = 〈c|p, d|q〉 � ω|(c�p)&(d�q);

3 ((e ⊗ f ) � ω)|p⊗q = (e|p ⊗ f |q) � ω|(e�p)⊗( f�q).

101


Proof. 1 (d ◦· c)|q(x) = (d ◦· c)(x)|q = (d � c(x))|q(2.7)= d|q �

(c(x)|d�q

)= d|q �

(c|d�q(x)

)=

(d|q ◦· c|d�q

)(x).

2 (〈c, d〉 � ω)|p⊗q(2.7)= (〈c, d〉|p⊗q �

(ω|〈c,d〉�(p⊗q)

)= 〈c|p, d|q〉 �

(ω|(c�p)&(d�q)

)by Exercises 2.3.9 and 2.4.6.

3 Using the same exercises we also get:

((e ⊗ f ) � ω)|p⊗q(2.7)= ((e ⊗ f )|p⊗q �

(ω|(e⊗ f )�(p⊗q)

= (e|p ⊗ f |q) � ω|(e�p)⊗( f�q)

Earlier we have seen ‘crossover update’ for a joint state, whereby one up-dates in one component, and then marginalises in the other, see for instance Ex-ample 2.3.2 (2). We shall do this below for joint states gr(σ, c) = 〈id, c〉 � σ

that arise as graph, see Definition 1.8.10, via a channel. The result below ap-peared in [13], see also [53, 54]. It says that crossover updates on joint graphstates can also be done via forward and backward inferences. This is also aconsequence of Theorem 2.5.7.

Corollary 2.5.9. Let c : X → Y be a channel with a state σ ∈ D(X) on itsdomain. Then:

1 for a factor p on X,(〈id, c〉 � σ

)|p⊗1

[0, 1

]= c � (σ|p).

2 for a factor q on Y, (〈id, c〉 � σ

)|1⊗q

[1, 0

]= σ|c�q.

Proof. Since:(〈id, c〉 � σ

)|p⊗1

[0, 1

](2.7)= π2 �

(〈id, c〉|p⊗1 �

(σ|〈id,c〉�(p⊗1)

))=

(π2 ◦· 〈id, c〉

)�

(σ|(id�p)&(c�1)

)by Exercises 2.3.9 and 2.4.6

= c � (σ|p)(〈id, c〉 � σ

)|1⊗q

[1, 0

](2.7)= π1 �

(〈id, c〉|1⊗q �

(σ|〈id,c〉�(1⊗q)

))=

(π1 ◦· 〈id, c|q〉

)�

(σ|(id�1)&(c�q)

)= σ|c�q.

This result will tell us how to do inference in a Bayesian network, see Sec-tion 3.10.

102


Exercises

2.5.1 We consider some disease with an a priori probability of 1%. Thereis a test for the disease with the following ‘sensitivity’. If someonehas the disease, then the test is 90% positive; but if someone does nothave the disease, there is still a 5% chance that the test is positive.

1 Take as disease space D = {d, d⊥}; describe the prior as a distribu-tion on D;

2 Take as test space T = {p, n} and describe the sensitivity as a chan-nel s : D→ T ;

3 Show that the predicted positive test probability is almost 6%.4 Assume that a test comes out positive. Use backward reasoning to

prove that the probability of having the disease (the posterior) isthen a bit more than 15% (to be precise: 18

117 ). Explain why it is solow — remembering Example 2.5.3.

2.5.2 Give a channel-based analysis and answer to the following questionfrom [87, Chap. I, Exc. 39].Stores A, B, and C have 50, 75, and 100 employees, and respectively, 50, 60,and 70 percent of these are women. Resignations are equally likely amongall employees, regardless of sex. One employee resigns and this is a woman.What is the probability that she works in store C?

2.5.3 Check that the ‘piranha’ predicate P : M({p, g}) → [0, 1] in Exam-ple 2.5.5 can also be described via the draw function D from Defini-tion 1.6.3 as:

P(σ) = D(σ) |= 1p ⊗ 1.

2.5.4 The following situation about the relationship between eating ham-burgers and having Kreuzfeld-Jacob disease is insprired by [5, §1.2].We have two sets: E = {H,H⊥} about eating Hamburgers (or not),and D = {K,K⊥} about having Kreuzfeld-Jacob disease (or not). Thefollowing distributions on these sets are given: half of the people eathamburgers, and only one in hundred thousand have Kreuzfeld-Jacobdisease, which we write as:

ω = 12 |H 〉 +

12 |H

⊥ 〉 and σ = 1100,000 |K 〉 +

99,999100,000 |K

⊥ 〉.

1 Suppose that we know that 90% of the people who have Kreuzfeld-Jacob disease eat Hamburgers. Use this additional information todefine a channel c : D→ E with c � σ = ω.

2 Compute the probability of getting Kreuzfeld-Jacob for someoneeating hamburgers (via backward inference).

103


2.5.5 Let c : X → Y be a channel with state ω ∈ D(X) and factors p ∈Fact(X), q ∈ Fact(Y). Check that:

c|q �(ω∣∣∣p&(c�q)

)=

(c �

(ω|p

))∣∣∣q.

Describe what this means for p = 1.2.5.6 Let p ∈ Fact(X) and q ∈ Fact(Y) be two factors on spaces X,Y .

1 Show that for two channels c : Z → X and d : Z → Y with a jointstate on their (common) domain σ ∈ D(Z) one has:(

〈c, d〉 � σ)∣∣∣

p⊗q

[1, 0

]=

(c � (σ|d�q)

)∣∣∣p(

〈c, d〉 � σ)∣∣∣

p⊗q

[0, 1

]=

(d � (σ|c�p)

)∣∣∣q

2 For channels e : U → X and d : V → Y with a joint state ω ∈D(U × V) one has:(

(e ⊗ f ) � ω)∣∣∣

p⊗q

[1, 0

]=

(e �

(ω∣∣∣1⊗( f�q)

[1, 0

]))∣∣∣p(

(e ⊗ f ) � ω)∣∣∣

p⊗q

[0, 1

]=

(f �

(ω∣∣∣(e�p)⊗1

[0, 1

]))∣∣∣q.

2.6 Discretisation, and coin bias learning

So far we have been using discrete probability distributions, with a finite do-main. We shall be looking at continuous distributions later on, in Chapter ??.In the meantime we can approximate continuous distributions by discretisa-tion, namely by chopping them up into finitely many parts, like in Riemannintegration. This section first introduces this discretisation and then uses it inan extensive example on learning the bias of a coin via successive backwardinference.

We start with discretisation.

Definition 2.6.1. Let a, b ∈ R with a < b and N ∈ N with N > 0 be given.

1 We write [a, b]N ⊆ [a, b] ⊆ R for the interval [a, b] reduced to N elements:

[a, b]N B {a + 12 s, a + 3

2 s, . . . , a + 2N−12 s} where s B b−a

N

= {a + (i + 12 )s | i ∈ N}.

2 Let f : S → R≥0 be a function, defined on a finite subset S ⊆ R. We writeDisc( f , S ) ∈ D(S ) for the discrete distribution defined as:

Disc( f , S ) B∑x∈S

f (x)t

∣∣∣ x⟩where t B

∑x∈S f (x).

104

2.6. Discretisation, and coin bias learning 1052.6. Discretisation, and coin bias learning 1052.6. Discretisation, and coin bias learning 105

Often we combine the notations from these two points and use discretisedstates of the form Disc( f , [a, b]N).

To see an example of point (1), consider the interval [1, 2] with N = 3. Thestep size s is then s = 2−1

3 = 13 , so that:

[1, 2]3 ={1 + 1

2 ·13 , 1 + 3

2 ·13 , 1 + 5

2 ·13}

={1 + 1

6 , 1 + 12 , 1 + 5

6}

We choose to use internal points only and exclude the end-points in this finitesubset since the end-points sometimes give rise to boundary problems, withfunctions being undefined. When N goes to infinity, the smallest and largestelements in [a, b]N will approximate the end-points a and b — from above andfrom below, respectively.

The ‘total’ number t in point (2) normalises the formal sum and ensures thatthe multiplicities add up to one. In this way we can define a uniform distribu-tion on [a, b]N as υ[a,b]N , like before, or alternatively as Disc(1, [a, b]N), where1 is the constant-one function.

Example 2.6.2. We look at the following classical question: suppose we aregiven a coin with an unknown bias, and we observe the following list of heads(1) and tails (0):

[0, 1, 1, 1, 0, 0, 1, 1].

What can we then say about the bias of the coin?The frequentist approach that we have seen in Section 1.6 would turn the

above list into a multiset and then into a distribution, by frequentist learning,see Diagram (1.10). This would give:

[0, 1, 1, 1, 0, 0, 1, 1] 7−→ 5|1〉 + 3|0〉 7−→ 58 |1〉 +

38 |0〉.

Here we do not use this frequentist approach to learning the bias parameter,but take a Bayesian route. We assume that the bias parameter itself is givenby a distribution, describing the likelihoods of various bias values. We assumeno prior knowledge and therefor start from the uniform distribution. It will beupdated based on successive observations, using the technique of backwardinference, see Definition 2.5.1 (2).

The bias b of a coin is a number in the unit interval [0, 1], giving rise to acoin distribution flip(b) = b|1〉 + (1 − b)|0〉. Thus we can see flip as a channelflip : [0, 1] → 2 = {0, 1}. At this stage we avoid continuous distributions anddiscretise the unit interval. We choose N = 100 in the chop up, giving as under-lying space [0, 1]N with N points, on which we take the uniform distributionas prior:

ω B Disc(1, [0, 1]N) =∑

x∈[0,1]N

1N | x〉.

105


We use the flip operation as channel, restricted to the discretised space:

[0, 1]N ◦flip// 2 given by flip(b) = b|1〉 + (1 − b)|0〉.

We write yes, no : 2 → [0, 1] for the (sharp) predicates yes = 11 = idand no = yes⊥ = 10. Given the above sequence of head/tail observations[0, 1, 1, 1, 0, 0, 1, 1], we perform successive backward inferences, mapping a 1in the sequence to an update with predicate flip � yes, and a 0 to update withflip � no, as in:

ω|flip�no

ω|flip�no |flip�yes = ω|(flip�no)&(flip�yes)

ω|flip�no |flip�yes |flip�yes = ω|(flip�no)&(flip�yes)&(flip�yes)...

An overview of the resulting distributions is given in Figure 2.6. These distri-butions approximate (continuous) beta distributions, technically because “betais conjugate prior to flip”, see Section ?? or e.g. [47]. These beta functionsform a smoothed out version of the bar charts in Figure 2.6. We shall lookmore systematically into ‘learning along a channel’ in Section 4.8, where thecurrent coin-bias computation re-appears in Example 4.8.4.

In the end, after these eight updates, let’s write ρ = ω|(flip�no)&···&(flip�yes) forthe resulting distribution. We now ask two questions.

1 What is the predicted coin distribution? The outcome, with truncated multi-plicities, is:

flip � ρ = 0.6|1〉 + 0.4|0〉.

2 What is the expected value of ρ. Here we get:

mean(ρ) = 0.6.

For mathematical reasons2 the exact outcome is 0.6. However, we have usedapproximation via discretisation. The value computed with this discretisationis 0.599999985316273. We can conclude that chopping the unit interval upwith N = 100 already gives a fairly good approximation.

We conclude with a number of observations. The distribution that is obtainedby successive backward inferences is determined by the number of 0’s and 1’sin the list of observations, and not by their order. This is because the orderof conditioning does not matter. Hence it makes more sense to use multisets

2 The distribution ρ is an approximation of the probability density function β(6, 4), which hasmean 6

6+4 = 0.6.

106

2.6. Discretisation, and coin bias learning 1072.6. Discretisation, and coin bias learning 1072.6. Discretisation, and coin bias learning 107

Figure 2.2 Coin bias distributions arising from the prior uniform distribution bysuccessive updates after coin observations [0, 1, 1, 1, 0, 0, 1, 1].

n1|1〉 + n0|0〉 ∈ M(2) as observations (data), instead of lists, where n1 ∈ Nis the observed number of head’s and n0 the number of tails. In the aboveexample we have n1 = 5 and n0 = 3. Such abstractions from data that keeponly the information that is relevant for updating are called sufficient statistics.

Exercises

2.6.1 Recall the the N-element set [a, b]N from Definition 2.6.1 (1).

1 Show that its largest element is b − 12 s, where s = b−a

N .2 Prove that

∑x∈[a,b]N

x =N(a+b)

2 . Remember:∑

i∈N i =N(N−1)

2 .

2.6.2 Consider coin parameter learning in Example 2.6.2.

1 Show that the predicition in the prior (uniform) state ω on [0, 1]N

gives a fair coin, i.e.

flip � ω = 12 |1〉 +

12 |0〉.

This equality is independent of N.2 Prove that, also independently of N,

ω |= flip � yes = 12 .

3 Show next that:

ω|flip�yes =∑

x∈[0,1]N

2xN | x〉.

107


4 Use the ‘square pyramidal’ formula∑

i∈N i2 =N(N−1)(2N−1)

6 to provethat: (

flip � (ω|flip�yes))(1) = 2

3 −1

6N2 .

Conclude that flip � (ω|flip�yes) approaches 23 |1〉+

13 |0〉 as N goes

to infinity.

2.6.3 Prove that the probabilities(flip � ρ

)(1) and mean(ρ) are the same,

in the two points at the end of Example 2.6.2.

2.7 Inference in Bayesian networks

In previous sections we have seen several examples of channel-based infer-ence, in forward and backward form. This section shows how to apply theseinference methods to Bayesian networks, via an example that is often used inthe literature: the ‘Asia’ Bayesian network, originally from [64]. It capturesthe situation of patients with a certain probability of smoking and of an earliervisit to Asia; this influences certain lung diseases and the outcome of an xraytest. Later on, in Section 3.10, we shall look more systematically at inferencein Bayesian networks.

The Bayesian network example considered here is described in several steps:Figure 2.3 gives the underlying graph structure, in the style of Figure 1.4, in-troduced at the end of Section 1.9, with diagonals and types of wires writ-ten explicitly. Figure 2.4 gives the conditional probability tables associatedwith the nodes of this network. The way in which these tables are writtenis different from Section 1.9: we now only write the probabilities r ∈ [0, 1]and omit the values 1 − r. Next, these conditional probability tables are de-scribed in Figure 2.5 as states smoke ∈ D(S ) and asia ∈ D(A) and as channelslung : S → L, tub : A→ T , bronc : S → B, xray : E → X, dysp : B × E → D,either : L × T → E.

Our aim in this section is illustrate channel-based inference in this Bayesiannetwork. It is not so much the actual outcomes that we are interested in, butmore the systematic methodology that is used to obtain these outcomes.

Probability of lung cancer, given no bronchitisLet’s start with the question: what is the probability that someone has lungcancer, given that this person does not have bronchitis. The latter information isthe evidence. It takes the form of a singleton predicate 1b⊥ = (1b)⊥ : B→ [0, 1]on the set B = {b, b⊥} used for presence and absence of bronchitis.

In order to obtain this updated probability of lung cancer we ‘follow the

108

2.7. Inference in Bayesian networks 1092.7. Inference in Bayesian networks 1092.7. Inference in Bayesian networks 109

�� dysp

DOO �� xray

XOO

•

OOkk

�� either

EOO

�� bronc

B

OO

�� lung

L ?? �� tub

TOO

•

dd ;;

�� smoke

SOO �� asia

A

OO

Figure 2.3 The graph of the Asia Bayesian network, with node abbreviations:bronc = bronchitis, dysp = dyspnea, lung = lung cancer, tub = tuberculosis. Thewires all have 2-element sets of the form A = {a, a⊥} as types.

graph’. In Figure 2.3 we see that we can transform (pull back) the evidencealong the bronchitis channel bronc : A → B, and obtain a predicate bronc �1b⊥ on A. The latter can be used to update the asia distribution on A. Subse-quently, we can push the updated distribution forward along the lung channellung : A → L via state transformation. Thus we follow the ‘V’ shape in therelevant part of the graph.

Combining this down-update-up steps gives the required outcome:

lung �(asia

∣∣∣bronc�1b⊥

)= 0.0427|` 〉 + 0.9573|`⊥ 〉.

We see that this calculation combines forward and backward inference, seeDefinition 2.5.1.

Probability of smoking, given a positive xrayIn Figures 2.4 and 2.5 we see a prior smoking probability of 50%. We like toknow what this probability becomes if we have evidence of a positive xray.The latter is given by the point predicate 1x ∈ Pred (X) for X = {x, x⊥}.

Now there is a long path from xray to smoking, see Figure 2.3 that we needto use for (backward) predicate transformation. Along the way there is a com-plication, namely that the node ‘either’ has two parent nodes, so that pullingback along the either channel yields a predicate on the product set L × T .

109


P(smoke)

0.5

P(asia)

0.01

smoke P(lung)

s 0.1

s⊥ 0.01

asia P(tub)

a 0.05

a⊥ 0.01

smoke P(bronc)

s 0.6

s⊥ 0.3

either P(xray)

e 0.98

e⊥ 0.05

bronc either P(dysp)

b e 0.9

b e⊥ 0.7

b⊥ e 0.8

b⊥ e⊥ 0.1

lung tub P(either)

` t 1

` t⊥ 1

`⊥ t 1

`⊥ t⊥ 0

Figure 2.4 The conditional probability tables of the Asia Bayesian network

The only sensible thing to do is continue predicate transformation down down-wards, but now with the parallel product channel lung ⊗ tub : S × A→ L × T .The resulting predicate on S × A can be used to update the product statesmoke⊗asia. Then we can take the first marginal to obtain the desired outcome.Thus we compute:(

(smoke ⊗ asia)∣∣∣(lung⊗tub)�(either�(xray�1x))

)[1, 0

]=

((smoke ⊗ asia)

∣∣∣(xray◦·either◦· (lung⊗tub))�1x

)[1, 0

]= 0.6878| s〉 + 0.3122| s⊥ 〉.

Thus, a positive xray makes it more likely — w.r.t. the uniform prior — that thepatient smokes — as is to be expected. This is obtained by backward inference.

Probability of lung cancer, given both dyspnoea and tuberculosisOur next inference challenge involves two evidence predicates, namely 1d onD for dyspnoea and 1t on T for tuberculosis. We would like to know the updatelung cancer probability.

The situation looks complicated, because of the ‘closed loop’ in Figure 2.3.But we can proceed in a straightforward manner and combine evidence viaconjunction & at a suitable meeting point. We now more clearly separate theforward and backward stages of the inference process. We first move the prior

110

2.7. Inference in Bayesian networks 1112.7. Inference in Bayesian networks 1112.7. Inference in Bayesian networks 111

smoke = 0.5| s〉 + 0.5| s⊥ 〉 asia = 0.01|a〉 + 0.99|a⊥ 〉

lung(s) = 0.1|` 〉 + 0.5|`⊥ 〉 tub(a) = 0.05| t 〉 + 0.95| t⊥ 〉lung(s⊥) = 0.3|` 〉 + 0.7|`⊥ 〉 tub(a⊥) = 0.01| t 〉 + 0.99| t⊥ 〉

bronc(s) = 0.6|b〉 + 0.4|b⊥ 〉 xray(e) = 0.98| x〉 + 0.02| x⊥ 〉bronc(s⊥) = 0.3|b〉 + 0.7|b⊥ 〉 xray(e⊥) = 0.05| x〉 + 0.95| x⊥ 〉

dysp(b, e) = 0.9|d 〉 + 0.1|d⊥ 〉 either(`, t) = 1|e〉dysp(b, e⊥) = 0.7|d 〉 + 0.3|d⊥ 〉 either(`, t⊥) = 1|e〉dysp(b⊥, e) = 0.8|d 〉 + 0.2|d⊥ 〉 either(`⊥, t) = 1|e〉

dysp(b⊥, e⊥) = 0.1|d 〉 + 0.9|d⊥ 〉 either(`⊥, t⊥) = 1|e⊥ 〉

Figure 2.5 The conditional probability tables from Figure 2.4 described as statesand channels.

states forward to a point that includes the set L — the one that we need tomarginalise on to get our conclusion. We abbreviate this state on B× L× T as:

σ B (bronc ⊗ lung ⊗ id) �((∆ ⊗ tub) � (smoke ⊗ asia)

).

Recall that we write ∆ for the copy channel, in this expression of type S →S × S .

Going in the backward direction we can form a predicate, called p below,on the set B × L × T , by predicate transformation and conjunction:

p B(1 ⊗ 1 ⊗ 1t

)&

((id ⊗ either) � (dys � 1d)

).

The result that we are after is now obtained via updating and marginalisation:

σ|p[0, 0, 1

]= 0.6878|` 〉 + 0.3122|`⊥ 〉. (2.8)

There is an alternative way to describe the same outcome, using that certainchannels can be ‘shifted’. In particular, in the definition of the above stateσ, the channel bronc is used for state transformation. It can also be used ina different role, namely for predicate transformation. We then use a slightlydifferent state, now on S × L × T ,

τ B (id ⊗ lung ⊗ id) �((∆ ⊗ tub) � (smoke ⊗ asia)

).

The bronc channel is now used for predicate transformation in the predicate:

q B(1 ⊗ 1 ⊗ 1t

)&

((bronc ⊗ either) � (dys � 1d)

).

The same updated lung cancer distribution is now obtained as:

τ|q[0, 0, 1

]= 0.0558|` 〉 + 0.9442|`⊥ 〉. (2.9)

111


The reason why the outcomes (2.8) and (2.9) are the same is the topic of Exer-cise 2.7.2.

We conclude that inference in Bayesian networks can be done composi-tionally via a combination of forward and backward inference, basically byfollowing the network structure.

Exercises

2.7.1 Consider the wetness Bayesian network from Section 1.9. Write downthe channel-based inference formulas for the following inference ques-tions and check the outcomes that are given below.

1 The updated sprinkler distribution, given evidence of a slipperyroad, is 63

260 |b〉 +197260 |b〉.

2 The updated wet grass distribution, given evidence of a slipperyroad, is 4349

5200 |d 〉 +8515200 |d

⊥ 〉.

(Answers will appear later, in Example 3.3.3, in two different ways.)

2.7.2 Check that the equality of the outcomes in (2.8) and in (2.9) can beexplained via Exercises 2.3.5 (2) and 2.4.7.

2.8 Validity-based distances

There are standard notions of distance between states, and also between predi-cates. It turns out these distances can both be formulated in terms of validity |=,in a dual form. For two states ω1, ω2 ∈ D(X), we take the join of the distancein [0, 1] of validities:

d(ω1, ω2) B∨

p∈Pred (X)

∣∣∣ω1 |= p − ω2 |= p∣∣∣. (2.10)

Similarly, for two predicates p1, p2 ∈ Pred (X),

d(p1, p2) B∨

ω∈D(X)

∣∣∣ω |= p1 − ω |= p2∣∣∣. (2.11)

This section collects some standard results about these distances (for discreteprobability), in order to have a useful overview. We note that the above formu-lations involve predicates only, not observables in general.

112

2.8. Validity-based distances 1132.8. Validity-based distances 1132.8. Validity-based distances 113

2.8.1 Distance between states

The distance defined in (2.10) is commonly called the total variation distance,which is a special case of the Kantorovich distance, see e.g. [31, 10, 71, 67].Its characterisations below are standard. We refer to [52] for more informationabout the validity-based approach.

Proposition 2.8.1. Let X be an arbitrary set, with states ω1, ω2 ∈ D(X). Then:

d(ω1, ω2

)= max

U⊆Xω1 |= 1U − ω2 |= 1U = 1

2∑

x∈X

∣∣∣ω1(x) − ω2(x)∣∣∣.

We write maximum ‘max’ instead of join∨

to express that the supremumis actually reached by a subset (sharp predicate).

Proof. Let ω1, ω2 ∈ D(X) be two discrete probability distributions on thesame set X. We will prove the two inequalities labeled (a) and (b) in:

12∑

x∈X

∣∣∣ω1(x) − ω2(x)∣∣∣ (a)≤ max

U⊆Xω1 |= 1U − ω2 |= 1U

≤∨

p∈Pred (X)

∣∣∣ω1 |= p − ω2 |= p∣∣∣

(b)≤ 1

2∑

x∈X

∣∣∣ω1(x) − ω2(x)∣∣∣.

This proves Proposition 2.8.1 since the inequality in the middle is trivial.We start with some preparatory definitions. Let U ⊆ X be an arbitrary subset.

We shall write ωi(U) =∑

x∈U ωi(x) = (ω |= 1U). We partition U in threedisjoint parts, and take the relevant sums:

U> = {x ∈ U | ω1(x) > ω2(x)}U= = {x ∈ U | ω1(x) = ω2(x)}U< = {x ∈ U | ω1(x) < ω2(x)}

U↑ = ω1(U>) − ω2(U>) ≥ 0U↓ = ω2(U<) − ω1(U<) ≥ 0.

We use this notation in particular for U = X. In that case we can use:

1 = ω1(X) = ω1(X>) + ω1(X=) + ω1(X<)1 = ω2(X) = ω2(X>) + ω2(X=) + ω2(X<)

Hence by subtraction we obtain, since ω1(X=) = ω2(X=),

0 =(ω1(X>) − ω2(X>)

)+

(ω1(X<) − ω2(X<)

)That is,

X↑ = ω1(X>) − ω2(X>) = ω2(X<) − ω1(X<) = X↓ .

113


As a result:12∑

x∈X

∣∣∣ω1(x) − ω2(x)∣∣∣

= 12

(∑x∈X>

(ω1(x) − ω2(x)

)+

∑x∈X<

(ω2(x) − ω1(x)

))= 1

2

((ω1(X>) − ω2(X>)

)+

(ω2(X<) − ω1(X<)

))= 1

2(X↑ + X↓

)= X↑

(2.12)

We have prepared the ground for proving the above inequalities (a) and (b).

(a) We will see that the above maximum is actually reached for the subset U =

X>, first of all because:12∑

x∈X

∣∣∣ω1(x) − ω2(x)∣∣∣ (2.12)

= X↑ = ω1(X>) − ω2(X>)= ω1 |= 1X> − ω2 |= 1X>

≤ maxU⊆X

ω1 |= 1U − ω2 |= 1U .

(b) Let p ∈ Pred (X) be an arbitrary predicate. We have:(1U & p

)(x) = 1U(x) ·

p(x), which is p(x) if x ∈ U and 0 otherwise. Then:∣∣∣ω1 |= p − ω2 |= p∣∣∣

=∣∣∣∣ (ω1 |= 1X>& p + ω1 |= 1X=

& p + ω1 |= 1X<& p)

−(ω2 |= 1X>& p + ω2 |= 1X=

& p + ω2 |= 1X<& p) ∣∣∣∣

=∣∣∣∣ (ω1 |= 1X>& p − ω2 |= 1X>& p

)−

(ω2 |= 1X<& p − ω1 |= 1X<& p

) ∣∣∣∣=

(ω1 |= 1X>& p − ω2 |= 1X>& p

)−

(ω2 |= 1X<& p − ω1 |= 1X<& p

)if ω1 |= 1X>& p − ω2 |= 1X>& p

(∗)≥ ω2 |= 1X<& p − ω1 |= 1X<& p(

ω2 |= 1X<& p − ω1 |= 1X<& p)−

(ω1 |= 1X>& p − ω2 |= 1X>& p

)otherwise

≤

ω1 |= 1X>& p − ω2 |= 1X>& p if (∗)ω2 |= 1X<& p − ω1 |= 1X<& p otherwise

=

∑

x∈X> (ω1(x) − ω2(x)) · p(x) if (∗)∑x∈X< (ω2(x) − ω1(x)) · p(x) otherwise

≤

∑

x∈X> ω1(x) − ω2(x) if (∗)∑x∈X< ω2(x) − ω1(x) otherwise

=

X↑ if (∗)X↓ = X↑ otherwise

= X↑(2.12)= 1

2∑

x∈X

∣∣∣ω1(x) − ω2(x)∣∣∣.

114


This completes the proof.

The sum-formulation in Proposition 2.8.1 is useful in many situations, forinstance in order to prove that the above distance function d between states isa metric.

Lemma 2.8.2. The distance d(ω1, ω2) between states ω1, ω2 ∈ D(X) in (2.10)turns the set of distributionsD(X) into a metric space, with [0, 1]-valued met-ric.

Proof. If d(ω1, ω2) = 12∑

x∈X |ω1(x)−ω2(x) | = 0, then |ω1(x)−ω2(x) | = 0 foreach x ∈ X, so that ω1(x) = ω2(x), and thus ω1 = ω2. Obviously, d(ω1, ω2) =

d(ω2, ω1). The triangle inequality holds for d since it holds for the standarddistance on [0, 1].

d(ω1, ω3) = 12∑

x∈X |ω1(x) − ω3(x) |

≤ 12∑

x∈X |ω1(x) − ω2(x) | + |ω2(x) − ω3(x)|

= 12∑

x∈X |ω1(x) − ω2(x) | + 12∑

x∈X |ω2(x) − ω3(x) |

= d(ω1, ω2) + d(ω2, ω3).

We use this same sum-formulation for the following result.

Lemma 2.8.3. State transformation is non-expansive: for a channel c : X → Yone has:

d(c � ω1, c � ω2) ≤ d(ω1, ω2).

Proof. Since:

d(c � ω1, c � ω2) = 12∑

y | (c � ω1)(y) − (c � ω2(y) |

= 12∑

y |∑

x ω1(x) · c(x)(y) −∑

x ω2(x) · c(x)(y) |

= 12∑

y |∑

x c(x)(y) · (ω1(x) − ω2(x)) |

≤ 12∑

y∑

x c(x)(y) · |ω1(x) − ω2(x) |

= 12∑

x (∑

y c(x)(y)) · |ω1(x) − ω2(x) |

= d(ω1, ω2).

For two states ω1, ω2 ∈ D(X) a coupling is a joint state σ ∈ D(X × X) thatmarginalises to ω1 and ω2, i.e. σ

[1, 0

]= ω1 and σ

[0, 1

]= ω2. Such couplings

give an alternative formulation of the distance between states, which is oftencalled the Wasserstein distance, see [100]. These couplings will also be usedin Section 3.8 as probabilistic relations. The proof of the next result is standardand is included in order to be complete.

115


Proposition 2.8.4. For states ω1, ω2 ∈ D(X),

d(ω1, ω2) =∧{σ |= Eq⊥ | σ is a coupling between ω1, ω2},

where Eq : X × X → [0, 1] is the equality predicate from Definition 2.1.1 (4).

Proof. We use the notation and results from the proof of Proposition 2.8.1. Wefirst prove the inequality (≤). Let X> = {x ∈ X | ω1(x) > ω2(x)}. Then:

ω1(x) =∑

y σ(x, y) = σ(x, x) +∑

y,x σ(x, y)≤ ω2(x) +

(σ |= (1x ⊗ 1) & Eq⊥

).

This means that ω1(x)−ω2(x) ≤ σ |= (1x ⊗ 1) & Eq⊥ for x ∈ X>. We similarlyhave:

ω2(x) =∑

x σ(y, x) = σ(x, x) +∑

y,x σ(y, x)≤ ω1(x) +

(σ |= (1 ⊗ 1x) & Eq⊥

).

Hence ω2(x) − ω1(x) for x < X>. Putting this together gives:

d(ω1, ω2) = 12∑

x∈X |ω1(x) − ω3(x) |

= 12∑

x∈X> ω1(x) − ω2(x) + 12∑

x∈¬X> ω2(x) − ω1(x)

≤ 12∑

x∈X> σ |= (1x ⊗ 1) & Eq⊥ + 12∑

x∈¬X> σ |= (1 ⊗ 1x) & Eq⊥

= 12 σ |= (1X> ⊗ 1) & Eq⊥ + 1

2 σ |= (1 ⊗ 1¬X> ) & Eq⊥

≤ 12 σ |= Eq⊥ + 1

2 σ |= Eq⊥

= σ |= Eq⊥.

For the inequality (≥) one uses what is called an optimal coupling ρ ∈ D(X×X) of ω1, ω2. It can be defined as:

ρ(x, y) B

min(ω1(x), ω2(x)

)if x = y

max(ω1(x)−ω2(x),0

)·max

(ω2(y)−ω1(y),0

)/d(ω1,ω2) otherwise.

We first check that this ρ is a coupling. Let x ∈ X> so that ω1(x) > ω2(x); then:∑y ρ(x, y)

= ω2(x) + (ω1(x) − ω2(x)) ·∑y,x

max(ω2(y) − ω1(y), 0

)d(ω1, ω2)

= ω2(x) + (ω1(x) − ω2(x)) ·∑

y∈X< ω2(y) − ω1(y)d(ω1, ω2)

= ω2(x) + (ω1(x) − ω2(x)) ·X↓

d(ω1, ω2)= ω2(x) + (ω1(x) − ω2(x)) · 1 see the proof of Proposition 2.8.1= ω1(x).

116


If x < X>, so that ω1(x) ≤ ω2(x), then it is obvious that∑

y ρ(x, y) = ω1(x)+0 =

ω1(x). This shows σ[1, 0

]= ω1. In a similar way one obtains σ

[0, 1

]= ω2.

Finally,

ρ |= Eq =∑

x ρ(x, x) =∑

x min(ω1(x), ω2(x)

)=

∑x∈X> ω2(x) +

∑x<X> ω1(x)

= ω2(X>) + 1 − ω1(X>)= 1 −

(ω1(X>) − ω2(X>)

)= 1 − d(ω1, ω2).

Hence d(ω1, ω2) = 1 −(ρ |= Eq

)= ρ |= Eq⊥.

2.8.2 Distance between predicates

What we have to say about the validity-based distance (2.11) between predi-cates is rather brief. First, there is also a point-wise formulation.

Lemma 2.8.5. For two predicates p1, p2 ∈ Pred (X),

d(p1, p2) =∨x∈X

∣∣∣ p1(x) − p2(x)∣∣∣.

This distance function d makes the set Pred (X) into a metric space.

Proof. First, we have for each x ∈ X,

d(p1, p2)(2.11)=

∨ω∈D(X)

|ω |= p1 − ω |= p2 |

≥ | unit(x) |= p1 − unit(x) |= p2 |

= | p1(x) − p2(x) |.

Hence d(p1, p2) ≥∨

x | p1(x) − p2(x) |.The other direction follows from:

|ω |= p1 − ω |= p2 | = |∑

z ω(z) · p1(z) −∑

z ω(z) · p2(z) |

≤∑

z ω(z) · | p1(z) − p2(z) |

≤∑

z ω(z) ·∨x∈X

| p1(x) − p2(x) |.

=(∑

z ω(z))·∨x∈X

| p1(x) − p2(x) |.

=∨x∈X

| p1(x) − p2(x) |.

The fact that we get a metric space is now straightforward.

We have the analogue of Lemma 2.8.3.

117


Lemma 2.8.6. Predicate transformation is non-expansive: for a channel c : X →Y one has, for predicates p1, p2 ∈ Pred (Y),

d(c � p1, c � p2) ≤ d(p1, p2).

Proof. Via the formulation of Lemma 2.8.5 we get:

d(c � p1, c � p2) =∨

x

| (c � p1)(x) − (c � p2)(x) |

=∨

x

|∑

y c(x)(y) · p1(y) −∑

y c(x)(y) · p2(y) |

≤∨

x

∑y c(x)(y) · | p1(y) − p2(y) |

≤∨

x

(∑y c(x)(y)

)· d(p1, p2)

= d(p1, p2).

Exercises

2.8.1 In [54] the influence of a predicate p on a state ω is measured via thedistance d(ω,ω|p). This influence can be zero, for the truth predicatep = 1.

Consider the set {H,T } with state σ(ε) = ε|H 〉 + (1 − ε)|T 〉, andwith predicate p = 1H . Prove that d(σ(ε), σ(ε)|p)→ 1 as ε→ 0.

2.8.2 This exercise uses the distance between a joint state and the productof its marginals as measure of entwinedness, like in [44].

1 Take σ2 B12 |00〉 + 1

2 |11〉 ∈ D(2 × 2), for 2 = {0, 1}. Show that:

d(σ2, σ2

[1, 0

]⊗ σ2

[0, 1

])= 1

2 .

2 Take σ3 B12 |000〉 + 1

2 |111〉 ∈ D(2 × 2 × 2). Show that:

d(σ3, σ3

[1, 0, 0

]⊗ σ3

[0, 1, 0

]⊗ σ3

[0, 0, 1

])= 3

4 .

3 Now define σn ∈ D(2n) for n ≥ 2 as:

σn B12 | 0 · · · 0︸︷︷︸

n times

〉 + 12 | 1 · · · 1︸︷︷︸

n times

〉.

Show that:

• each marginal πi � σn equals 12 |0〉 +

12 |1〉;

• the product⊗

i (πi � σn) of the marginals is the uniform stateon 2n;

• d(σn,

⊗i (πi � σn)

)=

2n−1 − 12n−1 .

118

2.9. Variance, covariance and correlation 1192.9. Variance, covariance and correlation 1192.9. Variance, covariance and correlation 119

Note that the latter distance goes to 1 as n goes to infinity.

2.8.3 Show that D(X), with the total variation distance (2.8), is a completemetric space when the set X is finite.

2.9 Variance, covariance and correlation

This section describes the standard notions of variance, covariance and corre-lation within the setting of this book. It uses the validity relation |= and the op-erations on observables from Section 2.2. Recall that we understand a randomvariable here as a pair (ω, p) consisting of a state ω ∈ D(X) and an observablep : X → R. It turns out that it is highly relevant which combination of stateand observable is used: the notions of covariance and correlation are definedfor two random variables. The question then comes up: do these two randomvariables involve the same state, like in (ω, p1) and (ω, p2), or do we have tworandom variables with different states, like in (ω1, p1) and (ω2, p2), or is therean underlying joint state? These differences are significant, but are usually notdealt with explicitly when the state associated with a random variable is leftimplicit.

Let (ω, p) be a random variable on a set X. The validity ω |= p is a realnumber, and can thus be used as a scalar, in the sense of Section 2.2. Thetruth predicate forms an observable 1 ∈ Obs(X); scalar multiplication yields anew observable (ω |= p) · 1 ∈ Obs(X). It can be subtracted3 from p, and thensquared, giving an observable:(

p − (ω |= p) · 1)2

=(p − (ω |= p) · 1

)&

(p − (ω |= p) · 1

)∈ Obs(X).

This observable denotes the function that sends x ∈ X to(p(x)−(ω |= p)

)2∈ R.

Its validity in the original state ω is called variance. It captures how far thevalues of p differ from their expected value.

Definition 2.9.1. For a random variable (ω, p), the variance Var(ω, p) is thenon-negative number defined by:

Var(ω, p) B ω |=(p − (ω |= p) · 1

)2.

This number is non-negative since it is the validity of a factor: the right-hand-side of |= is a square and is thus R≥0-valued.

3 Subtraction expressions like these occur more frequently in mathematics. For instance, aneigenvalue λ of a matrix M may be defined as the scalar that forms a solution to the equationM − λ · 1 = 0, where 1 is the identity matrix. A similar expression is used to define theelements in the spectrum of a C∗-algebra. See also Excercise 2.2.8.

119


When the underlying sample space X is a subset of R, say via incl : X ↪→ R,then one simply writes Var(ω) for Var(ω, incl).

The name standard deviation is used for the square root of the variance;thus:

StDev(ω, p) B√

Var(ω, p).

Example 2.9.2. 1 We recall Example 2.1.4 (1), with distribution flip(0.3) =

0.3|1〉+ 0.7|0〉 and observable v(0) = −50 and v(1) = 100. We had ω |= v =

−5, and so we get:

Var(flip(0.3), v) =∑

x flip(0.3)(x) · (v(x) + 5)2

= 0.3 · (100 + 5)2 + 0.7 · (−50 + 5)2 = 4725.

The standard deviation is around 68.7.2 For a (fair) dice we have pips = {1, 2, 3, 4, 5, 6} ↪→ R and mean(dice) = 7

2so that:

Var(dice) =∑

x dice(x) · (x − 72 )2

= 16 ·

(( 52)2

+( 3

2)2

+( 1

2)2

+( 1

2)2

+( 3

2)2

+( 5

2)2)

= 7024 .

The following result is sometimes useful in calculations, see the subsequentbinomial example.

Lemma 2.9.3. Variance satisfies:

Var(ω, p) =(ω |= p2) − (ω |= p)2.

Proof. We have:

Var(ω, p)= ω |=

(p − (ω |= p) · 1

)2

=∑

x ω(x)(p(x) − (ω |= p)

)2

=∑

x ω(x)(p(x)2 − 2(ω |= p)p(x) + (ω |= p)2

)=

(∑x ω(x)p2(x)

)− 2(ω |= p)

(∑x ω(x)p(x)

)+

(∑x ω(x)(ω |= p)2

)=

(ω |= p2) − 2

(ω |= p)(ω |= p) + (ω |= p)2

=(ω |= p2) − (ω |= p)2.

We use this result to obtain the variance of a binomial distribution.

Example 2.9.4. We fix K ∈ N and r ∈ [0, 1] and use incl : {0, 1, . . . ,K} ↪→ Ras observable. We recall from Example 2.1.4 (4) that:

mean(binom[K](r)) = binom[K](r) |= incl = K · r.

120


By re-using the same trick as for the mean calculation, we first compute:

binom[K](r) |= incl & (incl − 1)

=∑

0≤k≤K

binom[K](r) · k · (k − 1)

=∑

0≤k≤K

k · (k−1) · K!k! · (K−k)!

· rk · (1 − r)K−k

= K · (K − 1) · r2 ·∑

2≤k≤K

(K−2)!(k−2)! · ((K−2) − (k−2))!

· rk−2 · (1 − r)(K−2)−(k−2)

= K · (K − 1) · r2 ·∑

0≤ j≤K−2

(K−2)!j! · ((K−2) − j)!

· r j · (1 − r)(K−2)− j

= K · (K − 1) · r2.

Then, using the reformulation of Lemma 2.9.3,

Var(binom[K](r)

)=

(binom[K](r) |= incl & incl

)−

(binom[K](r) |= incl

)2

=(binom[K](r) |= incl & (incl − 1)

)+

(binom[K](r) |= incl

)− (K · r)2

= K · (K − 1) · r2 + K · r − (K · r)2

= K · r · (1 − r).

We then notice that variance of the binomial is ‘symmetric’ in the probability,in the sense that Var(binom[K](r)) = Var(binom[K](1 − r)).

We continue with covariance and correlation, which involve two randomvariables, instead one, as for variance. We distinguish two forms of these,namely:

• The two random variables are of the form (ω, p1) and (ω, p2), where theyshare their state ω.

• There is a joint state τ ∈ D(X1×X2) together with two observables q1 : X1 →

R and q2 : X2 → R on the two components X1, X2. This situation can be seenas a special case of the previous point by first weakening the two observablesto the product space, via: π1 � q1 = q1 ⊗ 1 and π2 � q2 = 1 ⊗ q2. Thus weobtain two random variable with a shared state:

(τ, π1 � q1) and (τ, π2 � q2).

These observable transformations πi � qi along a deterministic channelπi : X1 × X2 → Xi can also be described simply as function compositionqi ◦ πi : X1 × X2 → R, see Lemma 2.4.2 (7).

We start with the situation in the first bullet above, and deal with the secondbullet in Definition 2.9.8.

121


Definition 2.9.5. Let (ω, p1) and (ω, p2) be two random variable with a sharedstate ω ∈ D(X).

1 The covariance of these random variables is defined as the validity:

Cov(ω, p1, p2) B ω |=(p1 − (ω |= p1) · 1

)&

(p2 − (ω |= p2) · 1

).

2 The correlation between (ω, p1) and (ω, p2) is the covariance divided bytheir standard deviations:

Cor(ω, p1, q2) BCov(ω, p1, p2)

StDev(ω, p1) · StDev(ω, p2).

Notice that variance is a special case of covariance, namely where p = p1 =

p2. Hence if there is an inclusion incl : X ↪→ R and we would use this inclusiontwice to compute covariance, we are in fact computing variance.

Before we go on, we state the following analogue of Lemma 2.9.3, leavingthe proof to the reader.

Lemma 2.9.6. Covariance can be reformulated as:

Cov(ω, p1, p2) =(ω |= p1 & p2

)−

(ω |= p1

)·(ω |= p2

).

Example 2.9.7. 1 We have seen in Definition 2.1.2 (2) how the average ofan observable can be computed as its validity in a uniform state. The sameapproach is used to compute the covariance (and correlation) in a uniformjoint state. Consider the following to lists a and b of numerical data, of thesame length.

a = [5, 10, 15, 20, 25] b = [10, 8, 10, 15, 12]

We will identify a and b with random variables, namely with a, b : 5 →R, where 5 = {0, 1, 2, 3, 4}. Hence there are obvious definitions: a(0) = 5,a(1) = 10, a(2) = 15, a(3) = 20, a(4) = 25, and simililarly for b. Then wecan compute their averages as validities in the uniform state υ5 on the set 5:

avg(a) = υ5 |= a = 15 avg(a) = υ5 |= a = 11.

We will calculate the covariance between a and b wrt. the uniform state υ5,as:

Cov(υ5, a, b) = υ5 |=(a − (υ5 |= a) · 1

)&

(b − (υ5 |= b) · 1

)=

∑i

15 ·

(a(i) − 15

)·(b(i) − 11

)= 11.

122


2 In order to obtain the correlation between a, b, we first need to compute theirvariances:

Var(υ5, a) =∑

i15(a(i) − 15

)2= 50

Var(υ5, b) =∑

i15(b(i) − 11

)2= 5.6

Then:

Cor(υ5, a, b) =Cov(υ5, a, b)

√Var(υ5, a) ·

√Var(υ5, b)

=11

√50 ·√

5.6≈ 0.66.

We will next define covariance and correlation for a joint state with observ-ables on its component spaces, as a special case of what we have seen so far.We shall refer to this instance as joint covariance and joint correlation, withnotation JCov and JCor .

Definition 2.9.8. Let τ ∈ D(X1×X2) be a joint state on sets X1, X2 and let q1 ∈

Obs(X1) and q2 ∈ Obs(X2) be two observables on these two sets separately. Inthis situation the joint covariance is defined as:

JCov(τ, q1, q2) B Cov(τ, π1 � q1, π2 � q2).

Similarly, the joint correlation is:

JCor(τ, q1, q2) B Cor(τ, π1 � q1, π2 � q2).

In both these cases, if there are inclusions X1 ↪→ R and X2 ↪→ R, then one canuse these inclusions as random variables and write just JCov(τ) and JCor(τ).

Joint covariance can be reformulated in different ways, including in the stylethat we have seen before, in Lemma 2.9.3 and 2.9.6.

Lemma 2.9.9. Joint covariance can be reformulated as:

JCov(τ, q1, q2) = τ |=(q1 − (τ

[1, 0

]|= q1) · 1

)⊗

(q2 − (τ

[0, 1

]|= q2) · 1

).

=(τ |= q1 ⊗ q2

)− (τ

[1, 0

]|= q1) · (τ

[0, 1

]|= q2).

Proof. The first equation follows from:

JCov(τ, q1, q2) = Cov(τ, π1 � q1, π2 � q2).= τ |=

((π1 � q1) − (τ |= π1 � q1) · 1

)&

((π2 � q2) − (τ |= π2 � q2) · 1

)= τ |=

((π1 � q1) − (τ

[1, 0

]|= q1) · (π1 � 1)

)&

((π2 � q2) − (τ

[0, 1

]|= q2) · (π2 � 1)

)= τ |=

(π1 �

(q1 − (τ

[1, 0

]|= q1) · 1

))&

(π2 �

(q2 − (τ

[0, 1

]|= q2) · 1

))= τ |=

(q1 − (τ

[1, 0

]|= q1) · 1

)⊗

(q2 − (τ

[0, 1

]|= q2) · 1

).

123


The last equation follows from Exercise 2.4.5.Via this first equation in Lemma 2.9.9 we prove the second one:

JCov(τ, q1, q2)= τ |=

(q1 − (τ

[1, 0

]|= q1) · 1

)⊗

(q2 − (τ

[0, 1

]|= q2) · 1

)=

∑x,y τ(x, y) ·

(q1 − (τ

[1, 0

]|= q1) · 1

)(x) ·

(q2 − (τ

[0, 1

]|= q2) · 1

)(y)

=∑

x,y τ(x, y) ·(q1(x) − (τ

[1, 0

]|= q1)

)·(q2(y) − (τ

[0, 1

]|= q2)

)=

(∑x,y τ(x, y) · q1(x) · q2(y)

)−

(∑x,y τ(x, y) · q1(x) · (τ

[0, 1

]|= q2)

)−

(∑x,y τ(x, y) · (τ

[1, 0

]|= q1) · q2(y)

)+ (τ

[1, 0

]|= q1)(τ

[0, 1

]|= q2)

=(τ |= q1 ⊗ q2

)− (τ

[1, 0

]|= q1)(τ

[0, 1

]|= q2).

We turn to some illustrations. Many examples of covariance are actually ofjoint form, especially if the underlying sets are subsets of the real numbers.In the joint case it makes sense to leave these inclusions implicit, as will beillustrated below.

Example 2.9.10. 1 Consider sets X = {1, 2} and Y = {1, 2, 3} as subsets of R,together with a joint distribution τ ∈ D(X × Y) given by:

τ = 14 |1, 1〉 +

14 |1, 2〉 +

14 |2, 2〉 +

14 |2, 3〉.

Its two marginals are:

τ[1, 0

]= 1

2 |1〉 +12 |2〉 τ

[0, 1

]= 1

4 |1〉 +12 |2〉 +

14 |3〉.

Since both X ↪→ R and Y ↪→ R we get means as validities of the inclusions:

mean(τ[1, 0

])= 3

2 mean(τ[1, 0

])= 2.

Now we can compute the joint covariance in the joint state τ as:

JCov(τ) = τ |=(incl −mean

(τ[1, 0

])· 1

)⊗

(incl −mean

(τ[1, 0

])· 1

)=

∑x,y τ(x, y) · (x − 3

2 ) · (y − 2)= 1

4(− 1

2 · −1 + 12 · 1

)= 1

4 .

2 In order to calculate the (joint) correlation of τ we first need the variancesof its marginals:

Var(τ[1, 0

])=

∑x τ

[1, 0

](x) · (x − 3

2 )2 = 14

Var(τ[0, 1

])=

∑y τ

[0, 1

](y) · (y − 2)2 = 1

2 .

Then:

JCor(τ) =JCov(τ)√

Var(τ[1, 0

])·√

Var(τ[0, 1

]) =1/4

1/2 · 1/√

2= 1

2

√2.

124


We have defined joint covariance as a special case of ordinary covariance.We conclude this section by showing that ordinary covariance can also be seenas joint covariance, namely for a copied state.

Proposition 2.9.11. Ordinary covariance can be expressed as joint covari-ance:

Cov(ω, p1, p2

)= JCov

(∆ � ω, p1, p2

).

Proof. We use that the marginals of the copied state ∆ � ω are ω itself, in:

JCov(∆ � ω, p1, p2

)= ∆ � ω |=

(p1 − ((∆ � ω)

[1, 0

]|= p1) · 1

)⊗

(p2 − ((∆ � ω)

[0, 1

]|= p2) · 1

)by Lemma 2.9.9

= ω |= ∆ �((

p1 − (ω |= p1) · 1)⊗

(p2 − (ω |= p2) · 1

))= ω |=

(p1 − (ω |= p1) · 1

)&

(p2 − (ω |= p2) · 1

)by Exercise 2.4.5 (2)

= Cov(ω, p1, p2

).

We started with ‘ordinary’ covariance in Definition 2.9.5 for two randomvariables with a shared state. It was subsequently used to define the ‘joint’version in Definition 2.9.8. The above result shows that we could have donethis the other way around: obtain the ordinary formulation in terms of the jointversion. In the light of this result we shall also refer to the ordinary version ofcovariance as the ‘copied state’ formulation — because of the role of copyingin ∆ � ω. The same terminology appears in Section 4.7 in the context oflearning from data.

Exercises

2.9.1 Let τ ∈ D(X × Y) be a joint state with an observable p : X → R. Wecan turn them into a random variable in two ways, by marginalisationand weakening:

(τ[1, 0

], p) and (τ, π1 � p),

where π1 � p = p ⊗ 1 is an observable on X × Y .These two random variables have the same expected value, by (2.4).

Show that they have the same variance too:

Var(τ[1, 0

], p) = Var(τ, π1 � p).

As a result, the standard deviations are also the same.

125


2.9.2 Prove that:

JCor(τ, q1, q2) =JCov(τ, q1, q2)

StDev(τ[1, 0

], q1) · StDev(τ

[0, 1

], q2)

.

2.9.3 Prove Lemma 2.9.6 along the lines of the proof of Lemma 2.9.3.2.9.4 Find examples of covariance and correlation computations in the liter-

ature (or online) and determine if they are of copied-state or joint-stateform.

2.10 Dependence and covariance

This section explains the notion of dependence between random variables inthe current setting. We shall follow the approach of the previous section andintroduce two versions of dependence, also called copied-state and joint-state.Independence is related to the property ‘covariance is zero’, but in a subtleway. This will be elaborated below.

Suppose we have two random variable describing the number of icecreamsales and the temperature. Intuitively one expects a dependence between thetwo, and that the two variables are correlated (in an informal sense). The op-posite, namely independence is usually formalised as follows. Two randomvariables p1, p2 are called independent if:

P[p1 = a, p2 = b] = P[p1 = a] · P[p2 = b]. (2.13)

holds for all real numbers a, b. In this formulation a distribution is assumedin the background. We like to use it explicitly. How should the above equa-tion (2.13) then be read?

As we described in Subsection 2.1.1, the expression P[p = a] can be inter-preted as transformation along the observable pi, considered as deterministicchannel:

P[p = a] = p � ω |= 1a(2.6)= ω |= p � 1a.

where ω is the implicit distribution.We can then translate the above equation (2.13) into:

〈p1, p2〉 � ω = (p1 � ω) ⊗ (p2 � ω). (2.14)

But this says that the joint state 〈p1, p2〉 � ω on R × R, transformed along〈p1, p2〉 : X → R×R, is non-entwined: it is the product of its marginals. Indeed,

126

2.10. Dependence and covariance 1272.10. Dependence and covariance 1272.10. Dependence and covariance 127

its (first) marginal is:(〈p1, p2〉 � ω

)[1, 0

]= π1 �

(〈p1, p2〉 � ω

)=

(π1 ◦· 〈p1, p2〉

)� ω

= p1 � ω.

This brings us to the following formulation. We formulate it for two randomvariables, but it can easily be extended to n-ary form.

Definition 2.10.1. Let (ω, p1) and (ω, p2) be two random variables with a com-mon state ω. They will be called independent if the transformed state

〈p1, p2〉 � ω = (p1 ⊗ p2) � (∆ � ω)

on R×R is non-entwined, as in Equation (2.14): it is the product of its marginals.We sometimes call this the copied-state form of independence, in order to

distinguish it from a later joint-state version.

We give an illustration, of non-independence, that is, of dependence.

Example 2.10.2. Consider a fair coin flip = 12 |1〉 + 1

2 |0〉. We are going touse it twice, first to determine how much we will bet (either €100 or €50),and secondly to determine whether the bet is won or not. Thus we use thedistribution ω = flip ⊗ flip with underlying set 2× 2, where 2 = {0, 1}. We willdefine two observables p1, p2 : 2 × 2→ R on this same distribution ω.

We first define an auxiliary observable p : 2→ R for the amount of the bet:

p(1) = 100 p(0) = 50.

We then define p1 = p ⊗ 1 = p ◦ π1 : 2 × 2 → R, as on the left below. Therandom variable p2 is defined on the right.

p1(x, y) B

100 if x = 150 if x = 0

p2(x, y) B

p(x) if y = 1−p(x) if y = 0.

We claim that (ω, p1) and (ω, p2) are not independent. Intuitively this may beclear, since the observable p forms a connection between p1 and p2. Formally,we can see this by doing the calculations:

〈p1, p2〉 � ω = D(〈p1, p2〉)(flip ⊗ flip)= 1

4 | p1(1, 1), p2(1, 1)〉 + 14 | p1(1, 0), p2(1, 0)〉

+ 14 | p1(0, 1), p2(0, 1)〉 + 1

4 | p1(0, 0), p2(0, 0)〉= 1

4 |100, 100〉 + 14 |100,−100〉 + 1

4 |50, 50〉 + 14 |50,−50〉

127


The two marginals are:

p1 � ω =(〈p1, p2〉 � ω

)[1, 0

]= 1

2 |100〉 + 12 |50〉

p2 � ω =(〈p1, p2〉 � ω

)[0, 1

]= 1

4 |100〉 + 14 | − 100〉 + 1

4 |50〉 + 14 | − 50〉.

It is not hard to see that the parallel product ⊗ of these two marginal distribu-tions differs from 〈p1, p2〉 � ω.

Proposition 2.10.3. The copied-state covariance of copied-state independentrandom variables is zero: if random variables (ω, p1) and (ω, p2) are indepen-dent, then Cov(ω, p1, p2) = 0.

The converse does not hold.

Proof. If (ω, p1) and (ω, p2) are independent, then, by definition, 〈p1, p2〉 �

ω = (p1 � ω) ⊗ (p2 � ω). The calculation belows shows that covariance isthen zero. It uses multiplication &: R × R → R as observable. It can also bedescribed as the parallel product id ⊗ id of the observable id : R → R withitself.

Cov(ω, p1, p2)=

(ω |= p1 & p2

)−

(ω |= p1

)·(ω |= p2

)by Lemma 2.9.6

=(ω |= & ◦ 〈p1, p2〉

)−

(p1 � ω |= id

)·(p2 � ω |= id

)by Exercise 2.1.3

=(〈p1, p2〉 � ω |= &

)−

((p1 � ω) ⊗ (p2 � ω) |= id ⊗ id

)by Exercises 2.2.5, 2.1.3

=(〈p1, p2〉 � ω |= &

)−

(〈p1, p2〉 � ω |= &

)by assumption

= 0.

The claim that the converse does not hold follows from Example 2.10.4, rightbelow.

Example 2.10.4. We continue Example 2.10.2. The set-up used there involvestwo dependent random variables (ω, p1) and (ω, p2), with shared state ω =

flip ⊗ flip. We show here that they (nevertheless) have covariance zero. Thisproves the second claim of Proposition 2.10.3, namely that zero-covarianceneed not imply independence, in the copied-state context.

We first compute the validities (expected values):

ω |= p1 = 14 · p1(1, 1) + 1

4 · p1(1, 0) + 14 · p1(0, 1) + 1

4 · p1(0, 0)= 1

4 · 100 + 14 · 100 + 1

4 · 50 + 14 · 50 = 75

ω |= p2 = 14 · 100 + 1

4 · −100 + 14 · 50 + 1

4 · −50 = 0

128

2.10. Dependence and covariance 1292.10. Dependence and covariance 1292.10. Dependence and covariance 129

Then:

Cov(ω, p1, p2)= ω |=

(p1 − (ω |= p1) · 1

)&

(p2 − (ω |= p2) · 1

)= 1

4

((100−75) · 100 + (100−75) · −100 + (50−75) · 50 + (50−75) · −50

)= 0.

We now turn to independence in joint-state form, in analogy with joint-statecovariance in Definition 2.9.8.

Definition 2.10.5. Let τ ∈ D(X1×X2) be a joint state and with two observablesq1 ∈ Obs(X1) and q2 ∈ Obs(X2). We say that there is joint-state independenceof q1, q2 if the two random variables (τ, π1 � q1) and (τ, π2 � q2) are (copied-state) independent, as described in Definition 2.10.1.

Concretely, this means that:

(q1 ⊗ q2) � τ =(q1 � τ

[1, 0

])⊗

(q2 � τ

[0, 1

]). (2.15)

Equation (2.15) is an instance of the formulation (2.14) used in Defini-tion 2.10.1 since:

〈q1 ◦ π1, q2 ◦ π2〉 � τ = D(〈q1 ◦ π1, q2 ◦ π2〉

)(τ)

= D(q1 × q2

)(τ)

= (q1 ⊗ q2) � τ((q1 ◦ π1) � τ

)⊗

((q2 ◦ π2) � τ

)=

(q1 � (π1 � τ)

)⊗

(q2 � (π2 � τ)

)=

(q1 � τ

[1, 0

])⊗

(q2 � τ

[0, 1

]).

In the joint-state case — unlike in the copied state situation — there is atight connection between non-entwinedness, independence and covariance be-ing zero.

Theorem 2.10.6. For a joint state τ the following three statements are equiv-alent.

1 τ is non-entwined, i.e. is the product of its marginals;2 all observables qi ∈ Obs(Xi) are binary independent wrt. τ;3 the joint-state covariance JCov(τ, q1, q2) is zero, for all observables qi ∈

Obs(Xi) — or equivalently, all correlations JCor(τ, q1, q2) are zero.

Proof. Let joint state τ ∈ D(X1 × X2) be given. We write τi B πi � τ ∈ D(Xi)for its marginals.

(1) ⇒ (2). If τ is non-entwined, then τ = τ1 ⊗ τ2. Hence for all observ-ables q1 ∈ Obs(X1) and q2 ∈ Obs(X2) we have that σ B (q1 ⊗ q2) � τ

129


is non-entwined. To see this, first note that πi � σ = qi � τi. Then, byLemma 1.8.2 (1):

(π1 � σ) ⊗ (π2 � σ) = (q1 � τ1) ⊗ (q2 � τ2)= (q1 ⊗ q2) � (τ1 ⊗ τ2) = (q1 ⊗ q2) � τ = σ.

(2) ⇒ (3). Let q1 ∈ Obs(X1) and q2 ∈ Obs(X2) be two observables. Wemay assume that q1, q2 are independent wrt. τ, that is, (q1 ⊗ q2) � τ = (q1 �

τ1) ⊗ (q2 � τ2) as in (2.15). We must prove JCov(τ, q1, q2) = 0. Consider,like in the proof of Proposition 2.10.3, the multiplication map &: R×R→ R,given by &(r, s) = r · s, as an observable on R × R. We consider its validity:

(q1 ⊗ q2) � τ |= & =∑

r,s((q1 ⊗ q2) � τ

)(r, s) · &(r, s)

=∑

r,sD(q1 × q2)(τ)(r, s) · r · s=

∑r,s

(∑(x,y)∈(q1×q2)−1(r,s) τ(x, y)

)· r · s

=∑

r,s∑

x∈q−11 (r),y∈q−1

2 (s) τ(x, y) · q1(x) · q2(y)=

∑x,y τ(x, y) · (q1 ⊗ q2)(x, y)

= τ |= q1 ⊗ q2.

In the same way one proves (q1 � τ1) ⊗ (q2 � τ2) |= & = (τ1 |= q1) ·(τ2 |= q2). But then we are done via the formulation of binary covariance fromLemma 2.9.9:

JCov(τ, q1, q2) =(τ |= q1 ⊗ q2

)− (τ1 |= q1) · (τ2 |= q2)

=((q1 ⊗ q2) � τ |= &

)−

((q1 � τ1) ⊗ (q2 � τ2) |= &

)=

((q1 ⊗ q2) � τ |= &

)−

((q1 ⊗ q2) � τ |= &

)= 0.

(3) ⇒ (1). Let joint-state covariance JCov(τ, q1, q2) be zero for all observ-ables q1, q2. In order to prove that τ is non-entwined, we have to show τ(x, y) =

τ1(x) · τ2(y) for all (x, y) ∈ X1 ×X2. We choose as random variables the observ-ables 1x and 1y and use again the formulation of covariance from Lemma 2.9.9.Then, since, by assumption, the binary covariance is zero, so that:

τ(x, y) = τ |= 1x ⊗ 1y = (τ1 |= 1x) · (τ2 |= 1y) = τ1(x) · τ2(y).

In essence this result says that joint-state independence and joint-state co-variance being zero are not properties of observables, but of joint states.

Exercises

2.10.1 Prove, in the setting of Definition 2.10.5 that the first marginal((q1 ⊗

q2) � τ)[

1, 0]

of the transformed state along q1 ⊗ q2 is equal to thetransformed marginal q1 � τ

[1, 0

].

130

3

Directed Graphical Models

In the previous two chapters we have seen several examples of graphs in prob-abilistic reasoning, see for instance Figures 1.3, 1.4 or 2.3. At this stage a moresystematic view will be developed in terms of so-called string diagrams. Thelatter are directed graphs that are built up inductively from boxes (as nodes)with wires (as edges) between them. These boxes can be interpreted as proba-bilistic channels. The wires are typed, where a type is a finite set that is asso-ciated with a wire. On a theoretical level we make a clear distinction betweensyntax and semantics, where graphs are introduced as syntactic objects, de-fined inductively, like terms in some formal language based on a signature orgrammar. Graphs can be given a semantics via an interpretion as suitable col-lections of channels. In practice, the distinction between syntax and semanticsis not so sharp: we shall frequently use graphs in interpreted form, essentiallyby describing a structured collection of channels as interpretation of a graph.This is like what happens in (mathematical) practice with algebraic notionslike group or monoid: one can precisely formulate a syntax for monoids withoperation symbols that are interpreted as actual functions. But one can also usemonoids as sets with operations. The latter form is often most convenient.

String diagrams are similar to the graphs used for Bayesian networks. A no-table difference is that they have explicit operations for copying and discarding.This makes them more expressive (and more useful). String diagrams are thelanguage of choice to reason about a large class of mathematical models, calledsymmetric monoidal categories, see [90], including categories of channels (ofpowerset, multiset or distribution type). Once we have seen string diagrams,we will describe Bayesian networks only in string diagrammatic form. Themost notable difference it that we then write copying as an explicit operation,as we have already started doing in Section 1.9. The language of string dia-grams also makes it easier to turn a Bayesian network into a joint distribution

131

132 Chapter 3. Directed Graphical Models132 Chapter 3. Directed Graphical Models132 Chapter 3. Directed Graphical Models

(state). We shall use string diagrams also for other artifacts, like Markov chainsand hidden Markov models in Section 3.4.

One of the main themes in this chapter is how to move back and forth be-tween a joint state/distribution, say on A, B, and a channel A → B. The pro-cess of extracting a channel form a joint state is called disintegration. It isthe process of turning a joint distribution p(a, b) into a conditional distributionp(b | a). It gives direction to a probabilistic relationship. It is this disintegra-tion technique that allows us to represent a joint state, on multiple sets, via agraph structure with multiple channels, like in Bayesin networks, by perform-ing disintegration multiple times. More explicitly, it is via disintegration thatthe conditional probability tables of a Bayesian network — which, recall, cor-respond to channels — can be obtained from a joint distribution. The lattercan be presented as a (possibly large) empirical distribution, analogously tothe (small) medicine - blood pressure example in Subsections 1.4.1 and 1.4.5.

Via the correspondence between joint states and channels one can also revertchannels. Roughly it works as follows. First one turns a channel A→ B into ajoint state on A, B, then into a joint state on B, A by swapping, and then againinto a channel B → A. It amounts to turning a conditional probability p(b | a)into p(a | b), essentially by Bayes’ rule. This channel inversion is a fundamen-tal operation in Bayesian theory, which bears resemblance to inversion (adjointtranspose) for matrices in quantum theory. Since the superscript notation (−)†

is often used in such situations, a ‘dagger’ terminology has become common.It is used for instance in the notion of a ‘dagger’ category, see e.g. [16, 90]. Inline with this tradition we often talk about the dagger of a channel, instead ofabout its Bayesian inversion. This Bayesian inversion is an extremely usefuloperation, not only in classification, but also in inference. It allows us in Sub-section 3.7.1 to establish a tight connection between the direction of a channeland the direction of inference: backward inference along a channel c can beexpressed as forward inference along the inverted channel c†, and similarly,forward inference along c can be expressed as backward inference along c†.This fundamental result (Theorem 3.7.7) gives the channel-based frameworkinternal consistency.

For readers interested in the more categorical aspects of Bayesian inver-sion, Section 3.8 elaborates how a suitable category of channels has a well-behaved dagger functor. This category of channels is shown to be equivalentto a category of couplings. This mimics the categorical structure underlyingthe relationship between functions X → P(Y) and relations on X × Y . Thiscategorically-oriented section can be seen as an intermezzo that is not neces-sary for the remainder; it could be skipped.

Sections 3.9 and 3.10 return to (directed) graphical modelling. Section 3.9

132

3.1. String diagrams 1333.1. String diagrams 1333.1. String diagrams 133

uses string diagrams to express intrinsic graphical structure, called shape, injoint probability distributions. This is commonly expressed in terms of inde-pendence and d-separation, but we prefer to stick at this stage to a purely string-diagrammatic account. String diagrams are also used in Section 3.10 to give aprecise account of Bayesian inference, including a (high-level) channel-basedinference algorithm.

The chapter concludes with an alternative form of backward probabilisticreasoning, called Jeffrey’s rule. The approach that we have used up to thispoint, namely backward inference along a channel, may be called Pearl’s rule,in contrast. It uses a predicate on the codomain of the channel as new informa-tion that is used for updating via predicate transformation. Thus, this approachis evidence-based. In contrast, Jeffrey’s rule is state-based: it uses a state on thecodomain of the channel and reasons backward via state transformation usingthe dagger of the channel. These two rules, of Pearl and of Jeffrey, give quitedifferent outcomes in the same situation. It is not well-understood which ruleshould be used in which situation. Our account focuses on a precise descriptionof these two rules in terms of channels.

3.1 String diagrams

In Chapter 1 we have seen how (probabilistic) channels can be composed se-quentially, via ◦· , and in parallel, via ⊗. These operations ◦· and ⊗ satisfy certainalgebraic laws, such as associativity of ◦· . Parallel composition ⊗ is also asso-ciative, in a suitable sense, but the equations involved are non-trivial to workwith. These equations are formalised via the notion of symmetric monoidal cat-egory that axiomatises the combination of sequential and parallel composition.

Instead of working with such somewhat complicated categories we shallwork with string diagrams. These diagrams form a graphical language that of-fers a more convenient and intuitive way of reasoning with sequential and par-allel composition. In particular, these diagrams allow obvious forms of topo-logical stretching and shifting that are more complicated, and less intuitive, tocapture via equations.

The main reference for string diagrams is [90]. Here we present a morefocused account, concentrating on what we need in a probabilistic setting. Inthis first section we describe string diagrams with directed edges between. Inthe next section we simply write undirected edges | instead of directed edges↑, with the convention that information always flows upward. But we have tokeep in mind that we do this for convenience only, and that the graphs involvedare actually directed.

133


String diagrams look a bit like Bayesian networks, but differ in subtle ways,esp. wrt. copying, discarding and joint states, see Subsection 3.1.1. The shiftthat we are making from Bayesian networks to string diagrams was alreadyprepared in Subsection 1.9.1.

Definition 3.1.1. The collection Type of types is the smallest set with:

1 1 ∈ Type, where 1 represents the trivial singleton type;2 F ∈ Type for any finite set F;3 if A, B ∈ Type, then also A × B ∈ Type.

In fact, the first point can easily be seen as a special case of the second one.However, we like to emphasise that there is a distinguished type that we write1. More generally, we write, as before, n = {0, 1, . . . , n − 1}, so that n ∈ Type.

The nodes of a string diagram are ‘boxes’ with input and output wires:

· · ·

· · ·A1 An

B1 Bm

(3.1)

Here the A1, . . . , An ∈ Type are the types of the input wires, and B1, . . . , Bm ∈

Type are the types of the output wires. We allow the cases n = 0 (no inputwires) and m = 0 (no output wires), which are written as:

· · ·B1 Bmand

· · ·A1 An

A diagram signature, or simply a signature, is a collection of boxes as in (3.1).The interpretation of such a box is a channel A1 × · · · × An → B1 × · · · × Bm.In particular, the interpretation of a box with no input channels is a distribu-tion/state. We like to make a careful distinction between syntax (boxes in adiagram) and semantics (their interpretation as channels).

Informally, a string diagram is a diagram that is built-up inductively, byconnecting wires of the same type between several boxes. These connectionsmay include copying and swapping, see below for details. A string diagraminvolves the requirement that it contains no cycles: it should be acyclic. Wewrite string diagrams with arrows flow going upwards, suggesting that there isa flow from bottom to top. This direction is purely a convention. Elsewhere inthe literature the flow may be from top to bottom, of from left to right.

More formally, given a signature Σ of boxes of the form (3.1), the set SD(Σ)of string diagrams over Σ is defined as the smallest set satisfying the pointsbelow. At the same time we define two ‘domain’ and ‘codomain’ functions

134


from string diagrams to lists of types, as in:

SD(Σ)dom //

cod// L(Type)

Recall that L is the list functor.

(1) Σ ⊆ SD(Σ), that is, every box in f ∈ Σ in the signature as in (3.1) is a stringdiagram in itself, with dom( f ) = [A1, . . . , An] and cod ( f ) = [B1, . . . , Bm].

(2) For all A, B ∈ Type, the following diagrams are in SD(Σ).

(a) The identity diagram:

idA = A with

dom(idA) = [A]cod (idA) = [A].

(b) The swap diagram:

swapA,B =

A B

B Awith

dom(swapA,B) = [A, B]cod (swapA,B) = [B, A].

(c) The copy diagram:

copyA =A A

Awith

dom(copyA) = [A]cod (copyA) = [A, A].

(d) The discard diagram:

discardA = A with

dom(discardA) = [A]cod (discardA) = [].

(e) The uniform state diagram:

uniformA = A with

dom(uniformA) = []cod (uniformA) = [A].

(f) For each type A and element a ∈ A a point state diagram:

pointA(a) =a

Awith

dom(pointA(a)) = []cod (pointA(a)) = [A].

(g) Boxes for the logical operations conjunction, orthosupplement andscalar multiplication, of the form:

⊥

2

r

2 2

2

&

2

2 2

135


where r ∈ [0, 1]. These string diagrams have obvious domains andcodomains. We recall that predicates on A can be identified with chan-nels A→ 2, see Exercise 2.4.8.

(3) If S 1, S 2 ∈ SD(Σ) are string diagrams, then the parallel composition (jux-taposition) S 1 S 2 is also string diagram, with dom(S 1 S 2) = dom(S 1) ++

dom(S 2) and cod (S 1 S 2) = cod (S 1) ++ cod (S 2).

(4) If S 1, S 2 ∈ SD(Σ) and cod (S 1) ends with ~C = [C1, . . . ,Ck] whereasdom(S 2) starts with ~C, then there is a sequential composition string di-agram T of the form:

T =···

S 1

S 2···

···

···

···

~Cwith

dom(T )= dom(S 1) ++

(dom(S 2) − ~C

)cod (T )=

(cod (S 1) − ~C

)++ cod (S 2).

(3.2)

From this basic form, different forms of composition can be obtained byreordening wires via swapping.

In this definition of string diagrams all wires point upwards. This excludesthat string diagrams contain cycles: they are acyclic by construction. One couldadd more basic constructions to point (2), such as convex combination, seeExercise 3.2.3, but the above constructions suffices for now.

Below we redescribe the ‘wetness’ and ‘Asia’ Bayesian networks, from Fig-

136


ures 1.4 and 2.3, as string diagrams.

sprinkler

wet grass

rain

winter

A

BC

D E

slippery road

smoking

asia

tublung

eitherbronc

dysp xray

S

B

L T

E

D X

A

(3.3)

What are the signatures for these string diagrams?

3.1.1 String diagrams versus Bayesian networks

One can ask: why are these strings diagrams any better than the diagrams thatwe have seen so far for Bayesian networks? An important point is that stringdiagrams make it possible to write a joint state, on an n-ary product type. Forinstance, for n = 2 we can have a write a joint state as:

(3.4)

In Bayesian networks joint states cannot be expressed. It is possible to write aninitial node with multiple outgoing wires, but as we have seen in Section 1.9,this should be interpreted as copy of a single outgoing wire, coming out of anon-joint state:

�� initial

__ ??

=•

gg 77

�� initial

OO

In string diagrams one can additionally express marginals via discarding :if a joint state ω is the interpretation of the above string diagram (3.4), then thediagram on the left below is interpreted as the marginal ω

[1, 0

], whereas the

137


diagram on the right is interpreted as ω[0, 1

].

We shall use this marginal notation with masks also for boxes in general — andnot just for states, without incoming wires. This is in line with Definition 1.8.7where marginalisation is defined both for states and for channels.

An additional advantage of string diagrams is that they can be used for equa-tional reasoning. This will be explained in the next section. First we give mean-ing to string diagrams.

3.1.2 Interpreting string diagrams as channels

String diagrams will be used as representations of (probabilistic) channels. Wethus consider string diagrams as a special, graphical term calculus (syntax).These diagrams are formal constructions, which are given mathematical mean-ing via their interpretation as channels.

This interpretation of string diagrams can be defined in a compositional way,following the structure of a diagram, using sequential ◦· and parallel ⊗ compo-sition of channels. An interpretation of a string diagram S ∈ SD(Σ) is a channelthat is written as [[ S ]]. It forms a channel from the product of the types in thedomain dom(S ) of the string diagram to the product of types in the codomaincod (S ). The product over the empty list is the singleton element 1 = {0}. Suchan interpretation [[ S ]] is parametrised by an interpretation of the boxes in thesignature Σ as channels.

Given such an interpretation of the boxes in Σ, the points below describethis interpretation function [[− ]] inductively, acting on the whole of SD(Σ), byprecisely following the above four points in the inductive definition of stringdiagrams.

(1) The interpretation of a box (3.1) in Σ is assumed. It is some channel oftype A1 × · · · × An → B1 × · · · × Bm.

(2) We use the following interpretation of primitive boxes.

(a) The interpretation of the identity wire of type A is the identity/unitchannel id : A→ A, given by point states: id(a) = 1|a〉.

(b) The swap string diagram is interpreted as the (switched) pair of pro-jections 〈π2, π1〉 : A × B→ B × A. It sends (a, b) to 1|b, a〉.

(c) The interpretation [[ copyA ]] of the copy string diagram is the copychannel ∆ : A→ A × A with ∆(a) = 1|a, a〉.

138


(d) The discard channel of type A is interpreted as the unique channel! : A → 1. Using that D(1) � 1, this function sends every elementa ∈ A to the sole element 1|0〉 ofD(1) for 1 = {0}.

(e) The uniform string diagram of type A is interpreted as the uniformdistribution/state υA ∈ D(A), considered as channel 1 → A. Recall, ifA has n-elements, then υA =

∑a∈A

1n |a〉.

(f) The point string diagram pointA(a), for a ∈ A is interpreted as the pointstate 1|a〉 ∈ D(A).

(g) The ‘logical’ channels for conjunction, orthosupplement, and scalarmultiplication are interpreted by the corresponding channels conj, orthand scal(r) from Exercise 2.4.8.

(3) The tensor product ⊗ of channels is used to interpret juxtaposition of stringdiagrams [[ S 1 S 2 ]] = [[ S 1 ]] ⊗ [[ S 2 ]].

(4) If we have string diagrams S 1, S 2 in a composition T as in (3.2), then:

[[ T ]] =(id ⊗ [[ S 2 ]]

)◦· ([[ S 1 ]] ⊗ id),

where the identity channels have the appropriate product types.

At this stage we can say more precisely what a Bayesian network is, withinthe setting of string diagrams.

Definition 3.1.2. A Bayesian network is given by:

1 a string diagram G ∈ SD(Σ) over a finite signature Σ, with dom(G) = [];2 an interpretation of each box in Σ as a channel.

The string diagram G in the first point is commonly referred to as the (under-lying) graph of the Bayesian network. The channels that are used as interpre-tations are the conditional probability tables of the network.

Given the definition of the boxes in (the signature of) a Bayesian network,the whole network can be interpreted as a state/distribution [[ G ]]. This is ingeneral not the same thing as the joint distribution associated with the network,see Definition 3.3.1 (3) later on. In addition, for each node A in G, we can lookat the subgraph GA of cumulative parents of A. The interpretation [[ GA ]] isthen also a state, namely the one that one obtains via state transformation. Thishas been described as ‘prediction’ in Section 1.9.

The above definition of Bayesian network is more general than usual, forinstance because it allows the graph to contain joint states, like in (3.4), ordiscarding , see also the discussion in [50].

139


Exercises

3.1.1 For a state ω ∈ D(X1×X2×X3×X4×X5), describe its marginalisationω[1, 0, 1, 0, 0

]as string diagram.

3.1.2 Check in detail how the two string diagrams in (3.3) are obtained byfollowing the construction rules for string diagrams.

3.1.3 Consider the boxes in the wetness string diagram in (3.3) as a signa-ture. An interpretation of the elements in this signature is describedin Section 1.9 as states and channels wi , sp etc. Check that this in-terpretation extends to the whole wetness string diagram in (3.3) as achannel 1→ D × E given by:

(wg ⊗ sr) ◦· (id ⊗ ∆) ◦· (sp ⊗ ra) ◦· ∆ ◦· wi .

3.1.4 Extend the interpretation from Section 2.7 of the signature of the Asiastring diagram in (3.3) to a similar interpretation of the whole Asiastring diagram.

3.1.5 For A = {a, b, c} check that:[[A

]]= 1

3 |a, a〉 +13 |b, b〉 +

13 |c, c〉.

3.1.6 Check that there are equalities of interpretations:

[[ A B ]] = [[ A×B ]] [[ A B ]] = [[ A×B ]]

3.1.7 Verify that for each box, interpreted as a channel, one has:[[A

B ]]=

[[A

]].

3.1.8 Recall Exercise 2.4.4 and check that the interpretation of

f

g

2

is [[ f ]] |= [[ g ]].

Similarly, check that the interpretation of

f

g

2

h

is[[ g ]] � [[ f ]] |= [[ h ]]= [[ f ]] |= [[ g ]] � [[ h ]],

140

3.2. Equations for string diagrams 1413.2. Equations for string diagrams 1413.2. Equations for string diagrams 141

see Proposition 2.4.3.3.1.9 Notice that the difference between ω |= p1 ⊗ p2 and σ |= q1 & q2 can

be clarified graphically as:

&

A B

2 2

2

versus

&

A

2 2

2

See also Exercise 2.4.5.

3.2 Equations for string diagrams

This section introduces equations S 1 = S 2 between string diagrams S 1, S 2 ∈

SD(Σ) over the same signature Σ. These equation are ‘sound’, in the sense thatthey are respected by the string diagram interpretation of Subsection 3.1.2:S 1 = S 2 implies [[ S 1 ]] = [[ S 2 ]]. From now on we simplify the writing ofstring diagrams in the following way.

Convention 3.2.1. 1 We will write the arrows ↑ on the wires of string dia-grams simply as | and assume that the flow is allways upwards in a stringdiagram, from bottom to top. Thus, eventhough we are not writing edges asarrows, we still consider string diagrams as directed graphs.

2 We drop the types of wires when they are clear from the context. Thus, wiresin string diagrams are always typed, but we do not always write these typesexplicitly.

3 We become a bit sloppy about writing the interpretation function [[− ]] forstring diagrams explicitly. Sometimes we’ll say “the interpretation of S ”instead of just [[ S ]], but also we sometimes simply write a string diagram,where we mean its interpretation as a channel. This can be confusing at first,so we will make it explicit when we start blurring the distinction betweensyntax and semantics.

Shift equationsString diagrams allow quite a bit of topological freedom, for instance, in thesense that parallel boxes may be shifted up and down, as expressed by the

141


following equations.

f

g f

g== f g

Semantically, this is justified by the equation:([[ f ]] ⊗ id

)◦·(id ⊗ [[ g ]]

)=

([[ f ]] ⊗ [[ g ]]

)=

(id ⊗ [[ g ]]

)◦·([[ f ]] ⊗ id

).

Similarly, boxes may be pulled trough swaps:

=

f

f

=

g

g

These four equations do not only hold for boxes, but also for and , sincewe consider them simply as boxes but with a special notation.

Equations for wiresWires can be stretched arbitrarily in length, and successive swaps cancel eachother out:

= =

Wires of product type correspond to parallel wires, and wires of the emptyproduct type 1 do nothing — as represented by the empty dashed box.

=A × B A B =1

Equations for copyingThe copy string diagram also satisfies a number of equations, expressingassociativity, commutativity and a co-unit property. Abstractly this makes acomonoid. This can be expressed graphically by:

= = = (3.5)

142


Because copying is associative, it does not matter which order of copying weuse in successive copying. Therefor we feel free to write such n-ary copyingas coming from a single node •, in:

· · ·

and ∆n B

· · · (3.6)

Copying of products A × B and of the empty product 1 comes with thefollowing equations.

A × B=

A B 1=

We should be careful that boxes can in general not be “pulled through”copiers, as expressed on the left below (see also Exercise 1.8.8).

,f

f fbut =

a a a

An exception to this is when copying follows a point state 1|a〉 for a ∈ A, asabove, on the right.

Equations for discarding and for uniform and point statesThe most important discarding equation is the one immediately below, say-ing that discarding after a box/channel is the same as discarding of the inputaltogether. We refer to this as: boxes are unital1.

· · ·

· · ·A1 An

=A1 An

· · · in particular· · ·

=

Discarding only some (not all) of the outgoing wires amounts to marginalisa-tion.

Discarding a uniform state also erases everything. This may be seen as aspecial case of the above equation on the right, but still we like to formulate itexplicitly, below on the left. The other equations deal with trivial cases.

=A = =11

1 In a quantum setting one uses ‘causal’ instead of ‘unital’, see e.g. [16].

143


For product wires we have the following discard equations.

= = A BA × B A B

Similarly we have the following uniform state equations.

= = A BA × B A B

For point states we have for elements a ∈ A and b ∈ B,

=(a, b) a b

Equations for logical operationsAs is well-known, conjunction & is associative and commutative:

&

&

&

&= =

& &

Moreover, conjunction has truth and falsity as unit and zero elements:

&=

1

&=

0

0

Orthosupplement (−)⊥ is idempotent and turns 0 to 1 (and vice versa):

=

⊥

⊥

=0

1

⊥

Finally we look at the rules for scalar multiplication for predicates:

=

r

s

r · s =1 =00

&=

r &

r

144


These equations be used for diagrammatic equational reasoning. For in-stance we can now derive a variation on the last equation, where scalar multi-plication on the second input of conjunction can be done after the conjunction:

&

& &

r= = =

r

r &

r

=

r

&

All of the above equations between string diagrams are sound, in the sensethat they hold under all interpretations of string diagrams — as described atthe end of the previous section. There are also completeness results in thisarea, but they require a deeper categorical analysis. For a systematic accountof string diagrams we refer to the overview paper [90]. In this book we usestring diagrams as convenient, intuitive notation with a precise semantics.

Exercises

3.2.1 Give a concrete description of the n-ary copy channel ∆n : A → An

in (3.6).3.2.2 Prove by diagrammatic equational reasoning that:

&=

r &

r · s

s

3.2.3 Consider for r ∈ [0, 1] the convex combination channel cc(r) : A ×A → A given by cc(r)(a, a′) = r|a〉 + (1 − r)|a′ 〉. It can be used asinterpretation of an additional convex combination box:

r

A A

A

Prove that it satisfies the following ‘barycentric’ equations, due to [96].

1 r

r rs= = 1 − r

s r(1−s)1−rs

=

145


For the last equation we must assume that rs , 1, i.e. not r = s = 1.

3.3 Accessibility and joint states

Intuitively, a string diagram is called accessible if all its internal connectionsare also accessible externally. This property will be defined formally below,but it is best illustrated via an example. Below on the left we see the wetnessBayesian network as string diagram, like in (3.3) — but with lines insteadof arrows. On the right is an accessible version of this string diagram, with(single) external connections to all its internal wires.

sprinkler

wet grass

rain

winter

A

BC

D E

slippery road

sprinkler

wet grass

rain

winter

A B CD E

slippery road

(3.7)

We are interested in such accessible versions because they lead in an easy wayto joint states: the joint state for the wetness Bayesian network is the interpre-tation of the above accessible string diagram on the right, giving a distributionon the product set A× B×D×C × E. Such joint distributions are conceptuallyimportant, but their size quickly grows out of hand. For instance, the joint dis-tribution ω for the wetness network is (with zero-probability terms omitted):

0.06384|a, b, d, c, e〉 + 0.02736|a, b, d, c, e⊥ 〉 + 0.0216|a, b, d, c⊥, e⊥ 〉+ 0.00336|a, b, d⊥, c, e〉 + 0.00144|a, b, d⊥, c, e⊥ 〉 + 0.0024|a, b, d⊥, c⊥, e⊥ 〉+ 0.215|a, b⊥, d, c, e〉 + 0.09216|a, b⊥, d, c, e⊥ 〉 + 0.05376|a, b⊥, d⊥, c, e〉+ 0.02304|a, b⊥, d⊥, c, e⊥ 〉 + 0.096|a, b⊥, d⊥, c⊥, e⊥ 〉 + 0.01995|a⊥, b, d, c, e〉+ 0.00855|a⊥, b, d, c, e⊥ 〉 + 0.243|a⊥, b, d, c⊥, e⊥ 〉 + 0.00105|a⊥, b, d⊥, c, e〉+ 0.00045|a⊥, b, d⊥, c, e⊥ 〉 + 0.027|a⊥, b, d⊥, c⊥, e⊥ 〉 + 0.0056|a⊥, b⊥, d, c, e〉+ 0.0024|a⊥, b⊥, d, c, e⊥ 〉 + 0.0014|a⊥, b⊥, d⊥, c, e〉 + 0.0006|a⊥, b⊥, d⊥, c, e⊥ 〉+ 0.09|a⊥, b⊥, d⊥, c⊥, e⊥ 〉.

This distribution is obtained as outcome of the interpreted accessible graph

146

3.3. Accessibility and joint states 1473.3. Accessibility and joint states 1473.3. Accessibility and joint states 147

in (3.7):

ω =(id ⊗ id ⊗ wg ⊗ id ⊗ sr

)◦·(id ⊗ ∆2 ⊗ ∆3

)◦·(id ⊗ sp ⊗ sr

)◦· ∆3 ◦· wi .

Recall that ∆n is written for the n-ary copy channel, see (3.6).Later on in this section we use our new insignts into accessible string dia-

grams to re-examine the relation between crossover inference on joint statesand channel-based inference, see Corollary 2.5.9. But we start with a moreprecise description.

Definition 3.3.1. 1 An accessible string diagram is defined inductively viathe string diagram definitions steps (1) – (3) from Section 3.1, with sequen-tial composition diagram (3.2) in step (4) extended with copiers on the in-ternal wires:

···

S 1

S 2···

···

···

···

···

(3.8)

In this way each ‘internal’ wire between string diagrams S 1 and S 2 is ‘ex-ternally’ accessible.

2 Each string diagram S can be made accessible by replacing each occur-rence (3.2) of step (4) in its construction with the above modified step (3.8).We shall write S for a choice of accessible version of S .

3 A joint state or joint distribution associated with a Bayesian network G ∈SD(Σ) is the interpretation [[ G ]] of an accessible version G ∈ SD(Σ) of thestring diagram G. It re-uses G’s interpretation of its boxes.

There are several subtle points related to the last two points.

Remark 3.3.2. 1 The above definition of accessible string diagram allowsthat there are multiple external connections to an internal wire, for instanceif one of the internal wires between S 1 and S 2 in (3.8) is made accessibleinside S 2, for instance because it is one of the outgoing wires of S 2. Suchmultiple accessibility is unproblematic, but in most cases we are happy withaccessibility via a single wire, as in (3.7) on the right. Anyway, we can al-ways purge unnecessary outgoing wires via discard .

147


2 The order of the outgoing wires in an accessible graph involves some levelof arbitrairiness. For instance, in the accessible wetness graph in (3.7) onecould also imagine putting C to the left of D at the top of the diagram.This would involve some additional crossing of wires. The result would beessentially the same. The resulting joint state, obtained by interpretation,would then be in D(A × B × C × D × E). These differences are immaterial,but do require some care if one performs marginalisation.

Thus, strictly speaking, the accessible version S of a string diagram Sis determined only up to permutation of outgoing wires. In concrete caseswe try to write accessible string diagrams in such a way that the number ofcrossings of wires is minimal.

With these points in mind we allow ourselves to speak about the joint stateassociated with a Bayesian network, in which duplicate outgoing wires aredeleted. This joint state is determined up-to permutation of outgoing wires.

In the crossover inference corollary 2.5.9 we have seen how component-wiseupdates on a joint state correspond to channel-based inference. This theoreminvolved a binary state and only one channel. Now that we have a better handleon joint states via string diagrams, we can extend this correspondence to n-aryjoint states and multiple channels. This is a theme that will be pursued furtherin this chapter. At this stage we only illustrate how this works for the wetnessBayesian network.

Example 3.3.3. In Exercise 2.7.1 we have formulated two inference questionsfor the wetness Bayesian network, namely:

1 What is the updated sprinkler distribution, given evidence of a slippery road?Using channel-based inference it is computed as:

sp �(wi

∣∣∣ra�(sr�1e)

)= 63

260 |b〉 +197260 |b

⊥ 〉.

2 What is the updated wet grass distribution, given evidence of a slipperyroad? It’s:

wg �((

(sp ⊗ ra) � (∆ � wi))∣∣∣

1⊗(sr�1e)

)= 4349

5200 |d 〉 +8515200 |d

⊥ 〉.

Via (multiple applications) of crossover inference, see Corollary 2.5.9, thesesame updated probabilities can be obtained via the joint state ω ∈ D(A × B ×D×C × E) of the wetness Bayesian network, as mentioned at the beginning ofthis section.

Concretely, this works as follows. In the above first case we have (point)evidence 1e : E → [0, 1] on the set E = {e, e⊥}. It is weakened (extended) toevidence 1⊗1⊗1⊗1⊗1e on the product A×B×D×C×E. Then it can be used to

148

3.3. Accessibility and joint states 1493.3. Accessibility and joint states 1493.3. Accessibility and joint states 149

update the joint state ω. The required outcome then appears by marginalisingon the sprinkler set B in second position, as in:

ω∣∣∣1⊗1⊗1⊗1⊗1e

[0, 1, 0, 0, 0

]= 63

260 |b〉 +197260 |b

⊥ 〉

≈ 0.2423|b〉 + 0.7577|b⊥ 〉.

For the second inference question we use the same evidence, but we nowmarginalise on the wet grass set D, in third position:

ω∣∣∣1⊗1⊗1⊗1⊗1e

[0, 0, 1, 0, 0

]= 4349

5200 |d 〉 +8515200 |d

⊥ 〉

≈ 0.8363|d 〉 + 0.1637|d⊥ 〉.

We see that crossover inference is easier to express than channel-based infer-ence, since we do not have to form the more complicated expressions with�and� in the above two points. Instead, we just have to update and marginaliseat the right positions. Thus, it is easier from a mathematical perspective. Butfrom a computational perspective crossover inference is more complicated,since these joint states grow exponentially in size (in the number of nodes inthe graph). This topic will be continued in Section 3.10.

Exercises

3.3.1 Draw an accessible verion of the Asia string diagram in (3.3). Writealso the corresponding composition of channels that produces the as-sociated joint state.

149


3.4 Hidden Markov models

Let’s start with a graphical description of an example of what is called a hiddenMarkov model:

Sunny

Cloudy

Rainy

0.8

0.5

0.150.2

0.20.2

0.3

0.05

0.6

Stay-in

Go-out

0.5

0.5

0.2

0.8

0.1

0.9(3.9)

This model has three ‘hidden’ elements, namely Cloudy, Sunny, and Rainy,representing the weather condition on a particular day. There are ‘temporal’transitions with associated probabilities between these elements, as indicatedby the labeled arrows. For instance, if it is cloudy today, then there is a 50%chance that it will be cloudy again tomorrow. There are also two ‘visible’ el-ements on the right: Stay-in, and Stay-out, describing two possible actions ofa person, depending on the weather condition. There are transitions with prob-abilities from the hidden to the visible elements. The idea is that with everytime step a transition is made between hidden elements, resulting in a visibleoutcome. Such steps may be repeated for a finite number of times — or evenforever. The interaction between what is hidden and what can be observed isa key element of hidden Markov models. For instance, one may ask: given acertain initial state, how likely is it to see a consecutive sequence of the fourvisible elements: Stay-in, Stay-in, Go-out, Stay-in?

Hidden Markov models are simple statistical models that have many appli-cations in temporal pattern recognition, in speech, handwriting or gestures, butalso in robotics and in biological sequences. This section will briefly look intohidden Markov models, using the notation and terminology of channels. In-deed, a hidden Markov model can be defined easily in terms of channels andforward and backward transformation of states and observables. In addition,conditioning of states by observables can be used to formulate and answer ele-mentary questions about hidden Markov models. Learning for Markov modelswill be described separately in Sections 4.5 and 4.6.

150

3.4. Hidden Markov models 1513.4. Hidden Markov models 1513.4. Hidden Markov models 151

Definition 3.4.1. A Markov model (or a Markov chain) is given by a set X of‘internal positions’, typically finite, with a ‘transition’ channel t : X → X andan initial state/distribution σ ∈ D(X).

A hidden Markov model, often abbreviated as HMM, is a Markov model, asjust described, with an additional ‘emission’ channel e : X → Y , where Y is aset of ‘outputs’.

In the above illustration (3.9) we have as sets of positions and outputs:

X ={Cloudy,Sunny,Rainy

}Y =

{Stay-in,Go-out

},

with transition channel t : X → X,

t(Cloudy) = 0.5∣∣∣Cloudy

⟩+ 0.2

∣∣∣Sunny⟩

+ 0.3∣∣∣Rainy

⟩t(Sunny) = 0.15

∣∣∣Cloudy⟩

+ 0.8∣∣∣Sunny

⟩+ 0.05

∣∣∣Rainy⟩

t(Rainy) = 0.2∣∣∣Cloudy

⟩+ 0.2

∣∣∣Sunny⟩

+ 0.6∣∣∣Rainy

⟩and emission channel e : X → Y ,

e(Cloudy) = 0.5∣∣∣Stay-in

⟩+ 0.5

∣∣∣Go-out⟩

e(Sunny) = 0.2∣∣∣Stay-in

⟩+ 0.8

∣∣∣Go-out⟩

e(Rainy) = 0.9∣∣∣Stay-in

⟩+ 0.1

∣∣∣Go-out⟩

An initial state is missing in the picture (3.9).In the literature on Markov models, the elements of the set X are often called

states. This clashes with the terminology in this book, since we use ‘state’ assynomym for ‘distribution’. So, here we call σ ∈ D(X) an (initial) state, andwe call elements of X (internal) positions. At the same time we may call X thesample space. The transition channel t : X → X is an endo-channel on X, thatis a channel from X to itself. As a function, it is of the form t : X → D(X); it isan instance of a coalgebra, that is, a map of the form A → F(A) for a functorF, see [93] or [42] for more information.

In a Markov chain/model one can iteratitively compute successor states. Foran initial state σ and transition channel t one can form successor states via statetransformation:

σ

t � σ

t � (t � σ) = (t ◦· t) � σ...

tn � σ

where tn =

unit if n = 0

t ◦· tn−1 if n > 0.

In these transitions the state at stage n + 1 only depends on the state at stage n:

151


in order to predict a future step, all we need is the immediate predecessor state.This makes HMMs relatively easy dynamical models. Multi-stage dependen-cies can be handled as well, by enlarging the sample space, see Exercise 3.4.6below.

One interesting problem in the area of Markov chains is to find a ‘stationary’state σ∞ with t � σ∞ = σ∞, see Exercise 3.4.2.

Here we are more interested in hidden Markov models 1σ→ X

t→ X

e→Y . The

elements of the set Y are observable — and hence sometimes called signals —whereas the elements of X are hidden. Thus, many questions related to hiddenMarkov models concentrate on what one can learn about X via Y , in a finitenumber of steps. Hidden Markov models are examples of models with latentvariables.

We briefly discuss some basic issues related to HMMs in separate subsec-tions. A recurring theme is the relationship between constructions involving ajoint state and sequential constructions.

3.4.1 Validity in hidden Markov models

The first question that we like to address is: given a sequence of observables,what is their probability (validity) in a HMM? Standardly in the literature, oneonly looks at the probability of a sequence of point observations (elements),but here we use a more general approach. After all, one may not be certainabout observing a specific point at a particular state, or some point observationsmay be missing; in the latter case one may wish to replace them by a constant(uniform) observation.

We proceed by defining validity of the sequence of observables in a jointstate first; subsequently we look at (standard) algorithms for computing thesevalidities more efficiently. We thus start by defining a relevant joint state.

We fix a HMM 1σ→ X

t→ X

e→Y . For each n ∈ N a channel 〈e, t〉n : X →

Yn × X is defined in the following manner:

〈e, t〉0 B(X ◦

id// X � 1 × X � Y0 × X

)〈e, t〉n+1 B

(X ◦〈e,t〉n // Yn × X ◦

idn⊗〈e,t〉// Yn × (Y × X) � Yn+1 × X

) (3.10)

We recall that the tuple 〈e, t〉 of channels is (e⊗ t)◦· ∆, see Definition 1.8.10 (1).With these tuples we can form a joint state 〈e, t〉n � σ ∈ D(Yn × X). As an

152


(interpreted) string diagram it looks as follows.

σ

t

...

tee · · ·e

Y Y Y X

(3.11)

We consider the combined likelihood of a sequence of observables on the setY in a hidden Markov model. In the literature these observables are typicallypoint predicates 1y : Y → [0, 1], for y ∈ Y , but, as mentioned, here we allowmore general observables Y → R.

Definition 3.4.2. Let H =(1

σ→ X

t→ X

e→Y

)be a hidden Markov model and

let ~p = p1, . . . , pn be a list of observables on Y . The validity H |= ~p of thissequence ~p in the modelH is defined via the tuples (3.10) as:

H |= ~p B(〈e, t〉n � σ

)[1, . . . , 1, 0

]|= p1 ⊗ · · · ⊗ pn

(2.4)= 〈e, t〉n � σ |= p1 ⊗ · · · ⊗ pn ⊗ 1

(2.6)= σ |= 〈e, t〉n �

(p1 ⊗ · · · ⊗ pn ⊗ 1

).

(3.12)

The marginalisation mask [1, . . . , 1, 0] contains n times the number 1. It en-sures that the X outcome in (3.11) is discarded.

We describe an alternative way to formulate this validity without using the(big) joint state on Yn × X. It forms the essence of the classical ‘forward’ and‘backward’ algorithms for validity in HMMs, see e.g. [87] or [58, App. A]. Analternative algorithm is described in Exercise 3.4.4.

Proposition 3.4.3. The HMM-validity (3.12) can be computed as:

H |= ~p = σ |= (e � p1) &t �

((e � p2) &t �

((e � p3) & · · ·

t � (e � pn) · · ·)).

(3.13)

This validity can be calculated recursively in forward manner as:

σ |= α(~p)

where

α([q]

)= e � q

α([q] ++ ~q

)= (e � q) & (t � α

(~q)).

153


Alternatively, this validity can be calculated recursively in backward manneras:

σ |= β(~p, 1

)where

β([q]

)= q

β(~q ++ [qn, qn+1]

)= β

(~q ++ [(e � qn) & (t � qn+1)]

).

Proof. We first prove, by induction on n ≥ 1 that for observables pi on Y andq on X one has:

〈e, t〉n � (p1 ⊗ · · · ⊗ pn ⊗ q)= (e � p1) & t � ((e � p2) & t � (· · · t � ((e � pn) & t � q) · · · ))

(∗)

The base case n = 1 is easy:

〈e, t〉1 � (p1 ⊗ q) = ∆ �((e ⊗ t) � (p1 ⊗ q)

)= ∆ �

((e � p1) ⊗ (t � q)

)by Exercise 2.4.6

= (e � p1) & (t � q) by Exercise 2.4.5.

For the induction step we reason as follows.

〈e, t〉n+1 �(p1 ⊗ · · · ⊗ pn ⊗ pn+1 ⊗ q

)= 〈e, t〉n �

((idn ⊗ 〈t, e〉) �

(p1 ⊗ · · · ⊗ pn ⊗ pn+1 ⊗ q

))= 〈e, t〉n �

(p1 ⊗ · · · ⊗ pn ⊗ (〈e, t〉 � (pn+1 ⊗ q)

)= 〈e, t〉n �

(p1 ⊗ · · · ⊗ pn ⊗ ((e � pn+1) & (t � q))

)as just shown

(IH)= (e � p1) & t � ((e � p2) & t � (· · ·

t � ((e � pn) & t � ((e � pn+1) & (t � q))) · · · )).

We can now prove Equation (3.13):

H |= ~p(3.12)= σ |= 〈e, t〉n �

(p1 ⊗ · · · ⊗ pn ⊗ 1

)(∗)= (e � p1) & t � ((e � p2) & t � (· · · t � ((e � pn) & t � 1) · · · ))= (e � p1) & t � ((e � p2) & t � (· · · t � (e � pn) · · · )).

3.4.2 Filtering

Given a sequence ~p of factors, one can compute their validity H |= ~p in aHMM H , as described above. But we can also use these factors to ‘guide’the evolution of the HMM. At each state i the factor pi is used to update thecurrent state, via backward inference. The new state is then moved forward viathe transition function. This process is called filtering, after the Kalman filterfrom the 1960s that is used for instance in trajectory optimisation in navigationand in rocket control (e.g. for the Apollo program). The system can evolve

154


autonomously via its transition function, but observations at regular intervalscan update (correct) the current state.

Definition 3.4.4. Let H =(1

σ→ X

t→ X

e→Y

)be a hidden Markov model and

let ~p = p1, . . . , pn be a list of factors on Y . It gives rise to the filtered sequenceof states σ1, σ1, . . . , σn+1 ∈ D(X) following the observe-update-proceed prin-ciple:

σ1 B σ and σi+1 B t �(σi

∣∣∣e�pi

).

In the terminology of Definition 2.5.1, the definition of the state σi+1 in-volves both forward and backward inference. Below we show that the finalstate σn+1 in the filtered sequence can also be obtained via crossover inferenceon a joint state, obtained via the tuple channels (3.10). This fact gives a theoret-ical justification, but is of little practical relevance — since joint states quicklybecome too big to handle.

Proposition 3.4.5. In the context of Definition 3.4.4,

σn+1 =(〈e, t〉n � σ

)∣∣∣p1⊗ ··· ⊗pn⊗1

[0, . . . , 0, 1

]

The marginalisation mask [0, . . . , 0, 1] has n zero’s.

Proof. By induction on n ≥ 1. The base case with 〈e, t〉1 = 〈e, t〉 is handled asfollows.

(〈e, t〉 � σ

)∣∣∣p1⊗1

[0, 1

]= π2 �

(〈e|p1 , t〉 �

(σ|〈e,t〉�(p1⊗1)

))by Corollary 2.5.8 (2)

= t �(σ|e�p1

)= σ2.

155


The induction step requires a bit more work:(〈e, t〉n+1 � σ

)∣∣∣p1⊗ ··· ⊗pn⊗pn+1⊗1

[0, . . . , 0, 0, 1

]= πn+2 �

(((idn ⊗ 〈e, t〉) � (〈e, t〉n � σ)

)∣∣∣p1⊗ ··· ⊗pn⊗pn+1⊗1

)(2.7)= πn+2 �

((idn ⊗ 〈e, t〉)

∣∣∣p1⊗ ··· ⊗pn⊗pn+1⊗1

�(〈e, t〉n � σ

)∣∣∣(idn⊗〈e,t〉)�(p1⊗ ··· ⊗pn⊗pn+1⊗1)

)=

(πn+2 � (idn ⊗ 〈e|pn+1,t〉)

)�

((〈e, t〉n � σ

)∣∣∣p1⊗ ··· ⊗pn⊗((e�pn+1)⊗(t�1))

)=

(t ◦· πn+1

)�

((〈e, t〉n � σ

)∣∣∣p1⊗ ··· ⊗pn⊗(e�pn+1)

)= t �

(((〈e, t〉n � σ)

∣∣∣p1⊗ ··· ⊗pn⊗1

∣∣∣1⊗ ··· ⊗1⊗(e�pn+1)

)[0, . . . , 0, 1

])by Lemma 2.3.5 (3) and Exercise 2.2.4

= t �((

(〈e, t〉n � σ)∣∣∣p1⊗ ··· ⊗pn⊗1

)[0, . . . , 0, 1

]∣∣∣e�pn+1

)by Lemma 2.3.5 (6)

(IH)= t �

(σn+1

∣∣∣e�pn+1

)= σn+2.

The next consequence of the previous proposition may be understood asBayes’ rule for HMMs — or more accurately, the product rule for HMMs, seeProposition 2.3.3.

Corollary 3.4.6. Still in the context of Definition 3.4.4, let q be a predicate onX. Its validity in the final state in the filtered sequence is given by:

σn+1 |= q =〈e, t〉n � σ |= p1 ⊗ · · · ⊗ pn ⊗ q

H |= ~p.

Proof. This follows from Bayes’ rule, in Proposition 2.3.3 (1):

σn+1 |= q =(〈e, t〉n � σ

)∣∣∣p1⊗···⊗pn⊗1

[0, . . . , 0, 1

]|= q

=(〈e, t〉n � σ

)∣∣∣p1⊗···⊗pn⊗1 |= 1 ⊗ · · · ⊗ 1 ⊗ q by (2.4)

=

(〈e, t〉n � σ

)|=

(p1 ⊗ · · · ⊗ pn ⊗ 1

)&

(1 ⊗ · · · ⊗ 1 ⊗ q

)(〈e, t〉n � σ

)|= p1 ⊗ · · · ⊗ pn ⊗ 1

=

(〈e, t〉n � σ

)|= p1 ⊗ · · · ⊗ pn ⊗ qH |= ~p

.

The last equation uses Exercise 2.2.4 and Definition 3.4.2.

3.4.3 Finding the most likely sequence of hidden elements

We briefly consider one more question for HMMs: given a sequence of obser-vations, what is the most likely path of internal positions that produces these

156


observations? This is also known as the decoding problem. It explicitly asksfor the most likely path, and not for the most likely individual position at eachstage. Thus it involves taking the argmax of a joint state. We concentrate ongiving a definition of the solution, and only sketch how to obtain it efficiently.

We start by constructing the state below, as intepreted string diagram, for agiven HMM 1

σ→ X

t→ X

e→Y .

σ

t

...

t e e· · · e

YYYX

· · ·

XXX

(3.14)

We proceed, much like in the beginning of this section, this time not usingtuples but triples. We define the above state as element of the subset:

〈id, t, e〉n � σ for X〈id,t,e〉n // Xn × X × Yn

These maps 〈id, t, e〉n are obtained by induction on n:

〈id, t, e〉0 B(X ◦

id// X � 1 × X × 1 � X0 × X × Y0

)〈id, t, e〉n+1 B

(X ◦〈id,t,e〉n// Xn × X × Yn ◦

idn⊗〈id,t,e〉⊗idn

// Xn × (X × X × Y) × Yn

o

Xn+1 × X × Yn+1)

Definition 3.4.7. Let 1σ→ X

t→ X

e→Y be a HMM and let ~p = p1, . . . , pn be a

sequence of factors on Y . Given these factors as successive observations, themost likely path of elements of the sample space X, as an n-tuple in Xn, isobtained as:

argmax((〈id, t, e〉n � σ

)∣∣∣1⊗ ··· ⊗1⊗1⊗pn⊗ ··· ⊗p1

[1, . . . , 1, 0, 0, . . . , 0

]). (3.15)

This expression involves updating the joint state (3.14) with the factors pi,in reversed order, and then taking its marginal so that the state with the first noutcomes in X remains. The argmax of the latter state gives the sequence in Xn

that we are after.The expression (3.15) is important conceptually, but not computationally.

157


It is inefficient to compute, first of all because it involves a joint state thatgrows exponentionally in n, and secondly because the conditioning involvesnormalisation that is irrelevant when we take the argmax.

We given an impression of what (3.15) amounts to for n = 3. We first focuson the expression within argmax

(−

). The marginalisation and conditioning

amount produce an state on X × X × X of the form:∑x,y1,y2,y3

(〈id, t, e〉3 � σ)(x1, x2, x3, x, y3, y2, y1) · p3(y3) · p2(y2) · p1(y1)〈id, t, e〉3 � σ |= 1 ⊗ 1 ⊗ 1 ⊗ 1 ⊗ p3 ⊗ p2 ⊗ p1

∼∑

x,y1,y2,y3

σ(x1) · t(x1)(x2) · t(x2)(x3) · t(x3)(x)

· e(x3)(y3) · e(x2)(y2) · e(x1)(y1) · p3(y3) · p2(y2) · p1(y1)= σ(x1) ·

(∑y1

e(x1)(y1) · p1(y1))· t(x1)(x2) ·

(∑y2

e(x2)(y2) · p2(y2))

· t(x2)(x3) ·(∑

y3e(x3)(y3) · p3(y3)

)·(∑

x t(x3)(x))

= σ(x1) · (e � p1)(x1) · t(x1)(x2) · (e � p2)(x2) · t(x2)(x3) · (e � p3)(x3).

The argmax of the latter expression can then be computed as in Exercise 1.4.7.This approach is known as the Viterbi algorithm, see e.g. [87, 58].

Exercises

3.4.1 Consider the HMM example 3.9 with initial state σ = 1|Cloudy〉.

1 Compute successive states tn � σ for n = 0, 1, 2, 3.2 Compute successive observations e � (tn � σ) for n = 0, 1, 2, 3.3 Check that the validity of the sequence of (point-predicate) obser-

vations Go-out,Stay-in,Stay-in is 0.1674.4 Show that filtering, as in Definition 3.4.4, with these same three

(point) observations yields as final outcome:

18676696 |Cloudy〉 + 347

1395 |Sunny〉 + 1581733480 |Rainy〉

≈ 0.2788|Cloudy〉 + 0.2487|Sunny〉 + 0.4724|Rainy〉.

3.4.2 Consider the transition channel t associated with the HMM exam-ple 3.9. Check that in order to find a stationary stateσ∞ = x|Cloudy〉+y|Sunny〉 + z|Rainy〉 one has to solve the equations:

x = 0.5x + 0.15y + 0.2zy = 0.2x + 0.8y + 0.2zz = 0.3x + 0.05y + 0.6z

Deduce that σ∞ = 0.25|Cloudy〉 + 0.5|Sunny〉 + 0.25|Rainy〉 anddouble-check that t � σ∞ = σ∞.

158


3.4.3 (The set-up of this exercise is copied from Machine Learning lecturenotes of Doina Precup.) Consider a 5-state hallway of the form:

321 4 5

Thus we use a space X = {1, 2, 3, 4, 5} of positions, together with aspace Y = {2, 3} of outputs, for the number of surrounding walls. Thetransition and emission channels t : X → X and e : X → Y for a robotin this hallway are given by:

t(1) = 34 |1〉 +

14 |2〉 e(1) = 1|3〉

t(2) = 14 |1〉 +

12 |2〉 +

14 |3〉 e(2) = 1|2〉

t(3) = 14 |2〉 +

12 |3〉 +

14 |4〉 e(3) = 1|2〉

t(4) = 14 |3〉 +

12 |3〉 +

14 |5〉 e(4) = 1|2〉

t(5) = 14 |4〉 +

34 |5〉 e(5) = 1|3〉.

We use σ = 1|3〉 as start state, and we have a sequence of observa-tions α = [2, 2, 3, 2, 3, 3], formally as a sequence of point predicates[12, 12, 13, 12, 13, 13].

1 Check that (σ, t, e) |= α = 3512 .

2 Next we filter with the sequence α. Show that it leads succesivelyto the following states σi as in Definition 3.4.4

σ1 B σ = 1|3〉σ2 = 1

4 |2〉 +12 |3〉 +

14 |4〉

σ3 = 116 |1〉 +

14 |2〉 +

38 |3〉 +

14 |4〉 +

116 |5〉

σ4 = 38 |1〉 +

18 |2〉 +

18 |4〉 +

38 |5〉

σ5 = 18 |1〉 +

14 |2〉 +

14 |3〉 +

14 |4〉 +

18 |5〉

σ6 = 38 |1〉 +

18 |2〉 +

18 |4〉 +

38 |5〉

σ7 = 38 |1〉 +

18 |2〉 +

18 |4〉 +

38 |5〉.

3 Show that a most likely sequence of states giving rise to the se-quence of observations α is [3, 2, 1, 2, 1, 1]. Is there an alternative?

3.4.4 Apply Bayes’ rule to the validity formulation (3.13) in order to provethe correctness of the following HMM validity algorithm.

(σ, t, e) |= [] B 1(σ, t, e) |= [p] ++ ~p B

(σ |= e � p

)·((t � (σ|e�p), t, e) |= ~p

).

(Notice the connection with filtering from Definition 3.4.4.)

159


3.4.5 A random walk is a Markov model d : Z → Z given by d(n) = r|n −1〉 + (1 − r)|n + 1〉 for some r ∈ [0, 1]. This captures the idea that astep-to-the-left or a step-to-the-right are the only possible transitions.(The letter ‘d’ hints at modeling a drunkard.)

1 Start from initial state σ = 1|0〉 ∈ D(Z) and describe a coupleof subsequent states d � σ, d2 � σ, d3 � σ, . . . Which patternemerges?

2 Prove that for K ∈ N,

dK � σ =∑

0≤k≤K

binom[K](1 − r)(k)∣∣∣2k − K

⟩=

∑0≤k≤K

(Kk

)(1 − r)krK−k

∣∣∣2k − K⟩

3.4.6 A Markov chain X → X has a ‘one-stage history’ only, in the sensethat the state at stage n + 1 depends only on the state at stage n. Thefollowing situation from [87, Chap. III, Ex. 4.4] involves a two-stagehistory.

Suppose that whether or not it rains today depends on wheather conditionsthrough the last two days. Specifically, suppose that if it has rained for thepast two days, then it will rain tomorrow with probability 0.7; if it rainedtoday but not yesterday, then it will rain tomorrow with probability 0.5; ifit rained yesterday but not today, then it will rain tomorrow with probability0.4; if it has not rained in the past to days, then it will rain tomorrow withprobability 0.2.

1 Write R = {r, r⊥} for the state space of rain and no-rain outcomes,and capture the above probabilities via a channel c : R × R→ R.

2 Turn this channel c into a Markov chain 〈π2, c〉 : R × R → R × R,where the second component of R × R describes whether or not itrains on the current day, and the first component on the previousday. Describe 〈π2, c〉 both as a function and as a string diagram.

3 Generalise this approach to a history of length N > 1: turn a chan-nel XN → X into a Markov model XN → XN , where the relevanthistory is incorporated into the sample space.

3.4.7 Use the approach of the prevous exercise to turn a hidden Markovmodel into a Markov model.

3.4.8 LetH1 andH2 be two HMMs. Define their parallel productH1 ⊗H2

using the tensor operation ⊗ on states and channels.

160

3.5. Disintegration 1613.5. Disintegration 1613.5. Disintegration 161

3.5 Disintegration

In Subsections 1.3.2 and 1.4.5 we have seen how a binary relation R ∈ P(A×B)on A × B corresponds to a P-channel A → P(B), and similarly how a multisetψ ∈ M(A × B) corresponds to anM-channel A → M(B). This phenomenonwas called: extraction of a channel from a joint state. This section describesthe analogue for probabilistic binary/joint states and channels. It turns out tobe more subtle because probabilistic extraction requires normalisation — inorder to ensure that the unitality requirement of a D-channel: multiplicitiesneed to add up to one.

In the probabilistic case this extraction is also called disintegration. It corre-sponds to a well-known phenomenon, namely that one can write a joint prob-ability P(x, y) in terms of a conditional probability as:

P(x, y) = P(y | x) · P(x).

The mapping x 7→ P(y | x) is the channel involved, and P(x) refers to the firstmarginal of P(x, y).

Our description of disintegration makes use of the graphical calculus thatwe introduced earlier in this chapter. The formalisation of disintegration incategory theory is not there (yet), but the closely related concept of Bayesianinversion has a nice categorical description as a ‘dagger’, see Section 3.8 be-low.

We start by formulating the property of ‘full support’ in string diagrammaticterms. It is needed as pre-condition in disintegration.

Definition 3.5.1. Let X be a non-empty finite set. We say that a distribution ormultiset ω on X has full support if ω(x) > 0 for each x ∈ X.

More generally, we say that a D- orM-channel c : Y → X has full supportif the distribution/multiset c(y) has full support for each y. This thus means thatc(y)(x) > 0 for all y ∈ Y and x ∈ X.

Lemma 3.5.2. Let ω ∈ D(X), where X is non-empty and finite. The followingthree points are equivalent:

1 ω has full support;2 there is a predicate p on X and probability r ∈ [0, 1] such that ω(x) · p(x) = r

for each x ∈ X.3 there is a predicate p on X such that ω|p is the uniform distribution on X.

Proof. Suppose that the finite set X has N ≥ 1 elements. For the implication(1) ⇒ (2), let ω ∈ D(X) have full support, so that ω(x) > 0 for each x ∈ X.

161


Since ω(x) ≤ 1, we get 1ω(x) ≥ 1 so that t B

∑x

1ω(x) ≥ N and t · ω(x) ≥ 1, for

each x ∈ X. Take r = 1t ≤

1N ≤ 1, and put p(x) = 1

t·ω(x) . Then:

ω(x) · p(x) = ω(x) · 1t·ω(x) = 1

t = r.

Next, for (2) ⇒ (3), suppose ω(x) · p(x) = r, for some predicate p andprobability r. Then ω |= p =

∑x ω(x) · p(x) = N · r. Hence:

ω|p(x) =ω(x) · p(x)ω |= p

=r

N · r=

1N

= υX(x).

For the final step (3) ⇒ (1), let ω|p = υX . This requires that ω |= p , 0 andthus that ω(x) · p(x) =

ω|=pN > 0 for each x ∈ X. But then ω(x) > 0.

This result allows us to express the full support property in diagrammaticterms. This will be needed as precondition later on.

Definition 3.5.3. A state box s has full support if there is a predicate box pand a scalar box r, for some r ∈ [0, 1], with an equation as on the left below.

=

s

p

2 A

r

2 A=

f

p

2 A2 A

B

B

q

Similarly, we say that a channel box f has full support if there are predicatesp, q with an equation as above on the right.

Strictly speaking, we didn’t have to distinguish states and channels in thisdefinition, since that state-case is a special instance of the channel-case, namelyfor A = 1. In the sequel we shall no longer make this distinction.

We now come to the main definition of this section. It involves turing aconditional probability P(b, c | a) into P(c | b, a) such that

P(b, c | a) = P(c | b, a) · P(b | a).

This is captured diagrammatically in (3.16) below.

Definition 3.5.4. Consider a box f

f

A

CB

for which f[1, 0

]= f has full support.

162


We say that f admits disintegration if there is a unique ‘disintegration’ box f ′

from B, A to C satisfying the equation:

f

f ′

CB

A

= f

A

CB

(3.16)

Uniqueness of f ′ in this definition means: for each box g from B, A to C andfor each box h from C to A with full support one has:

g

h= f

A

CB

=⇒ h = f and g = f ′

The first equation h = f[1, 0

]in the conclusion is obtained simply by applying

discard to two right-most outgoing wires in the assumption (and using thatg is unital):

g

hh= =h = f

The second equation g = f ′ in the conclusion expresses uniqueness of thedisintegration box f ′.

We shall use the notation extr1( f ) or f[0, 1

∣∣∣ 1, 0] for this disintegration boxf ′. This notation will be explained in greater generality below.

The first thing to do is to check that this newly introduced construction issound.

Proposition 3.5.5. Disintegration as described in Definition 3.5.4 exists forprobabilistic channels.

163


Proof. Let f : A → B × C be a channel such that f[1, 0

]: A → B has full

support. The latter means that for each a ∈ A and b ∈ B one has f[1, 0

](a)(b) =∑

c∈C f (a)(b, c) , 0. Then we can define f ′ : B × A→ C as:

f ′(b, a) B∑

c

f (a)(b, c)f[1, 0

](a)(b)

∣∣∣c⟩. (3.17)

By the full support assumption this is well-defined.The left-hand-side of Equation (3.16) evaluates as:(

(id ⊗ f ′) ◦· (∆ ⊗ id) ◦· ( f[1, 0

]⊗ id) ◦· ∆

)(a)(b, c)

=∑

x,y id(y)(b) · f ′(y, x)(c) · f[1, 0

](a)(y) · id(x)(a)

= f ′(b, a)(c) · f[1, 0

](a)(b)

= f (a)(b, c).

For uniqueness, assume (id ⊗ g) ◦· (∆ ⊗ id) ◦· (h ⊗ id) ◦· ∆ = f . As already men-tioned, we can obtain h = f

[1, 0

]via diagrammatic reasoning. Here we reason

element-wise. By assumption, g(b, a)(c) · h(a)(b) = f (a)(b, c). This gives:

h(a)(b) = 1 · h(a)(b) =(∑

c g(b, a)(c))· h(a)(b) =

∑c g(b, a)(c) · h(a)(b)

=∑

c f (a)(b, c)= f

[1, 0

](a)(b).

Hence h = f[1, 0

]. But then g = f ′ since:

g(b, a)(c) =f (a)(b, c)h(a)(b)

=f (a)(b, c)

f[1, 0

](a)(b)

= f ′(b, a)(c).

Without the full support requirement, disintegrations may still exist, but theyare not unique, see Example 3.6.1 (2) for an illustration.

Next we elaborate on the notation that we will use for disintegration.

Remark 3.5.6. In traditional notation in probability theory one simply omitsvariables to express marginalisation. For instance, for a distributionω ∈ D(X1×

X2 × X3 × X4), considered as function ω(x1, x2, x3, x4) in four variables xi, onewrites:

ω(x2, x3) for the marginal∑x1,x4

ω(x1, x2, x3, x4).

We have been using masks instead: lists containing as only elements 0 or 1.We then write ω

[0, 1, 1, 0

]∈ D(X2 × X3) to express the above marginalisation,

with a 0 (resp. a 1) at position i in the mask meaning that the i-th variable inthe distribution is discarded (resp. kept).

We like to use a similar notation mask-style for disintegration, mimickingthe traditional notation ω(x1, x4 | x2, x3). This requires two masks, separated

164


by the sign ‘|’ for conditional probability. We sketch how it works for a box ffrom A to B1, . . . , Bn in a string diagram. The notation

f [N | M]

will be used for a disintegration channel, in the following manner:

1 masks M,N must both be of length n;2 M,N must be disjoint: there is no position i with a 1 both in M and in N;3 the marginal f [M] must have full support;4 the domain of the disintegration box f [N | M] is ~B ∩ M, A, where ~B ∩ M

contain precisely those Bi with a 1 in M at position i;5 the codomain of f [N | M] is ~B ∩ N.6 f [N |M] is unique in satisfying an “obvious” adaptation of Equation (3.16),

in which f [M ∪ N] is equal to a string diagram consisting of f [M] suitablyfollowed by f [N | M].

How this really works is best illustrated via a concrete example. Consider thebox f on the left below, with five output wires. We elaborate the disintegrationf[1, 0, 0, 0, 1

∣∣∣ 0, 1, 0, 1, 0], as on the right.

f

B1 B2 B3 B4 B5

A

f[1, 0, 0, 0, 1

∣∣∣ 0, 1, 0, 1, 0]B1

B2 B4

B5

A

The disintegration box f[1, 0, 0, 0, 1

∣∣∣ 0, 1, 0, 1, 0] is unique in satisfying:

f[1, 0, 0, 0, 1

∣∣∣ 0, 1, 0, 1, 0]

f

= f

In traditional notation one could express this equation as:

f (b1, b5 | b2, b4, a) · f (b2, b4 | a) = f (b1, b2, b4, b5 | a).

Two points are still worth noticing.

165


• It is not required that at position i there is a 1 either in mask M or in maskN when we form f [N | M]. If there is a 0 at i both in M and in N, then thewire at i is discarded altogether. This happens in the above illustration forthe third wire: the string diagram on the right-hand-side of the equation isf [M ∪ N] = f

[1, 1, 0, 1, 1

].

• The above disintegration f[1, 0, 0, 0, 1

∣∣∣ 0, 1, 0, 1, 0] can be obtained from the‘simple’, one-wire version of disintegration in Definition 3.5.4 by first suit-ably rearranging wires and combining them via products. How to do thisprecisely is left as an exercise (see below).

Disintegration is a not a ‘compositional’ operation that can be obtained bycombining other string diagrammatic primitives. The reason is that disintegra-tion involves normalisation, via division in (3.17). Still it would be nice to beable to use disintegration in diagrammatic form. For this purpose one may usea trick: in the setting of Definition 3.5.4 we ‘bend’ the relevant wire down-wards and put a gray box around the result in order to suggest that its interioris closed off and has become inaccessible. Thus, we write the disintegration of:

f

A

CB

as f[0, 1

∣∣∣ 1, 0] = f

C

AB

This ‘shaded box’ notation can also be used for more complicated forms ofdisintegration, as described above.

Exercises

3.5.1 1 Prove that if a joint state has full support, then each of its marginalshas full support.

2 Consider the joint state ω = 12 |a, b〉 + 1

2 |a⊥, b⊥ 〉 ∈ D(A × B) for

A = {a, a⊥} and B = {b, b⊥}. Check that both marginals ω[1, 0

]∈

D(A) and ω[0, 1

]∈ D(B) have full support, but ω itself does not.

Conclude that the converse of the previous point does not hold.3.5.2 Consider the box f in Remark 3.5.6. Write done the equation for the

disintegration f[0, 1, 0, 1, 0

∣∣∣ 1, 0, 1, 0, 1]. Formulate also what unique-ness means.

3.5.3 Show how to obtain the disintegration f[0, 0, 0, 1, 1

∣∣∣ 0, 1, 1, 0, 0] inRemark 3.5.6 from the formulation in Definition 3.5.4 via rearrangingand combining wires (via ×).

166

3.6. Disintegration for states 1673.6. Disintegration for states 1673.6. Disintegration for states 167

3.6 Disintegration for states

The previous section has described disintegration for channels, but in the lit-erature it is mostly used for states only, that is, for boxes without incomingwires. This section will look at this special case, following [13].

In its most basic form disintegration involves extracting a channel from a(binary) joint state, so that the state can be reconstructed as a graph — as inDefinition 1.8.10. This works as follows.

Fromω

A Bextract c

A

B

Bω

so that:

ω

A B=

ω

A Bc

(3.18)

Using the notation for disintegration from Remark 3.5.6 we can write the ex-tracted channel c as c = ω

[0, 1

∣∣∣ 1, 0]. Occasionaly we also write it as c =

dis1(ω). If we also use our notation for marginalisation and graphs we canexpress the equation on the right in (3.18) as:

ω = gr(ω[1, 0

], ω

[0, 1

∣∣∣ 1, 0]). (3.19)

This is a special case of Equation (3.16), where the incoming wire is trivial(equal to 1). We should not forget the side-condition of disintegration, whichin this case is: the first marginal ω

[1, 0

]must have full support. Then one can

define the channel c = dis1(ω) : X → Y as disintegration, explicitly, as a spe-cial case of (3.17):

c(a) B∑

b

ω(a, b)∑y ω(a, y)

∣∣∣b⟩=

∑b

ω(a, b)ω[1, 0

](a)

∣∣∣b⟩. (3.20)

From a joint state ω ∈ D(X×Y) we can extract a channel ω[0, 1

∣∣∣ 1, 0] : X →Y and also a channel ω

[1, 0

∣∣∣ 0, 1] : Y → X in the opposite direction, providedthat ω’s marginals have full support. The direction of the channels is thus ina certain sense arbitrary, and does not reflect any form of causality; for moreinformation, see [83, 84, 94] (or [50] for a string-diagrammatic account).

Example 3.6.1. We shall look at two examples, involving spaces A = {a, a⊥}and B = {b, b⊥}.

167


1 Consider the following state ω ∈ D(A × B),

ω = 14 |a, b〉 +

12 |a, b

⊥ 〉 + 14 |a

⊥, b⊥ 〉.

We have as first marginal σ B ω[1, 0

]= 3

4 |a〉+14 |a

⊥ 〉 with full support. Theextracted channel c B ω

[0, 1

∣∣∣ 1, 0] : A→ B is given by:

c(a) =ω(a, b)σ(a)

|b〉 +ω(a, b⊥)σ(a)

|b⊥ 〉 =1/4

3/4|b〉 +

1/2

3/4|b⊥ 〉 = 1

3 |b〉 +23 |b

⊥ 〉

c(a⊥) =ω(a⊥, b)σ(a⊥)

|b〉 +ω(a⊥, b⊥)σ(a⊥)

|b⊥ 〉 =0

1/4|b〉 +

1/4

1/4|b⊥ 〉 = 1|b⊥ 〉.

Then indeed, gr(σ, c) = 〈id, c〉 � σ = ω.

2 Now let’s start from:

ω = 13 |a, b〉 +

23 |a, b

⊥ 〉.

Then σ B ω[1, 0

]= 1|a〉. It does not have full support. Let τ ∈ D(B) be an

arbitrary state. We define c : A→ B as:

c(a) = 13 |b〉 +

23 |b

⊥ 〉 and c(a⊥) = τ.

We then still get 〈id, c〉 � σ = ω, no matter what τ is.

More generally, if we don’t have full support, disintegrations still exist,but they are not unique. We generally avoid such non-uniqueness by requir-ing full support.

3.6.1 Disintegration of states and conditioning

With disintegration (of states) we can express conditioning (updating) of statesin a new way. Also, the earlier results on crossover inference can now be refor-mulated via conditioning.

For the next result, recall that we can identify a predicate X → [0, 1] with achannel X → 2, see Exercise 2.4.8.

Proposition 3.6.2. Let ω ∈ D(A) be state and p ∈ Pred (A) be predicate, bothon the same finite set A. Assumme that the validity ω |= p is neither 0 nor 1.

Then we can express the conditionings ω|p ∈ D(A) and ωp⊥ ∈ D(A) as

168

3.6. Disintegration for states 1693.6. Disintegration for states 1693.6. Disintegration for states 169

(interpreted) string diagrams:

ω|p =

ω

p

1

A

2

and ω|p⊥ =

ω

p

0

A

2

Proof. First, let’s write σ B 〈p, id〉 � ω ∈ D(2 × A) for the joint statethat is disintegrated in the above string diagrams. The pre-conditioning fordisintegration is that the marginal σ

[1, 0

]∈ D(2) has full support. Explicitly,

this means for b ∈ 2 = {0, 1},

σ[1, 0

](b) =

∑a∈A σ(b, a) =

∑a∈A ω(a) · p(a)(b) =

ω |= p if b = 1

ω |= p⊥ if b = 0.

The full support requirement that σ[1, 0

](b) > 0 for each b ∈ 2 means that both

ω |= p and ω |= p⊥ = 1 − (ω |= p) are non-zero. This holds by assumption.We elaborate the above string diagram on the left. Let’s write c for the ex-

tracted channel. It is, according to (3.20),

c(1) =∑

a

σ(1, a)∑a σ(1, a)

∣∣∣a⟩=

∑a

ω(a) · p(a)ω |= p

∣∣∣a⟩= ω|p.

We move to crossover inference, as described in Corollary 2.5.9. There, ajoint state ω ∈ D(X × Y) is used, which can be described as a graph ω =

gr(σ, c). But we can drop this graph assumption now, since it can be obtainedfrom ω via disintegration — assuming that ω’s first marginal has full support.

Thus we come to the following reformulation of Corollary 2.5.9.

Theorem 3.6.3. Let ω ∈ D(X×Y) be joint state whose first marginal ω[1, 0

]∈

D(X) has full support, so that the channel ω[0, 1

∣∣∣ 1, 0] : X → Y exists by dis-integration, and thus ω = gr(ω

[1, 0

], ω

[0, 1

∣∣∣ 1, 0]). Then:

1 for a factor p on X,(ω|p⊗1

)[0, 1

]= ω

[0, 1

∣∣∣ 1, 0] � (ω[1, 0

] ∣∣∣p

).

2 for a factor q on Y, (ω|1⊗q

)[1, 0

]= ω

[1, 0

] ∣∣∣ω[0,1 | 1,0]�q.

169


Exercises

3.6.1 Let σ ∈ D(X) have full support and consider ω B σ ⊗ τ for someτ ∈ D(Y). Check that the channel X → Y extracted from ω by disin-tegration is the constant function x 7→ τ. Give a string diagrammaticaccount of this situation.

3.6.2 Consider an extracted channel ω[1, 0, 0

∣∣∣ 0, 0, 1] for some state ω.

1 Write down the defining equation for this channel, as string dia-gram.

2 Check that ω[1, 0, 0

∣∣∣ 0, 0, 1] is the same as ω[1, 0, 1

][1, 0

∣∣∣ 0, 1].3.6.3 Check that a marginalisation ωM can also be described as disintegra-

tion ω[M

∣∣∣ 0, . . . , 0] where the number of 0’s equals the length of themask/list M.

3.6.4 Disintegrate the distribution Flrn(τ) ∈ D({H, L} × {1, 2, 3}) in Exam-ple 4.1.2 to a channel {H, L} → {1, 2, 3}.

3.7 Bayesian inversion

Bayesian inversion is a mechanism for turning a channel A → B into a chan-nel B → A in the opposite direction, in presence of a state on A. Bayesianinversion can be defined via disintegration, but also in the other direction, dis-integration can be defined via Bayesian inversion. Thus, Bayesian inversionand disintegration are interdefinable.

This section introduces the basics of Bayesian inversion, following [15]and [13]. A categorical analysis of Bayesian inversion is given in the next sec-tion. Bayesian inversion will be used later on in Jeffrey’s adaptation rule andin learning.

Definition 3.7.1. Let c : A → B be a channel with a state ω ∈ D(A) suchthat c � ω ∈ D(B) has full support. The Bayesian inversion of c wrt. ω is achannel written as c†ω : B → A, and defined as disintegration of the joint state

170

3.7. Bayesian inversion 1713.7. Bayesian inversion 1713.7. Bayesian inversion 171

〈c, id〉 � ω in:

c†ω B

ω

c

A

B

such that =

ω

cω

c

c†ω

(3.21)

By construction, c†ω is the unique box giving the above equation. By applyingto the left outgoing line on both sides of the above equation we get:

ω = c†ω �(c � ω

). (3.22)

The superscript-dagger notation (−)† is used since Bayesian inversion issimilar to the adjoint-transpose of a linear operator A between Hilbert spaces,see [15] (and Exercise 3.7.6 below). This transpose is typically written as A†,or also as A∗. The dagger notation is more common in quantum theory, and hasbeen formalised in terms of dagger categories [16]. This categorical approachis sketched in Section 3.8 below.

We know from Proposition 3.5.5 that disintegrations exist, so Bayesian in-versions also exist, as special case. The next result gives a concrete description,in terms of backward inference with a point predicate.

Proposition 3.7.2. For a channel c : A → B and state ω ∈ D(A) such thatc � ω has full support, one has:

c†ω(b) = ω|c�1b =∑

a

ω(a) · c(a)(b)(c � ω)(b)

∣∣∣a⟩=

∑a

ω(a) · c(a)(b)∑x ω(x) · c(x)(b)

∣∣∣a⟩. (3.23)

Proof. By uniqueness of Bayesian inversion, it suffices to show that the for-mulation (3.23) ensures that the equation on the right in (3.21) holds. This iseasy: (

〈id, c†ω〉 � (c � ω))(b, a) = c†ω(b)(a) · (c � ω)(b)

=ω(a) · c(b)(b)(c � ω)(b)

· (c � ω)(a)

= ω(a) · c(a)(b)=

(〈c, id〉 � ω

)(b, a).

Now that we know that Bayesian inversion has this backward-inference-with-point-evidence form, we can recognise that we have been using severaltimes it before. We briefly review some of these examples.

171


Example 3.7.3. 1 In Example 2.5.2 we used a channel c : {H,T } → {W, B}from sides of a coin to colors of balls in an urn, together with a fair coinγ ∈ D({H,T }). We had evidence of a white ball and wanted to know theupdated coin distribution. The outcome that we computed can be describedvia a dagger channel, since:

c†γ(W) = γ|c�1W = 2267 |H 〉 +

4567 |T 〉.

The same redescription in terms of Bayesian inversion can be used for Ex-amples 2.5.3 and 2.5.6. In Example 2.5.4 we can use Bayesian inversionfor the point evidence case, but not for the illustration with soft evidence,i.e. with fuzzy predicates. This matter will be investigated further in Sec-tion 3.11.

2 In Definition 3.4.4 we have seen filtering for hidden Markov models, startingfrom a sequence of factors ~p as observations. In practice these factors areoften point predicates 1yi for a sequence of elements ~y of the visible spaceY . In that case one can describe filtering via Bayesian inversion as follows.Suppose we have a hidden Markov model with transition channel t : X →X, emission channel e : X → Y and initial state σ ∈ D(X). The filteredsequence of states for point observations ~y in Y is given by:

σ1 = σ and σi+1 = t �(σi|e�1yi

)= t � e†σi (yi)= (t ◦· e†σi )(yi).

Above, we have defined Bayesian inversion in terms of disintegration. Infact, the two notions are inter-definable, since disintegration can also be definedin terms of Bayesian inversion. This will be shown next.

Lemma 3.7.4. Let ω ∈ D(A × B) be such that its first marginal π1 � ω =

ω[1, 0

]∈ D(A) has full support, where π1 : A × B→ A is the projection chan-

nel. The second marginal of the Bayesian inversion:(π1

)†ω

[0, 1

]for

(π1

)†ω : A→ A × B,

is then the disintegration channel A→ B for ω.

Proof. We first note that the projection channel π1 : A× B→ A can be writtenas string diagram:

π1 =

172


Outlook Temperature Humidity Windy Play

Sunny hot high false noSunny hot high true no

Overcast hot high false yesRainy mild high false yesRainy cool normal false yesRainy cool normal true no

Overcast cool normal true yesSunny mild high false noSunny cool normal false yesRainy mild normal false yesSunny mild normal true yes

Overcast mild high true yesOvercast hot normal false yes

Rainy mild high true no

Figure 3.1 Weather and play data, copied from [101].

We need to prove the equation in (3.18). It can be obtained via purely diagram-matic reasoning:

= =

ωω

ω

(π1)†ω

The first equation is an instance of (3.21).

We illustrate the use of both disintegration and Bayesian inversion in anexample of ‘naive’ Bayesian classification from [101]; we follow the analysisof [13].

Example 3.7.5. Consider the table in Figure 3.1. It collects data about certainweather conditions and whether or not there is playing (outside). The ques-tion asked in [101] is: given this table, what can be said about the probabilityof playing if the outlook is Sunny, the temperature is Cold, the humidity isHigh and it is Windy? This is a typical Bayesian update question, starting from(point) evidence. We will first analyse the situation in terms of channels.

We start by extracting the underlying spaces for the columns/categories inthe table in Figure 3.1. We choose obvious abbreviations for the entries in thetable:

O = {s, o, r} T = {h,m, c} H = {h, n}. W = {t, f } P = {y, n}

173


These sets are joined into a single product space:

S B O × T × H ×W × P.

It combines the five columns in Figure 3.1. The table itself can now be consid-ered as a multiset inM(S ) with 14 elements, each with multiplicity one. Wewill turn it immediately into an empirical distribution — formally via frequen-tist learning. It yields a distribution τ ∈ D(S ), with 14 entries, each with thesame probability, written as:

τ = 114 | s, h, h, f , n〉 + 1

14 | s, h, h, t, n〉 + · · · +114 |r,m, h, t, n〉.

We use a ‘naive’ Bayesian model in this situation, which means that weassume that all weather features are independent. This assumption can be vi-sualised via the following string diagram:

Play

TemperatureOutlook Humidity Windy

(3.24)

This model oversimplifies the situation, but still it often leads to good (enough)outcomes.

We take the above perspective on the distribution τ, that is, we ‘factorise’ τaccording to this string diagram (3.24). Obviously, the play state π ∈ D(P) isobtained as the last marginal:

π B τ[0, 0, 0, 0, 1

]= 9

14 |y〉 +514 |n〉.

Next, we extract four channels cO, cT , cH , cW via appropriate disintegrations,from the Play column to the Outlook / Temperature / Humidity / Windy columns.

cO B τ[1, 0, 0, 0, 0

∣∣∣ 0, 0, 0, 0, 1] cH B τ[0, 0, 1, 0, 0

∣∣∣ 0, 0, 0, 0, 1]cT B τ

[0, 1, 0, 0, 0

∣∣∣ 0, 0, 0, 0, 1] cW B τ[0, 0, 0, 1, 0

∣∣∣ 0, 0, 0, 0, 1].For instance the ‘outlook’ channel cO : P→ O looks as follows.

cO(y) = 29 | s〉 +

29 |o〉 +

39 |r 〉 cO(n) = 3

5 | s〉 +25 |r 〉. (3.25)

It is analysed in greater detail in Exercise 3.7.3 below.Now we can form the tuple channel of these extracted channels, called c in:

P ◦cB 〈cO,cT ,cH ,cW 〉 // O × T × H ×W

Recall the question that we started from: what is the probability of playing

174


if the outlook is Sunny, the temperature is Cold, the humidity is High andit is Windy? These features can be translated into an element (s, c, h,w) of thecodomain O×T×H×W of this tuple channel — and thus into a point predicate.Hence our answer can be obtained by Bayesian inversion of the tuple channel,as:

c†π(s, c, h, t

) (3.23)= π

∣∣∣c�1(s,c,h,t)

= 125611 |y〉 +

486611 |n〉

≈ 0.2046|y〉 + 0.7954|n〉.

This corresponds to the probability 20.5% calculated in [101] — without anydisintegration or Bayesian inversion.

In the end one can reconstruct a joint state with space S via the extractedchannel, as graph:

〈cO, cT , cH , cW , id〉 � π.

This state differs considerably from the original table/state τ. It shows that theshape (3.24) does not really fit the data that we have in Figure 3.1. But recallthat this approach is called naive. We shall soon look closer into such mattersof shape in Section 3.9.

After this concrete application of Bayesian inversion we turn to some of itsmore abstract properties.

Lemma 3.7.6. For channels c : X → Y and d : X → Z we have:

〈c, d〉†ω(y, z) = ω|(c�1y)&(d�1z) = d†c†ω(y)

(z)

= ω|(d�1z)&(c�1y) = c†c†ω(z)

(y).

Proof. We use the explicit formulation (3.23) of Bayesian inversion in:

〈c, d〉†ω(y, z) = ω|〈c,d〉�1(y,z) = ω|∆�((c⊗d)�(1y⊗1z))

= ω|∆�((c�1y)⊗(d�1z))

= ω|(c�1y)&(d�1z)

= ω|c�1y |d�1z = c†ω(y)|d�1z = d†c†ω(y)

(z).

The outcome c†d†ω(z)

(y) can be obtained by using that conjunction & is commu-tative in this proof.

3.7.1 Bayesian inversion and inference

In Definition 2.5.1 we have carefully distinguished forward inference c �(ω|p) from backward inference ω|c�q. Since Bayesian inversion turns chan-nels around, the question arises whether it also turns inference around: can

175


one express forward (resp. backward) inference along c in terms of backward(resp. forward) inference along the Bayesian inversion of c? The next resultfrom [49] shows that this is indeed the case. It demonstrates that the directionsof inference and of channels are intimately connected.

Theorem 3.7.7. Let c : X → Y be a channel with a state ω ∈ D(X) on itsdomain, such that the transformed state σ B c � ω has full support.

1 Given a factor q on Y, we can express backward inference as forward infer-ence via:

ω|c�q = c†ω �(σ|q

).

2 Given a factor p on X, we can express forward inference as backward infer-ence:

c �(ω|p

)= σ

∣∣∣c†ω�p.

Proof. 1 For x ∈ X we have:(c†ω �

(σ|q

))(x) =

∑y

c†ω(y)(x) · (c � ω)|q(y)

=∑

y

c(x)(y) · ω(x)(c � ω)(y)

·(c � ω)(y) · q(y)

c � ω |= q

=ω(x) ·

(∑y c(x)(y) · q(y)

)ω |= c � q

=ω(x) · (c � q)(x)ω |= c � q

= ω|c�q(x).

2 Similarly, for y ∈ Y ,(σ∣∣∣c†ω�p

)(y) =

(c � ω)(y) · (c†ω � p)(y)

c � ω |= c†ω � p

=∑

x

(c � ω)(y) · c†ω(y)(x) · p(x)

c†ω � (c � ω) |= p

=∑

x

(c � ω)(y) · ω(x)·c(x)(y)(c�ω)(y) · p(x)

ω |= p

=∑

xc(x)(y) ·

ω(x) · p(x)ω |= p

=∑

xc(x)(y) · ω|p(x)

=(c � (ω|p)

)(y).

We conclude with a follow-up of the crossover update corollary 2.5.9. There

176


we look at marginalisation after update. Here we look at extraction of a chan-nel, namely at the same position of the original channel. It can be expressed interms of an updated channel, see Definition 2.3.1 (3).

Theorem 3.7.8. Let c : X → Y be a channel with state σ ∈ D(X) on itsdomain, and let p ∈ Fact(X) and q ∈ Fact(Y) be factors.

1 The extraction on an updated graph state yields:(〈id, c〉 � σ

)∣∣∣p⊗q

[0, 1

∣∣∣ 1, 0] = c|q where c|q(x) B c(x)|q.

2 And as a result: (〈id, c〉 � σ

)∣∣∣p⊗q = 〈id, c|q〉 �

(σ|p&(c�q)

).

Proof. 1 For x ∈ X and y ∈ Y we have:(〈id, c〉 � σ

)∣∣∣p⊗q

[0, 1

∣∣∣ 1, 0](x)(y)

(3.23)=

(〈id, c〉 � σ

)∣∣∣p⊗q(x, y)(

〈id, c〉 � σ)∣∣∣

p⊗q

[1, 0

](x)

=

(〈id, c〉 � σ

)(x, y) · (p ⊗ q)(x, y)∑

v(〈id, c〉 � σ

)(x, v) · (p ⊗ q)(x, v)

=σ(x) · c(x)(y) · p(x) · q(y)∑v σ(x) · c(x)(v) · p(x) · q(v)

=c(x)(y) · q(y)∑v c(x)(v) · q(v)

=c(x)(y) · q(y)

c(x) |= q= c(x)|q(y)= c|q(x)(y).

2 A joint state like (〈id, c〉 � σ)|p⊗q can always be written as graph gr(τ, d) =

〈id, d〉 � τ. The channel d is obtained by distintegration of the joint state,and equals c|q by the previous point. The state τ is the first marginal of thejoint state. Hence we are done by:(

〈id, c〉 � σ)∣∣∣

p⊗q

[1, 0

]=

(〈id, c〉 � σ

)∣∣∣(1⊗q)&(p⊗1)

[1, 0

]by Exercise 2.4.5

=(〈id, c〉 � σ

)∣∣∣1⊗q

∣∣∣p⊗1

[1, 0

]by Lemma 2.3.5 (3)

=(〈id, c〉 � σ

)∣∣∣1⊗q

[1, 0

]∣∣∣p by Lemma 2.3.5 (6)

= σ|c�q|p by Corollary 2.5.9 (2)= σ|p&(c�q) by Lemma 2.3.5 (3).

177


In Section 3.11 we will use Bayesian inversion for a new form of beliefrevision, called Jeffrey’s adaptation.

Exercises

3.7.1 Check that Proposition 3.6.2 can be reformulated as:

ω|p = p†ω(1) and ω|p⊥ = p†ω(0).

3.7.2 1 Prove Equation (3.22) concretely, using the backward inferenceformulation (3.23) of Bayesian inversion.

2 Check that Equation (3.22) can be obtained as special case of The-orem 3.7.7 (1), namely when q is the truth predicate 1.

3.7.3 Write σ ∈ M(O × T × H × W × P) for the weather-play table inFigure 3.1, as multidimensional multiset.

1 Compute the marginal multiset σ[1, 0, 0, 0, 1

]∈ M(O × P).

2 Reorganise this marginalised multiset as a 2-dimensional table withonly Outlook (horizontal) and Play (vertical) data, as given below,and check how this ‘marginalised’ table relates to the original onein Figure 3.1.

Sunny Overcast Rainy

yes 2 4 3no 3 0 2

3 Deduce a channel P → O from this table, and compare it to thedescription (3.25) of the channel cO given in Example 3.7.5 — seealso Lemma 1.6.2 and Proposition 4.1.1.

4 Do the same for the marginal tables σ[0, 1, 0, 0, 1

], σ

[0, 0, 1, 0, 1

],

σ[0, 0, 0, 1, 1

], and the corresponding channels cT , cH , cW in Ex-

ample 3.7.5.

3.7.4 Prove that the equation(〈id, c〉 � σ

)∣∣∣p⊗1

[0, 1

∣∣∣ 1, 0] = c.

can be obtained both from Theorem 2.5.7 and from Theorem 3.7.8.3.7.5 Prove that for a state ω ∈ D(X × Y) with full support one has:

ω[1, 0

∣∣∣ 0, 1] =(ω[0, 1

∣∣∣ 1, 0])† : Y → X.

3.7.6 For ω ∈ D(X), c : X → Y , p ∈ Obs(X) and q ∈ Obs(Y) prove:

c � ω |= (c†ω � p) & q = ω |= p & (c � q).

178

3.8. Categorical aspects of Bayesian inversion 1793.8. Categorical aspects of Bayesian inversion 1793.8. Categorical aspects of Bayesian inversion 179

This is a reformulation of [15, Thm. 6]. If we ignore the validity signs|= and the states before them, then we can recognise in this equationthe familiar ‘adjointness property’ of daggers (adjoint transposes) inthe theory of Hilbert spaces: 〈A†(x) | y〉 = 〈x | A(y)〉.

3.7.7 Recall from Exercise 1.7.9 what it means for c : X → Y to be a bi-channel. Check that c∗ : Y → Y , as defined there, is c†υ, where υ is theuniform distribution on X (assuming that X is finite).

3.8 Categorical aspects of Bayesian inversion

As suggested in the beginning of this section the ‘dagger’ of a channel — i.e. itsBayesian inversion — can also be described categorically. It turns out to be aspecial ‘dagger’ functor. Such reversal is quite common for non-deterministiccomputation, see Example 3.8.1 below. The fact that this same abstract struc-ture exists for probabilistic computation demonstrates once again that Bayesianinversion is a canonical operation — and that category theory provides a usefullanguage for making such similarities explicit.

This section goes a bit deeper into the category-theoretic aspects of proba-bilistic computation in general and of Bayesian inversion in particular. It is notessential for the rest of this book, but provides deeper insight into the underly-ing structures.

Example 3.8.1. Recall the category Chan(P) of non-deterministic compu-tations. Its objects are sets X and its morphisms f : X → Y are functionsf : X → P(X). The identity morphism unit X : X → X in Chan(P) is the sin-gleton function unit(x) = {x}. Composition of f : X → Y and g : Y → Z is thefunction g ◦· f : X → Z given by:(

g ◦· f)(x) = {z ∈ Z | ∃y ∈ Y. y ∈ f (x) and z ∈ g(y)}.

It is not hard to see that ◦· is associative and has unit as neutral element. In fact,this has already been proven more generally, in Lemma 1.7.3.

There are two aspects of the category Chan(P) that we wish to illustrate,namely (1) that it has an ‘inversion’ operation, in the form of a dagger functor,and (2) that it is isomorphic to the category Rel of sets with relations betweenthem (as morphisms). Probabilistic analogues of these two points will be de-scribed later.

1 We start from a very basic observation, namely that morphisms in Chan(P)can be reversed. There is a bijective correspondence, indicated by the double

179


lines, between morphisms in Chan(P):

X ◦ // Y============Y ◦ // X

that is, between functions:X

f// P(Y)

===============Y g

// P(X)

This correspondence sends f : X → P(Y) to the function f † : Y → P(X)with f †(y) B {x | y ∈ f (x)}. Hence y ∈ f (x) iff x ∈ f †(y). Similarly onesends g : Y → P(X) to g† : X → P(Y) via g†(x) B {y | x ∈ g(y)}. Clearly,f †† = f and g†† = g.

It turns out that this dagger operation (−)† interacts nicely with compo-sition: one has unit† = unit and also (g ◦· f )† = f † ◦· g†. This means thatthe dagger is functorial. It can be described as a functor (−)† : Chan(P) →Chan(P)op, which is the identity on objects: X† = X. The opposite (−)op

category is needed for this functor since it reverses arrows.2 We write Rel for the category with sets X as objects and relations R ⊆ X ×Y

as morphisms X → Y . The identity X → X is given by the equality relationEqX ⊆ X × X, with EqX = {(x, x) | x ∈ X}. Composition of R ⊆ X × Y andS ⊆ X × Z is the ‘relational’ composition S • R ⊆ X × Z given by:

S • R B {(x, z) | ∃y ∈ Y.R(x, y) and S (y, z)}.

It is not hard to see that we get a category in this way.There is a ‘graph’ functor G : Chan(P) → Rel, which is the identity

on objects: G(X) = X. On a morphism f : X → Y , that is, on a functionf : X → P(Y), we define G( f ) ⊆ X × Y to be G( f ) = {(x, f (x)) | x ∈ X}.Then: G(unit X) = EqX and G(g ◦· f ) = G( f ) • G( f ).

In the other direction there is also a functor F : Rel → Chan(P), whichis again the identity on objects: F(X) = X. On a morphism R : X → Yin Rel, that is, on a relation R ⊆ X × Y , we define F(R) : X → P(Y) asF(R)(x) = {y | R(x, y)}. This F preserves identities and composition.

These two functors G and F are each other’s inverses, in the sense that:

F ◦ G = id : Chan(P)→ Chan(P) and G ◦ F = id : Rel→ Rel.

This establishes an isomorphism Chan(P) � Rel of categories.Interestingly, Rel is also a dagger category, via the familar operation of

reversal of relations: for R ⊆ X × Y one can form R† ⊆ Y × X via R†(y, x) =

R(x, y). This yields a functor (−)† : Rel→ Relop, obviously with (−)†† = id.Moreover, the above functors G and F commute with the daggers of

Chan(P) and Rel, in the sense that:

G( f †) = G( f )† and F(R†) = F(R)†.

180


We shall prove the first equation and leave the second one to the interestedreader. The proof is obtained by carefully unpacking the right definition ateach stage. For a function f : X → P(Y) and elements x ∈ X, y ∈ Y ,

G( f †)(y, x) ⇔ x ∈ f †(y) ⇔ y ∈ f (x) ⇔ G( f )(x, y) ⇔ G( f )†(y, x).

We now move from non-deterministic to probabilistic computation. Our aimis to obtain analogous results, namely inversion in the form of a dagger functoron a category of probabilistic channels, and an isomorphism of this categorywith a category of probabilistic relations. One may expect that these resultshold for the category Chan(D) of probabilistic channels. But the situation isa bit more subtle. Recall from the previous section that the dagger (Bayesianinversion) c†ω : Y → X of a probabilistic channel c : X → Y requires a stateω ∈ D(X) on the domain. In order to conveniently deal with this situation weincorporate these states ω into the objects of our category. We follow [15] anddenote this category as Krn; its morphisms are ‘kernels’.

Definition 3.8.2. The category Krn of kernels has:

• objects: pairs (X, σ) where X is a finite set and σ ∈ D(X) is a distribution onX with full support;

• morphisms: f : (X, σ) → (Y, τ) are probabilistic channels f : X → Y withf � σ = τ.

Identity maps (X, σ) → (X, σ) in Krn are identity channels unit : X → X,given by unit(x) = 1| x〉, which we write simply as id. Composition in Krn isordinary composition ◦· of channels.

Theorem 3.8.3. Bayesian inversion forms a dagger functor (−)† : Krn →Krnop which is the identity on objects and which sends:(

(X, σ)f−→ (Y, τ)

)7−→

((Y, τ)

f †σ−→ (X, σ)

)This functor is its own inverse: f †† = f .

Proof. We first have to check that the dagger functor is well-defined, i.e. thatthe above mapping yields another morphism in Krn. This follows from (3.22):

f †σ � τ = f †σ � ( f � σ) = σ.

Aside: this does not mean that f † ◦· f = id.We next show that the dagger preserves identities and composition. We use

the concrete formulation of Bayesian inversion of Proposition 3.7.2. A string

181


diagrammatic proof is also an option, see below. The identity map id : (X, σ)→(X, σ) in Krn satisfies:

id†σ(x) = σ|id�1x = σ|1x = 1| x〉 see Exercise 2.3.4= id(x).

For morphisms (X, σ)f−→ (Y, τ)

g−→ (Z, ρ) in Krn we have, for z ∈ Z,(

g ◦· f)†σ(z) = σ|(g◦· f )�1z = σ| f�(g�1z)

= f †σ �(( f � σ)|g�1z

)by Theorem 3.7.7 (1)

= f †σ �(τ|g�1z

)= f †σ � g†τ(z)=

(f †σ ◦· g†τ

)(z).

Finally we have:(f †σ

)†τ(x) = τ| f †σ�1x

= ( f � σ)| f †σ�1x

= f � (σ|1x ) by Theorem 3.7.7 (2)= f � 1| x〉= f (x).

One can also prove this result via equational reasoning with string diagrams.For instance preservation of composition ◦· by the dagger follows by unique-ness from:

σ

f

g

σ

g

=f †σ

f

=

τ

g

f †σ

τ

f †σ

==g†τ

g

f

f †σ

g†τ

g

σ

We now turn to probabilistic relations, with the goal of finding a categoryof such relations that is isomorphic to Krn. For this purpose we use what arecalled couplings. The definition that we use below differs only in inessentialways from the formulation that we used in Subsection 2.8.1 for the Wassersteindistance.

Definition 3.8.4. We introduce a category Cpl of couplings with the sameobjects as Krn. A morphism (X, σ)→ (Y, τ) in Cpl is a joint state ϕ ∈ D(X×Y)

182


with ϕ[1, 0

]= σ and ϕ

[0, 1

]= τ. Such a distribution which marginalises to σ

and τ is called a coupling between σ and τ.Composition of ϕ : (X, σ) → (Y, τ) and ψ : (Y, τ) → (Z, ρ) is the distribution

ψ • ϕ ∈ D(X × Z) defined as:

(ψ • ϕ) B 〈ϕ[1, 0

∣∣∣ 0, 1], ψ[0, 1 ∣∣∣ 1, 0]〉 � τ

=∑x,z

(∑yϕ(x, y) · ψ(y, z)

τ(y)

) ∣∣∣ x, z⟩. (3.26)

The identity coupling Eq(X,σ) : (X, σ)→ (X, σ) is the distribution ∆ � σ.

The essence of the following result is due to [15], but there it occurs inslightly different form, namely in a setting of continuous probability. Here it istranslated to the discrete situation.

Theorem 3.8.5. Couplings as defined above indeed form a category Cpl.

1 This category carries a dagger functor (−)† : Cpl → Cplop which is theidentity on objects; on morphisms it is defined via swapping:(

(X, σ)ϕ−→ (Y, τ)

)†B

((Y, τ)

〈π2,π1〉�ϕ−−−−−−−→ (X, σ)

).

More concretely, this dagger is defined by swapping arguments, as in: ϕ† =∑x,y ϕ(x, y)|y, x〉.

2 There is an isomorphism of categories Krn � Cpl, in one direction by takingthe graph of a channel, and in the other direction by disintegration. Thisisomorphism commutes with the daggers on the two categories.

Proof. We first need to prove that ψ • ϕ is a distribution:∑x,z

(ψ • ϕ)(x, z) =∑

x,y,z

ϕ(x, y) · ψ(y, z)τ(y)

=∑

y,z

(∑x ϕ(x, y)

)· ψ(y, z)

τ(y)

=∑

y,z

ϕ[0, 1

](y) · ψ(y, z)τ(y)

=∑

y,z

τ(y) · ψ(y, z)τ(y)

=∑

y,zψ(y, z) = 1.

We leave it to the reader to check that Eq(X,σ) = ∆ � σ is neutral element for•. We do verify that • is associative — and thus that Cpl is indeed a category.Let ϕ : (X, σ) → (Y, τ), ψ : (Y, τ) → (Z, ρ), χ : (Z, ρ) → (W, κ) be morphisms in

183


Cpl. Then:(χ • (ψ • ϕ)

)(x,w) =

∑z

(ψ • ϕ)(x, z) · χ(z,w)ρ(z)

=∑

y,z

ϕ(x, y) · ψ(y, z) · χ(z,w)τ(y) · ρ(z)

=∑

y

ϕ(x, y) · (χ • ψ)(y,w)τ(y)

=((χ • ψ) • ϕ

)(x,w).

We turn to the dagger. It is obvious that (−)†† is the identity functor, and alsothat (−)† preserves identity maps. It also preserves composition in Cpl since:(

ψ • ϕ)†(z, x) =

(ψ • ϕ

)(x, z)

=∑

y

ϕ(x, y) · ψ(y, z)τ(y)

=∑

y

ψ†(z, y) · ϕ†(y, x)τ(y)

=(ϕ† • ψ†

)(z, x).

The graph operation gr on channels from Definition 1.8.10 gives rise to anidentity-on-objects ‘graph’ functor G : Krn→ Cpl via:

G((X, σ)

f−→ (Y, τ)

)B

((Y, τ)

gr(σ, f )−−−−−→ (X, σ)

).

where gr(σ, f ) = 〈id, f 〉 � σ. This yields a functor since:

G(id(X,σ)

)= 〈id, id〉 � σ

= Eq(X,σ)

G(g ◦· f

)(x, z) = gr(σ, g ◦· f )(x, z)

= σ(x) · (g ◦· f )(x)(z)=

∑yσ(x) · f (x)(y) · g(y)(z)

=∑

y

σ(x) · f (x)(y) · τ(y) · g(y)(z)τ(y)

=∑

y

gr(σ, f )(x, y) · gr(τ, g)(y, z)τ(y)

=(gr(τ, g) • gr(σ, f )

)(x, z)

=(G(g) • G( f )

)(x, z).

In the other direction we define a functor F : Cpl → Krn which is theidentity on objects and uses disintegration on morphisms: for ϕ : (X, σ) →(Y, τ) in Cpl one gets a channel F(ϕ) B dis1(ϕ) = ϕ

[0, 1

∣∣∣ 1, 0] : X → Y whichsatisfies, by construction (3.19):

ϕ = gr(ϕ[1, 0

], dis1(ϕ)

)= gr

(σ, F(ϕ)

)= GF(ϕ).

184


Moreover, F(ϕ) is a morphism (X, σ)→ (Y, τ) in Krn since:

F(ϕ) � σ = dis1(ϕ) �(ϕ[1, 0

])=

(gr

(ϕ[1, 0

], dis1(ϕ)

))[0, 1

] (3.18)= ϕ

[0, 1

]= τ.

We still need to prove that F preserves identities and composition. This fol-lows by uniqueness of disintegration:

〈id, F(Eq(X,σ))〉 � σ

= Eq(X,σ) by definition= ∆ � σ

= 〈id, id〉 � σ

〈id, F(ψ • ϕ)〉 � σ

= ψ • ϕ by definition(3.26)= 〈ϕ

[1, 0

∣∣∣ 0, 1], ψ[0, 1 ∣∣∣ 1, 0]〉 � τ

= (id ⊗ ψ[0, 1

∣∣∣ 1, 0]) � (〈ϕ

[1, 0

∣∣∣ 0, 1], id〉 � (ϕ[1, 0

∣∣∣ 0, 1] � σ))

= (id ⊗ F(ψ)) �(〈F(ϕ)†, id〉 �

(F(ϕ) � σ

))by Exercise 3.7.5

(3.21)= (id ⊗ F(ψ)) �

(〈id, F(ϕ)〉 � σ

)= 〈id, F(ψ) ◦· F(ϕ)〉 � σ.

We have seen that, by construction G ◦ F is the identity functor on thecategory Cpl. In the other direction we also have F ◦ G = id : Krn → Krn.This follows directly from uniqueness of disintegration.

In the end we like to show that the functors G and F commute with the dag-gers, on kernels and couplings. This works as follows. First, for f : (X, σ) →(Y, τ) we have:

G(f †

)(y, x) = gr(τ, f †)(y, x) = τ(y) · f †σ(y)(x)

= ( f � σ)(y) ·σ(x) · f (x)(y)( f � σ)(y)

= σ(x) · f (x)(y)= gr(σ, f )(x, y)= G( f )(x, y) = G( f )†(y, x).

185


Next, for ϕ : (X, σ)→ (Y, τ) in Cpl,

F(ϕ†

)(y)(x) = dis1

(ϕ†

)(y)(x)

(3.20)=

ϕ†(y, x)ϕ†

[1, 0

](y)

=ϕ(x, y)ϕ[0, 1

](y)

= ϕ[1, 0

∣∣∣ 0, 1](y)(x)=

(ϕ[0, 1

∣∣∣ 1, 0])†(y)(x) by Exercise 3.7.5= F(ϕ)†(y)(x).

We have done a lot of work in order to be able to say that Krn and Cpl areisomorphic dagger categories, or, more informally, that there is a one-one cor-respondence between probabilistic computations (channels) and probabilisticrelations.

Exercises

3.8.1 Prove in the context of the powerset channels of Example 3.8.1 that:

1 unit† = unit and (g ◦· f )† = f † ◦· g†.2 G(unit X) = EqX and G(g ◦· f ) = G( f ) • G( f ).3 F(EqX) = unit X and F(S • R) = F(S ) ◦· F(R).4 (S • R)† = R† • S †.5 F(R†) = F(R)†.

3.8.2 Give a string diagrammatic proof of the property f †† = f in Theo-rem 3.8.3. Prove also that ( f ⊗ g)† = f † ⊗ g†.

3.8.3 Prove the equation = in the definition (3.26) of composition • in thecategory Cpl. Give also a string diagrammatic description of •.

3.9 Factorisation of joint states

Earlier, in Subsection 1.8.1, we have called a binary joint state non-entwinedif it is the product of its marginals. This can be seen as an intrinsic property ofthe state, which we will express in terms of a string diagram, called its shape.For instance, the state:

ω = 14 |a, b〉 +

12 |a, b

⊥ 〉 + 112 |a

⊥, b〉 + 16 |a

⊥, b⊥ 〉

186

3.9. Factorisation of joint states 1873.9. Factorisation of joint states 1873.9. Factorisation of joint states 187

is non-entwined: it is the product of its marginals ω[1, 0

]= 3

4 |a〉 + 14 |a

⊥ 〉 andω[0, 1

]= 1

3 |b〉 +23 |b

⊥ 〉. We will formulate this as:

ω has shapeor as:

ω factorises asand write this as: ω |≈ .

We shall give a formal definition of |≈ below, but at this stage it suffices to readσ |≈ S , for a state σ and a string diagram S , as: there is an interpretation of theboxes in S such that σ = [[ S ]].

In the above case of ω |≈ we obtain ω = ω1 ⊗ω2, for some state ω1 thatinterpretes the box on the left, and some ω2 interpreting the box on the right.But then:

ω[1, 0

]=

(ω1 ⊗ ω2

)[1, 0

]= ω1.

Similarly, ω[0, 1

]= ω2. Thus, in this case the interpretations of the boxes are

uniquely determined, namely as first and second marginal of ω.We conclude that non-entwinedness of an arbitrary binary joint state ω can

be expressed as: ω |≈ . Here we are interested in similar intrinsic ‘shape’properties of states and channels that can be expressed via string diagrams.These matters are often discussed in the literature in terms of (conditional)independencies. Here we postpone such independencies and only use stringdiagrams, for the time being.

In general there may be several interpretations of a string diagram (as shape).Consider for instance:

1425 |H 〉 +

1125 |T 〉 |≈

This state has this form in multiple ways, for instance as:

c1 � σ1 = 0.56|H 〉 + 0.44|T 〉 = c2 � σ2

for:

σ1 = 15 |1〉 +

45 |0〉 σ2 = 2

5 |1〉 +35 |0〉

c1(1) = 45 |H 〉 +

15 |T 〉 and c2(1) = 1

2 |H 〉 +12 |T 〉

c1(0) = 12 |H 〉 +

12 |T 〉 c2(0) = 3

5 |H 〉 +25 |T 〉.

We note that is not an accessible string diagram: the wire inbetween the two

boxes cannot be accessed from the outside. If these wires are accessible, thenwe can access the individual boxes of a string diagram and use disintegrationto compute them. We illustrate how this works.

187


Example 3.9.1. Consider two-element sets A = {a, a⊥}, B = {b, b⊥}, C = {c, c⊥}and D = {d, d⊥} and an (accessible) string diagram S of the form:

S =

σ

gf

A BC D

(3.27)

Now suppose we have a joint state ω ∈ D(A ×C × D × B) given by:

ω = 0.04|a, c, d, b〉 + 0.18|a, c, d, b⊥ 〉 + 0.06|a, c, d⊥, b〉 + 0.02|a, c, d⊥, b⊥ 〉+ 0.04|a, c⊥, d, b〉 + 0.18|a, c⊥, d, b⊥ 〉 + 0.06|a, c⊥, d⊥, b〉+ 0.02|a, c⊥, d⊥, b⊥ 〉 + 0.024|a⊥, c, d, b〉 + 0.018|a⊥, c, d, b⊥ 〉+ 0.036|a⊥, c, d⊥, b〉 + 0.002|a⊥, c, d⊥, b⊥ 〉 + 0.096|a⊥, c⊥, d, b〉+ 0.072|a⊥, c⊥, d, b⊥ 〉 + 0.144|a⊥, c⊥, d⊥, b〉 + 0.008|a⊥, c⊥, d⊥, b⊥ 〉

We ask ourselves: does ω |≈ S hold? More specifically, can we somehow ob-tain interpretations of the boxes σ, f and g in S so that ω = [[ S ]]? We shallshow that by appropriately using marginalisation and disintegration we can‘factorise’ this joint state according to the above string diagram.

First we can obtain the state σ ∈ D(A× B) by discarding the C,D outputs inthe middle, as in:

σ

gf

σ

σ= =

We can thus compute σ as:

σ = ω[1, 0, 0, 1

]=

(∑u,v ω(a, u, v, b)

)|a, b〉 +

(∑u,v ω(a, u, v, b⊥)

)|a, b⊥ 〉

+(∑

u,v ω(a⊥, u, v, b))|a⊥, b〉 +

(∑u,v ω(a⊥, u, v, b⊥)

)|a⊥, b⊥ 〉

=(0.04 + 0.06 + 0.04 + 0.06

)|a, b〉

+(0.18 + 0.02 + 0.18 + 0.02

)|a, b⊥ 〉

+(0.024 + 0.036 + 0.096 + 0.144

)|a⊥, b〉

+(0.018 + 0.002 + 0.072 + 0.008

)|a⊥, b⊥ 〉

= 0.2|a, b〉 + 0.4|a, b⊥ 〉 + 0.3|a⊥, b〉 + 0.1|a⊥, b⊥ 〉.

Next we concentrate on the channels f and g, from A to C and from B to D.

188


We first illustrate how to restrict the string diagram to the relevant part viamarginalisation. For f we concentrate on:

σ

f

σ

f= =

σ

gf

The string diagram on the right tells us that we can obtain f via disintegrationfrom the marginal ω

[1, 1, 0, 0

], using that extracted channels are unique, in dia-

grams of this form, see (3.18). In the same way one obtains g from the marginalω[0, 0, 1, 1

]. Below we directly give the outcomes, and leave the details of the

calculations to the reader.

f = ω[0, 1, 0, 0

∣∣∣ 1, 0, 0, 0]=

a 7→

∑v,y ω(a, c, v, y)∑

u,v,y ω(a, u, v, y)|c〉 +

∑v,y ω(a, c⊥, v, y)∑u,v,y ω(a, u, v, y)

|c⊥ 〉

a⊥ 7→

∑v,y ω(a⊥, c, v, y)∑

u,v,y ω(a⊥, u, v, y)|c〉 +

∑v,y ω(a⊥, c⊥, v, y)∑u,v,y ω(a⊥, u, v, y)

|c⊥ 〉

=

a 7→ 0.5|c〉 + 0.5|c⊥ 〉a⊥ 7→ 0.2|c〉 + 0.8|c⊥ 〉

g = ω[0, 0, 1, 0

∣∣∣ 0, 0, 0, 1]=

b 7→

∑x,u ω(x, u, d, b)∑

x,u,v ω(x, u, v, b)|d 〉 +

∑x,u ω(x, u, d⊥, b)∑x,u,v ω(x, u, v, b)

|d⊥ 〉

b⊥ 7→∑

x,u ω(x, u, d, b⊥)∑x,u,v ω(x, u, v, b⊥)

|d 〉 +∑

x,v ω(x, u, d⊥, b⊥)∑x,u,v ω(x, u, v, b⊥)

|d⊥ 〉

=

b 7→ 0.4|d 〉 + 0.6|d⊥ 〉b⊥ 7→ 0.9|d 〉 + 0.1|d⊥ 〉

At this stage one can check that the joint state ω can be reconstructed fromthese extracted state and channels, namely as:

[[ S ]] = (id ⊗ f ⊗ g ⊗ id) �((∆ ⊗ ∆) � σ

)= ω.

This proves ω |≈ S .

We now come to the definition of |≈. We will use it for channels, and not justfor states, as used above.

Definition 3.9.2. Let A1, . . . , An, B1, . . . , Bm be finite sets. Consider a channelc : A1×· · ·×An → B1×· · ·×Bm and a string diagram S ∈ SD(Σ) over signature

189


Σ, with domain [A1, . . . , An] and codomain [B1, . . . , Bm]. We say that c and Shave the same type.

In this situation we write:

c |≈ S

if there is an interpretation of the boxes in Σ such that c = [[ S ]].We then say that c has shape S and also that c factorises according to S . Al-

ternative terminology is: c is Markov relative to S , or: S represents c. Given cand S , finding an interpretation of Σ such that c = [[ S ]] is called a factorisationproblem. It may have no, exactly one, or more than one solution (interpreta-tion), as we have seen above.

The following illustration is a classical one, showing how seemingly differ-ent shapes are related. It is often used to describe conditional independence —in this case of A,C, given B.

Lemma 3.9.3. Let a state ω ∈ D(A × B ×C) have full support. Then:

ω |≈ ⇐⇒ ω |≈ ⇐⇒ ω |≈

The string diagrams on the left is often called a fork; the other two are calledchain.

Proof. Since ω has full support, so have all its marginals. This allows us toperform all disintegrations below. We start on the left-hand-side, and assumean interpretation ω = 〈c, id, d〉 � τ, consisting of a state τ = ω

[0, 1, 0

]∈ D(B)

and channels c : B → A and d : B → C. We write σ = c � τ = ω[1, 0, 0

]and

take the Bayesian inversion c†τ : A → B. We now have an interpretation of thestring diagram in the middle, which is equal to ω, since by (3.21):

σ

c†τ

d

c

c†τ

d

=

τ

=

τ

d

=c

τ

d=

cω

Similarly one obtains an interpretation of the string diagram on the right viaρ = d � τ = ω

[0, 0, 1

]and the inversion d†τ : C → B.

190


In the direction (⇐) one uses Bayesian inversion in a similar manner totransform one interpretation into another one.

In Example 3.7.5 we have extracted certain structure from a joint state, givena particular string diagram (or shape) (3.24). This can be done quite generally,essentially as in Example 3.9.1. But note that the resulting interpretation of thestring diagram need not be equal to the original joint state — as Example 3.7.5shows.

Proposition 3.9.4. Let c be a channel with full support, of the same type asan accessible string diagram S ∈ SD(Σ) that does not contain . Then thereis a unique interpretation of Σ that can be obtained from c. In this way we canfactorise c according to S .

In the special case that c |≈ S holds, the factorisation interpretation of Sobtained in this way from c is the same one that gives [[ S ]] = c because ofc |≈ S and uniqueness of disintegrations.

Proof. We conveniently write multiple wires as a single wire, using producttypes. We use induction on the number of boxes in S . If this number is zero,then S only consists of ‘structural’ elements, and can be interpreted irrespec-tive of the channel c.

Now let S contain at least one box. Pick a box g at the top of S , with all ofits output wires directly coming out of S . Thus, since S is accessible, we mayassume that it has the form as described on the left below.

S =

g

hso that S ′ B

g

h

g

h=

By discarding the wire on the left we get a diagram S ′ as used in disintegra-tion, see Definition 3.5.4. But then a unique channel (interpretation) g can beextracted from [[ S ′ ]] = c

[0, 1, 1

].

We can now apply the induction hypothesis with the channel c[1, 1, 0

]and

corresponding string diagram from with the box g has been removed. It willgive an interpretation of all the other boxes — not equal to g — in S .

This result gives a way of testing whether a channel c has shape S : fac-torise c according to S and check if the resulting interpretation [[ S ]] equals c.Unfortunately, this is computationally rather expensive.

191


3.9.1 Shapes under conditioning

We have seen that a distribution can have a certain shape. An interesting ques-tion that arises is: what happens to such a shape when the distribution is up-dated? The result below answers this question, much like in [54], for threebasic shapes, called fork, chain and collider,

Proposition 3.9.5. Let ω ∈ D(X × Y × Z) be an arbitrary distribution and letq ∈ Fact(Y) be a factor on its middle component Y. We write a ∈ Y for anarbitrary element with associated point predicate 1a.

1 Let ω have fork shape:

ω |≈ then also ω|1⊗q⊗1 |≈ .

In the special case of conditioning with a point predicate we get:

ω|1⊗1a⊗1 |≈ .

2 Suppose ω has chain shape

ω |≈ then also ω|1⊗q⊗1 |≈ .

In the special case of a point predicate we get:

ω|1⊗1a⊗1 |≈ .

3 Let ω have collider shape:

ω |≈ then ω|1⊗q⊗1 |≈

For this shape it does not matter if q is a point predicate or not.

Proof. 1 Let’s assume we have an interpretation ω = 〈c, id, d〉 � σ, forc : Y → X, d : Y → Z and σ ∈ D(Y). Then:

ω|1⊗q⊗1

=(〈c, id, d〉 � σ

)|1⊗q⊗1

= 〈c, id, d〉|1⊗q⊗1 �(σ)|〈c,id,d〉�(1⊗q⊗1)

)by Corrolary 2.5.8 (2)

= 〈c, id, d〉 � (σ|q) by Exercises 2.3.9 and 2.4.6.

Hence we see the same shape that ω has.

192


In the special case when q = 1a we can extend the above calculation andobtain a parallel product of states:

ω|1⊗1a⊗1 = 〈c, id, d〉 � (σ|1a ) as just shown=

((c ⊗ id ⊗ d) ◦· ∆3

)� 1|a〉 by Lemma 2.3.5 (2)

= (c ⊗ id ⊗ d) �(∆3 � 1|a〉

)= (c ⊗ id ⊗ d) �

(1|a〉 ⊗ 1|a〉 ⊗ 1|a〉

)= (c � 1|a〉) ⊗ 1|a〉 ⊗ (d � 1|a〉).

2 We now write the distribution ω via its interpretion of the chain shape as:

ω =(id ⊗ 〈id, d〉

)�

(〈id, c〉 � σ

)= 〈id, 〈id, d〉 ◦· c〉 � σ,

for σ ∈ D(X) with c : X → Y and d : Y → Z. Then:

ω|1⊗q⊗1 =(〈id, 〈id, d〉 ◦· c〉 � σ

)|1⊗q⊗1

= 〈id, 〈id, d〉 ◦· c〉|1⊗q⊗1 �(σ|〈id,〈id,d〉◦·c〉�(1⊗q⊗1)


= 〈id, (〈id, d〉 ◦· c)|q⊗1〉 �(σ|(id�1)&(〈id,d〉◦·c)�(q⊗1)

)= 〈id, 〈id, d〉|q⊗1 ◦· c|〈id,d〉�(q⊗1)〉 �

(σ|c�(〈id,d〉�(q⊗1))


= 〈id, 〈id, d〉 ◦· c|q〉 � (σ|c�q).

Thus we see the same chain shape again.

When q = 1a we get a parallel product of three states:

ω|1⊗1a⊗1(x, y, z) =(〈id, 〈id, d〉 ◦· c〉 � σ)(x, y, z) · (1 ⊗ 1a ⊗ 1)(x, y, z)

〈id, 〈id, d〉 ◦· c〉 � σ |= 1 ⊗ 1a ⊗ 1

=σ(x) · c(x)(y) · d(y)(z) · 1a(y)σ |= ((〈id, d〉 ◦· c) � (1a ⊗ 1))

=σ(x) · c(x)(a) · d(a)(z) · 1|a〉(y)σ |= c � (1a & (d � 1))

=σ(x) · (c � 1a)(x)σ |= c � 1a

· 1|a〉(y) · (d � 1|a〉)(z)

= σ|c�1a (x) · 1|a〉(y) · (d � 1|a〉)(z)=

(σ|c�1a ⊗ 1|a〉 ⊗ (d � 1|a〉)

)(x, y, z).

3 Let’s assume as interpretation of the collider shape:

ω =(id ⊗ c ⊗ id

)�

((∆ ⊗ ∆) � (σ ⊗ τ)

),

193


for states σ ∈ D(X) and τ ∈ D(Z) and channel c : X × Z → Y . Then:

ω|1⊗q⊗1

=((

id ⊗ c ⊗ id)�

((∆ ⊗ ∆) � (σ ⊗ τ)

))∣∣∣1⊗q⊗1

=(id ⊗ c|q ⊗ id

)�

(((∆ ⊗ ∆) � (σ ⊗ τ)

)∣∣∣1⊗(c�q)⊗1


=(id ⊗ c|q ⊗ id

)�

((∆ ⊗ ∆)|1⊗(c�q)⊗1 �

((σ ⊗ τ)

∣∣∣(∆⊗∆)�(1⊗(c�q)⊗1)

)by Theorem 2.5.7

=(id ⊗ c|q ⊗ id

)�

((∆ ⊗ ∆) �

((σ ⊗ τ)

∣∣∣(∆⊗∆)�(1⊗(c�q)⊗1)

)by Exercise 2.3.8.

The updated state (σ ⊗ τ)∣∣∣(∆⊗∆)�(1⊗(c�q)⊗1) is typically entwined, even if q is

a point predicate, see also Exercise 2.3.6.

The fact that conditioning with a point predicate destroys the shape, in theabove first two points, is an important phenomenon since it allows us to breakentwinedness / correlations. This is relevant in statistical analysis, esp. w.r.t.causality [83, 84], see Section ??. In such a context, conditioning on a pointpredicate is often expressed in terms of ‘controlling for’. For instance, if thereis a gender component G = {m, f } with elements m for male and f for fe-male, then conditioning with a point predicate 1m or 1 f , suitably weakenendvia tensoring with truth 1, amounts to controlling for gender. Via restriction toone gender value one fragments the shape and thus controls the influence ofgender in the situation at hand.

Exercises

3.9.1 Check the aim of Exercise 1.8.1 really is to prove

ω |≈

for the (ternary) state ω defined there.3.9.2 Let S ∈ SD(Σ) be given with an interpretation of the string diagram

signature Σ. Check that then [[ S ]] |≈ S .

3.10 Inference in Bayesian networks, reconsidered

Inference is one of the main topics in Bayesian probability. We have seen illus-trations of Bayesian inference in Section 2.7 for the Asia Bayesian network,using forward and back inference along a channel (see Definition 2.5.1). We

194

3.10. Inference in Bayesian networks, reconsidered 1953.10. Inference in Bayesian networks, reconsidered 1953.10. Inference in Bayesian networks, reconsidered 195

are now in a position to approach the topic of inference more systematically,using string diagrams. We start by defining inference itself, namely as what wehave earlier called crossover inference.

Let ω ∈ D(X1 × · · · × Xn) be a joint state on a product space X1 × · · · × Xn.Suppose we have evidence p on one component Xi, in the form of a factor p ∈Fact(Xi). After updating with p, we are interested in the marginal distributionat component j. We will assume that i , j, otherwise the problem is simple andcan be reduced to updating the marginal at i = j with p, see Lemma 2.3.5 (6).

In order to update ω on X1 × · · · × Xn with p ∈ Fact(Xi) we first have toextend the factor p to a factor on the whole product space, by weakening it to:

πi � p = 1 ⊗ · · · ⊗ 1 ⊗ p ⊗ 1 ⊗ · · · ⊗ 1.

After the update with this factor we take the marginal at j in the form of:

π j �(ω|πi�p

)= ω|πi�p

[0, . . . , 0, 1, 0 . . . , 0

].

The latter provides a convenient form.

Definition 3.10.1. A basic inference query is of the form:

π j �(ω|πi�p

)(3.28)

for a joint state ω with an evidence factor p on its i-th component and with thej-th marginal as conclusion.

What is called inference is the activity of calculating the outcome of aninference query.

An inference query may also have multiple evidence factors pk ∈ Fact(Xik ),giving an inference query of the form:

π j �(ω|(πi1�p1) & ···& (πim�pim )

)= π j �

(ω|πi1�p1 · · · |πim�pim

).

By suitably swapping components and using products one can reduce such aquery to one of the from (3.28). Similarly one may marginalise on several com-ponents at the same time and again reduce this to the canonical form (3.28).

The notion of inference that we use is formulated in terms of joint states.This gives mathematical clarity, but not a practical method to actually computequeries. If the joint state has the shape of a Bayesian network, we can use thisnetwork structure to guide the computations. This is formalised in the nextresult: it described quite generally what we have been illustrating many timesalready, namely that inference can be done along channels — both forward andbackward — in particular along channels that interprete edges in a Bayesiannetwork.

195


Theorem 3.10.2. Let joint state ω have the shape S of a Bayesian network:ω |≈ S . An inference query π j �

(ω|πi�p

)can then be computed via forward

and backward inference along channels in the string diagram S .

Proof. We use induction on the number of boxes in S . If this number is one,S consists of a single state box, whose interpretation is ω. Then we are donethe crossover inference Corollary 2.5.9.

We now assume that S contains more than one box. We pick a box at the topof S so that we can write:

ω =(idX1×···×Xk−1 ⊗ c ⊗ idXk+1×···×Xn

)� ρ where ρ |≈ S ′,

and where S ′ is the string diagram obtained from S by removing the singlebox whose interpretation we have written as channel c, say of type Y → Xk.The induction hypothesis applies to ρ and S ′.

Consider the situation below wrt. the underlying product space X1 × · · · ×Xn

of ω,

evidencep��

marginal

��

X1 × · · · × Xi × · · · × Xk × · · · × X j × · · · × Xn

channelcOO

Below we distinguish three cases about (in)equality of i, k, j, where, recall, thatwe assume i , j.

1 i , k and j , k. In that case we can compute the inference query as:

π j �(ω|πi�p

)= π j �

((id ⊗ c ⊗ id) � ρ

)|πi�p

)= π j �

((id ⊗ c ⊗ id) �

(ρ|πi�p

))by Exercise 2.3.5 (1)

= π j �(ρ|πi�p

).

Hence we are done by the induction hypothesis.Aside: we are oversimplifying by assuming that the domain of the channel

c is not a product space; if it is, say Y1 × · · · × Ym of length m > 1 we shouldnot marginalise at j but at j + m − 1. A similar shift may be needed for theweakening πi � p if i < k.

2 i = k, and hence j , k. Then:

π j �(ω|πi�p

)= π j �

((id ⊗ c ⊗ id) � ρ

)|πi�p

)= π j �

(ρ|(id⊗c⊗id)�(πi�p)

)by Exercise 2.3.5 (2)

= π j �(ρ|πi�(c�p)

).

The induction hypothesis now applies with transformed evidence c � p.

196

3.10. Inference in Bayesian networks, reconsidered 1973.10. Inference in Bayesian networks, reconsidered 1973.10. Inference in Bayesian networks, reconsidered 197

smoking

asia

tub

lung

either

bronc

dysp

xray

S D X

smoking

asia

tub

lung

either

bronc

dysp

xray

S D X

Figure 3.2 Two ‘stretchings’ of the Asia Bayesian network from Section 2.7, withthe same semantics.

Aside: again we are oversimplifying; when the channel c has a productspace as domain, the single projection πi in the conditioning-factor πi �

(c � p) may have to be replaced by an appropriate tuple of projections.3 j = k, and hence i , k. Then:

π j �(ω|πi�p

)= π j �

((id ⊗ c ⊗ id) � ρ

)|πi�p

)= π j �

((id ⊗ c ⊗ id) �

(ρ|πi�p

))by Exercise 2.3.5 (1)

= c �(π j �

(ρ|πi�p

)).

Again we are done by the induction hypothesis.Instead of a channel c, as used above, one can have a diagonal or a swap

map in the string diagram S . Since diagonals and swaps are given by func-tions f — forming trivial channels — Lemma 2.3.5 (7) applies, so that wecan proceed via evaluation of πi ◦ f .

Example 3.10.3. We shall recompute the query ‘smoking, given a positivexray’ in the Asia Bayesian network from Section 2.7, this time using the mech-anism of Theorem 3.10.2. In order to do this we have to choose a particularorder in the channels of the network. Figure 3.2 give two such orders, whichwe also call ‘stretchings’ of the network. These two string diagrams have the

197


same semantics, so it does not matter whether which one we take. For thisillustration we choose the one on the left.

This means that we can write the joint state ω ∈ D(S ×D×X) as composite:

ω = (id ⊗ dysp ⊗ id) ◦· (id ⊗ id ⊗ id ⊗ xray) ◦· (id ⊗ id ⊗ ∆2)◦· (id ⊗ id ⊗ either) ◦· (id ⊗ id ⊗ id ⊗ tub) ◦· (id ⊗ id ⊗ id ⊗ asia)◦· (id ⊗ id ⊗ lung) ◦· (id ⊗ dysp ⊗ id) ◦· ∆3 ◦· smoking

The positive xray evidence translates into a point predicate 1x on the set X =

{x, x⊥} of xray outcomes. It has to be weakened to π3 � 1x = 1 ⊗ 1 ⊗ 1x

on the product space S × D × X. We are interested in the updated smokingdistribution, which can be obtained as first marginal. Thus we compute thefollowing inference query, where the number k in k

= refers to the k-th of thethree distinctions in the proof of Theorem 3.10.2.

π1 �(ω∣∣∣π3�1x

)= π1 �

(((id ⊗ dysp ⊗ id) ◦· · · ·

)∣∣∣π3�1x

)1= π1 �

(((id ⊗ id ⊗ id ⊗ xray) ◦· · · ·

)∣∣∣π4�1x

)2= π1 �

(((id ⊗ id ⊗ ∆2) ◦· · · ·

)∣∣∣π4�(xray�1x)

)= π1 �

(((id ⊗ id ⊗ either) ◦· · · ·

)∣∣∣π3�(xray�1x)

)2= π1 �

(((id ⊗ id ⊗ id ⊗ tub) ◦· · · ·

)∣∣∣〈π3,π4〉�(either�(xray�1x))

)2= π1 �

(((id ⊗ id ⊗ id ⊗ asia) ◦· · · ·

)∣∣∣〈π3,π4〉�((id⊗tub)�(either�(xray�1x)))

)2= π1 �

(((id ⊗ id ⊗ lung) ◦· · · ·

)∣∣∣π3�((id⊗asia)�((id⊗tub)�(either�(xray�1x))))

)2= π1 �

(((id ⊗ dysp ⊗ id) ◦· · · ·

)∣∣∣π3�(lung�((id⊗asia)�((id⊗tub)�(either�(xray�1x)))))

)1= π1 �

((∆3 ◦· smoking

)∣∣∣π3�(lung�((id⊗asia)�((id⊗tub)�(either�(xray�1x)))))

)= smoking

∣∣∣lung�((id⊗asia)�((id⊗tub)�(either�(xray�1x))))

= smoking∣∣∣(xray◦·either◦· (id⊗tub)◦· (id⊗asia)◦· lung)�1x

= 0.6878| s〉 + 0.3122| s⊥ 〉.

It is a exercise below to compute the same inference query for the stretchingon the right in Figure 3.2. The outcome is necessarily the same, by Theo-rem 3.10.2, since there it is shown that all such inference computations areequal to the computation on the joint state.

The inference calculation in Example 3.10.3 is quite mechanical in natureand can thus be implemented easily, giving a channel-based inference algo-rithm, see [46] for Bayesian networks. The algorithm consists of two parts.

1 It first finds a stretching of the Bayesian network with a minimal width, that

198

3.11. Updating and adapting: Pearl versus Jeffrey 1993.11. Updating and adapting: Pearl versus Jeffrey 1993.11. Updating and adapting: Pearl versus Jeffrey 199

is, a description of the network as a sequence of channels such that the statespace inbetween the channels has minimal size. As can be seen in Figure 3.2,there can be quite a bit of freedom in choosing the order of channels.

2 Perform the calculation as in Example 3.10.3, following the steps in theproof of Theorem 3.10.2.

The resulting algorithm’s performance compares favourably to the performanceof the pgmpy Python library for Bayesian networks [3]. By design in uses(fuzzy) factors and not (sharp) events and thus solves the “soft evidential up-date” problem [99].

Exercises

3.10.1 In Example 3.10.3 we have identified a state on a space X with a chan-nel 1 → X, as we have done before, e.g. in Exercise 1.7.2. Considerstates σ ∈ D(X) and τ ∈ D(Y) with a factor p ∈ Fact(X × Y). Provethat, under this identification,

σ|(id⊗τ)�p = (σ ⊗ τ)|p[1, 0

].

3.10.2 Compute the inference query from Example 3.10.3 for the stretchingof the Asia Bayesian network on the right in Figure 3.2.

3.10.3 From Theorem 3.10.2 one can conclude that the for the channel-basedcalculation of inference queries the particular stretching of a Bayesiannetwork does not matter — since they are all equal to inference onthe joint state. Implicitly, there is a ‘shift’ result about forward andbackward inference.

Let c : A → X and d : B → Y be channels, together with a jointstate ω ∈ D(A×X) and a factor q : Y → R≥0 on Y . Then the followingmarginal distributions are the same.((

(c ⊗ id) � ω)∣∣∣

(id⊗d)�(1⊗q)

)[1, 0

]=

((c ⊗ id) �

(((id ⊗ d) � ω)

∣∣∣1⊗q

))[1, 0

].

1 Prove this equation.2 Give an interpretation of this equation in relation to inference cal-

culations.

3.11 Updating and adapting: Pearl versus Jeffrey

The evidence that is standardly used in Bayesian updating is in the form ofevents, that is, of subsets, or equivalently, of sharp predicates. In this book, in

199


contrast, we have been using fuzzy evidence right from the start, with valuesin the unit interval [0, 1], or more generally, in R≥0. Updating has been definedfor such fuzzy evidence, see Section 2.3.

In the Bayesian literature however this approach is non-standard, and updat-ing with fuzzy (or: soft or uncertain) evidence has been seen as problematic.Two approaches emerged, one named after Pearl [80, 82], and one after Jef-frey [56], see e.g. [11, 20, 22, 99].

Pearl’s approach is described operationally: extend a Bayesian network withan auxiliary binary node, so that soft evidence can be emulated in terms ofhard evidence on this additional node, and so that the usual inference methodscan be applied. This extension of a Bayesian network at node X with a binarynode 2 and an associated conditional probability table, corresponds to adding achannel X → 2. But such channels correspond to fuzzy predicates X → [0, 1],sinceD(2) � [0, 1], see also Exercise 2.4.8.

What is called Pearl’s rule can be identified with what we call backwardinference, in Section 2.5. We have seen that backward inference forms a pow-erful reasoning method, where we have a channel c : X → Y that mediatesbetween a state ω ∈ D(X) on the channel’s domain X, that must be updatedwith evidence q ∈ Fact(Y) on the channel’s codomain Y . This works via theformula ω|c�q, where the factor q is first transformed along the channel c andthen used for updating the state ω.

Jeffrey’s rule forms an alternative, for which we shall use the word ‘adap-tation’ instead of ‘update’. It applies in a similar situation, with a channelc : X → Y and a state ω ∈ D(X). But now there is no evidence factor onY , but a state ρ ∈ D(Y). It turns out that there is snappy description of Jef-frey’s rule using the dagger (Bayesian inversion) c† of the channel c and statetransformation along c†, see [49]. One then adapts ω to a new state, namely:c†ω � ρ.

We summarise the situation.

Definition 3.11.1. Let c : X → Y be channel with a prior state ω ∈ D(X).

1 Pearl’s rule involves evidence-based updating of the prior ω with a factorq ∈ Fact(Y) to the posterior:

ω|c�q ∈ D(X).

2 Jeffrey’s rule involves state-based adaptation of the prior ω with a state ρ ∈D(Y) to the posterior:

c†ω � ρ ∈ D(X).

We shall illustrate the difference between Pearl’s and Jeffreys’s rules.

200


Example 3.11.2. Consider the Bayesian network / string diagram below, withits interpretation via two states and a channel alarm : B× E → A as defined onthe right, for 2-element sets B = {b, b⊥}, E = {e, e⊥} and A = {a, a⊥}.

burglary earthquake

alarm

B

A

E

burglary = 0.01|b〉 + 0.99|b⊥ 〉earthquake = 0.000001|e〉 + 0.999999|e⊥ 〉alarm(b, e) = 0.9999|a〉 + 0.0001|a⊥ 〉

alarm(b, e⊥) = 0.99|a〉 + 0.01|a⊥ 〉alarm(b⊥, e) = 0.99|a〉 + 0.01|a⊥ 〉

alarm(b⊥, e⊥) = 0.0001|a〉 + 0.9999|a⊥ 〉

The following question is asked in [5, Example 3.1 and 3.2]:

Imagine that we are 70% sure we heard the alarm sounding. What is the probability ofa burglary?

Via the backward inference approach that we have used so far, that is, viaPearl’s rule, we interpret the given evidence as a fuzzy/soft predicate p : A →[0, 1] with p(a) = 0.7 and p(a⊥) = 0.3. This predicate p can be seen as anextension of the above Bayesian network with a box and table/interpretation:

2

A

p(a) = 0.7p(a⊥) = 0.3.

The required outcome is then obtained via backward inference, followed bymarginalisation to the burglary case:(

burglary ⊗ earthquake)|alarm�p

[1, 0

]= 0.0229|b〉 + 0.9771|b⊥ 〉.

In contrast, the outcome given in [5] is:

0.693|b〉 + 0.307|b⊥ 〉.

The difference in outcomes is substantial: 2% versus 70%. What is going on?The second answer is motivated in [5] via ‘Jeffrey’s’ rule. As mentioned

above, this method can be formulated in terms of Bayesian inversion (dagger).First, we interpret the evidence as a state ρ = 0.7|a〉 + 0.3|a⊥ 〉. We can takethe Bayesian inversion of the channel alarm : B × E → A wrt. the productstate given by burglary and earthquake, and then do state transformation andmarginalisation:(

alarm†burglary⊗earthquake � ρ)[

1, 0]

= 0.693|b〉 + 0.307|b⊥ 〉.

This is the formulation of Jeffrey’s rule in Definition 3.11.1.

201


Figure 3.3 The outcomes of Jeffrey’s rule (JR) versus Pearl’s rule (PR) from Ex-ample 3.11.2: posterior burglary probability, given alarm evidence certainty, start-ing from a priori burglary probability of 0.01.

We look a bit closer at the difference between these two rules in this exam-ple. If we are 100% certain that we heard the alarm the burglary distributionbecomes:(

burglary ⊗ earthquake)|alarm�1a

[1, 0

]= 0.99|b〉 + 0.01|b⊥ 〉=

(alarm†burglary⊗earthquake � 1|a〉

)[1, 0

].

Indeed, as we shall see below, Pearl’s and Jeffrey’s rules agree on point evi-dence/states.

We next notice that the outcomes of Jeffrey’s rule look linear: about 70%burglar probability for 70% certainty about the alarm, and about 100% for100% certainty. This is indeed what the blue straight line for Jeffrey’s rule (JR)shows in Figure 3.3. It describes the function [0, 1]→ [0, 1] given by:

r 7−→(alarm†burglary⊗earthquake � (r|a〉 + (1 − r)|a⊥ 〉)

)[1, 0

](b).

The red bent line describes Pearl’s rule (PR) via the function:

r 7−→(burglary ⊗ earthquake

)|alarm�(r1a+(1−r)1a⊥ )

[1, 0

](b)

The plots in Figure 3.3 and 3.4 show that changing the prior burglary proba-bility gives different outcomes, but does not fundamentally change the shapes.

The question remains: is there a ‘right’ way to do update/adapt in this sit-uation, via Pearl’s or via Jeffrey’s rule, and if so, which one is right, under

202


Figure 3.4 Jeffrey’s rule (JR) versus Pearl’s rule (PR) in Example 3.11.2 but nowwith a priori burglary probability of 0.1.

which circumstances? This is an urgent question, since important decisions,e.g. about medical treatments, are based on probabilistic inference.

We briefly focuses on some commonalities and differences.

Lemma 3.11.3. Let c : X → Y be a channel with a prior state ω ∈ D(X).

1 Pearl’s rule and Jeffrey’s rule agree on points: for y ∈ Y, with associatedpoint predicate 1y and point state 1|y〉, one has:

ω|c�1y = c†ω(y) = c†ω � 1|y〉.

2 Pearl’s updating with a constant predicate (no information) does not changethe prior state ω:

ω|c�r·1 = ω, for r > 0.

3 Jeffreys’s adaptation with the predicted state c � ω does not change theprior state ω:

c†ω � (c � ω) = ω.

4 Multiple updates with Pearl’s rule commute, but multiple adaptations withJeffrey’s rule do not commute.

Proof. These points follow from earlier results.

1 Easy, by Proposition 3.7.2.

203


2 By Lemma 2.3.5 (1) and (4).3 This is Equation (3.22).4 By Lemma 2.3.5 (3) we have:

ω|c�q1 |c�q2 = ω|(c�q1)&(c�q2) = ω|c�q2 |c�q1 .

However, in general,

c†c†ω�ρ1

� ρ2 , c†c†ω�ρ2

� ρ1.

Probabilistic updating may be used as a model for what is called primingin cognition theory, see e.g. [35, 39]. It is well-known that the human mind issensitive to the order in which information is processed, that is, to the order ofpriming/updating. Thus, point (4) above suggests that Jeffrey’s rule might bemore appropriate in such a setting.

Points (2) and (3) present different views on ‘learning’, where learning isnow used in an informal sense: point (2) says that according to Pearl’s rule youlearning nothing when you get no information; but point (3) tells that accordingto Jeffrey you learn nothing when you are presented with what you alreadyknow. Both interpretations make sense.

This last fact about Jeffrey’s rule has led to the view that one should ap-ply Jeffrey’s adaptation rule in situations where one is confronted with ‘sur-prises’ [24] or with ‘unanticipated knowledge’ [23]. This does not give a pre-cise criterion about when to use which rule, but it does give an indication. Thewhole issue which is still poorly understood. We refer to [49] for more dis-cussion on this intruiging and important matter, in terms of ‘factoring in’ evi-dence (Pearl) versus ‘adjusting to’ or ‘adaptation to’ evidence (Jeffrey). Herewe include an example from [24] (also used in [49]) that makes this ‘surprise’perspective more concrete.

Example 3.11.4. The following setting is taken from [24]. Ann must decideabout hiring Bob, whose characteristics are described in terms of competence(c or c⊥) and experience (e or e⊥). The prior is a joint distribution on the productspace C × E, for C = {c, c⊥} and E = {e, e⊥}, given as:

ω = 410 |c, e〉 +

110 |c, e

⊥ 〉 + 110 |c

⊥, e〉 + 410 |c

⊥, e⊥ 〉.

The first marginal of ω is the uniform distribution 12 |c〉+

12 |c

⊥ 〉. It is the neutralbase rate for Bob’s competence.

We use the two projection functions Cπ1←− C × E

π2−→ E as deterministic

channels along which we reason, using both Pearl’s and Jeffrey’s rules.When Ann would learn that Bob has relevant work experience, given by

point evidence 1e, her strategy is to factor this in via Pearl’s rule / backward

204


inference: this gives as posterior ω|π2�1e = ω|1⊗1e , whose first marginal is45 |c〉 +

15 |c

⊥ 〉. It is then more likely that Bob is competent.Ann reads Bob’s letter to find out if he actually has relevant experience. We

quote from [24]:

Bob’s answer reveals right from the beginning that his written English is poor. Ann no-tices this even before figuring out what Bob says about his work experience. In responseto this unforeseen learnt input, Ann lowers her probability that Bob is competent from12 to 1

8 . It is natural to model this as an instance of Jeffrey revision.

Bob’s poor English is a new state of affairs — a surprise — which translates toa competence state ρ = 1

8 |c〉+78 |c

⊥ 〉. Ann wants to adapt to this new surprisingsituation, so she uses Jeffrey’s rule, giving a new joint state:

ω′ = (π1)†ω � ρ = 110 |c, e〉 +

140 |c, e

⊥ 〉 + 740 |c

⊥, e〉 + 710 |c

⊥, e⊥ 〉.

If the letter now tells that Bob has work experience, Ann will factor this in, inthis new situation ω′, giving, like above, via Pearl’s rule followed by marginal-isation,

ω′|π2�1e

[1, 0

]= 4

11 |c〉 +7

11 |c⊥ 〉.

The likelihood of Bob being competent is now lower than in the prior state,since 4

11 <12 . This example reconstructs the illustration from [24] in channel-

based form, with the associated formulations of Pearl’s and Jeffrey’s rules fromDefinition 3.11.1, and produces exactly the same outcomes as in loc. cit.

There are translations back-and-forth between the Pearl’s and Jeffrey’s rules,due to [11]; they are translated here to the current setting.

Proposition 3.11.5. Let c : X → Y be a channel with a prior state ω ∈ D(X)on its domain, such that τ = c � ω has full support.

1 Pearl’s updating can be expressed as Jeffrey’s adaptation, by turning a fac-tor q : Y → R≥0 into a state τ|q ∈ D(Y), so that:

c†ω �(τ|q

)= ω|c�q.

2 Jeffrey’s adaptation can also be expressed as Pearl’s updating: for a stateρ ∈ D(Y) write ρ/τ for the factor y 7→ ρ(y)

τ(y) ; then:

c†ω � ρ = ω|c�ρ/τ.

Proof. The first point is exactly Theorem 3.7.7 (1). For the second point wefirst note that τ |= ρ/τ = 1, since ρ is a state:

τ |= ρ/τ =∑

yτ(y) · ρ(y)

τ(y) =∑

yρ(y) = 1.

205


But then, for x ∈ X,(c†ω � ρ

)(x) =

∑yρ(y) · c†ω(y)(x)

(3.23)=

∑yρ(y) ·

ω(x) · c(x)(y)(c � ω)(y)

= ω(x) ·∑

yc(x)(y) ·

ρ(y)τ(y)

since τ = c � ω

=ω(x) · (c � ρ/τ)(x)

c � ω |= ρ/τas just shown

=ω(x) · (c � ρ/τ)(x)ω |= c � ρ/τ

=(ω|c�ρ/τ

)(x).

We have frequently seen that Pearl-style updating in one component of ajoint state, as ω|1⊗q, can be expressed via channels extracted from ω. A naturalquestion is whether the same can be done for Jeffrey’s approach. In order toclarify this, let’s marginalise ω ∈ D(X × Y) to σ B ω

[1, 0

]∈ D(X) and

τ B ω[0, 1

]∈ D(Y), and disintegrate ω to c B ω

[0, 1

∣∣∣ 1, 0] : X → Y andd B ω

[1, 0

∣∣∣ 0, 1] : Y → X in:

ω =

c d

τσ

=

As a result, d = c†σ and c = d†τ .In this situation we have, for a factor q ∈ Fact(Y),

ω|1⊗q = 〈d, id〉 �(τ|q

)=

d

τ|q

and ω|1⊗q[1, 0

]= σ|c�q.

The first equation is Corollary 2.5.8 (2) and the second one is Corollary 2.5.9 (2).A cross-over version of Jeffrey-style adaption of the joint prior state ω with

ρ ∈ D(Y) is then defined as the posterior:

〈d, id〉 � ρ =d

ρ

206


The first marginal of this adapted joint state is d � ρ = c†σ � ρ, as in Def-inition 3.11.1. The second marginal is the state ρ that we wish to adapt to.This explains why in the context of Jeffrey’s rule the ‘evidence’ state is oftendescribed as a posterior:ω is adapted in such a way that its second marginal be-comes equal to ρ. We shall see several examples of such Jeffrey-style updatingin the Exercises below, notably in 3.11.3 and 3.11.4.

Exercises

3.11.1 Check for yourself the claimed outcomes in Example 3.11.4:

1 ω|π2�1e

[1, 0

]= 4

5 |c〉 +15 |c

⊥ 〉;2 ω′ B (π1)†ω � ρ = 1

10 |c, e〉 +140 |c, e

⊥ 〉 + 740 |c

⊥, e〉 + 710 |c

⊥, e⊥ 〉;3 ω′|π2�1e

[1, 0

]= 4

11 |c〉 +7

11 |c⊥ 〉.

3.11.2 This example is taken from [22], where it is attributed to Whitworth:there are three contenders A, B,C for winning a race with a prori dis-tribution ω = 2

11 |A〉 + 411 |B〉 + 5

11 |C 〉. Surprising information comesin that A’s chances have become 1

2 . What are the adapted chances ofB and C?

1 Split up the sample space X = {A, B,C} into a suitable two-elementpartition via a function f : X → 2.

2 Use the uniform distribution υ = 12 |1〉 + 1

2 |0〉 on 2 and show thatJeffrey’s rule gives as adapted distribution:

f †ω � υ = 12 · ω|1{A} + 1

2 · ω|1{B,C} = 12 |A〉 +

29 |B〉 +

518 |C 〉.

3.11.3 The following alarm example is in essence due to Pearl [80], seealso [20, §3.6]; we have adapted the numbers in order to make thecalculations a bit easier. There is an ‘alarm’ set A = {a, a⊥} and a ‘bur-glary’ set B = {b, b⊥}, with the following a priori joint distribution onA × B.

ω = 1200 |a, b〉 +

7500 |a, b

⊥ 〉 + 11000 |a

⊥, b〉 + 98100 |a

⊥, b⊥ 〉.

Someone reports that the alarm went off, but with only 80% certaintybecause of deafness.

1 Translate the alarm information into a predicate p : A→ [0, 1] andshow that crossover updating leads to a burglary distribution:

ω|p⊗1[0, 1

]= 21

1057 |b〉 +10361057 |b

⊥ 〉.

207


2 Compute the extracted channel c = ω[1, 0

∣∣∣ 0, 1] : B→ A via disin-tegration, and express the answer in the previous point in terms ofbackward inference using c.

3 Use the Bayesian inversion/dagger d = c†ω[0,1] = ω[0, 1

∣∣∣ 1, 0] : A→B of this channel c, see Exercise 3.7.5, to calculate the outcome ofJeffrey’s adaption as:

d �( 4

5 |a〉 +15 |a

⊥ 〉)

= 1963993195 |b〉 +

7355693195 |b

⊥ 〉.

(Notice again the considerable difference in outcomes.)

3.11.4 The next illustration is attributed to Jeffrey, and reproduced for in-stance in [11, 20]. We consider three colors: green (g), blue (b) andviolet (v), which are combined in a space C = {g, b, v}. These colorsapply to cloths, which can additionally be sold or not, as representedby the space S = {s, s⊥}. There is a prior joint distribution τ on C × S ,namely:

τ = 325 |g, s〉 +

950 |g, s

⊥ 〉 + 325 |b, s〉 +

950 |b, s

⊥ 〉 + 825 |v, s〉 +

225 |v, s

⊥ 〉.

A cloth is inspected by candlelight and the following likelihoods arereported per color: 70% certainty that it is green, 25% that it is blue,and 5% that it is violet.

1 Show that the updated sales distribution via the evidence-based(crossover) inference is:

2661 | s〉 +

3561 | s

⊥ 〉.

2 Prove that Jeffrey’s adaptation approach, obtained via state trans-formation along c = τ

[0, 1

∣∣∣ 1, 0] : C → S , yields:

c �( 7

10 |g〉 +14 |b〉 +

120 |v〉

)= 21

50 | s〉 +2950 | s

⊥ 〉.

3 Check also that:

〈id, c〉 �( 7

10 |g〉 +14 |b〉 +

120 |v〉

)= 14

50 |g, s〉 +2150 |g, s

⊥ 〉)

+ 110 |b, s〉

+ 320 |b, s

⊥ 〉)

+ 125 |v, s〉 +

1100 |v, s

⊥ 〉).

The latter outcome is given in [20].

3.11.5 Jeffrey’s rule is frequently formulated (notably in [36], to which werefer for details) and used (like in Exercise 3.11.2), in situations wherethe channel involved is deterministic. Consider an arbitrary functionf : X → I, giving a partition of the set X via subsets Ui = f −1(i) =

{x ∈ X | f (x) = i}. Let ω ∈ D(X) be a prior.

208


1 Show that applying Jeffrey’s rule to a new state of affairs ρ ∈ D(I)gives as posterior:

f †ω � ρ =∑

i ρ(i) · ω|1Uisatisfying f �

(f †ω � ρ

)= ρ.

2 Prove that:

d(f †ω � ρ, ω) =

∧{d(ω,ω′) | ω′ ∈ D(X) with f � ω′ = ρ},

where d is the total variation distance from Section 2.8.

3.11.6 Prove that for a general, not-deterministic channel c : X → Y withprior state ω ∈ D(X) and state ρ ∈ D(Y) that there is an inequality:

d(c†ω � ρ, ω) ≤

∧ω′∈D(X)

d(ω,ω′) + d(c � ω′, ρ).

209

4

Learning of States and Channels

The aim of this chapter is to present basic structure underlying the learningmethods for states and channels. We have already seen learning in a simple,frequentist form, namely as a natural transformation Flrn : M∗ ⇒ D from(non-empty) multisets to probability distributions. Below we shall see that thisleads to a maximal validity, in a suitable sense. In general, probabilistic learn-ing is described as consisting of small steps that need to be repeated in orderto reach a certain optimum. We concentrate on the nature of these small steps,and not on (convergence of) the repetitions: that’s a separate topic.

These small learning steps can be introduced as follows. Consider the valid-ity expression:

ω |= p

distribution / state

CC

observable / evidence

[[(4.1)

As we shall understand it here, learning involves increasing this validity, bychanging the state ω into a new state ω′ such that ω′ |= p ≥ ω |= p. Thus,in learning one takes the evidence p as a given, fixed datum that one needs toadjust to. Learning happens by changing the state ω so that it better fits theevidence.

This may be understood in an amateur-psychological sense: the state ω

somehow captures (a consistent portion of) one’s state of my mind. When con-fronted with evidence p, one learns p by changing one’s internal state ω intoω′, where the validity of p is higher in ω′ than in ω. Thus, in learning oneadapts one’s mind to the world, as given by the evidence p.

One important way to learn in the above situation is to update (condition) thestateωwith the evidence p, as introduced in Section 2.3. Indeed, a fundamental

210

4.1. Learning a channel from tabular data 2114.1. Learning a channel from tabular data 2114.1. Learning a channel from tabular data 211

result in this chapter, namely Theorem 4.3.1 (2), tells that there is an inequality:

ω|p |= p ≥ ω |= p.

This captures an important intuition behind conditioning with p: it changes thestate so that p becomes ‘more true’.

Earlier we have seen validity ω |= p of an observable in a state, like in (4.1),but also validity H |= ~p of a list of observables ~p in a hidden Markov modelH , see Section 3.4. In this chapter we shall define new forms of validity |=M and|=C for multisets of data. In each of these cases we shall develop an associatedform of learning, via step-wise increase of validity, by introducing a ‘better’state. We provide explicit proofs for all of these increases, which all go backto an elementary method from [7]. It is reproduced here as a ‘sum-increaselemma’, see Lemma 4.4.2. This method tells us that we should look for a max-imum of a certain function, which is determined each time via elementary realanalysis, using the Lagrange multiplier method. We shall see many instancesof this powerful technique, which forms the basis for all learning results in thischapter.

These learning methods are applied to Markov chains and hidden Markovmodels in Section 4.5 and 4.6. In addition, they are used to the describe thegeneral Expectation-Maximisation (EM) learning method [21]. It consists oftwo parts, called Expectation (E) and Maximisation (M). We identify them aslearning a ‘latent’ state along a channel, and learning a channel itself. We shallidentify several forms of Expectation-Maximisation, for the different formsof validity |=M and |=C and also for their combination. Since there are so manyversions of learning, this chapter elaborates many examples — mostly takenfrom the literature — in order to clarify the situation.

4.1 Learning a channel from tabular data

So far we have concentrated on learning a state (as empirical distribution) fromdata, via the frequentist learning operation Flrn : M∗ ⇒ D, introduced in Sec-tion 1.6. This map Flrn is natural, which implies in particular that it commuteswith marginalisation. In this first, short section we concentrate on how to learna channel from multi-dimensional (tabular) data. There are in principle twoways of doing so:

1 turn the whole table into an empirical distribution, via Flrn, and then appydisintegration to the resulting joint state in order to extract the right channel;

2 extract the appropriate multiset channel directly from the table, and apply

211

212 Chapter 4. Learning of States and Channels212 Chapter 4. Learning of States and Channels212 Chapter 4. Learning of States and Channels

frequentist learning pointwise, in order to turn this M-channel into a D-channel.

The main result of this section is that both ways coincide. This substantiatesthe claim from Section 3.5 that disintegration is the analogue of extraction forpowerset and multiset.

Recall from Subsection 1.4.5 that for a table, or joint multiset, τ ∈ M(X×Y)there is a function extr1(τ) : X → M(Y) given by: extr1(τ)(x) =

∑y τ(x, y)|y〉.

It describes the ‘row’ of the table τ at position x.

Proposition 4.1.1. Let’s use the ad hoc notationM∗1(X×Y) ⊆ M∗(X×Y) forthe subset of multisets τ for which the first marginal has full support: for eachx ∈ X there is a y ∈ Y with τ(x, y) > 0.

Extraction for multisets and disintegration for distributions commute withfrequentist learning, in the sense that the following diagram commutes.

M∗1(X × Y)Flrn��

extr1 //M∗(Y)X

FlrnX��

D(X × Y)extr1

// D(Y)X

where we write extr1 for the channel extraction function used in (3.18).Explicitly, this means that for τ ∈ M∗1(X × Y) and z ∈ X,

Flrn(extr1(τ)(z)

)= extr1

(Flrn(τ)

)(z).

Proof. By unfolding the relevant definitions one gets, for z ∈ X,(extr1 ◦ Flrn

)(τ)(z)

= extr1

∑x,y

τ(x, y)t

∣∣∣ x, y⟩ (z) where t =∑

x,y τ(x, y)

(3.20)=

∑y

τ(z,y)/t∑y τ(z,y)/t

∣∣∣y⟩=

∑y

τ(z, y)∑y τ(z, y)

∣∣∣y⟩= Flrn

∑y

τ(z, y)∣∣∣y⟩

= Flrn(extr1(τ)(z)

)=

(FlrnX ◦ extr1

)(τ)(z).

We review what this result means in an earlier illustration.

212

4.1. Learning a channel from tabular data 2134.1. Learning a channel from tabular data 2134.1. Learning a channel from tabular data 213

Example 4.1.2. We return to the medicine / blood pressure table from Sub-section 1.4.1. It involves a joint multiset τ ∈ M({H, L} × {1, 2, 3}) that capturesthe table of data:

τ = 10|H, 0〉 + 35|H, 1〉 + 25|H, 2〉 + 5|L, 0〉 + 10|L, 1〉 + 15|L, 2〉.

As we have seen in Subsection 1.4.5 the first, multiset extraction captures thetwo rows of Table 1.4:

extr1(τ)(H) =∑

x∈M τ(H, x)| x〉 = 10|0〉 + 35|1〉 + 25|2〉extr1(τ)(L) =

∑x∈M τ(L, x)| x〉 = 5|0〉 + 10|1〉 + 15|2〉.

We can apply frequentist learning (normalisation) to these two multisets sepa-rately, and obtain as probabilistic channel c : {H, L} → {1, 2, 3},

c(H) = Flrn(extr1(τ)(H)

)= 1

7 |0〉 +12 |1〉 +

514 |2〉

c(L) = Flrn(extr1(τ)(L)

)= 1

6 |0〉 +13 |1〉 +

12 |2〉.

This channel is obtained by following the east-south part of the rectangle inProposition 4.1.1 above.

If we take the south-east route in the rectangle, we first apply frequentistlearning to τ itself, giving a joint state in Flrn(τ) ∈ D({H, L}×{1, 2, 3}). We canthen apply disintegration to Flrn(τ), using Equation (3.20). Proposition 4.1.1says that we will get the same channel c in this way.

Thus, the essence of Proposition 4.1.1 is: one can extract a (probabilistic)channel from a table (as joint multiset), either by normalising the rows sep-arately, or by first normalising the entire table and then disintegrating. Theoutcomes are the same.

Remark 4.1.3. In the south-east route in the diagram in Proposition 4.1.1 weperform disintegration extr1 after frequentist learning Flrn. Both these oper-ations involve normalisation. These efforts can be combined if we allow our-selves to perform disintegration not only on joint distributions ω ∈ D(X × Y),but also on joint multisets τ ∈ M(X × Y). Concretely, this means that we allowourselves the use of formulation (3.18) not only for states ω, but also for mul-tisets τ. When we apply this in the context of Example 4.1.2 we directly get

213


the (same) outcomes that we saw there:

extr1(τ)(H) = τ[0, 1

∣∣∣ 1, 0](H)(3.18)=

∑x∈M

τ(H, x)∑y τ(H, y)

∣∣∣ x⟩= 10

10+35+25 |0〉 +35

10+35+25 |0〉 +25

10+35+25 |0〉 = 17 |0〉 +

12 |1〉 +

514 |2〉

extr1(τ)(L) = τ[0, 1

∣∣∣ 1, 0](L)(3.18)=

∑x∈M

τ(L, x)∑y τ(L, y)

∣∣∣ x⟩= 5

5+10+15

∣∣∣0⟩+ 10

5+10+15

∣∣∣0⟩+ 15

5+10+15

∣∣∣0⟩= 1

6 |0〉 +13 |1〉 +

12 |2〉.

We now also allow ourselves to use the more complicated disintegration ex-pressions τ

[· · ·

∣∣∣ · · · ] from Remark 3.5.6 for multisets τ. This allows us tolearn a channel directly from a multidimensional table, as multiset.

Exercises

4.1.1 Prove in general what we have claimed and illustrated in Remark 4.1.3,namely that for a multiset τ ∈ M(X × Y) whose first marginal has fullsupport, one obtains an equality of disintegrations:

extr1(τ) = extr1(Flrn(τ)

): X → Y.

4.2 Learning with missing data

This section looks at learning from tables in which some entries are missing.We do not treat this topic exhaustively, but instead we elaborate two examplesfrom the literature in order to given an impression of the issues involved.

4.2.1 Learning with missing data, without prior

We look at an example from [57, §6.2.1]. It involves pregnancy of cows, whichcan be deduced from a urine test and a blood test. A simple Bayesian networkstructure is assumed, which we write as string diagram:

Pregnancy

Blood Test Urine test

P

B U

with sets

P = {p, p⊥}B = {b, b⊥}U = {u, u⊥}.

(4.2)

214

4.2. Learning with missing data 2154.2. Learning with missing data 2154.2. Learning with missing data 215

The elements p and p⊥ represent ‘pregnancy’ and ‘no pregnancy’, respectively.Similarly, b, b⊥ and u, u⊥ represent a positive and negative blood/urine test.

The data in [57] involves 5 cases, as given in the following table, where aquestion mark is used for a missing item.

case Pregnancy Blood test Urine test

1 ? b u2 p b⊥ u3 p b ?4 p b u⊥

5 ? b⊥ ?

(4.3)

Earlier, we could simply read-off the multiset from a completely filled table,for instance in Subsection 1.4.1. Here, in Table (4.3), this can be done onlyfor the second and fourth line: the second line, for instance, corresponds to asingleton multiset 1| p, b⊥, u〉. The first step in our analysis is to see this as atensor of one-dimensional data/multisets, as in:

1| p, b⊥, u〉 = 1| p〉 ⊗ 1|b⊥ 〉 ⊗ 1|u〉.

The second step is to use uniform distributions in the ?-case, where an entry ismissing. We use these uniform distributions because we assume no prior infor-mation about the missing table entries. We can then use the fact that tensoringpreserves sums, see Exercise 1.8.15.

For instance, the first line in the above table (4.3) becomes:( 12 | p〉 +

12 | p

⊥ 〉)⊗ 1|b〉 ⊗ 1|u〉 = 1

2 | p〉 ⊗ 1|b〉 ⊗ 1|u〉 + 12 | p

⊥ 〉 ⊗ 1|b〉 ⊗ 1|u〉= 1

2 | p, b, u〉 +12 | p

⊥, b, u〉.

The fifth line has two missing entries so that our analysis involves preservationof convex sums in two tensor coordinates:( 1

2 | p〉 +12 | p

⊥ 〉)⊗ 1|b⊥ 〉 ⊗

( 12 |u〉 +

12 |u

⊥ 〉)

= 14 | p, b

⊥, u〉 + 14 | p, b

⊥, u⊥ 〉 + 14 | p

⊥, b⊥, u〉 + 14 | p

⊥, b⊥, u⊥ 〉.

Adding-up these five cases gives a multiset τ ∈ M(P × B × U) of the form:

τ = 12 | p, b, u〉 +

12 | p

⊥, b, u〉 + 1| p, b⊥, u〉 + 12 | p, b, u〉 +

12 | p, b, u

⊥ 〉

+ 1| p, b, u⊥ 〉 + 14 | p, b

⊥, u〉 + 14 | p, b

⊥, u⊥ 〉 + 14 | p

⊥, b⊥, u〉 + 14 | p

⊥, b⊥, u⊥ 〉= 1| p, b, u〉 + 3

2 | p, b, u⊥ 〉 + 5

4 | p, b⊥, u〉 + 1

4 | p, b⊥, u⊥ 〉

+ 12 | p

⊥, b, u〉 + 14 | p

⊥, b⊥, u〉 + 14 | p

⊥, b⊥, u⊥ 〉.

We can now learn the two ‘Blood test’ and ‘Urine Test’ channels P → B

215


and P → U in (4.2) from this table τ, directly via disintegration — see Exer-cise 4.1.1 — and marginalisation.

• We get c B τ[0, 1, 0

∣∣∣ 1, 0, 0] : P→ B with:

c(p)(3.18)=

τ(p, b, u) + τ(p, b, u⊥)τ(p, b, u) + τ(p, b, u⊥) + τ(p, b⊥, u) + τ(p, b⊥, u⊥)

∣∣∣b⟩+

τ(p, b⊥, u) + τ(p, b⊥, u⊥)τ(p, b, u) + τ(p, b, u⊥) + τ(p, b⊥, u) + τ(p, b⊥, u⊥)

∣∣∣b⊥ ⟩=

1 + 3/2

1 + 3/2 + 5/4 + 1/4

∣∣∣b⟩+

5/4 + 1/4

1 + 3/2 + 5/4 + 1/4

∣∣∣b⊥ ⟩= 5

8 |b〉 +38 |b

⊥ 〉.

c(p⊥)(3.18)= · · · = 1

2 |b〉 +12 |b

⊥ 〉.

We see that a pregnant cow has a slightly higher probability than a non-pregnant cow of getting a positive blood test.

• In the same way we obtain a channel d B τ[0, 0, 1

∣∣∣ 1, 0, 0] : P→ U with:

d(p) = 916 |u〉 +

716 |u

⊥ 〉 and d(p⊥) = 34 |u〉 +

14 |u

⊥ 〉.

Apparantly, the urine test is more likely to be positive for a non-pregnantcow.

These channels c and d coincide with the conditional probabilities computedin [57], without explicitly using tensors of multisets and disintegration.

4.2.2 Learning with missing data, with prior

In the previous example we used a uniform distribution for the missing dataitems — corresponding to the fact that we assumed no prior knowledge. Herewe look into an example, taken from [20, §17.3.1], where we do have priorinformation, namely in the form of a Bayesian network, involving four setsA = {a, a⊥}, B = {b, b⊥}, C = {c, c⊥} and D = {d, d⊥}, in the string diagram:

ω

f g

h

A

B

D

Cwith

ω = 15 |a〉 +

45 |a

⊥ 〉

f (a) = 34 |b〉 +

14 |b

⊥ 〉

f (a⊥) = 110 |b〉 +

910 |b

⊥ 〉

h(b) = 15 |d 〉 +

45 |d

⊥ 〉

h(b⊥) = 710 |d 〉 +

310 |d

⊥ 〉

g(a) = 12 |c〉 +

12 |c

⊥ 〉

g(a⊥) = 14 |c〉 +

34 |c

⊥ 〉

(4.4)

216

4.2. Learning with missing data 2174.2. Learning with missing data 2174.2. Learning with missing data 217

We first turn this Bayesian network into a joint state τ ∈ D(A × B × C × D

),

as interpretation of the accessible string diagram on the left below — obtainedfrom the string diagram in (4.4).

τ B((id ⊗ id ⊗ swap) ◦· (id ⊗ id ⊗ h ⊗ id)

◦· (id ⊗ ∆2 ⊗ id) ◦· (id ⊗ f ⊗ g) ◦· ∆3

)� ω

That is:

τ =

ω

f g

h

A B DC

=

3200 |a, b, c, d 〉 +

350 |a, b, c, d

⊥ 〉 +3

200 |a, b, c⊥, d 〉 + 3

50 |a, b, c⊥, d⊥ 〉 +

7400 |a, b

⊥, c, d 〉 + 3400 |a, b

⊥, c, d⊥ 〉 +7

400 |a, b⊥, c⊥, d 〉 + 3

400 |a, b⊥, c⊥, d⊥ 〉 +

1250 |a

⊥, b, c, d 〉 + 2125 |a

⊥, b, c, d⊥ 〉 +3

250 |a⊥, b, c⊥, d 〉 + 6

125 |a⊥, b, c⊥, d⊥ 〉 +

63500 |a

⊥, b⊥, c, d 〉 + 27500 |a

⊥, b⊥, c, d⊥ 〉 +189500 |a

⊥, b⊥, c⊥, d 〉 + 81500 |a

⊥, b⊥, c⊥, d⊥ 〉.

This distribution τ will be used as prior.The table with data provided in [20] is:

case A B C D

1 ? b c⊥ ?2 ? b ? d⊥

3 ? b⊥ c d4 ? b⊥ c d5 ? b ? d⊥

(4.5)

At this stage we do not use uniform distributions as fill-in. Instead, we use theBayesian network to infer probabilities for the missing data. For instance, inthe first case we have point predicates 1b and 1c⊥ . From the crossover updatecorollary 2.5.9 we know that this inference can also be done via the joint stateτ, namely as:

τ1 B τ∣∣∣1⊗1b⊗1c⊥⊗1.

The missing probabilities for A,D are then obtained via the marginal, as:

τ1[1, 0, 0, 1

]= 0.1111|a, d 〉 + 0.4444|a, d⊥ 〉 + 0.08889|a⊥, d 〉 + 0.3556|a⊥, d⊥ 〉.

This distribution appears in [20, Fig. 17.4 (a)].

217


In a similar way we define:

τ2 B τ∣∣∣1⊗1b⊗1⊗1d⊥

τ3 B τ∣∣∣1⊗1b⊥⊗1⊗1d

τ4 B τ3 τ5 B τ2.

These five states τ1, . . . , τ5 are combined in a uniform convex combination,giving a new joint state:

τ′ = 15 · τ1 + 1

5 · τ2 + 15 · τ3 + 1

5 · τ4 + 15 · τ5 = 1

5 · τ1 + 25 · τ2 + 2

5 · τ3.

From τ′ we can obtain a newly learned interpretation of the Bayesian net-work (4.4), via marginalisation and disintegration. In detail, the new distribu-tion ω′ on A is:

ω′ B τ′[1, 0, 0, 0

]= 0.4208|a〉 + 0.5792|a⊥ 〉.

With newly learned channels:

f ′ B τ′[0, 1, 0, 0

∣∣∣ 1, 0, 0, 0] f ′(a) = 0.8841|b〉 + 0.1159|b⊥ 〉f ′(a⊥) = 0.3937|b〉 + 0.6063|b⊥ 〉

g′ B τ′[0, 0, 1, 0

∣∣∣ 1, 0, 0, 0] g′(a) = 0.4259|c〉 + 0.5741|c⊥ 〉g′(a⊥) = 0.6664|c〉 + 0.3336|c⊥ 〉

h′ B τ′[0, 0, 0, 1

∣∣∣ 0, 1, 0, 0] h′(b) = 0.06667|d 〉 + 0.9333|d⊥ 〉h′(b⊥) = 1|d 〉.

These outcomes are as in [20, Fig. 17.5] — up to some small differences, prob-ably due to rounding.

Interestingly, Table (4.5) gives no information about A, but still we learn anew distribution ω′ on A, basically from the prior information τ.

Exercises

4.2.1 Show that the pregnancy distribution in Subsection 4.2.1, obtainedfrom Table (4.3), is 4

5 | p〉 +15 | p

⊥ 〉.4.2.2 Calculate in detail that, in Subsection 4.2.2,

τ1[1, 0, 0, 1

]= 1

9 |a, d 〉 +49 |a, d

⊥ 〉 + 445 |a

⊥, d 〉 + 1645 |a

⊥, d⊥ 〉.

4.2.3 Apply the approach of Subsection 4.2.2 to the example in Subsec-tion 4.2.1, starting from the Bayesian network 4.2 interpreted withonly uniform distributions. Check that it gives the same outcome asgiven in Subsection 4.2.1.

218

4.3. Increasing valdity via updating 2194.3. Increasing valdity via updating 2194.3. Increasing valdity via updating 219

4.3 Increasing valdity via updating

In the beginning of this section we have described a basic form of learning,where the aim is to increase a validity ω |= p into a higher validity ω′ |= p bychanging ω into a ‘better’ state ω′. The main result in this section is that onecan take ω′ = ω|p, the conditioning of ω by p. This requires that p is a factor(has non-negative values), see Section 2.3.

We provide a proof that ω|p increases the validity of p, see Theorem 4.3.1below. Another proof will be given later on, in Example 4.4.3 via a more gen-eral ‘sum-increase’ methodology, that will appear in Section 4.4.

The main result of this section, Theorem 4.3.1, can be used quite generally,to derive various other forms of validity increases, not only by updating states,but also by updating channels.

Theorem 4.3.1. Let ω ∈ D(X) and p ∈ Obs(X) be a state and an observableon the same set X. Then there are inequalities between validities:

1(ω |= p

)2= ω ⊗ ω |= p ⊗ p ≤ ω |= p & p;

2 ω |= p ≤ ω|p |= p, when p is a factor.

The second point says that we can increase the validity of p in ω by updatingω with p. This confirms the intuition that p becomes ‘more true’ by movingfrom state ω to ω|p.

An obvious generalisation of the first point to different predicates does nothold, see Exercise 4.3.1 below.

Proof. 1 We first give an elementary proof, and then provide an alternativeproof using variance. For the first proof we rely on the square of sums for-mula: (

v1 + · · · + vn)2

=∑

i

v2i +

∑j<i

2viv j. (4.6)

Let supp(ω) = {x1, . . . , xn} and write ωi = ω(xi) and pi = p(xi). We thenneed to prove the inequality in the middle:

(ω |= p

)2=

(∑i ωi pi

)2≤

∑i ωi p2

i = ω |= p & p.

219


We proceed as follows.∑i ωi p2

i =∑

i ωiωi p2i + (1 − ωi)ωi p2

i

=∑

i ω2i p2

i +∑

i(∑

j,i ω j)ωi p2

i

=∑

i ω2i p2

i +∑

i∑

j<i ωiω j(p2i + p2

j ) by re-ordering(∗)≥

∑i ω

2i p2

i +∑

i∑

j<i ωiω j2pi p j

=∑

i(ωi pi)2 +∑

j<i 2(ωi pi)(ω j p j)

=(∑

i ωi pi

)2by (4.6).

The marked inequality (≥) holds because:

0 ≤ (pi − p j)2 = p2i − 2pi p j + p2

j so 2pi p j ≤ p2i + p2

j .

There is an alternative way to prove this first point. We know from Def-inition 2.9.1 that the variance Var(ω, p) of the random variable (ω, p) isnon-negative, since it is defined as the validity of a factor. But Lemma 2.9.3tells that Var(ω, p) = (ω |= p2) − (ω |= p)2, so that (ω |= p2) ≥ (ω |= p)2, asclaimed above.

2 We now assume that p is a factor and that the validity ω |= p is non-zero,so that ω|p is defined. The first equation below is Bayes’ product rule, seeProposition 2.3.3 (2); the subsequent inequality comes from the previouspoint.

ωp |= p =ω |= p & pω |= p

≥

(ω |= p

)2

ω |= p= ω |= p.

By n-fold iterating the updating ω|p · · · |p = ω|p&···&p = ω|pn we reach a statewhere the factor p has its maximum validity.

The previous theorem gives important leads about how to increase validitiesby updating both states and channels (see Definition 2.3.1).

Corollary 4.3.2. Let c : X → Y be a channel with ω ∈ D(X) and q ∈ Fact(Y).

1 There are inequalities between validities:

c � ω |= q

≥ ≥c � (ω|c�q) |= q c|q � ω |= q

≥ ≥c|q � (ω|c�q) |= q

220

4.3. Increasing valdity via updating 2214.3. Increasing valdity via updating 2214.3. Increasing valdity via updating 221

2 For an additional channel d : X → Z and a factor r ∈ Fact(Z),

〈c, d〉 � ω |= q ⊗ r

≥ ≥〈c, d〉 � (ω|(c�q)&(d�r)) |= q ⊗ r 〈c|q, d|r〉 � ω |= q ⊗ r

≥ ≥〈c|q, d|r〉 � (ω|(c�q)&(d�r)) |= q ⊗ r

Proof. For both points it suffices to prove the upper two inequalities.

1 We do the upper-left inequality first:

c � ω |= q = ω |= c � q by Proposition 2.4.3≤ ω|c�q |= c � q by Theorem 4.3.1 (2)= c � (ω|c�q) |= q by Proposition 2.4.3 again.

For proving the upper-right inequality we unpack validity:

c|q � ω |= q =∑

y(c|q � ω)(y) · q(y)

=∑

x,yω(x) · c|q(x)(y) · q(y)

=∑

x,yω(x) ·

c(x)(y) · q(y)c(x) |= q

· q(y)

=∑

xω(x) ·

∑y c(x)(y) · q(y) · q(y)

c(x) |= q

=∑

xω(x) ·

c(x) |= q & qc(x) |= q

≥∑

xω(x) ·

(c(x) |= q)2

c(x) |= qby Theorem 4.3.1 (1)

=∑

xω(x) · (c(x) |= q)

=∑

x,yω(x) · c(x)(y) · q(y)

= c � ω |= q.

2 The upper-left inequality follows from the upper-left inequality in the pre-vious point since 〈c, d〉 � (q ⊗ r) = (c � q) & (d � r). This follows fromExercise 2.4.5:

〈c, d〉 � (q ⊗ r) =((c ⊗ d) ◦· ∆

)� (q ⊗ r)

= ∆ �((c ⊗ d) � (q ⊗ r)

)= ∆ �

((c � q) ⊗ (d � r)

)= (c � q) & (d � r).

221


Similarly, one obtains the upper-right inequality from point (1) and the equa-tion 〈c, d〉|q⊗r = 〈c|q, d|r〉, which holds by Exercise 2.3.9.

Recall that observables carry a partial order ≤, which is obtained pointwise,from the order of the reals: p ≤ q iff p(x) ≤ q(x) for all x. The above resultsabout inequalities between validities also have consequences for predicates.

Corollary 4.3.3. For a channel c : X → Y and a factor q on Y one has:

c � q ≤ c|q � q.

Proof. For each x ∈ X,(c � q

)(x) = 1| x〉 |= c � q

= c � (1| x〉) |= q≤ c|q � (1| x〉) |= q by Corollary 4.3.2 (1)= 1| x〉 |= c|q � q=

(c|q � q

)(x).

Exercises

4.3.1 Check that Theorem 4.3.1 (1) does not generalise to:

(ω |= p) · (ω |= q) ≤ (ω |= p & q).

Take for instance ω = 12 |H 〉 + 1

2 |T 〉 with p(H) = 12 , p(T ) = 4

5 andq(H) = 3

5 , q(T ) = 25 .

4.4 A general sum-increase method

The previous section has introduced several inequalities between validities.They were all derived from a basic observation, in Theorem 4.3.1 (1). Thissection introduces a more general approach to derive inequalities, from [7]. Itforms the basis for many learning methods. This approach involves a genericsum-increase result, see Lemma 4.4.2 below. In many situations a concretedescription of this increase can be given via some basic real analysis. This willbe illustrated, among others, by re-deriving, in a completely different manner,some of the inequalities of the previous section.

We begin with a classical result in this setting that we need below.

Lemma 4.4.1 (Jensen’s inequality). Let ω ∈ D(X) be state with a factor

222

4.4. A general sum-increase method 2234.4. A general sum-increase method 2234.4. A general sum-increase method 223

p : X → R≥0 on its domain which is positive (non-zero) on the support ofω. For each function f : R>0 → R with f ′′ < 0 one has:

f(ω |= p

)≥ ω |=

(f ◦ p

).

The inequality is strict, except in trivial cases.This result holds in particular for f = ln, the natural logarithm.

The proof is standard but is included nevertheless, for convenience of thereader.

Proof. It suffices to prove that for a1, . . . , an ∈ R>0 and r1, . . . , rn ∈ [0, 1] with∑i ri = 1 one has f (

∑i riai) ≥

∑i ri f (ai). We shall provide a proof for n = 2.

The inequality is easily extended to n > 2, by induction.So let a, b ∈ R>0 be given, with r ∈ [0, 1]. We need to prove f (ra+(1−r)b) ≥

r f (a)+(1−r) f (b). The result is trivial if a = b or r = 0 or r = 1. So let, withoutloss of generality, a < b and r ∈ (0, 1). Write c = ra + (1 − r)b = b − r(b − a),so that a < c < b. By the mean value theorem we can find a < u < c andc < v < b with:

f (c) − f (a)c − a

= f ′(u) andf (b) − f (c)

b − c= f ′(v)

Since f ′′ < 0 we have that f ′ is strictly decreasing, so f ′(u) > f ′(v) becauseu > v. We can write:

c − a = (r − 1)a + (1 − r)b = (1 − r)(b − a) and b − c = r(b − a).

From f ′(u) > f ′(v) we deduce inequalities:

f (c) − f (a)(1 − r)(b − a)

>f (b) − f (c)

r(b − a)i.e. r( f (c) − f (a)) > (1 − r)( f (b) − f (c).

By reorganising the latter inequality we get f (c) > r f (a) + (1 − r) f (b), asrequired.

We now come to the main technical result which we shall call the sum-increase lemma. It is a special (discrete) case of a more general result [7,Thm. 2.1]. In the previous section we concentrated on increasing validitiesω |= p =

∑y ω(y) · p(y). The next result describes how to find increases for

sum expressions in general.

Lemma 4.4.2. Let X,Y be finite sets, and let F : X × Y → R≥0 be a givenfunction. For each x ∈ X, write F1(x) B

∑y∈Y F(x, y). Assume that there is an

x′ ∈ X with:

x′ ∈ argmaxz

G(x, z) where G(x, z) B∑

y∈YF(x, y) · ln

(F(z, y)

).

223


Then F1(x′) ≥ F1(x). We often say informally that x′ is better x.

In general, the inequality ≥ in this lemma gives a strict increase, like inJensen’s inequality in Lemma 4.4.1. It forms the basis for many forms of learn-ing. An actual maximum x′ can in many situation be determined analytically— using the Lagrange multiplier method — but it need not be unique and itcan be a local maximum. This will be illustrated in the remainder of this sec-tion. We will see that several validity increases from the previous section canbe obtained via the above lemma.

Proof. Let x′ be the element where G(x,−) : Y → R≥0 takes its maximum.This x′ satisfies F1(x′) ≥ F1(x), since:

ln(

F1(x′)F1(x)

)= ln

(∑y

F(x′, y)F1(x)

)= ln

(∑y

F(x, y)F1(x)

·F(x′, y)F(x, y)

)≥

∑y

F(x, y)F1(x)

· ln(

F(x′, y)F(x, y)

)by Lemma 4.4.1

=1

F1(x)·∑

yF(x, y) ·

(ln

(F(x′, y)

)− ln

(F(x, y)

))=

1F1(x)

·(G(x, x′) −G(x, x)

)≥ 0.

Example 4.4.3. Suppose we have a validity ω |= p, where ω ∈ D(Y) andp ∈ Fact(Y), which we would like to increase, by changing the state ω, whilekeeping p fixed. We apply Lemma 4.4.2 with F : D(Y) × Y → R≥0 given byF(ω, y) = ω(y) · p(y). Then F1(ω) =

∑y F(ω, y) =

∑y ω(y) · p(y) = ω |= p.

We have G(ω,ω′) =∑

y ω(y)·p(y)·ln(ω′(y)·p(y)

)and we wish to find a max-

imum of G(ω,−) : D(Y) → R≥0. Let Y = {y1, . . . , yn}; we will use variablesvi for the values ω′(yi) that we wish to find, by seeing G(ω,−) as a functionRn → R≥0 with constraints.

These constraints are handled via the Lagrange multiplier method for findingthe maximum (see e.g. [9, §2.2]). We keep ω fixed and consider a new functionH, also known as the Lagrangian, with an additional parameter κ.

H(~v, κ) B G(ω,~v) − κ ·((∑

i vi) − 1)

=∑

i ω(yi) · p(yi) · ln(vi · p(yi)

)− κ ·

((∑

i vi) − 1).

224


The partial derivatives of H are:

∂H∂vi

(~v, κ) =ω(yi) · p(yi)

vi · p(yi)· p(vi) − κ =

ω(yi) · p(yi)vi

− κ

∂H∂κ

(~v, κ) = (∑

i vi) − 1.

Setting all these derivatives to zero yields:

1 =∑

i vi =∑

iω(yi) · p(yi)

κ=ω |= pκ

.

Hence κ = ω |= p and thus:

vi =ω(yi) · p(yi)

κ=ω(yi) · p(yi)ω |= p

= ω|p(yi).

Hence, Lemma 4.4.2 gives as ‘better’ state the updated version ω|p of ω withp, as in Theorem 4.3.1 (2), where we already saw the inequality ω|p |= p ≥ω |= p.

Example 4.4.4. We will now try to find a ‘better’ version of a channel c in asituation c � ω |= q, where c : X → Y , ω ∈ D(X) and q ∈ Fact(Y). We defineF : D(Y)X × X × Y → R≥0 as:

F(c, x, y) B ω(x) · c(x)(y) · q(y)

Then F1(c) =∑

x,y F(c, x, y) =∑

y(c � ω)(y) · q(y) = c � ω |= q.Let X = {x1, . . . , xn} and Y = {y1, . . . , ym}. We shall use variables zi j for

c′(xi)(yi). We can thus describe G(c,−) : Rn×m → R≥0 as:

G(c,~z) =∑

i, j F(c, xi, y j) · ln(F(~z, xi, y j)

)=

∑i, j ω(xi) · c(xi)(y j) · q(y j) · ln

(ω(xi) · zi j · q(y j)

).

We now use the Lagrange multiplier method with n additional variables ~κ in anew function:

H(~z,~κ) B∑

i, j ω(xi) · c(xi)(y j) · q(y j) · ln(ω(xi) · zi j · q(y j)

)−

∑i κi ·

((∑

j zi j) − 1).

Its partial derivatives are:

∂H∂zi j

(~z,~κ) =ω(xi) · c(xi)(y j) · q(y j)ω(xi) · zi j · q(y j)

· ω(xi) · q(y j) − κi

=ω(xi) · c(xi)(y j) · q(y j)

zi j− κi

∂H∂κi

(~z,~κ) = (∑

j zi j) − 1.

225


Setting all these derivatives to zero yields for each i,

1 =∑

j zi j =∑

jω(xi) · c(xi)(y j) · q(y j)

κi=ω(xi) · (c(xi) |= q)

κi.

Hence κi = ω(xi) · (c(xi) |= q) so that:

zi j =ω(xi) · c(xi)(y j) · q(y j)

κi=ω(xi) · c(xi)(y j) · q(y j)ω(xi) · (c(xi) |= q)

=c(xi)(y j) · q(y j)

c(xi) |= q= c(xi)|q(y j) = c|q(xi)(y j).

Thus, the ‘better’ channel than c in the situation c � ω |= q is c|q. This matchesCorollary 4.3.2 (1) where we already saw that c|q � ω |= q ≥ c � ω |= q.

In the description of Lemma 4.4.2 the set X may be a product of other sets, sothat the sum-increase approach also works for multiple channels and/or states— giving a better version for each of them. This will be illustrated next.

Example 4.4.5. We again consider a situation c � ω |= q as in Example 4.4.3,but now we wish to get better versions of both c andω. The situation is much asbefore, except that we now use two channel arguments: we define F : D(Y)X ×

D(X) × X × Y → R≥0 as:

F(c, ω, x, y) B ω(x) · c(x)(y) · q(y)

Then F1(c, ω) =∑

x,y F(c, ω, x, y) =∑

y(c � ω)(y) · q(y) = c � ω |= q.Let X = {x1, . . . , xn} and Y = {y1, . . . , ym}. We shall use variables zi j for

c′(xi)(yi) and wi for ω′(xi). We can thus describe G(c, ω,−,−) : Rn×m × Rn →

R≥0 as:

G(c, ω,~z, ~w) =∑

i, j F(c, ω, xi, y j) · ln(F(~z, ~w, xi, y j)

)=

∑i, j ω(xi) · c(xi)(y j) · q(y j) · ln

(wi · zi j · q(y j)

).

We again use the Lagrange multiplier method, now with variables ~κ, λ in thefunction:

H(~z, ~w,~κ, λ) B∑

i, j ω(xi) · c(xi)(y j) · q(y j) · ln(wi · zi j · q(y j)

)−

∑i κi ·

((∑

j zi j) − 1)− λ ·

((∑

i wi) − 1).

226


The partial derivatives are:

∂H∂zi j

(~z, ~w,~κ, λ) =ω(xi) · c(xi)(y j) · q(y j)

wi · zi j · q(y j)· wi · q(y j) − κi

=ω(xi) · c(xi)(y j) · q(y j)

zi j− κi

∂H∂wi

(~z, ~w,~κ, λ) =∑

j

ω(xi) · c(xi)(y j) · q(y j)wi · zi j · q(y j)

· zi j · q(y j) − λ

=ω(xi) · (c � q)(xi)

wi− λ

∂H∂κi

(~z, ~w,~κ, λ) = (∑

j zi j) − 1

∂H∂λ

(~z, ~w,~κ, λ) = (∑

i wi) − 1.

By taking all these derivatives to be zero we can derive zi j = c|q(xi)(y j), asin Example 4.4.4. We claim that we can also derive wi = ω|c�q(xi), in thefollowing way.

1 =∑

i wi =∑

iω(xi) · (c � q)(xi)

λ=ω |= c � q

λ.

Thus λ = ω |= c � q and so:

wi =ω(xi) · (c � q)(xi)

λ=ω(xi) · (c � q)(xi)

ω |= c � q= ω|c�q(xi).

Thus, we rediscover the inequality c|q � ω|c�q |= q ≥ c � ω |= q fromCorollary 4.3.2 (1), in which better versions of both the channel c and the stateω are used.

The previous illustrations involved inequalities that we already knew. Weconclude this section with a new result.

Proposition 4.4.6. Let ω ∈ D(X) be state on a finite set X and let p1, . . . , pn

be factors on X, all with non-zero validity ω |= pi.

1 The state ω′ =∑

i1n · ω|pi satisfies:∏

i(ω′ |= pi) ≥

∏i(ω |= pi).

2 For k1, . . . , kn ∈ N write k =∑

i ki and ω′ =∑

ikik · ω|pi , where we assume

k > 0. Then: ∏i(ω′ |= pi)ki ≥

∏i(ω |= pi)ki .

227


Proof. 1 We shall do the proof for n = 2 and leave the general case as exerciseto the reader. We use Lemma 4.4.2 with function F : D(X) × X × X → R≥0

given by:

F(ω, x, y) B ω(x) · p1(x) · ω(y) · p2(y).

Then by distributivity of multiplication over addition:∑x,y F(ω, x, y) =

(∑x ω(x) · p1(x)

)·(∑

y ω(y) · p2(y))

= (ω |= p1) · (ω |= p2).

Let X = {x1, . . . , xn} and let the function H be given by:

H(~w, λ) B∑

i, j F(ω, xi, x j) · ln(wi · p1(xi) · w j · p2(x j)

)− λ ·

((∑

i wi) − 1).

Then:∂H∂wk

(~w, λ) =∑

i

F(ω, xk, xi) + F(ω, xi, xk)wk

− λ

∂H∂λ

(~w, λ) = (∑

i wi) − 1.

Setting these to zero gives:

1 =∑

k wk =

∑k,i F(ω, xk, xi) + F(ω, xi, xk)

λ=

2 · (ω |= p1) · (ω |= p2)λ

.

Hence λ = 2 · (ω |= p1) · (ω |= p2) so that:

wk =

∑i F(ω, xk, xi) + F(ω, xi, xk)

λ

= 12 ·

ω(xk) · p1(xk) · (ω |= p2)(ω |= p1) · (ω |= p2)

+ 12 ·

(ω |= p1) · ω(xk) · p2(xk)(ω |= p1) · (ω |= p2)

= 12 ·

ω(xk) · p1(xk)ω |= p1

+ 12 ·

ω(xk) · p2(xk)ω |= p2

= 12 · ω|p1 (xk) + 1

2 · ω|p2 (xk).

2 Directly from (1), by writing∏

i (ω′ |= pi)ki as a k-fold product, for k =∑i ki.

Exercises

4.4.1 Check that the better channel that appears in Example 4.4.5 is c|q, likein Example 4.4.4.

4.4.2 Prove Proposition 4.4.6 (1) in general, not just for n = 2.

228

4.5. Learning Markov chains 2294.5. Learning Markov chains 2294.5. Learning Markov chains 229

4.5 Learning Markov chains

The current and next section continue the study of Markov chains and hiddenMarkov models (HMMs) from Section 3.4. This section will introduce learningfor Markov chains, whereas the next section extends the approach to hiddenMarkov models.

We recall that a Markov chain is a pair (σ, t) where σ ∈ D(X) is the initialstate and t : X → X is the transition channel. It can be seen as a special case ofan HMM, where the emission channel is the identity. Thus we will use validityfor HMMs also for Markov chains, where (σ, t) |= ~p should be understood as(σ, t, id) |= ~p, in the sense of Definition 3.4.2.

Below we first look at the simple situation where we like to learn a Markovchain directly from (point) observations, in the style of frequentist learningFlrn. Subsequently we look at the more difficult situation of increasing thevalidity of a given sequence of observations. This will be done via the sum-increase approach of the previous section. It leads to a channel-based descrip-tion of the Baum-Welch algorithm [7] (for Markov chains).

4.5.1 Learning by counting elements, without priors

We consider an example. Let X = {A,C,G,T } the letters for the four DNAnucleobase elements adenine (A), cytosine (C), guanine (G) and thymine (T).Suppose we are given two 8-letter sequences α1, α2 ∈ L(X), namely:

α1 = [C,G, A, A,T,T,G,C] α2 = [A,C,T,C,C, A,T,G].

What can we learn from this? In particular, is there a Markov chain that (best)fits these data?

For the initial state we only have the first entries in the two sequences,namely C and A. This gives a multiset 1|A〉 + 1|C 〉, which yields the initialstate σ = 1

2 |A〉 +12 |C 〉, via frequentist learning Flrn.

For the transition channel we turn the above two sequences of letters intosequences of consecutive pairs of letters, as in:

[(C,G), (G, A), (A, A), (A,T ), (T,T ), (T,G), (G,C)][(A,C), (C,T ), (T,C), (C,C), (C, A), (A,T ), (T,G)]

We combine these two lists and turn them into a multiset ψ ∈ M(X × X) ofpairs:

ψ = 1|A, A〉 + 1|A,C 〉 + 2|A,T 〉 + 1|C, A〉 + 1|C,C 〉 + 1|C,G 〉 + 1|C,T 〉+ 1|G, A〉 + 1|G,C 〉 + 1|T,C 〉 + 2|T,G 〉 + 1|T,T 〉.

229


Applying frequentist learning to ψ gives a joint state Flrn(ψ) ∈ D(X×X), fromwhich we can extract the transition channel t : X → X by disintegration. ByRemark 4.1.3 we also extract directly from ψ. This yields:

t(A) = 14 |A〉 +

14 |C 〉 +

12 |T 〉 t(G) = 1

2 |A〉 +12 |C 〉

t(C) = 14 |A〉 +

14 |C 〉 +

14 |G 〉 +

14 |T 〉 t(T ) = 1

4 |C 〉 +12 |G 〉 +

14 |T 〉.

We then obtain validities:

(σ, t) |= α1 = 12 ·

14 ·

12 ·

14 ·

12 ·

14 ·

12 ·

12 = 1

2048 ≈ 0.000488(σ, t) |= α2 = 1

2 ·14 ·

14 ·

14 ·

14 ·

14 ·

12 ·

12 = 1

8192 ≈ 0.000122.

Of course, one can also calculate the validities of other sequences in the Markovchain (σ, t) learned from α1, α2.

4.5.2 Finding a better Markov chain

We are going to apply the techniques from Section 4.4 to learn a new, better,Markov chain, starting form a given (prior) one. This given chain will be writ-ten as (σ, t), with initial state σ ∈ D(X) and transition channel t : X → X. Wefurther assume that a list of ` factors ~p = p1, . . . p` ∈ Fact(X) is given, of con-secutive observations, from which we like to learn. Our aim is to find a betterMarkov chain (σ′, t′) with:

(σ′, t′) |= ~p ≥ (σ, t) |= ~p.

We recall from Definition 3.4.4 the filtered sequence of states σ1, . . . , σ`+1 ∈

D(X) with σ1 B σ and σi+1 B t � (σi|pi ). We introduce two new sequences,of factors q1, . . . , q`+1 ∈ Fact(X), and of states γ1, . . . , γ`+1 ∈ D(X). First, thefactors are defined, ‘backwards’, as in:

q`+1 B 1 and qi−1 B pi−1 & (t � qi). (4.7)

We can now define the states γi, for 1 ≤ i ≤ `, via entry-wise updating, simplyas:

γi B σi

∣∣∣qi

and γ B∑

1≤i<` γi. (4.8)

Each γi is a state, by construction, but γ, as a sum of states, is a multiset.

Theorem 4.5.1. In the above setting, starting from a Markov chain (σ, t) wedefine the Baum-Welch improvement of (σ, t) with factors ~p = p1, . . . , p` as:

BW(σ, t, ~p) B (σ′, t′), where: σ′ B γ1 t′(x) B∑

1≤i<`

γi(x)γ(x)

· t|qi+1 (x).

Then: BW(σ, t, ~p) |= ~p ≥ (σ, t) |= ~p.

230


This result describes just one improvement step. What is called the Baum-Welch algorithm for Markov chains involves repeating this step multiple times,with the same factors ~p. The number of repetitions may depend on some preci-sion requirement, like the increase in validity being smaller than some epsilonvalue.

The rest of this section will be devoted mostly to the proof of this result,using Lemma 4.4.2.

We start with several technical properties for the q’s and γ’s in definitions (4.7)and (4.8). They are collected below, for laters use.

Lemma 4.5.2. In the above situation,

1 The (prior) validity (σ, t) |= ~p equals σ1 |= q1;

2 σi+1 |= qi+1 =σi |= qi

σi |= piand thus σi |= qi =

σ1 |= q1∏j<i σ j |= p j

;

3 γi(x) =

(∏j<i σ j |= p j

)· σi(x) · pi(x) · (t � qi+1)(x)σ1 |= q1

.

Proof. 1 Directly by Proposition 3.4.3.2 By unravelling the definitions and using Bayes’ product rule we get:

σi+1 |= qi+1 = t �(σi|pi

)|= qi+1

= σi|pi |= t � qi+1

=σi |= pi & (t � qi+1)

σi |= pi=

σi |= qi

σi |= pi.

The second part of point (2) follows by induction.3 We have:

γi(x) = σi|qi (x) =σi(x) · qi(x)σi |= qi

=σi(x) · pi(x) · (t � qi+1)(x)

σ1 |=q1/∏ j<i σ j |=p j

by (2)

=

(∏j<i σ j |= p j

)· σi(x) · pi(x) · (t � qi+1)(x)σ1 |= q1

.

We now turn to the sum-increase setting of Lemma 4.4.2, in order to get abetter version of both σ and t. We take F : X ×D(X)X × X` → R≥0, to be:

F(σ, t, ~x) B σ(x1) · p1(x1) · t(x1)(x2) · p2(x2) · · · t(x`−1)(x`) · p`(x`)

= σ(x1) · p1(x1) ·∏

1≤k<`t(xk)(xk+1) · pk+1(xk+1).

(4.9)

We also collect some results about F.

Lemma 4.5.3. 1 F1(σ, t) = (σ, t) |= ~p;

231


2 For each 1 ≤ k < ` one has:∑x1,...,xk−1

∑xk+2,...,x` F(σ, t, ~x) = (σ1 |= q1) · γk(xk) · t|qk+1 (xk)(xk+1).

Proof. 1 Since F1(σ, t) =∑~x F(σ, t, ~x) = σ |= q1 = (σ, t) |= ~p.

2 Straightforward:∑x1,...,xk−1

∑xk+2,...,x` F(σ, t, ~x)

=(∑

x1,...,xk−1σ(x1) · p1(x1) · t(x1)(x2) · · · pk−1(xk−1) · t(xk−1)(xk)

)· pk(xk) · t(xk)(xk+1) · pk+1(xk+1) ·

(∑xk+2,...,x` t(xk+1)(xk+2) · · · p`(x`)

)= (σ1 |= p1) · · · (σk−1 |= pk−1) · σk(xk)

· pk(xk) · t(xk)(xk+1) · pk+1(xk+1) · (t � qk+2)(xk+1)=

(∏j<k σ j |= p j

)· σk(xk) · pk(xk) · t(xk)(xk+1) · qk+1(xk+1)

= (σ1 |= q1) · γk(xk) ·t(xk)(xk+1) · qk+1(xk+1)

(t � qk+1)(xk)by Lemma 4.5.2 (3)

= (σ1 |= q1) · γk(xk) · t|qk+1 (xk)(xk+1).

We now turn to the actual proof of the main result of this section.

Proof. (of Theorem 4.5.1) According to Lemma 4.4.2 we seek a better stateand channel σ′, t′ as maximum of the function G(σ, t,−,−), where:

G(σ, t, σ′, t′) =∑~x F(σ, t, ~x) · ln

(F(σ′, t′, ~x)

).

Let X = {x1, . . . , xn}. We write wi as variable for σ′(xi) and zi j for t′(xi)(x j).We use additional variables ~κ and λ, as in Example 4.4.5, in a new function H,for the Lagrange multiplier method, where:

H(~w,~z, λ,~κ)B

∑~i F(σ, t, x~i) · ln

(wi1 · p(xi1 ) ·

∏1≤k<`

zik ik+1 · pk+1(xik+1 ))

− λ ·((∑

i wi) − 1)−

∑i κi ·

((∑

i′ zii′ ) − 1)

=∑~i F(σ, t, x~i) ·

(ln

(wi1 · p(xi1 )

)+

∑1≤k<`

ln(zik ik+1 · pk+1(xk+1 )

))− λ ·

((∑

i wi) − 1)−

∑i κi ·

((∑

i′ zii′ ) − 1).

The partial derivatives of H are as follows. First,

∂H∂wi

(~w,~z, λ,~κ) =∑

~i,i1=i

F(σ, t, x~i)wi1 · p1(xi1 )

· p1(xi1 ) − λ =σ(xi) · q1(xi)

wi− λ.

Next, the expression for zi j can be rewritten using Lemma 4.5.3 (2).

∂H∂zii′

(~w,~z, λ,~κ) =∑

1≤k<`

∑~i,ik=i,ik+1=i′

F(σ, t, x~i)zik ik+1 · pk+1(xik+1 )

· pk+1(xik+1 ) − κi

=(σ1 |= q1) ·

∑1≤k<` γk(xi) · t|qk+1 (xi)(xi′ )

zii′− κi.

232


The partial derivatives w.r.t. the auxiliary variables are:

∂H∂λ

(~w,~z, λ,~κ) = (∑

i wi) − 1∂H∂κi

(~w,~z, λ,~κ) = (∑

j zi j) − 1.

When we set all these partial derivatives to zero, we can derive the followingabout ~w and ~z. First we have:

1 =∑

i wi =∑

iσ(xi) · q1(xi)

λ=σ |= q1

λ.

Hence λ = σ |= q1 and thus:

wi =σ(xi) · q1(xi)

λ=σ(xi) · q1(xi)σ |= q1

= σ|q1 (xi) = γ1(xi).

Thus we conclude that the better initial state σ′ equals γ1, as in Theorem 4.5.1.For the better channel t′ we obtain, for each i,

1 =∑

i′ zii′ =

∑i′ (σ1 |= q1) ·

∑1≤k<` γk(xi) · t|qk+1 (xi)(xi′ )κi

=(σ1 |= q1) ·

∑1≤k<` γk(xi)

κi

(4.8)=

(σ1 |= q1) · γ(xi)κi

.

Hence κi = (σ1 |= q1) · γ(xi), so that:

zii′ =(σ1 |= q1) ·

∑1≤k<` γk(xi) · t|qk+1 (xi)(xi′ )

κi

=

∑1≤k<` γk(xi) · t|qk+1 (xi)(xi′ )

γ(xi)

=∑

1≤k<`

γk(xi)γ(xi)

· t|qk+1 (xi)(xi′ ).

This is the expression for t′ in Theorem 4.5.1. The claimed increase in validityfollows from Lemma 4.5.3 (1).

Remark 4.5.4. The description in Theorem 4.5.1 of the newly learned chan-nel t′, as convex sum of updated channels t|qi+1 is mathematically appealing,but is not optimal from a computational perspective: it involves several nor-malisations, namely for each channel update, and for the coefficients γi/γ of theconvex sum. It is more efficient to simply multiply all factors γi(x), t(x)(y) andqi+1(y) and then normalise at the very end. Getting into such implementationdetails is beyond the scope of this book.

An alternative way to compute the new transition channel t′ is described inExercise 4.5.2.

233


We shall elaborate an example in the next section.

Exercises

4.5.1 Prove that in the situation of Subsection 4.5.2 one has:

t|qi+1 � γi = γi+1.

Hint: Recall Exercise 2.5.5.Conclude that there is a chain of morphisms in the category of ker-

nels Krn of the form:

(X, γ1)t|q2 // (X, γ3)

t|q3 // · · ·t|qn // (X, γn+1).

4.5.2 In the context of Theorem 4.5.1, consider a new sequence of jointstates τ1, . . . , τ` ∈ D(X × X), via:

τi B(〈id, t〉 � σi

)∣∣∣pi⊗qi+1

= 〈id, t|qi+1〉 �(σi|pi&(t�qi+1)

)by Theorem 3.7.8 (2)

= 〈id, t|qi+1〉 �(σi|qi

)= 〈id, t|qi+1〉 � γi.

Then define τ B∑

1≤i<` τi and prove that the new channel t′ in Theo-rem 4.5.1 can also be defined via disintegration as:

t′ = τ[0, 1

∣∣∣ 1, 0].4.5.3 Write ωn = 〈id, t〉n � σ ∈ D(Xn × X), for the joint state defined

in (3.10) with e = id. Prove that the states γi ∈ D(X) in (4.8) arise asi-th marginal:

γi = ω`∣∣∣p1⊗ ··· ⊗p`⊗1

[0, . . . , 0, 1, 0, . . . , 0

],

with the (single) 1 in the mask on the i-th position, for 1 ≤ i ≤ `, inthe mask of length ` + 1.

4.6 Learning hidden Markov models

In this section we extend the Baum-Welch algorithm from Markov chains tohidden Markov models (HMM). This means that we have to deal with an ad-ditional emmission channel. We shall see that the improvement approach forMarkov chains in the previous section carries over HMMs. It only involvesa bit more book-keeping. We shall concentrate on what is new, namely the

234

4.6. Learning hidden Markov models 2354.6. Learning hidden Markov models 2354.6. Learning hidden Markov models 235

emission channel. As before, we use a channel-based formulation and allowarbitrary factors as observables, and not just point observations.

After stating an proving the main result, we elaborate an example from theliterature.

4.6.1 Finding a better hidden Markov model

Throughout this subsection we shall use:

• a fixed Hidden Markov modelH =(1

σ→ X

t→ X

e→Y

);

• a list ~p = p1, . . . , p` of factors on Y , from which we wish to learn;• the filtered sequence of states σ1, . . . , σn+1 ∈ D(X) with σ1 B σ and σi+1 B

t � (σ|e�pi ), as introduced in Definition 3.4.4.

Our aim is to define a newly learned HMM (σ′, t′, e′).We introduce two new sequences, like in the previous section. We shall use

the same letters (q and γ), but their definition is slightly different. In fact, thedefinitions given below coincide with the ones from (4.7) and (4.8) in the pre-vious section where the emission channel e is the identity id.

We start with a series of factors q1, . . . , q`+1 ∈ Fact(X), defined in a ‘back-wards’ manner:

q`+1 B 1 and qi−1 B (e � pi−1) & (t � qi). (4.10)

We can now define states γi ∈ D(X), for 1 ≤ i ≤ ` via entry-wise updating,simply as:

γi B σi

∣∣∣qi

giving: γ B∑

1≤i<` γi. (4.11)

Theorem 4.6.1. In the above setting, starting from a Markov chain (σ, t, e) wedefine the Baum-Welch improvement of (σ, t, e) with factors ~p = p1, . . . , p` as:

BW(σ, t, e, ~p) B (σ′, t′, e′),

where σ′ B γ1 and:

t′(x) B∑

1≤i<`

γi(x)γ(x)

· t|qi+1 (x) e′(x) B∑

1≤i≤`

γi(x)γ(x) + γ`(x)

· e|pi (x).

Then: BW(σ, t, e, ~p) |= ~p ≥ (σ, t, e) |= ~p.

Notice that the new transition channel t′ is defined via a convex sum of`−1 updated channels, whereas the new emmission channel e′ involves ` sum-mands. This makes sense, since e can be updated via all ` factors pi. Updatingthe transition channel t proceeds via the pairs (pi, pi+1), of which there are only` − 1.

235


For convenience, we overload the operation BW and use it both for Markovchains (with three arguments) and for HMMs (with four arguments). It is nothard to see that the above newly learned state and transition channel σ′, t′

satisfy:

(σ′, t′) = BW(σ, t,−−−−−→e � pi

),

with BW as in Theorem 4.5.1.We use Lemma 4.4.2 with function F : D(X)×D(X)X ×D(Y)X ×X` ×Y` →

R≥0 defined by:

F(σ, t, e, ~x, ~y) B σ(x1) · e(x1)(y1) · p1(y1) · t(x1)(x2) · e(x2)(y2) · p2(y2) · · ·· · · t(x`−1)(x`) · e(x`)(y`) · p`(y`)

= σ(x1) · e(x1)(y1) · p1(y1)·∏

1≤k<`t(xk)(xk+1) · e(xk+1)(yk+1) · pk+1(yk+1).

Then:∑~x,~y F(σ, t, e, ~x, ~y)

=∑~x σ(x1) · t(x1)(x2) · (e � p1)(x1) · · · t(x`−1)(x`) · (e � p`)(x`)

= σ |= q1

= (σ, t, e) |= ~p.

We collect the main facts about the above function F that we need in thesetting of HMMs, as analogues of Lemmas 4.5.2 and 4.5.3. Proofs are verysimilar and are left to the interested reader.

Lemma 4.6.2. In the above situation,

1 γi(x) =

(∏j<i σ j |= e � p j

)· σi(x) · (e � pi)(x) · (t � qi+1)(x)σ1 |= q1

.

2 For each 1 ≤ k < ` one has:∑x1,...,xk−1

∑xk+2,...,x`

∑~y F(σ, t, e, ~x, ~y) = (σ1 |= q1) · γk(xk) · t|qk+1 (xk)(xk+1).

3 For each 1 ≤ k < ` one also has:∑x1,...,xk−1

∑xk+1,...,x`

∑y1,...,yk−1

∑yk+1,...,y` F(σ, t, e, ~x, ~y)

= (σ1 |= q1) · γk(xk) · e|pk (xk)(yk).

Proof. (of Theorem 4.6.1) The spaces X,Y are assumed to be finite, say of theform X = {x1, . . . , xn} and Y = {y1, . . . , ym}. We use (indexed) variables(

wi)1≤i≤n(zii′ )1≤i≤n,1≤i′≤n

(ui j)1≤i≤n,1≤ j≤m.

236


for the new state σ′, transition channel t′ and emmission channel e′. The func-tion that we we use for the Lagrange method is:

H(~w,~z, ~u, λ,~κ, ~µ) B∑~i,~j F(σ, t, e, x~i, y~j)

· ln(wi1 · ui1 j1 · p1(y j1 ) ·

∏1≤k<`

zik ik+1 · uik+1 jk+1 · pk+1(yik+1 ))

− λ ·((∑

i wi) − 1)−

∑i κi ·

((∑

i′ zii′ ) − 1)−

∑i µi ·

((∑

j ui j) − 1)

=∑~i,~j F(σ, t, e, x~i, y~j)·(

ln(wi1 · ui1 j1 · p1(y j1 )

)+

∑1≤k<`

ln(zik ik+1 · uik+1 jk+1 · pk+1(yik+1 )

))− λ ·

((∑

i wi) − 1)−

∑i κi ·

((∑

i′ zii′ ) − 1)−

∑i µi ·

((∑

j ui j) − 1).

As before, we take the partial derivatives of H, set them to zero, and solve theresulting equations in order to obtain the maximum. We shall only do so for thevariables ui j, corresponding to the new emmission channel e′. The other caseswork much as in the previous section. We use Lemma 4.6.2 (3) to obtain:

∂H∂ui j

(~w,~z, ~u, λ,~κ, ~µ)

=∑

~i,~j,i1=i, j1= j

F(σ, t, e, x~i, y~j)

wi1 · ui1 j1 · p1(y j1 )· wi1 · p1(y j1 )

+∑

1≤k<`

∑~i,~j,ik+1=i, jk+1= j

F(σ, t, e, x~i, y~j)

zik ik+1 · uik+1 jk+1 · pk+1(y jk+1 )· zik ik+1 · pk+1(y jk+1 ) − κi

=(σ1 |= q1) · γ1(xi) · e|p1 (xi)(y j)

ui j

+

∑1≤k<` (σ1 |= q1) · γk+1(xi) · e|pk+1 (xi)(y j)

ui j− κi

=

∑1≤k≤` (σ1 |= q1) · γk(xi) · e|pk (xi)(y j)

ui j− κi.

We then obtain, for each i,

1 =∑

j ui j =

∑j (σ1 |= q1) ·

∑1≤k≤` γk(xi) · e|pk (xi)(x j)κi

=(σ1 |= q1) ·

∑1≤k≤` γk(xi)

κi

(4.11)=

(σ1 |= q1) · (γ(xi) + γ`(xi)κi

.

237


Hence κi = (σ1 |= q1) · γ(xi), so that:

ui j =(σ1 |= q1) ·

∑1≤k≤` γk(xi) · e|pk (xi)(x j)

κi

=

∑1≤k≤` γk(xi) · e|pk (xi)(x j)

γ(xi) + γ`(xi)

=∑

1≤k≤`

γk(xi)γ(xi) + γ`(xi)

· e|pk (xi)(x j).

This is the formula for e′ in Theorem 4.6.1.

4.6.2 A soft drink example

We elaborate in detail the ‘crazy soft drink machine’ from [66, §9.2]. It hastwo internal positions cp and ip for ‘cola preference’ and ‘iced tea preference’,so X = {cp, ip}, with initial state:

σ = 1|cp 〉.

The transition channel t : X → X is:

t(cp) = 710 |cp 〉 + 3

10 | ip 〉 t(ip) = 12 |cp 〉 + 1

2 | ip 〉.

There are three output elements: c for ‘cola’, i for ‘iced tea’ and l for ‘lemon’,in a set Y = {c, i , l}. The machine is called crazy because it may give a lemondrink, with some probability, in both internal positions, see the transition chan-nel e : X → Y ,

e(cp) = 610 |c 〉 +

110 | i 〉 +

310 | l 〉 e(ip) = 1

10 |cp 〉 + 710 | ip 〉 +

210 | l 〉.

The sequence of observations that we wish to learn from is [l , i , c]. It forms asequence of point predicates p1 = 1l , p2 = 1i , p3 = 1c .

Pulling these predicates on Y along e : X → Y yields:

(e � p1)(cp) = 310 (e � p2)(cp) = 1

10 (e � p3)(cp) = 610

(e � p1)(ip) = 210 (e � p2)(ip) = 7

10 (e � p3)(ip) = 110 .

The sequence of filtered states starts with σ1 = σ = 1|cp 〉 and continues with:

σ2 = t � (σ1|e�p1 ) = 710 |cp 〉 + 3

10 | ip 〉

σ3 = t � (σ2|e�p2 ) = 1120 |cp 〉 + 9

20 | ip 〉.

The predicates q1, q2, q3, q4 from (4.10) on X satisfy q4 = 1 and:

q3(cp) = 610 q2(cp) = 9

200 q1(cp) = 63200

q3(ip) = 110 q2(ip) = 49

200 q1(ip) = 291000

238


The states γi = σi|qi on X from (4.11) are:

γ1 = 1|cp 〉 γ2 = 310 |cp 〉 + 7

10 | ip 〉 γ3 = 2225 |cp 〉 + 3

25 | ip 〉.

The new transition channel t′ : X → X is described by the following convexsum of two updated channels:

t′(cp) =γ1(cp)

γ1(cp)+γ2(cp) · t|q2 (cp) +γ2(cp)

γ1(cp)+γ2(cp) · t|q3 (cp)

=1

1 + 3/10·

(t(cp)(cp) · q2(cp)

(t � q2)(cp)

∣∣∣cp⟩

+t(cp)(ip) · q2(ip)

(t � q2)(cp)

∣∣∣ ip ⟩)+

3/10

1 + 3/10·

(t(cp)(cp) · q3(cp)

(t � q3)(ip)

∣∣∣cp⟩

+t(cp)(ip) · q3(ip)

(t � q3)(ip)

∣∣∣ ip ⟩)= 10

13 ·

(7/10 · 9/200

7/10 · 9/200 + 3/10 · 49/200

∣∣∣cp⟩

+3/10 · 49/200

7/10 · 9/200 + 3/10 · 49/200

∣∣∣ ip ⟩)+ 3

13 ·

(7/10 · 6/10

7/10 · 6/10 + 3/10 · 1/10

∣∣∣cp⟩

+3/10 · 1/10

7/10 · 6/10 + 3/10 · 1/10

∣∣∣ ip ⟩)= 10

13 ·(

63210

∣∣∣cp⟩

+ 147210

∣∣∣ ip ⟩)+ 3

13 ·(

4245

∣∣∣cp⟩

+ 345

∣∣∣ ip ⟩)=

(3

13 + 1465

)∣∣∣cp⟩

+(

713 + 1

65

)∣∣∣ ip ⟩= 29

65 |cp 〉 + 3665 | ip 〉 ≈ 0.4462|cp 〉 + 0.5538| ip 〉.

In a similar way we get:

t′(ip) = 67 |cp 〉 + 1

7 | ip 〉 ≈ 0.8571|cp 〉 + 0.1429| ip 〉.

The new emission channel e′ : X → Y is a convex sum of three updatedchannels:

e′(cp) =γ1(cp)

γ1(cp)+γ2(cp)+γ3(cp) · e|p1 (cp) +γ2(cp)

γ1(cp)+γ2(cp)+γ3(cp) · e|p2 (cp)

+γ3(cp)

γ1(cp)+γ2(cp)+γ3(cp) · e|p3 (cp)

=1

1 + 3/10 + 22/25· 1

∣∣∣ l ⟩ +3/10

1 + 3/10 + 22/25· 1

∣∣∣ i ⟩ +22/25

1 + 3/10 + 22/25· 1

∣∣∣c ⟩=

1109/50

∣∣∣ l ⟩ +3/10

109/50

∣∣∣ i ⟩ +22/25

109/50

∣∣∣c ⟩= 44

109 |c 〉 +15

109 | i 〉 +50

109 | l 〉 ≈ 0.4037|c 〉 + 0.1376| i 〉 + 0.4587| l 〉.

In a similar way one obtains:

e′(ip) = 641 |c 〉 +

3541 | i 〉 ≈ 0.1463|c 〉 + 0.8537| i 〉.

In the end we like to point out the rise in HMM-validity as a result of the singleBaum-Welch learning step:

(σ, t, e) |= [p1, p2, p3] = 0.0315 (γ1, t′, e′) |= [p1, p2, p3] = 0.0869.

239


After repeating the Baum-Welch steps ten times, this validity approaches 14 .

Remark 4.6.3. As mentioned in the beginning of this subsection, the soft drinkexample is taken from [66, §9.3.3]. The newly learned emission channel thatis computed there coincides with the channel e′ that we calculate above. Butthe new transition channel in [66] is different from the one that we obtain. Theoutcome of [66], written below as t∗, can be reconstructed as a convex sum ofthree updated channels, as in:

t∗(cp) =∑

1≤i≤3

γi(cp)∑1≤ j≤3 γ j(cp)

· t|qi+1 (cp) = 299545 |cp 〉 + 246

545 | ip 〉

≈ 0.5486|cp 〉 + 0.4514| ip 〉

t∗(ip) =∑

1≤i≤3

γi(ip)∑1≤ j≤3 γ j(ip)

· t|qi+1 (ip) = 3341 |cp 〉 + 8

41 | ip 〉

≈ 0.8049|cp 〉 + 0.1951| ip 〉.

However, this does not fit the Baum-Welch steps as given in Theorem 4.6.1.Moreover, in this way we get a lower validity than when one uses a convexsum of two, as in the above calculation of the newly learned channel t′, since:

(γ1, t′, e′) |= [p1, p2, p3] = 0.0869 (γ1, t∗, e′) |= [p1, p2, p3] = 0.0724.

Exercises

4.6.1 Prove in the general setting of Subsection 4.6.1:

(σ, t, e) |= ~p = σ1 |= q1 ≤ σ1|q1 |= q1 = (σ′, t, e) |= ~p

The first equation follows directly from (3.13).4.6.2 By repeatedly applying Baum-Welch to the crazy soda HMMH from

Subsection 4.6.2 we obtain an HMM H∞ = (σ∞, t∞, e∞) with σ∞ =

1|cp 〉 and with:

t∞(cp) = 1| ip 〉 e∞(cp) = 12 |c 〉 +

12 | l 〉

t∞(ip) = 1|cp 〉 e∞(ip) = 1| i 〉.

Check thatH∞ |= [1l , 1i , 1c] = 14 .

4.6.3 Compute the newly learned channel t′ in Subsection 4.6.2 also via theformulas in Exercise 4.5.2, as disintegration of τ = τ1 + τ2 + τ3 whereτi =

(〈id, t〉 � σi

)∣∣∣(e�pi)⊗qi+1

.4.6.4 This exercise gives an alternative way to the describe the newly learned

emission channel, in the style of Exercise 4.5.2. In the context de-scribed in Subsection 4.6.1, define for 1 ≤ i ≤ `,

ρi B(〈e, id〉 � σi

)∣∣∣pi⊗(t�qi+1) ∈ D(Y × X).

240

4.7. Data, validity, and learning 2414.7. Data, validity, and learning 2414.7. Data, validity, and learning 241

1 Show that:

ρi = 〈e|pi , id〉 � γi.

2 Take ρ B∑

i ρi and prove that the disintegration ρ[1, 0

∣∣∣ 0, 1] equalsthe newly learned emission channel e′.

4.7 Data, validity, and learning

As we have described at the very beginning of this section, in learning oneadapts one’s state to given evidence, in the form of data. Specifically, onestrives to increase the validity of the data. Although we have talked severaltimes about data already, we have not yet made things precise. This is whatthis section will do: it will define what kind of data we’ll use, namely multi-sets of factors. It will then define validity of data in two fundamentally differentways, using either a ‘multiple-state’ perspective or a ‘copied-state’ perspective.This resembles the difference between drawing from an urn with replacement(multiple-state) or without replacement (copied-state). Finally the section endswith two learning theorems, one for each perspective. We shall see examplesof both forms of learning in later sections.

4.7.1 Data and its validity

Data that is used for learning often comes in the form of sequences or (multidi-mentional) tables. The order of the data items does not matter for updating, seeLemma 2.3.5 (3). Thus the order is irrelevant, but multiple occurrences of dataitems do matter. These two aspects suggest that data items should be organisedin the form of multisets, see Section 1.4. Suppose we have a state ω ∈ D(X) ona set X and we wish to improve it in order to increase the vality of data on X.What form should this data have? In the light of what we just said, we wouldexpect that data takes the form of a (natural) multiset ψ ∈ N(X) on X, with nat-ural numbers as frequencies. But a recurring theme in this book is that insteadof point predicates one can use more general fuzzy predicates or even factors.Indeed, in sections 4.5 and 4.6 we have described learning for Markov chainsand hidden Markov models with data given by lists of factors. This is useful incases when there is uncertainty about specific data items, or when data itemsare missing — so that a uniform predicate can be used, see Section 4.2. Alongthose lines we should use multisets of factors as data.

Definition 4.7.1. Data on a set X means a multiset of factors ψ ∈ M(Fact(X)

).

241


When such a ψ contains only point point predicates 1x, for x ∈ X, we speak ofpoint-data, and identify the data with a multiset ψ ∈ M(X).

Notice that we do not require that we have natural multisets, with naturalnumbers of frequencies, eventhough in many examples and results that will bethe case. We like to be as general as possible.

Next we wish to introduce validity for data. Before doing so we take a stepback and look at some elementary issues. Assume we have a fair coin σ =12 |H 〉 +

12 |T 〉 as state. Consider the following questions.

1 What is the probability of getting two heads? Most people will immediatelysay 1

4 . The idea behind this answer is that we flip the coin twice, where get-ting a head has probability 1

2 each time. More formally, this answer involvesa ‘multiple-state’ perspective, where the coin state σ is used twice, as in:

σ ⊗ σ |= 1H ⊗ 1H =(σ |= 1H

)·(σ |= 1H

)= 1

2 ·12 = 1

4 .

There is an alternative perspective in which the probability of getting twoheads is 1

2 , namely: the coin is tossed once, and two (honest, truthfull) ob-servers are asked to tell what they see. They will both report ’head’, so weget two heads, as required. This perspective is reflected by the computation:

σ |= 1H & 1H = σ |= 1H = 12 .

2 We can similarly ask: what is the probability of getting head and tail? Let’snot worry about taking the order into account and simply ask about firstseeing head, then tail. The first, multiple state perspective again gives 1

4 asprobability:

σ ⊗ σ |= 1H ⊗ 1T =(σ |= 1H

)·(σ |= 1T

)= 1

2 ·12 = 1

4 .

But if we toss the coin only once and look at the probability that the firstof two observers reports ‘head’ and the second one reports ‘tail’ we getoutcome zero:

σ |= 1H & 1T = σ |= 0 = 0.

3 Next assume it is dark and the observers cannot see the coin very clearly.We consider the predicate p = 0.8 · 1H + 0.2 · 1T giving 80% certainty forhead, and q = 0.7 · 1H + 0.3 · 1T with only 70% certainty. What is now theprobability of seeing p and q? The multiple-state perspective gives:

σ ⊗ σ |= p ⊗ q =(σ |= p

)·(σ |= q

)=

( 12 · 0.8 + 1

2 · 0.2)·( 1

2 · 0.7 + 12 · 0.3

)= 1

2 ·12 = 1

4 .

242


The alternative perspective now gives:

σ |= p & q = σ |= 0.56 · 1H + 0.06 · 1T = 12 · 0.56 + 1

2 · 0.06 = 0.31.

We have referred to the first approach with tensor σ ⊗ σ of states as themultiple-state perspective. We will refer to the second approach as the copied-state perspective, since in general:

σ |= p & q = σ |= ∆ � (p ⊗ q) see Exercise 2.4.5 (2)= ∆ � σ |= p ⊗ q.

In the latter line we have a copied state. As we already know from Subsec-tion 1.8.2, in general ∆ � σ , σ ⊗ σ. Hence these two perspectives are reallydifferent. This difference also played a role in our description of covarianceand correlation in Section 2.9. These two perspectives also lead to differentnotions of validity of data.

Definition 4.7.2. Let ω ∈ D(X) be a state and ψ ∈ M(Fact(X)

)be data on X.

1 The multiple-state validity, or M-validity for short, of ψ in ω is defined as:

ω |=M ψ B∏

p

(ω |= p

)ψ(p). (4.12)

2 The copied-state validity, or C-validity, of ψ in ω is:

ω |=C ψ B ω |= &p pψ(p). (4.13)

When ψ consists of point-data, these two validities become∏

x∈supp(ψ) ω(x)ψ(x)

and ω |= &x∈supp(ψ) 1x, respectively.

For clarity: pn in point (2) is the n-fold conjunction p & · · · & p, with p0 =

1. When p is sharp, then pn = p, for n > 1. In particular, point-predicates 1x

are sharp. This formulation gives in general pn(x) = p(x)n. Similarly, r ∈ R≥0

we use pr(x) = p(x)r, see also Exercise 4.7.4.

Remark 4.7.3. What the above definition ignores is the multiple counting ofprobabilities that needs to take place in order to handle different orders. Forinstance, if we ask about the probability of getting one head and two tails fora fair coin, we should take into account the (sum of the) probabilities for thethree possible orders:

H,T,T T,H,T T,T,H.

This yields a probability of 38 , instead of 1

8 given by |=M .

243


More generally, for point-data ψ =∑

i ni| xi 〉 we should get as probability instate ω:

(∑

i ni)!∏i ni!

∏iω(xi)ni .

Here we use the multinomial coeffient:(N~n

)B

N!n1! · · · nk!

where N = n1 + · · · + nk.

It gives the number of ways to go from a multiset∑

i ni| xi 〉 to a list over{x1, , ..., xk}. It was already used in Example 2.1.5.

In the current setting of learning we are interested in increasing validities.Then these normalisation constants do not play a role. That’s why they areomitted in the definitions of |=M and |=C . Thus, to be precise, we should reallyrefer to |=M and |=C as unnormalised validities.

4.7.2 Learning steps for data

We collect earlier results and organise them in the form of learning steps, bothfor the multiple-state and the copied-state perspective.

Theorem 4.7.4. Let ω ∈ D(X) be a state, with data ψ ∈ N(Fact(X)

).

1 Learning from a multiple-state perspective, or M-learning, can be done asfollows.

Mlrn(ω, ψ) |=M ψ ≥ ω |=M ψ for Mlrn(ω, ψ) B∑

p

ψ(p)|ψ|· ω|p,

where |ψ| =∑

p ψ(p).2 Learning from a copied-state perspective, or C-learning, works via:

Clrn(ω, ψ) |=C ψ ≥ ω |=C ψ for Clrn(ω, ψ) B ω∣∣∣&p pψ(p) .

Proof. 1 This is a reformulation of Proposition 4.4.6 (2).2 Directly from Theorem 4.3.1 (2).

We have formulated M- and C-validity and M- and C-learning for naturalmultisets only, with natural numbers as frequencies. However, the above C-learning result works for arbitrary multisets. M-learning however depends onProposition 4.4.6 (2), which has been proven for natural numbers only.

Given data ψ ∈ N(Fact(X)

), the mappings

ω 7−→ Mlrn(ω, ψ) and ω 7−→ Clrn(ω, ψ)

in the above theorem form endofunctions D(X) → D(X). By iterating these

244


functions one may hope to find a fixed point. This is an important part of thelearning process, which forms a topic in itself. It will not be investigated here.Instead, we concentrate on single learning steps.

When we apply point (1) to point-data ψ ∈ N(X) we get as better state:

Mlrn(ω, ψ) =∑

x

ψ(x)|ψ|

ω|1x =∑

x

ψ(x)|ψ|| x〉 = Flrn(ψ).

Hence we rediscover frequentist learning Flrn from Section 1.6. The aboveresult says that frequentist learning gives an increase of M-validity. In fact,it gives the highest possible validity. This is a standard result, see e.g. [61,Ex. 17.5], which we reproduce here.

Proposition 4.7.5. For non-empty point-data ψ ∈ M∗(X) the predicate “M-validity of ψ”

D(X)(−) |=M ψ

// R≥0

takes its maximum at the distribution Flrn(ψ) ∈ D(X) that is obtained by fre-quentist learning. Expressed differently, this means:

Flrn(ψ) ∈ argmaxω

ω |=M ψ.

Proof. Let ψ be a fixed non-empty multiset. We will seek the maximum of thefunction ω 7→ ω |=M ψ by taking the derivative with respect to ω. We will workwith the ‘log-validity’, that is, with the function ω 7→ ln(ω |=M ψ), where ln isthe monotone (natural) logarithm function. It reduces the product

∏of powers

in the definition of |=M to a sum∑

of multiplications, as in Exercise 1.2.2.Assume that the support of ψ =

∑i ri| xi 〉 is {x1, . . . , xn} ⊆ X. We look

at distributions ω ∈ D({x1, . . . , xn}); they may be identified with numbersv1, . . . , vn ∈ R≥0 with

∑i vi = 1. We thus seek the maximum of the log-validity

function:

k(~v) B ln(∑

i vi| xi 〉 |=M∑

i ri| xi 〉)

= ln(∏

i vrii

)=

∑i ri · ln(vi).

Since we have a constraint (∑

i vi) − 1 = 0 on the inputs, we can use theLagrange multiplier method for finding the maximum. We thus take anotherparameter λ in a new function:

K(~v, λ) B k(~y) − λ ·((∑

i vi) − 1)

=(∑

i ri ln(vi))− λ ·

((∑

i vi) − 1).

The partial derivatives of K are:

∂K∂vi

(~v, λ) =ri

vi− λ

∂K∂λ

(~v, λ) = 1 −∑

i vi.

245


Setting all of these to 0 and solving gives the required maximum:

1 =∑

i vi =∑

iri

λ=

∑i ri

λ.

Hence λ =∑

i ri and thus:

vi =ri

λ=

ri∑i ri

(1.9)= Flrn(ψ)(xi).

Informally, this proposition allows us to say that Flrn(ψ) is the distributionthat best fits the data ψ ∈ M∗(X). Alternatively, there is an inequality betweenM-validities:

ω |=M ψ ≤ Flrn(ψ) |=M ψ, for each ω ∈ D(X).

Notice that Proposition 4.7.5 requires that ψ consists of point-data. For non-point-data there is no such maximality result, see Exercise 4.7.3 below.

We conclude this section with some additional observations about frequen-tist learning. We first show how frequentist learning can be done by condition-ing a uniform state.

Proposition 4.7.6. Let multiset ϕ ∈ M∗(X) form point-data on X with (finite,non-empty) support S ⊆ X. We consider ϕ as a factor, see Remark 2.1.6. Then:

1 Flrn(ϕ) = υ|ϕ, where υ is the uniform distribution on S ;2 Flrn(r · ϕ) = Flrn(ϕ), for r ∈ R>0;3 Flrn(ϕ & ψ) = Flrn(ϕ)|ψ.

Proof. 1 Let the support S of ϕ have n elements, so that υ =∑

x∈S1n | x〉. We

simply follow the definition of conditioning:

υ|ϕ(x) =υ(x) · ϕ(x)υ |= ϕ

=1/n · ϕ(x)∑

y∈S 1/n · ϕ(y)=

ϕ(x)∑y∈S ϕ(y)

= Flrn(ϕ)(x).

2 Obvious.3 We use the point (1) twice and Lemma 2.3.5 (3) in:

Flrn(ϕ & ψ) = υ|ϕ&ψ = υ|ϕ|ψ = Flrn(ϕ)|ψ.

The last point gives a multiplicative compositionality result: in order to learnfrom the (pointwise) product of two multisets, we can learn from them succes-sively. This is nice, but it would be more useful to have an additive compo-sitionality result, involving a sum of multisets, as in Flrn(ϕ + ψ), since thenwe can add new data to what we have already learnt from an existing table(multiset) of data. Such an additive learning result will appear later in Propo-sition 4.8.7.

We conclude with a couple of additional points.

246


Remark 4.7.7. Suppose we have a space Y = {y1, . . . , yn} and a multiset ψ =∑i ni|yi 〉 of data over Y . If ni = 0 for some i, then also Flrn(ψ)(yi) = 0. This

is sometimes undesirable, since such a zero value remains zero in subsequentupdates. This situation is referred to as the zero count problem.

A common solution is to use Laplacian learning Llrn : M(X) → D(X),which is defined as:

Llrn(ψ) B Flrn(ψ + 1), (4.14)

where 1 ∈ M(X) is the multiset∑

x 1| x〉 in which each element occurs once.Thus, for n = 3 and ψ = 3|y1 〉 + 2|y3 〉 we get:

Flrn(ψ) = 35 |y1 〉 +

25 |y3 〉 whereas Llrn(ψ) = 1

2 |y1 〉 +18 |y3 〉 +

38 |y3 〉.

Remark 4.7.8. At the beginning of this section we suggested that the multi-ple/copied state difference is comparable to the familiar urn-model with/withoutreplacement. We like to make this a bit more precise now.

Suppose that we have a state ω and we have a sequence of three predicatesp1, p2, p3 whose validity we like to determine consecutively, without replace-ment. This means that after each observation we update the state:

1 we determine ω |= p1 and proceed to the next, updated state ω|p1 ;2 we determine ω|p1 |= p2 in the updated state, and proceed to the next level

update ω|p1 |p2 = ω|p1&p2 , see Lemma 2.3.5 (3);3 we determine ω|p1&p2 |= p3.

The combined validity obtained after these steps can be rewritten via Bayes’product rule — see Proposition 2.3.3 (1) — as copied-state validity:(

ω |= p1)·(ω|p1 |= p2

)·(ω|p1&p2 |= p3

)=

(ω |= p1 & p2

)·(ω|p1&p2 |= p3

)by Bayes’ rule

= ω |= p1 & p2 & p3 again by Bayes’ rule= ω |=C 1| p1 〉 + 1| p2 〉 + 1| p3 〉.

Exercises

4.7.1 Check the following M-validities of point-data:12 |H 〉 +

12 |T 〉 |=M 1|H 〉 + 2|T 〉 = 1

814 |H 〉 +

34 |T 〉 |=M 1|H 〉 + 2|T 〉 = 9

6438 |H 〉 +

58 |T 〉 |=M 1|H 〉 + 2|T 〉 = 75

512 .

Conclude that in general, for an arbitrary multiset ψ ∈ M(X), theoperation (−) |=M ψ : D(X) → [0, 1] does not preserve convex sums ofstates.

247


4.7.2 Consider the channel c : {H,T } → {0, 1, 2} defined by:

c(H) = 13 |0〉 +

13 |1〉 +

13 |2〉 c(T ) = 1

2 |0〉 +13 |1〉 +

16 |2〉,

with state σ = 12 |H 〉 + 1

2 |T 〉 and point-data ψ = 3|1〉 + 6|2〉. Checkthat:

1 c � σ = 512 |0〉 +

13 |1〉 +

14 |2〉;

2 c � ψ = 3|H 〉 + 2|T 〉;3 c � σ |=M ψ = 1

110592 ;4 σ |=M c � ψ = 1

32 .

We conclude that c � σ |=M ψ and σ |=M c � ψ are different — whereasthe analogues for validity |= are the same, see Proposition 2.4.3. Aswe shall see later in Exercise 4.8.5, there is more to say.

4.7.3 In Proposition 4.7.5 we have seen that for point-data ψ ∈ M(X) thefunction (−) |=M ψ : D(X) → R≥0 takes its maximum at Flrn(ψ) =

Mlrn(ω, ψ). This exercise demonstrates that there is no analogousmaximality for non-point-data ϕ ∈ M(Fact(X)).

Take X = {0, 1} with predicates:

p1 = 12 · 10 + 3

4 · 11 p2 = 1 · 10 + 12 · 11.

We combined them into non-point-data ϕ = 1| p1 〉+1| p2 〉 ∈ M(Fact(X)).Consider the two states:

ω = 14 |0〉 +

34 |1〉 σ = 1

2 |0〉 +12 |1〉

Check that:

1 ω |=M ϕ = 55128 ≈ 0.43;

2 Mlrn(ω, ϕ) = 1655 |0〉 +

3955 |1〉;

3 Mlrn(ω, ϕ) |=M ϕ = 1057924200 ≈ 0.437;

4 σ |=M ϕ = 55128 ≈ 0.469.

4.7.4 1 Prove that for each set X, the collection Fact(X) of factors X → R≥0

on X is a multiplicative cone, via its multiplicative commutativemonoid structure (1,&) and its power scalar multiplication R≥0 ×

Fact(X)→ Fact(X) given by (s, p) 7→ ps, where ps(x) = p(x)s.2 Consider the ‘curried’ version of M-validity as a function:

M(Fact(X)) L // Fact(D(X)) given by L(ϕ)(ω) B ω |=M ϕ.

Prove that L is a morphism of cones, from multisets with their ad-ditive cone structure, to factors with their multiplicative cone struc-ture.

248


Concretely, this amounts to:

ω |=M (ϕ + ψ) =(ω |=M ϕ

)·(ω |=M ψ

)ω |=M 0 = 1

ω |=M (s · ϕ) =(ω |=M ϕ

)s.

3 Conclude that L is the free extension — see Exercise 1.4.9 (2) —of the ‘factor-validity’ function fv : Fact(X) → Fact(D(X)) givenby:

fv(p)(ω) B ω |= p.

4.7.5 In Proposition 4.7.5 we have seen how the state Flrn(ψ) learned froma multiset ψ ∈ M(X) gives maximal M-validity |=M to ψ. There is an al-ternative way to describe Flrn(ψ) as maximal, using ordinary validity|= (plus entropy) instead of M-validity, via a formulation from [76]. Itinvolves turning the multiset ψ ∈ M(X) into an observable lnψ : X →R given by (lnψ)(x) = ln

(ψ(x)

).

The claim is:

Flrn(ψ) ∈ argmaxω

(ω |= lnψ + H(ω)

),

where H(ω) is the Shannon entropy, see Definition 2.1.3 (1), ex-pressed via the natural logarithm ln.

1 Check that:

ω |= lnψ + H(ω) = ω |= ln(ψ/ω

).

2 Write X = {x1, . . . , xn} and use variable wi for ω(xi). Consider theLagrangian:

F(~w, λ) B∑

i wi · ln(ψ(xi)/wi

)− λ ·

((∑

i wi) − 1).

Check that its partial derivatives are:

∂F∂wi

(~w, λ) = ln(ψ(xi)/wi

)− 1 − λ

∂F∂λ

(~w, λ) = (∑

i wi) − 1

3 Set these partial derivatives to zero and derive:

wi =ψ(xi)eλ+1 and eλ+1 = |ψ| =

∑i ψ(xi).

Conclude that wi = Flrn(ψ)(xi).

249


4.8 Learning along a channel

In the previous section we have looked at validity of data in a state and howto increase this validity by adapting the state. So far we have considered thesituation where the state and data are on the same set X. In practice, it oftenhappens that there is a difference, like in:

X ◦e // Y

state to be learned

CC

data

ZZ(4.15)

We will assume that there is a channel between the two spaces — as in theabove picture — that can be used to mediate between the given data and thestate that we wish to learn. This is what we call ‘learning along a channel’. Thislearning challenge is often described in terms of ‘hidden’ or ‘latent’ variables,since the elements of the space X are not directly accessible, but only indirectlyvia the ‘emmission’ channel e. In later chapters we shall deal with a differentsituation, where the channel e becomes a learning goal in itself. But for now,we assume that it is given.

We describe two different ways to handle this new learning situation alonga channel, depending on whether one takes the multiple-state or the copied-state perspective. In both cases we first move data ψ ∈ M

(Fact(Y)

)on Y to

data on X, via the channel e : X → Y . This can be done easily, by using thatthe multiset operation M is functorial — works not only on sets but also onfunctions between them, see Definition 1.4.1 — and that factor transformationyields a function e � (−) : Fact(Y) → Fact(X). Thus we get data on X via anew data transformation, which we shall write as:

e �M ψ B M(e � (−)

)(ψ) =

∑pψ(p)

∣∣∣e � p⟩. (4.16)

Definition 4.8.1. Let e : X → Y be a channel with data ψ ∈ M(Fact(Y)

)on

its codomain. We define M/C-learning along e from ψ as M/C-learning frome �M ψ in (4.16). Concretely this turns a state ω ∈ D(X) into a newly learnedstates on X, where:

1 for M-learning,

Mlrn(ω, e, ψ) B Mlrn(ω, e �M ψ) =∑

p

ψ(p)|ψ|· ω|e�p. (4.17)

2 for C-learning,

Clrn(ω, e, ψ) B Clrn(ω, e �M ψ)= ω

∣∣∣&p(e�p)ψ(p)

= ω|(e�p1)n1 · · · |(e�pk)nk when ψ =∑

1≤i≤k ni| pi 〉.

(4.18)

250

4.8. Learning along a channel 2514.8. Learning along a channel 2514.8. Learning along a channel 251

Both formulations Mlrn(ω, e, ψ) and Clrn(ω, e, ψ) for the newly learned statescome directly from Theorem 4.7.4, using data of the form (4.16). They guar-antee a higher M/C-validty for the transformed data:

Mlrn(ω, e, ψ) |=M e �M ψ ≥ ω |=M e �M ψ

Clrn(ω, e, ψ) |=C e �M ψ ≥ ω |=C e �M ψ.

The last equation in (4.18) follows from Lemma 2.3.5 (3). Notice that we haveoverloaded the notation Mlrn and Clrn, where its meaning depends on thenumber of arguments: two for ordinary learning and three for learning along achannel.

Much of the remainder of this section will be devoted to examples from theliterature. The M/C distinction is made here, following [48], and is not part ofthe description of these examples in the literature.

We start with M-learning along a channel. We first point out that in the spe-cial case where we have point-data, M-learning along a channel in (4.17) canbe expressed via a dagger channel, as Jeffrey’s adaptation (see Section 3.11).

Lemma 4.8.2. For a state ω ∈ D(X) and channel e : X → Y with point-dataϕ ∈ M(X) we get:

Mlrn(ω, e, ϕ) = e†ω � Flrn(ϕ). (4.19)

Proof. Since:

Mlrn(ω, e, ϕ)(4.17)=

∑x

ϕ(x)|ϕ|· ω|e�1x

=∑

x

ϕ(x)|ϕ|· e†ω(x) by Proposition 3.7.2

=∑

xFlrn(ϕ)(x) · e†ω(x)

= e†ω � Flrn(ϕ) by Exercise 1.7.3 (2).

This special case occurs in the next two illustrations.

Example 4.8.3. Consider the following Bayesian network from [88, §20.3].

Bag

WrapperFlavour Holes

{0, 1}

{C, L} {R,G} {H,H⊥}

(4.20)

It involves two bags, labelled 0 and 1, containing candies. These candies can

251


have one of two flavours, cherry (C) or lime (L), with a red (R) or green (G)wrapper, and with a hole (H) or not (H⊥).

The network (4.20) involves three channels, written as f : {0, 1} → {C, L},w : {0, 1} → {R,G}, h : {0, 1} → {H,H⊥}. They have equal probabilities in [88]:

f (0) = 610 |C 〉 +

410 |L〉 f (1) = 4

10 |C 〉 +6

10 |L〉w(0) = 6

10 |R〉 +4

10 |G 〉 w(1) = 410 |R〉 +

610 |G 〉

h(0) = 610 |H 〉 +

410 |H

⊥ 〉 h(1) = 410 |H 〉 +

610 |H

⊥ 〉.

Here we combine these three channels into a single 3-tuple channel 〈 f ,w, h〉 =

( f ⊗ w ⊗ h) ◦· ∆3 : {0, 1} → {C, L} × {R,G} × {H,H⊥}, as in (4.20). On input 0this tuple is:

〈 f ,w, h〉(0) = f (0) ⊗ w(0) ⊗ h(0)= 216

1000 |C,R,H 〉 +1441000 |C,R,H

⊥ 〉 + 1441000 |C,G,H 〉 +

961000 |C,G,H

⊥ 〉

+ 1441000 |L,R,H 〉 +

961000 |L,R,H

⊥ 〉 + 961000 |L,R,H 〉 +

641000 |L,R,H

⊥ 〉.

The data to learn from in [88] involves 1000 data elements in total, about com-binations of flavours, wrappers and holes. From these data we like to learn a(better) distribution on bags, giving the probabilities that the candies that pro-duce these data come from bag 0 or 1. Here we formalise the data as a multisetψ ∈ M

({C, L} × {R,G} × {H,H⊥}

)with the following multiplicities.

ψ = 273|C,R,H 〉 + 93|C,R,H⊥ 〉 + 104|C,G,H 〉 + 90|C,G,H⊥ 〉

+ 79|L,R,H 〉 + 100|L,R,H⊥ 〉 + 94|L,R,H 〉 + 167|L,R,H⊥ 〉.

We are now set for M-learning along the tuple channel 〈 f ,w, h〉 from thesepoint-data on its codomain, via the dagger of the tuple channel, see Lemma 4.8.2.In [88] this uses a prior distribution ρ = 6

10 |0〉 + 410 |1〉. We thus compute the

newly learned distribution on {0, 1} via Jeffrey’s rule:

Mlrn(ρ, 〈 f ,w, h〉, ψ)(4.19)= 〈 f ,w, h〉†ρ � Flrn(ψ) = 30891

50440 |0〉 +1954950440 |1〉

≈ 0.6124|0〉 + 0.3876|1〉.

This probability 0.6124 is exactly as computed in [88, §20.3], but without anyof the channel and dagger machinery. Moreover, what we call the ‘multiple-state’ perspective is not made explicit there.

We add another example, which combines M-learning and C-learning alonga channel.

Example 4.8.4. The following is a classical data analysis challenge from [86]that is used for instance in [21] to motivate the Expectation-Maximisation al-gorithm (see Section 4.9 below for more information). We shall describe it first

252


as M-learning along a channel, as direct translation of the description in [21].The channel that we use looks much like the one in the Bayesian learning of acoin-bias in Example 2.6.2.

The example involves point-data on four kinds of animals, which we shallsimply categorise via the 4-element set 4 = {0, 1, 2, 3}. The point-data is ψ =

125|0〉 + 18|1〉 + 20|2〉 + 34|3〉 ∈ M(4). A (fixed) channel e : [0, 1] → 4 isgiven with the unit interval as sample space, and:

e(r) B ( 12 + 1

4 · r)|0〉 + 14 · (1 − r)|1〉 + 1

4 · (1 − r)|2〉 + 14 · r|3〉

The aim is to find the value r ∈ [0, 1] that best fits the data ψ.We will discretise the unit interval, like in Section 2.6 and chop it up in

N = 100 parts, giving as subspace [0, 1]N ↪→ [0, 1], see Definition 2.6.1, so wecan consider e as a channel e : [0, 1]N → 4. We start without any assumption,via the uniform distribution ω0 = υ on [0, 1]N .

We can now successively learn new distributions on [0, 1]N via:

ωi+1 B Mlrn(ωi, e, ψ)(4.19)= e†ωi � Flrn(ψ).

After running these M-learning steps 100 times we get the distribution inthe bar chart on the left in Figure 4.1. Its expected value is 0.6268214581,which differs only slightly from the computed value, see Exercise 4.8.3 (copiedfrom [21]).

With the knowledge that we have accumlated by now about learning it isgood to look back at our earlier approach to coin bias learning in Exam-ple 2.6.2. There we had a channel flip : [0, 1]N → 2 where 2 = {0, 1} is usedfor head (1) and tail (0). We had a list of data items [0, 1, 1, 1, 0, 0, 1, 1] fromwhich we wished to learn the coin bias in [0, 1]. We would now consider thedata not as a list but as a multiset ϕ = 3|0〉 + 5|1〉. In Example 2.6.2 we (also)started from a uniform prior υ on [0, 1]N and performed eight successive con-ditionings, following the data [0, 1, 1, 1, 0, 0, 1, 1]

υ|c�10 |c�11 |c�11 . . . |c�11 = υ|(c�10)3&(c�11)5 by Lemma 2.3.5 (3)= Clrn(υ, c, ϕ) by (4.18).

Thus, without knowing at the time, we already saw an example of C-learningalong a channel in Example 2.6.2.

But having seen this, we can ask: what do we get if we apply C-learningin the original example from [86] that we started from. Doing 100 steps asabove, but now with ωi+1 = Clrn(ωi, e, ψ), gives the bar chart on the right inFigure 4.1. Since this involves a focused spike, we have use N = 1000 in thediscretisation in order to get a clearer picture. The expected value is now alittle bit different, namely: 0.6267803732. Thus, the distributions obtained via

253


Figure 4.1 Distribution obtained via M-learning along a channel (on the left) andvia C-learning along a channel (on the right) in Example 4.8.4, on the (discretised)unit interval [0, 1].

M- or C-learning along a channel are radically different, but not their expectedvalues.

We include a somewhat similar example, but use it to illustrate another dif-ference between M-learning and C-learning, namely the number of iterationsinvolved. Since this is only one example, without a general analysis, every-thing we can conclude from it is that the number of iterations to reach a certainlevel of stability may vary considerably.

Example 4.8.5. We analyse an image-processing illustration from [74] whichcan be structured clearly in terms of two different channels, namely of theform:

[0, 1] ◦e // 3 ◦

d // 2 (4.21)

The set 3 = {0, 1, 2} refers to three forms of shapes (round dark, square dark,light), which are mapped to the set 2 = {0, 1} for only dark and light. Thechannel d : 3→ 2 is deterministic, with d(0) = d(1) = 1|0〉 and d(2) = 1|1〉.

The channel e : [0, 1]→ 3 is given, following [74], as:

e(r) B 14 |0〉 +

14 (1 + r)|1〉 + 1

4 (2 − r)|2〉.

Like in Example 4.8.4 we discretise the unit interval [0, 1] into [0, 1]N for N =

100, with uniform prior state ω0 = υ on [0, 1]N .The data used in [74] involve 63 dark objects. The aim is to learn how many

of them are round or square. In this set-up, light objects are mapped to lightobjects by the channel d, via d(2) = 1|1〉, so the number of light objects in thedata does not matter very much. We choose 37, in order to get a round numberof 100 data items in total. Thus we have point-date ψ = 63|0〉+ 37|1〉 ∈ N(Y).

The computations in [74, Table 1] quickly lead to a division of the 63 darkobjects into 25 square and 38 round, for a learned value of r = 0.520000 ∈

254


[0, 1]. In this computation an a priori equal division of 63 is somehow assumedwith r = 0.379562. Here we shall approximate these outcomes in a slightlydifferent way, starting from only a uniform prior on [0, 1]N .

What we first need to determine is along which channel to learn in Dia-gram 4.21. We have a prior ω0 on [0, 1]N ⊆ [0, 1] and point-data ψ on Y , so weshould learn along the composite channel d ◦· e : [0, 1]N → Y . In M-form thisproceeds via:

ωi+1 B Mlrn(ωi, d ◦· e, ψ)(4.19)= (d ◦· e)†ωi � Flrn(ψ).

Recall that the point of this example is to learn the actual division of squaredark objects (represented by the element 0 in the set 3) and round dark objects(represented by 1 in 3). In order to obtain this division we do state transforma-tion in Diagram 4.21 along e : [0, 1] → 3, in order to obtain the distributione � ωi on the set 3. In addition, we like to learn the expected value (mean) ofωi. The following table gives an impression.

M-learning steps e � ωi mean(ωi)

i = 0 0.25|0〉 + 0.3738|1〉 + 0.3762|2〉 0.4950

i = 10 0.25|0〉 + 0.375|1〉 + 0.375|2〉 0.5000

i = 25 0.25|0〉 + 0.3764|1〉 + 0.3736|2〉 0.5057

i = 100 0.25|0〉 + 0.3793|1〉 + 0.3707|2〉 0.5174

i = 250 0.25|0〉 + 0.38|1〉 + 0.37|2〉 0.5199

The last line reflects the outcomes in [74].We now do the same experiment with C-learning along the channel d ◦· e, so

that we now take ωi+1 B Clrn(ωi, d ◦· e, ψ). Then we get the following table:

C-learning steps e � ωi mean(ωi)

i = 0 0.25|0〉 + 0.3738|1〉 + 0.3762|2〉 0.4950

i = 10 0.25|0〉 + 0.3797|1〉 + 0.3703|2〉 0.5190

i = 25 0.25|0〉 + 0.3799|1〉 + 0.3701|2〉 0.5196

i = 60 0.25|0〉 + 0.38|1〉 + 0.37|2〉 0.5198

We thus see that in this example, C-learning along a channel requires in theorder of 4-5 times fewer steps than M-learning. The resulting distributions on

255


Figure 4.2 Distribution obtained after 250 M-learning steps (on the left) and af-ter 60 C-learning steps (on the right) in Example 4.8.5, on the (discretised) unitinterval [0, 1].

[0, 1]N are given as bar charts in Figure 4.2. They have the same shapes as inFigure 4.1.

We turn to another example, also with a comparison between M-learningand C-learning.

Example 4.8.6. Reference [27] explains Expectation-Maximisation in com-putational biology, mainly by providing an example. It involves classificationof a coin in one of two classes, depending on its bias. There are five separatesets of data (multisets), each with ten coin outcomes head (H) and tail (T ), seethe first column in Table (4.22) below. There is a channel e : {0, 1} → {H,T }that captures two coins, with slightly different biases:

e(0) = 35 |H 〉 +

25 |T 〉 and e(1) = 1

2 |H 〉 +12 |T 〉.

Thus, the distribution e(0) shows a slight bias towards head, whereas e(1) isunbiased. By learning along e the resulting distribution on {0, 1} tells whetherthe first or the second coin in e best fits the data. The idea is that if the datacontains more heads than tails, then the first coin e(0) is most likely: the learneddistribution r|0〉 + (1 − r)|1〉 has r > 1

2 .The second and third columns of the table below give the learned distri-

butions on {0, 1}, both via C-learning and M-learning along the coin channele. Each line describes a separate new learning action starting from the uni-form distribution, independent from the outcomes of earlier lines. Explicitly,for point-data ψ ∈ M({H,T }) these updates of the uniform state υ ∈ D({0, 1})

256


are given by:

Clrn(υ, e, ψ) = υ|(e�1H )ψ(H)&(e�1T )ψ(T )

= υ|(e�1H )ψ(H) |(e�1T )ψ(T )

Mlrn(υ, e, ψ) =ψ(H)

ψ(H)+ψ(T ) · υ|e�1H +ψ(T )

ψ(H)+ψ(T ) · υ|e�1T

(4.19)= e†υ �

(ψ(H)

ψ(H)+ψ(T ) |H 〉 +ψ(T )

ψ(H)+ψ(T ) |T 〉).

These formulas give us the following table.

point-data ψ C-learning M-learning

5|H 〉 + 5|T 〉 0.4491|0〉 + 0.5509|1〉 0.4949|0〉 + 0.5051|1〉9|H 〉 + 1|T 〉 0.805|0〉 + 0.195|1〉 0.5354|0〉 + 0.4646|1〉8|H 〉 + 2|T 〉 0.7335|0〉 + 0.2665|1〉 0.5253|0〉 + 0.4747|1〉4|H 〉 + 6|T 〉 0.3522|0〉 + 0.6478|1〉 0.4848|0〉 + 0.5152|1〉7|H 〉 + 3|T 〉 0.6472|0〉 + 0.3528|1〉 0.5152|0〉 + 0.4848|1〉

(4.22)

C-learning seems to perform binary classification best: picking up the differ-ences between the numbers of heads and tails in the data. This C-learningcolumn is as reported in [27], but uses higher precision. The term C-learningor the copied-state perspective do not occur in [27].

We conclude this section with some general observations. The most interest-ing point is that C-learning along a channel is additively compositional. Thismeans that it can handle additional data as it arrives, by performing anotherC-learning step. This is a clear advantage of C-learning over M-learning.

Proposition 4.8.7. 1 Data transformation�M from (4.16) is functorial in thesense that:

id �M ψ = ψ and (d ◦· e) �M ψ = e �M(d �M ψ

).

2 M-learning interacts well with channel composition:

Mlrn(ω, id, ψ) = Mlrn(ω, ψ)Mlrn(ω, d ◦· e, ψ) = Mlrn(ω, e, d �M ψ).

3 C-learning is additively compositional, in the sense that:

Clrn(ω, e, φ + ψ) = Clrn(Clrn(ω, e, φ), e, ψ

).

Proof. 1 We recall from Lemma 2.4.2 (5) and (6) that observable transforma-tion is functorial, so that:

id � (−) = id and (d ◦· e) � (−) =(e � (−)

)◦

(d � (−)

).

257


The result then follows by functoriality ofM.2 By the previous point:

Mlrn(ω, id, ψ) = Mlrn(ω, id �M ψ) = Mlrn(ω, ψ).

And:Mlrn(ω, d ◦· e, ψ) = Mlrn(ω, (d ◦· e) �M ψ)

= Mlrn(ω, e �M (d �M ψ))= Mlrn(ω, e, d �M ψ).

3 Let’s use the ad hoc notation:

e ≪ ψ B &p (e � p)ψ(p).

We then have e ≪ (φ + ψ) = (e ≪ φ) & (e ≪ ψ) since:

e ≪ (φ + ψ) = &p (e � p)(φ+ψ)(p)

= &p (e � p)φ(p)+ψ(p)

= &p (e � p)ϕ(p) & (e � p)ψ(p)

= &p (e � p)φ(p) & &p (e � p)ψ(p)

=(e ≪ φ

)(x) &

(e ≪ ψ

)(x).

Now we are almost done:

Clrn(ω, e, φ + ψ)(4.18)= ω|e≪(φ+ψ)

= ω|(e≪φ)&(e≪ψ) as just shown= ω|e≪φ|e≪ψ by Lemma 2.3.5 (3)

(4.18)= Clrn(ω, e, φ)|e≪ψ

(4.18)= Clrn

(Clrn(ω, e, φ), e, ψ

).

Exercises

4.8.1 Consider the channel e : {v,w} → {a, b, c} given by:

e(v) = 13 |a〉 +

23 |b〉 and e(w) = 1

8 |a〉 +12 |b〉 +

38 |c〉.

with (uniform) state ω = 12 |u〉+

12 |v〉 and point-data ψ = 1|a〉+2|b〉+

1|c〉.

1 Show that:

e �M ψ = 1∣∣∣ 1

3 · 1v + 18 · 1w

⟩+ 2

∣∣∣ 23 · 1v + 1

2 · 1w⟩

+ 1∣∣∣ 3

8 · 1w⟩.

2 Check that:

ω |=M e �M ψ = 53936864 ≈ 0.0146

ω |=C e �M ψ = 3512 ≈ 0.00586

258


3 Prove that:

Mlrn(ω, e, ψ) = 3677 |v〉 +

4177 |w〉

Mlrn(ω, e, ψ) |=M e �M ψ = 1334767718999178496 ≈ 0.0148.

4 Prove similarly that:

Clrn(ω, e, ψ) = 1|w〉Clrn(ω, e, ψ) |=C e �M ψ = 3

256 ≈ 0.0117.

Given an informal explanation for why Clrn(ω, e, ψ) = 1|w〉.

4.8.2 In the setting of the previous exercise, consider the point-data ψ =

1|a〉 + 2|b〉 + 1|c〉 from a missing-data perspective like in Subsec-tion 4.2.1:

X Y

? a? b? b? c

One can learn a distribution on X, by updating the following jointstate, for each of the four entries in this table:

τ B 〈id, e〉 � υ = 16 |v, a〉 +

13 |v, b〉 +

116 |w, a〉 +

14 |w, b〉 +

316 |w, c〉.

Show that following the methodology of Subsection 4.2.2 for obtain-ing a distribution on X via updates of τ gives the M-learned state.

4.8.3 In the setting of Example 4.8.4, consider the function:

f (r) B(e(r) |=M ψ

), for r ∈ [0, 1].

Prove that f takes its maximum at:

15 +√

53, 809394

≈ 0.6268214978709824

Hint: Compute the derivative of ln ◦ f and solve a second order equa-tion.

4.8.4 Prove that for point-data φ, ψ,

Mlrn(ω ⊗ ρ, c ⊗ d, φ ⊗ ψ

)= Mlrn(ω, c, φ) ⊗Mlrn(ρ, d, ψ).

259


4.8.5 Prove that:

e � ω |=M ψ = ω |=M e �M ψ,

but in general:

e � ω |=C ψ , ω |=C e �M ψ

Hint: Recall from Exercise 2.4.1 that observable transformation �does not preserve conjunctions &.

This last inequality , tells us that there is a second way to do C-learning along a channel, namely by increasing the validity e � ω |=C ψ

via updatingω toω|e�(&p pψ(p)). This however does not seem to make somuch sense, especially when ψ consists of point-data, since 1n

x = 1x.

4.9 Expectation Maximisation

Expectation-Maximisation (EM) is the name of a collection of learning tech-niques that typically use the sum-increase approach of Lemma 4.4.2 in a (joint)iteration of two separate steps, often referred to as ‘Expection’ (E) and ‘Max-imisation’ (M). For instance, Baum-Welch for learning new hidden Markovmodels (including Markov chains) in Sections 4.5 and 4.6, M-learning like inExample 4.8.3 and C-learning like in Example 4.8.6 are all called instancesof Expectation-Maximisation. The first systematic exposition is in [21], butsince then EM has developed into an umbrella term, see [74] for backgroundinformation.

In the current channel-based setting we develop our own interpretation andelaboration of Expectation-Maximisation, where the previous section preparedthe stage and developed the ‘E’ part, corresponding to learning a state along achannel. The ‘M’ part then consists of learning the channel as well. Hence forEM we can use the following picture, as refinement of (4.15):

channel to be learned

��X ◦ // Y

state to be learned

CC

data

ZZ(4.23)

This picture indicates that EM is (potentially) a powerful technique: given twosets X,Y with data on Y , it leads to a ‘latent’ state on X and to a channel X → Ythat fit the data. However, learning both is rather fragile and may lead to wildlydifferent outcomes — in contrast to the situation in Section 4.8, where we wereonly learning the state and got good results. Indeed, we shall see that some —

260

4.9. Expectation Maximisation 2614.9. Expectation Maximisation 2614.9. Expectation Maximisation 261

naive but natural — approaches do not deliver as expected, for instance becausethey stabilise immediately, already after one iteration.

We continue the previous distinction between M-validity and C-validity todescribe two possibilities for what ‘fit’ means, leading to two forms of Expec-tation-Maximisation, namely MEM and CEM. We also continue our view ondata as consisting of a multiset of factors.

4.9.1 Multiple-state Expectation-Maximisation (MEM)

We first continue an earlier example of M-learning along a channel, in order toget a concrete idea of the issue at hand.

Example 4.9.1. In Example 4.8.3, with candies in a bag (from [88]), we startedfrom a channel 〈 f ,w, h〉 : {0, 1} → {C, L} × {R,G} × {H,H⊥} with data ψ on itscodomain. This is used in what is now called the ‘E’ part of ‘EM’ to turna prior state ρ ∈ D({0, 1}) into a new state ρ′ B d � Flrn(ψ), where d =

〈 f ,w, h〉†ρ : {C, L}×{R,G}×{H,H⊥} → {0, 1} is the Bayesian inversion (dagger).As part of Maximisation in EM, we now also wish to update the three chan-

nels f : {0, 1} → {C, L}, w : {0, 1} → {R,G} and h : {0, 1} → {H,H⊥} via a‘double dagger’. We take:

dd B d†Flrn(ψ) : {0, 1} → {C, L} × {R,G} × {H,H⊥}.

We then obtain the individually learned channels f ′,w′, h′ via marginalisationof the channels:

f ′ B dd[1, 0, 0

]: {0, 1} → {C, L}

w′ B dd[0, 1, 0

]: {0, 1} → {R,G}

h′ B dd[0, 0, 1

]: {0, 1} → {H,H⊥}

This yields precisely the values reported in [88]:

f ′(0) = 0.6684|C 〉 + 0.3316|L〉 f ′(1) = 0.3887|C 〉 + 0.6113|L〉w′(0) = 0.6483|R〉 + 0.3517|G 〉 w′(1) = 0.3817|R〉 + 0.6183|G 〉h′(0) = 0.6558|H 〉 + 0.3442|H⊥ 〉 h′(1) = 0.3827|H 〉 + 0.6173|H⊥ 〉.

We generalise this approach as follows.

Theorem 4.9.2. Let e : X → Y be a channel between finite sets X,Y, with astate ω ∈ D(X) on its domain and with data ψ ∈ N(Fact(Y)) on its codomain.We already know from (4.17) that ω′ = Mlrn(ω, e, ψ) =

∑pψ(p)|ψ|·ω|e�p gives a

new state with increased M-validity: ω′ |=M e �M ψ ≥ ω |=M e �M ψ.

261


1 We get a further increase ω′ |=M e′ �M ψ ≥ ω′ |=M e �M ψ for the ‘better’channel e′ : X → Y defined by:

e′(x) B∑

p

ψ(p)|ψ|·ω|e�p(x)ω′(x)

· e|p(x). (4.24)

2 In the special case when ψ consists of point-data, this formula gives thedouble-dagger:

e′ =(e†ω

)†Flrn(ψ) : X → Y. (4.25)

Proof. The proof proceeds as for Proposition 4.4.6. The result follows fromthe inequality, for factors p1, . . . , pn,∏

i(ω |= e � pi) ≤

∏i(ω |= e′ � pi)

for the channel e′ given by the following convex combination of updated chan-nels:

e′(x)(y) =∑

i

ω|e�pi (x)∑j ω|e�p j (x)

· e|pi (x)(y).

We shall derive this outcome for only two predicates p1, p2 and leave the n-arygeneralisation to the interested reader.

The proof uses Lemma 4.4.2 (again), now with function

F(e, x, x′, y, y′) B ω(x) · e(x)(y) · p1(y) · ω(x′) · e(x′)(y′) · p2(y′).

Then:∑

x,x′,y,y′ F(e, x, x′, y, y′) = (ω |= e � p1) · (ω |= e � p2).We write X = {x1, . . . , xn} and Y = {y1, . . . , ym} and use variables zi j for the

channel e′ that we are seeking. We do so via the auxiliary function:

H(~z, ~λ)B

∑i,i′, j, j′ F(e, xi, xi′ , y j, y j′ ) · ln

(ω(xi) · zi j · p1(y j) · ω(xi′ ) · zi′ j′ · p2(y j′ )

)−

∑i λi ·

((∑

j zi j) − 1).

We then have partial derivatives:

∂H∂zk`

(~z, ~λ) =∑

i′, j′F(e, xk, xi′ , y`, y j′ )

zk`+

∑i, j

F(e, xi, xk, y j, y`)zk`

− λk

=ω(xk) · e(xk)(y`) · p1(y`) · (ω |= e � p2)

zkl

+ω(xk) · e(xk)(y`) · p2(y`) · (ω |= e � p1)

)zkl

− λk

∂H∂λk

(~z, ~λ) = (∑

j zk j) − 1.

262


The latter M-validity is maximal, by Proposition 4.7.5.In fact, repeating M-learning along a channel, with point-data, makes no

sense, because the process reaches a fixed point in one step already. That is thecontent of the following result.

Proposition 4.9.3. Consider a channel c : X → Y with states σ ∈ D(X) andτ ∈ D(Y). Define:

d B c†σ : Y → X σ′ B d � τ ∈ D(X).

If we take the double-dagger of c, like in Theorem 4.9.2 (2):

c′ B d †τ : X → Y,

then its dagger satisfies: c′ †σ′ = d = c†σ. We then also get c′ †σ′ � τ = d � τ = σ′.

Proof. We give both a concrete and an abstract proof. For the concrete proofwe first note:

c′ � σ′ = d †τ � (d � τ)(3.22)= τ.

But then:

c′ †σ′ (y)(x)(3.23)=

σ′(x) · c′(x)(y)(c′ � σ′)(y)

(3.23)=

(d � τ)(x) · τ(y)·d(y)(x)(d�τ)(x)

τ(y)= d(y)(x).

The more abstract proof uses the category Krn of kernels from Defini-tion 3.8.2. It allows us to write the channel d as a morphism in Krn of theform:

(Y, τ) d // (X, σ′).

Theorem 3.8.3 says that taking the dagger of a morphism twice returns theoriginal morphism. In this case it gives c′ = d †† = d.

This result puts cold water on MEM and severly limits its usability. Recallthat in Section 4.8 we did get progress via multiple iterations of M-learning —notably in Examples 4.8.3, 4.8.4 and 4.8.5 — but there we kept the channel efixed and iterated the learning of new states, along the fixed channel.

Remark 4.9.4. 1 It is an intruiging question whether the authors of [88] areaware that the form EM that they use (MEM) stabilises after one iteration.It seems they are not: they calculate log-likelihoods, amounting to:

ln(〈 f ,w, h〉 � ρ |=M ψ

)≈ −2044,

and also:

ln(〈 f ′,w′, h′〉 � ρ′ |=M ψ

)≈ −2021.

264


This gives the expected increase (as in Theorem 4.9.2). But then we findin [88, §20.3]:By the tenth iteration, the learned model is a better fit than the original model (L =

−1982.214).

This is a mysterious remark. First of all, if you know that things stabiliseafter one iteration, you are not going to iterate ten times. Secondly, it isunclear where the likelihood L = −1982.214 comes from. The closes thingthat we can reconstruct is:

ln(dd � ρ′ |=M ψ

)≈ −1979,

where dd = d†Flrn(ψ) is the double-dagger introduced in Example 4.9.1.Aside: we may indeed expect this value −1979 to be higher than the

previous one −2021 since the latter involves a tuple of marginalisations〈 f ′,w′, h′〉 taken from dd.

2 Topic modelling is an area where one uses statistical techniques to identifya (previously fixed) number of topics in documents based on word occur-rences. One approach, called Probabilistic Latent Semantic Analysis [38], isan instance of the present MEM setting. We assume that we have finite setsD of documents, W of words, and T of topics. A document-word matrix isgiven as ψ ∈ N(D×W), such that ψ(d,w) ∈ N is the number of occurrencesof word w in document d. We consider ψ as point-data. The aim is to learntwo channels e : T → D and f : T → W which assign to a topic z ∈ T thedistribution e(z) ∈ D(D) of documents, associated with topic z, and the dis-tribution f (z) ∈ D(W) of words, associated with topic z. Thus e(z)(d) is theprobability that document d is on topic z and f (z)(w) is the probability thatword w belongs to topic z.

As usual, a prior distribution σ ∈ D(T ) of topics is assumed, possiblyuniform initially. We plan to learn an updated distribution along the tuplechannel 〈e, f 〉 : T → D × W, using point-data ψ. Following MEM we canupdate σ to:

σ′ B Mlrn(σ, 〈e, f 〉, ψ)(4.19)= 〈e, f 〉†σ � Flrn(ψ).

The dagger channel 〈e, f 〉†σ : D × W → T is described concretely in [38,Eqn. (6)] as conditional probability P(z | d,w), satisfying:

P(z | d,w) =P(z) · P(d | z) · P(w | z)∑

z′ P(z′) · P(d | z′) · P(w | z′).

This corresponds precisely to the dagger that we use, since:

〈e, f 〉†σ(d,w)(z)(3.23)=

σ(z) · 〈e, f 〉(z)(d,w)(〈e, f 〉 � σ)(d,w)

=σ(z) · e(z)(d) · f (z)(w)∑

z′ σ(z′) · e(z′)(d) · f (z′)(w).

265


Newly learned channels e′ : T → D and f ′ : T → W are obtained in oursetting via Theorem 4.9.2 (2), namely by first taking the double daggerg : T → D ×W, from which e′ and f ′ are obtained by marginalisation:

g B(〈e, f 〉†σ

)†Flrn(ψ) with

e′ B g[1, 0

]f ′ B g

[0, 1

]More concretely,

e′(z)(d)(3.23)=

∑w

Flrn(ψ)(d,w) · 〈e, f 〉†σ(d,w)(z)

(〈e, f 〉†σ � Flrn(ψ))(z)

=∑

w

ψ(d,w) · 〈e, f 〉†σ(d,w)(z)|ψ| · σ′(z)

.

This is the newly learned conditional probability P(d | z) described in [38,Eqn. (4)]. It is unclear to what extend it is recognized in [38] that this MEMapproach stabilises after one round — as shown in Proposition 4.9.3.

4.9.2 Copied-state Expectation-Maximisation (CEM)

For the next form of Expectation-Maximisation we also first continue an earlierillustration.

Example 4.9.5. Recall Example 4.8.6 on classification of coins, where wehave a channel e : {0, 1} → {H,T } and a uniform prior state υ on {0, 1}. In thissituation we have five pieces of point-data ψ1, . . . , ψ5 ∈ N({H,T }), giving riseto five newly learned states on {0, 1} via C-learning along e, namely:

ωi B Clrn(υ, e, ψi) = υ∣∣∣&y(e�1y)ψi (y) ,

see Table (4.22) for details about this expectation-part of EM.The maximisation-part of EM in [27] can be described as follows. Five prod-

uct states ωi ⊗ Flrn(ψi) ∈ D({0, 1} × {H,T }) are formed and added up in a(uniform) convex sum:

τ B∑

i15 ·

(ωi ⊗ Flrn(ψi)

).

A new channel e′ = τ[0, 1

∣∣∣ 1, 0] : {0, 1} → {H,T } is then extracted, which canbe described more concretely as follows (with the same parameters as in [27]):

e′(0) = 0.713|H 〉 + 0.287|T 〉 e′(1) = 0.5813|H 〉 + 0.4187|T 〉.

As newly learned state we can takeω′ = τ[1, 0

]=

∑i

15 ·ωi. What has happened

here?An aside: in the light of Remark 4.1.3 we notice that we do not need to force

266


τ to be state, if we use it only for disintegration. We might as well define it as amultiset, of the form τ =

∑i ωi ⊗ ψi, where we view the states ωi as multisets.

We can then still define e′ = τ[0, 1

∣∣∣ 1, 0]. This makes the calculations a bitsimpler.

In the remainder of this section we analyse the above situation. We do not dothis in full generality, but only for point-data, since it is unclear how to handlethe general case — with multisets of factors as data.

We start with a negative, but revealing observation, telling that the sum-increase approach that we have used so succesfully so far, see Lemma 4.4.2, isnot very helpful here.

Lemma 4.9.6. Assume we have a channel e : X → Y with a state ω ∈ D(X)and point-data ψ ∈ M(Y). Finding a better channel e′ : X → Y for C-learningalong e, via the sum-increase method, yields the constant channel e′(x) =

Flrn(ψ).

Proof. We are interested in increasing the C-validity:

ω |=C e �M ψ = ω |= &y(e � 1y)ψ(y) =∑

x ω(x) ·∏

y e(x)(y)ψ(y).

Our aim is to do so via a better channel e′. Therefor we look at the function:

F(e, x) B ω(x) ·∏

y e(x)(y)ψ(y) =∏

y ω(x) · e(x)(y)ψ(y).

We write X = {x1, . . . , xn} and Y = {y1, . . . , ym} and look for zi j representinge′(xi)(y j). We use the function:

H(~z, ~λ) B∑

i F(e, xi) · ln(∏

j ω(xi) · zψ(y j)i j

)−

∑i λi ·

((∑

j zi j) − 1)

=∑

i, j F(e, xi) ·(

ln(ωi) + ψ(y j) · ln(zi j))−

∑i λi ·

((∑

j zi j) − 1).

It has partial derivatives:

∂H∂zk`

(~z, ~λ) =F(e, xk) · ψ(y`)

zk`− λk

∂H∂λk

(~z, ~λ) = (∑

j zk j) − 1.

When we set these to zero we get:

1 =∑` zk` =

∑`

F(e, xk) · ψ(y`)λk

=F(e, xk) · |ψ|

λk.

Hence:

zk` =F(e, xk) · ψ(y`)

λk=

F(e, xk) · ψ(y`)F(e, xk) · |ψ|

=ψ(y`)|ψ|

= Flrn(ψ)(y`).

Getting a constant channel is not very useful in an iterative learning process.But that is not what is used in Example 4.9.5. The crucial difference there isthat we have multiple point-data elements ψi from which we learn a single

267


channel. It turns out that what happens there is a mixture of C-learning — toget multiple states ωi, one for each ψi — and M-learning — to get the newchannel e′. This is what we will look at next.

4.9.3 Combined M- and C-versions of Expectation-Maximisation(MCEM)

Starting in Section 4.7 we have described data as a multiset of factors, with amultiset of elements as a special case — called point-data. But what if we havemultiple portions of data, as in Example 4.9.5, where there are five multisetsfrom which to learn. How do we handle such multiple portions more abstractly,and what is the appropriate form of validity? In this section we briefly look intothis matter, without claiming to have the definitive story on this matter.

We have identified data on a set X with a multiset of factors ψ ∈ M(Fact(X)).But what if we have multiple data ψi? How to organise them? Well, we cansimply climb the abstraction ladder a bit further, and use multisets of multisetsof factors, of the form Ψ ∈ M(M(Fact(X))). As special case we can use point-predicates as factors, which we can identify with a multiset of multisets Ψ ∈

M(M(X)). For instance, in Example 4.9.5 one has such a multiset of multisets,namely Ψ = 1|ψ1 〉 + 1|ψ2 〉 + 1|ψ3 〉 + 1|ψ4 〉 + 1|ψ5 〉.

Now let’s return to the general case of Ψ ∈ M(M(Fact(X))). What wouldbe its validity be in a state ω. One possible, quite natural, interpretation is touse a combination of M- and C-validity, as in:

ω |=MC Ψ B∏

ϕ

(ω |=C ϕ

)Ψ(ϕ) (4.13)=

∏ϕ

(ω |= &p pϕ(p)

)Ψ(ϕ). (4.26)

Here we use the M-validity-approach on the outside, with C-validity inside.There is no obvious way to it the other way around. This MC-validity includesboth M-validity and C-validity as special cases, see Exercise 4.9.3.

The form of this validity definition (4.26) suggests how to do MC-learning,namely as M-learning with the data Ψ ∈ M(Fact(X)) given by:

Ψ B∑

ϕΨ(ϕ)

∣∣∣ &p pϕ(p) ⟩so that ω |=MC Ψ = ω |=M Ψ.

Thus we can express MC-learning in terms of M-learning:

MClrn(ω,Ψ) B Mlrn(ω,Ψ)

=∑

ϕ

Ψ(ϕ)

|Ψ|· ω|&p pϕ(p) by Theorem 4.7.4 (1)

=∑

ϕ

Ψ(ϕ)|Ψ|· Clrn(ω, ϕ) by Theorem 4.7.4 (2).

(4.27)

The next question is: what if we have a channel e : X → Y , with a state

268


ω ∈ D(X) and data Ψ ∈ M(M(Fact(Y)))? We can proceed very much likein (4.16), but now using functoriality ofM ◦M =M2.

e �M2 Ψ B M2(e � (−))(Ψ)

=∑

ϕΨ(ϕ)

∣∣∣M(e � (−)

)(ϕ)

⟩=

∑ϕ

Ψ(ϕ)∣∣∣e �M ϕ

⟩=

∑ϕ

Ψ(ϕ)∣∣∣ ∑p ϕ(p)|e � p〉

⟩.

(4.28)

We then get:

ω |=MC e �M2 Ψ(4.26)=

∏ϕ

(ω |=C e �M ϕ

)Ψ(ϕ)

=∏

ϕ

(ω |=C &p(e � p)ϕ(p)

)Ψ(ϕ).

(4.29)

The transformation e � (−) now occurs within the conjunction &. This is sig-nificant, since such factor transformations and conjunctions do not commute,see Exercise 2.4.1. As a result, we cannot reduce MC-learning along a chan-nel to M- and/or C-learning along a channel. However, we can say somethingwhen we have a multiset of point-data.

Theorem 4.9.7. Let e : X → Y be a channel with a state ω ∈ D(X). For amultiset Ψ ∈ N(N(Y)) of point-data we are interested in increasing the “MC”validity:

ω |=MC e �N2 Ψ =∏

ϕ

(ω |= &y (e � 1y)ϕ(y)

)Ψ(ϕ).

Then we can get a better state and channel via MC-learning:

ω′ B∑

ϕ

Ψ(ϕ)|Ψ|· Clrn(ω, e, ϕ)

e′(x)(y) B∑ϕ Ψ(ϕ) · ϕ(y) · Clrn(ω, e, ϕ)(x)∑ϕ Ψ(ϕ) · |ϕ| · Clrn(ω, e, ϕ)(x)

.

(4.30)

Proof. We keep things simple and shall prove the result for the special casewhere Ψ = 1|ϕ1 〉 + 1|ϕ2 〉. We thus wish to increase the product:(

ω |= e �M ϕ1)·(ω |= e �M ϕ2

),

by finding both a better ω and e. We thus look at the function:

F(ω, e, x, x′) B ω(x) ·(&y(e � 1y)ϕ1(y)(x) · ω(x′) ·

(&y′ (e � 1y′ )ϕ2(y′)(x′)

=∏

y,y′ω(x) · e(x)(y)ϕ1(y) · ω(x′) · e(x′)(y′)ϕ2(y′).

As we have done before, we write X = {x1, . . . , xn} and Y = {y1, . . . , ym} and

269


use variables wi for ω′(xi) and zi j for e′(xi)(y j). In order to use Lemma 4.4.2we define:

H(~w,~z, κ, ~λ)B

∑i,i′ F(ω, e, xi, xi′ ) · ln

(∏j, j′ wi · z

ϕ1(y j)i j · wi′ · z

ϕ2(y j′ )i′ j′

)− κ ·

((∑

i wi) − 1)−

∑i λi ·

((∑

j zii) − 1)

=∑

i,i′, j, j′ F(ω, e, xi, xi′ ) ·(

ln(wi) + ln(wi′ ) + ϕ1(y j) · ln(zi j) + ϕ2(y j′ ) · ln(zi′ j′ ))

− κ ·((∑

i wi) − 1)−

∑i λi ·

((∑

j zii) − 1).

We first look at the following two partial derivatives.

∂H∂κ

(~w,~z, κ, ~λ) = (∑

i wi) − 1

∂H∂wk

(~w,~z, κ, ~λ)

=∑

i′F(ω, e, xk, xi′ )

wk+

∑i

F(ω, e, xi, xk)wk

− κ

=

∏j ω(xk) · (e � 1y j )(xk)ϕ1(y j) · (ω |= &y (e � 1y)ϕ2(y))

wk

+(ω |= &y (e � 1y)ϕ1(y)) ·

∏j ω(xk) · (e � 1y j )(xk)ϕ2(y j)

wk− κ

=(ω |= &y (e � 1y)ϕ1(y)) · (ω |= &y (e � 1y)ϕ2(y))

wk·(ω|&y(e�1y)ϕ1(y) (xk) + ω|&y(e�1y)ϕ2(y) (xk)

)− κ

=ω |=MC e �N2 Ψ

wk·(Clrn(ω, e, ϕ1)(xk) + Clrn(ω, e, ϕ2)(xk)

)− κ.

Setting these to zero gives:

1 =∑

k wk

=∑

kω |=MC e �N2 Ψ

κ·(Clrn(ω, e, ϕ1)(xk) + Clrn(ω, e, ϕ2)(xk)

)=

2κ·(ω |=MC e �N2 Ψ

).

Hence κ = 2 ·(ω |=MC e �N2 Ψ

), so that:

ω′(xk) = wk = 12 · Clrn(ω, e, ϕ1)(xk) + 1

2 · Clrn(ω, e, ϕ2)(xk).

Thus, the newly learned state ω′ is a convex combination of C-learned statesalong e.

270


We turn to the other two partial derivatives of H.

∂H∂λk

(~w,~z, κ, ~λ) = (∑

j zk j) − 1

∂H∂zk`

(~w,~z, κ, ~λ)

=∑

i′F(ω, e, xk, xi′ ) · ϕ1(y`)

zk`+

∑i

F(ω, e, xi, xk) · ϕ2(y`)zk`

− λk

=

∏j ω(xk) · (e � 1y j )(xk)ϕ1(y j) · (ω |= &y (e � 1y)ϕ2(y)) · ϕ1(y`)

zk`

+(ω |= &y (e � 1y)ϕ1(y)) ·

∏j ω(xk) · (e � 1y j )(xk)ϕ2(y j) · ϕ2(y`)

zk`− λk

=(ω |= &y (e � 1y)ϕ1(y)) · (ω |= &y (e � 1y)ϕ2(y))

zk`·(ϕ1(y`) · ω|&y(e�1y)ϕ1(y) (xk) + ϕ2(y`) · ω|&y(e�1y)ϕ2(y) (xk)

)− λk

=ω |=MC e �N2 Ψ

zk`·(ϕ1(y`) · Clrn(ω, e, ϕ1)(xk) + ϕ2(y`) · Clrn(ω, e, ϕ2)(xk)

)− λk.

When these partial derivatives are zero we can derive:

1 =∑` zk`

=∑`

ω |=MC e �N2 Ψ

λk·(ϕ1(y`) · Clrn(ω, e, ϕ1)(xk) + ϕ2(y`) · Clrn(ω, e, ϕ2)(xk)

)=ω |=MC e �N2 Ψ

λk·(|ϕ1| · Clrn(ω, e, ϕ1)(xk) + |ϕ2| · Clrn(ω, e, ϕ2)(xk)

).

But then:

e′(xk)(y`) = zk` =ϕ1(y`) · Clrn(ω, e, ϕ1)(xk) + ϕ2(y`) · Clrn(ω, e, ϕ2)(xk)|ϕ1| · Clrn(ω, e, ϕ1)(xk) + |ϕ2| · Clrn(ω, e, ϕ2)(xk)

.

What remains is to show that the ω′ and e′ in Example 4.9.5 are instancesof the formulas (4.30) for Ψ =

∑i≤5 1|ψi 〉. This is immediate for:

ω′ B∑

i15 · ωi =

∑i

Ψ(ψi)|Ψ|

· Clrn(ω, e, ψi).

For e′ = τ[0, 1

∣∣∣ 1, 0] where τ =∑

i ωi ⊗ ψi we have:

e′(x)(y) Bτ(x, y)∑z τ(x, z)

=

∑i ωi(x) · ψi(y)∑

i,z ωi(x) · ψi(z)=

∑i Ψ(ψi) · Clrn(ω, e, ψi)(x) · ψi(y)∑

i Ψ(ψi) · Clrn(ω, e, ψi)(x) · |ψi|.

271


We conclude that what is called Expectation-Maximisation in [27] about Ex-ample (4.9.5) is a mixture of C-learning and M-learning.

When this form of learning is repeated 10 times, each time with the samedata, one gets the following state and channel.

ω = 0.5376|0〉 + 0.4624|1〉

e(0) = 0.7899|H 〉 + 0.2101|T 〉e(1) = 0.5089|H 〉 + 0.4911|T 〉.

These outcomes are basically the same as in [27], where the small differencesseem to be due to rounding errors: [27] uses only two decimals, probably alsoduring the computations.

Exercises

4.9.1 Consider point-data ψ = 3|0〉 + 2|1〉 + 5|2〉 ∈ M(3), together withthe parameterised state flip(r) = r|1〉 + (1 − r)|0〉 ∈ D(2) and theparameterised channel c(s, t) : 2→ 3 given by:

c(s, t)(0) = s|0〉 + (1 − s)|1〉 c(s, t)(1) = (1 − t)|1〉 + t|2〉.

The parameters r, s, t are all in the unit interval [0, 1].

1 Compute c(s, t) � flip(r).2 Show that the equation c(s, t) � flip(r) = Flrn(ψ) yields the equa-

tions:

s =3

10rt =

12(1 − r)

for r ∈ (0, 1). This gives a family of solutions.3 Now let’s take r = s = t = 1

2 and write ω = flip( 12 ) and e = c( 1

2 ,12 ).

Our aim is to learn better values for r, s, t via the M-learning versionof EM, starting with ω, e.Compute e†ω : 3→ 2 and show that:

Mlrn(ω, e, ψ)(4.19)= e†ω � Flrn(ψ) = 2

5 |0〉 +35 |1〉.

4 Show that the double-dagger(e†ω

)†Flrn(ψ) : 2→ 3 satisfies:(

e†ω)†Flrn(ψ)(0) = 3

4 |0〉 +14 |1〉

(e†ω

)†Flrn(ψ)(1) = 1

6 |1〉 +56 |2〉.

5 Check that the values r = 25 and s = 3

4 , t = 56 that emerge from

MEM in points (3) and (4) are instances of the family of solutionsdescribed in (2).

272

4.10. A clustering example 2734.10. A clustering example 2734.10. A clustering example 273

4.9.2 Consider the topic modelling situation from Remark 4.9.4 (2). Therewe have a joint distribution τ ∈ D(D × W) on documents of wordsgiven by:

τ = 〈e, f 〉 � σ.

Show that the channel τ[0, 1

∣∣∣ 1, 0] : D→ W, extracted via disintegra-tion, equals f ◦· e†σ, see also [38, Eqn. (7)].

4.9.3 Let Φ ∈ M(Fact(X)) be data on a set X. There are two ways of turningΦ into a multiset in M(M(Fact(X))), via the unit maps unit : A →M(A), given by unit(a) = 1|a〉 from Subsection 1.4.2. Concretely:

unit(Φ) = 1∣∣∣Φ⟩

M(unit)(Φ) =∑

p Φ(p)∣∣∣unit(p)

⟩=

∑p Φ(p)

∣∣∣1| p〉⟩.Check that:

ω |=MC unit(Φ) = ω |=C Φ ω |=MCM(unit)(Φ) = ω |=M Φ.

4.9.4 We now use the approach of the previous exercise in the setting ofTheorem 4.9.7. Let ω ∈ D(X), e : X → Y and Φ ∈ M(Y).

1 Show that the newly learned state and channel ω′ and e′ in (4.30)from unit(Φ) ∈ M(M(Y)) are:

ω′ = Clrn(ω, e,Φ) e′(x) = Flrn(Φ).

2 Show also that ω′ and e′ obtained from M(unit)(Φ) ∈ M(M(Y))are:

ω′ = Mlrn(ω, e,Φ) e′ =(e†ω

)†Flrn(Φ).

4.10 A clustering example

This section describes some experiments with Expectation-Maximisation forclassifying clustered points in a square grid. The grid that we use is discreteand has size 30 × 30. We randomly choose three centers in the plane and thenrandomly choose 10 points around these centers (which may possibly overlap).This gives pictures as in the column on the left in Figure 4.3, where chosenpoints are indicated by crosses ×.

Our aim is to classify the given points in the plane into one of three cate-gories, indicated by colours yellow (Y), red (R) or blue (B). Several exampleclassifications are in the column on the right in Figure 4.3. Visually it is clear

273


which classifications succeed (partially) or fail. We construct such a classifica-tion algorithm based on the techniques from the previous sections. There maybe other techniques, which perform far better, but we will use what we haveseen so far, as illustration.

This situation is an instance of the Expectation-Maximisation set-up in Dia-gram (4.23), with a channel of the form:

{Y,R, B} = X ◦ // Y = 30 × 30

The crossed points form the data on the codomain Y = 30 × 30. Our aim is tolearn both a state ω on X = {Y,R, B} and a channel e : X → Y . For each crossedpoint y = (y1, y2) ∈ 30 × 30 we can then compute its classification in {Y,R, B}via the dagger of the channel e, namely via the distribution:

e†ω(y) ∈ D({Y,R, B}

).

The colour that we assign to point y is the argmax of this distribution: thecolour with the highest probability, see Subsection 1.4.4. This is an impor-tant methodological point: we classify via the dagger of the channel in an EMlearning situation (4.23).

In order to learn the state ω and channel e we massage the data into theright form. Each crossed point y = (y1, y2) ∈ 30 × 30 is turned into a predicatepy : 30 × 30 → [0, 1]. It is zero almost everywhere on 30 × 30, except on9 places (1% only), as indicated by the sub-grid of probabilities below, withcrossed point y in the center.

1/2 3/4 1/2

3/4 1 3/4

1/2 3/4 1/2

These crossed points are chosen to be never on the border of the grid, so thereis enough room around them to define a predicate in this way. Explicitly, forarbitrary (x1, x2) ∈ 30 × 30,

py(x1, x2) =

1 if x1 = y1 and x2 = y234 if

(x1 = y1 and (x2 = y2 + 1 or x2 = y2 − 1)

)or

(x2 = y2 and (x1 = y1 + 1 or x1 = y1 − 1)

)12 if

(x1 = y1 − 1 and (x2 = y2 + 1 or x2 = y2 − 1)

)or

(x1 = y1 + 1 and (x2 = y2 + 1 or x2 = y2 − 1)

)0 otherwise.

274

4.10. A CLUSTERING EXAMPLE 275

Figure 4.3 Four examples of data in a plane on the left and their classification intothree groups (yellow, blue, red) on the right. The third row contains some errors,given by yellow dots in a region that should be entirely red. The last row describesa failed classification.


The probabilities around the crossed point y incorporate the idea of distance toy. In this way we can organise the data as a multiset of predicates, representedin the sequel as Ψ ∈ N(Pred (30 × 30)), with a predicate py for each crossedpoint y.

We now come to a crucial decision, namely which form of EM to use: M-learning, or C-learning, or some form of MC-learning. Experimenting with thisexample shows that none of these approaches work well if we start from anarbitrary state ω ∈ D(X) and channel e : X → Y and iterate towards improve-ments: this often leads to failed classifications like in the last row of Figure 4.3.Things improve if we ‘help’ the channel a little bit, by incorporating initiallyinformation about where the clusters are — while still choosing the initial stateω randomly. The learning approach that performs best, experimentally, is theform of MC-learning that we describe next.

Proposition 4.10.1. Let σ ∈ D(X) and c : X → Y be given, for finite setsX,Y, together with data Φ ∈ N(Fact(Y)). Consider these data as a multiset ofpoint-data Φ ∈ N(M(Y)), via Remark 2.1.6, where:

Φ B∑

pΦ(p)

∣∣∣ ∑y p(y)|y〉

⟩.

Using MC-learning along channel c with Φ yields, according to Theorem 4.9.7,as better state:

σ′ =∑

pΦ(p)|Φ|· Clrn(σ, e,

∑y p(y)|y〉) =

∑p

Φ(p)|Φ|· σ

∣∣∣&y(e�1y)p(y) ,

with better channel c′ = τ[0, 1

∣∣∣ 1, 0] where τ ∈ M(X × Y) is the multiset:

τ B∑

pΦ(p)|Φ|· σ

∣∣∣&y(e�1y)p(y) ⊗

(∑y p(y)|y〉

).

Proof. The equation for σ′ follows directly from the formula for ω′ in (4.30),where we use that for point-data ϕ ∈ M(Y) one has Clrn(σ, e, ϕ) = σ|&y(e�1y)ϕ(y) ,see (4.18).

The above description for c′ matches e′ in (4.30) since:

c′(x)(y) =τ(x, y)∑z τ(x, z)

=

∑p Φ(p)/|Φ| · σ|&y(e�1y)ϕ(y) (x) · p(y)∑

z,p Φ(p)/|Φ| · σ|&y(e�1y)ϕ(y) (x) · p(z)

=

∑p Φ(p) · Clrn(σ, e,

∑y p(y)|y〉)(x) · p(y)∑

p Φ(p) · Clrn(σ, e,∑

y p(y)|y〉)(x) · |p|.

We finish by describing how we get the classification results in the columnon the right in Figure 4.3. The starting point is the data Ψ ∈ M(Pred (30× 30))as described in the beginning of this section. We then pick an arbitrary ω ∈

D({Y,R, B}) as initial state. A channel e : {Y,R, B} → 30 × 30 is chosen by

276

4.10. A clustering example 2774.10. A clustering example 2774.10. A clustering example 277

picking three distributions e(Y), e(R), e(B) on the 30× 30 grid. We then iterate25 computations of better ω′ and e′ according to Proposition 4.10.1. Choosingthe channel e completely randomly leads to many failed classification results,because the iterations get stuck in a local optimum. We need to point them inthe right direction by ‘helping’ the channel e a little bit. We can do so by mixingthe three cluster centers — chosen at the very beginning of each experiment —into the choices of the three states e(Y), e(R), e(B). One can steer the amount ofrandomness in this mix via a scaling factor. The more randomness, the poorerthe classification results.

What we conclude at the end of this chapter is that in a general learningsituation 4.23 with a channel X → Y , with data on Y , with the methods that wehave discussed:

• learning only the ‘latent’ state on X, when the channel X → Y is given (andfixed), works well;

• learning both the state on X and the channel X → Y works poorly in general,since the iterations can easily go off in a wrong direction. Still, one can getacceptable results if the channel is initially not chosen randomly but alreadyfits the situation at hand in a reasonable manner.

277

References

[1] E. Alfsen and F. Shultz. State spaces of operator algebras: basic theory, ori-entations and C∗-products. Mathematics: Theory & Applications. BirkhauserBoston, 2001.

[2] C. Anderson. The end of theory. the data deluge makes the scientific methodobsolete. Wired, 16:07, 2008.

[3] A. Ankand and A. Panda. Mastering Probabilistic Graphical Models usingPython. Packt Publishing, Birmingham, 2015.

[4] S. Awodey. Category Theory. Oxford Logic Guides. Oxford Univ. Press, 2006.[5] D. Barber. Bayesian Reasoning and Machine Learning. Cambridge Univ.

Press, 2012. Publicly available via http://web4.cs.ucl.ac.uk/staff/D.Barber/pmwiki/pmwiki.php?n=Brml.HomePage.

[6] M. Barr and Ch. Wells. Category Theory for Computing Science. Prentice Hall,Englewood Cliffs, NJ, 1990.

[7] L. Baum, T. Petrie, G. Soules, and N. Weiss. A maximization technique occur-ring in the statistical analysis of probabilistic functions of Markov chains. Ann.Math. Statistics, 41:164–171, 1970.

[8] J. Bernardo and A. Smith. Bayesian Theory. John Wiley & Sons, 2000.[9] C. Bishop. Pattern Recognition and Machine Learning. Information Science

and Statistics. Springer, 2006.[10] F. van Breugel, C. Hermida, M. Makkai, and J. Worrell. Recursively defined

metric spaces without contraction. Theor. Comp. Sci., 380:143–163, 2007.[11] H. Chan and A. Darwiche. On the revision of probabilistic beliefs using uncer-

tain evidence. Artif. Intelligence, 163:67–90, 2005.[12] K. Cho and B. Jacobs. The EfProb library for probabilistic calculations. In

F. Bonchi and B. König, editors, Conference on Algebra and Coalgebra in Com-puter Science (CALCO 2017), volume 72 of LIPIcs. Schloss Dagstuhl, 2017.

[13] K. Cho and B. Jacobs. Disintegration and Bayesian inversion via string dia-grams. Math. Struct. in Comp. Sci., 29(7):938–971, 2019.

[14] K. Cho, B. Jacobs, A. Westerbaan, and B. Westerbaan. An introduction to effec-tus theory. see arxiv.org/abs/1512.05813, 2015.

[15] F. Clerc, F. Dahlqvist, V. Danos, and I. Garnier. Pointless learning. In J. Esparzaand A. Murawski, editors, Foundations of Software Science and Computation

279

http://web4.cs.ucl.ac.uk/staff/D.Barber/pmwiki/pmwiki.php?n=Brml.HomePage

http://web4.cs.ucl.ac.uk/staff/D.Barber/pmwiki/pmwiki.php?n=Brml.HomePage

arxiv.org/abs/1512.05813

280 Chapter 4. References280 Chapter 4. References280 Chapter 4. References

Structures, number 10203 in Lect. Notes Comp. Sci., pages 355–369. Springer,Berlin, 2017.

[16] B. Coecke and A. Kissinger. Picturing Quantum Processes. A First Course inQuantum Theory and Diagrammatic Reasoning. Cambridge Univ. Press, 2016.

[17] B. Coecke and R. Spekkens. Picturing classical and quantum Bayesian inference.Synthese, 186(3):651–696, 2012.

[18] F. Dahlqvist and D. Kozen. Semantics of higher-order probabilistic programswith conditioning. See arxiv.org/abs/1902.11189, 2019.

[19] V. Danos and T. Ehrhard. Probabilistic coherence spaces as a model of higher-order probabilistic computation. Information & Computation, 209(6):966–991,2011.

[20] A. Darwiche. Modeling and Reasoning with Bayesian Networks. CambridgeUniv. Press, 2009.

[21] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incompletedata via the EM algorithm. Journ. Royal Statistical Soc., 39(1):1–38, 1977.

[22] P. Diaconis and S. Zabell. Updating subjective probability. Journ. AmericanStatistical Assoc., 77:822–830, 1982.

[23] P. Diaconis and S. Zabell. Some alternatives to Bayes’ rule. Technical Report339, Stanford Univ., Dept. of Statistics, 1983.

[24] F. Dietrich, C. List, and R. Bradley. Belief revision generalized: A joint charac-terization of Bayes’ and Jeffrey’s rules. Journ. of Economic Theory, 162:352–371, 2016.

[25] E. Dijkstra. A Discipline of Programming. Prentice Hall, Englewood Cliffs, NJ,1976.

[26] E. Dijkstra and C. Scholten. Predicate Calculus and Program Semantics.Springer, Berlin, 1990.

[27] C. Do and S. Batzoglou. What is the expectation maximization algorithm? Na-ture Biotechnology, 26:897–899, 2008.

[28] A. Dvurecenskij and S. Pulmannová. New Trends in Quantum Structures.Kluwer Acad. Publ., Dordrecht, 2000.

[29] B. Fong. Causal theories: A categorical perspective on Bayesian networks. Mas-ter’s thesis, Univ. of Oxford, 2012. see arxiv.org/abs/1301.6201.

[30] D. J. Foulis and M.K. Bennett. Effect algebras and unsharp quantum logics.Found. Physics, 24(10):1331–1352, 1994.

[31] A. Gibbs and F. Su. On choosing and bounding probability metrics. Int. Statis-tical Review, 70(3):419–435, 2002.

[32] G. Gigerenzer and U. Hoffrage. How to improve Bayesian reasoning withoutinstruction: Frequency formats. Psychological Review, 102(4):684–704, 1995.

[33] M. Giry. A categorical approach to probability theory. In B. Banaschewski,editor, Categorical Aspects of Topology and Analysis, number 915 in Lect. NotesMath., pages 68–85. Springer, Berlin, 1982.

[34] A. Gordon, T. Henzinger, A. Nori, and S. Rajamani. Probabilistic programming.In Int. Conf. on Software Engineering, 2014.

[35] T. Griffiths, C. Kemp, and J. Tenenbaum. Bayesian models of cognition. InR. Sun, editor, Cambridge Handbook of Computational Cognitive Modeling,pages 59–100. Cambridge Univ. Press, 2008.

[36] J. Halpern. Reasoning about Uncertainty. MIT Press, Cambridge, MA, 2003.

280



281281281

[37] W. Hino, H. Kobayashi I. Hasuo, and B. Jacobs. Healthiness from duality. InLogic in Computer Science. IEEE, Computer Science Press, 2016.

[38] T. Hofmann. Probabilistic latent semantic analysis. In UAI’99: Conf. on Uncer-tainty in Artificial Intelligence, pages 289–296, 1999.

[39] J. Hohwy. The Predictive Mind. Oxford Univ. Press, 2013.[40] B. Jacobs. Convexity, duality, and effects. In C. Calude and V. Sassone, editors,

IFIP Theoretical Computer Science 2010, number 82(1) in IFIP Adv. in Inf. andComm. Techn., pages 1–19. Springer, Boston, 2010.

[41] B. Jacobs. New directions in categorical logic, for classical, probabilistic andquantum logic. Logical Methods in Comp. Sci., 11(3), 2015. See https://

lmcs.episciences.org/1600.[42] B. Jacobs. Introduction to Coalgebra. Towards Mathematics of States and Ob-

servations. Number 59 in Tracts in Theor. Comp. Sci. Cambridge Univ. Press,2016.

[43] B. Jacobs. Hyper normalisation and conditioning for discrete probability dis-tributions. Logical Methods in Comp. Sci., 13(3:17), 2017. See https:

//lmcs.episciences.org/3885.[44] B. Jacobs. A note on distances between probabilistic and quantum distributions.

In A. Silva, editor, Math. Found. of Programming Semantics, Elect. Notes inTheor. Comp. Sci. Elsevier, Amsterdam, 2017.

[45] B. Jacobs. A recipe for state and effect triangles. Logical Methods in Comp. Sci.,13(2), 2017. See https://lmcs.episciences.org/3660.

[46] B. Jacobs. A channel-based exact inference algorithm for Bayesian networks.See arxiv.org/abs/1804.08032, 2018.

[47] B. Jacobs. A channel-based perspective on conjugate priors. Math. Struct. inComp. Science. See arxiv.org/abs/1707.00269, 2019.

[48] B. Jacobs. Learning along a channel: the Expectation part of Expectation-Maximisation. In B. König, editor, Math. Found. of Programming Semantics,Elect. Notes in Theor. Comp. Sci. Elsevier, Amsterdam, 2019. To appear.

[49] B. Jacobs. The mathematics of changing one’s mind, via Jeffrey’s or via Pearl’supdate rule. Journ. of Artif. Intelligence Res., 2019, to appear. See arxiv.org/abs/1807.05609.

[50] B. Jacobs, A. Kissinger, and F. Zanasi. Causal inference by string diagramsurgery. In M. Bojanczyk and A. Simpson, editors, Foundations of SoftwareScience and Computation Structures, number 11425 in Lect. Notes Comp. Sci.,pages 313–329. Springer, Berlin, 2019.

[51] B. Jacobs, J. Mandemaker, and R. Furber. The expectation monad in quantumfoundations. Information & Computation, 250:87–114, 2016.

[52] B. Jacobs and A. Westerbaan. Distances between states and between predicates.See arxiv.org/abs/1711.09740, 2017.

[53] B. Jacobs and F. Zanasi. A predicate/state transformer semantics for Bayesianlearning. In L. Birkedal, editor, Math. Found. of Programming Semantics, num-ber 325 in Elect. Notes in Theor. Comp. Sci., pages 185–200. Elsevier, Amster-dam, 2016.

[54] B. Jacobs and F. Zanasi. A formal semantics of influence in Bayesian reasoning.In K. Larsen, H. Bodlaender, and J.-F. Raskin, editors, Math. Found. of ComputerScience, volume 83 of LIPIcs, pages 21:1–21:14. Schloss Dagstuhl, 2017.

281

https://lmcs.episciences.org/1600











[55] B. Jacobs and F. Zanasi. The logical essentials of Bayesian reasoning. InG. Barthe, J.-P. Katoen, and A. Silva, editors, Probabilistic Programming. Cam-bridge Univ. Press, 2019. See arxiv.org/abs/1804.01193, to appear.

[56] R. Jeffrey. The Logic of Decision. The Univ. of Chicago Press, 2nd rev. edition,1983.

[57] F. Jensen and T. Nielsen. Bayesian Networks and Decision Graphs. Statisticsfor Engineering and Information Science. Springer, 2nd rev. edition, 2007.

[58] D. Jurafsky and J. Martin. Speech and language processing. Third Edition draft,available at https://web.stanford.edu/~jurafsky/slp3/, 2018.

[59] K. Keimel and G. Plotkin. Mixed powerdomains for probability and nondeter-minism. Logical Methods in Comp. Sci., 13(1), 2017. See https://lmcs.

episciences.org/2665.[60] A. Kock. Closed categories generated by commutative monads. Journ. Austr.

Math. Soc., XII:405–424, 1971.[61] D. Koller and N. Friedman. Probabilistic Graphical Models. Principles and

Techniques. MIT Press, Cambridge, MA, 2009.[62] D. Kozen. Semantics of probabilistic programs. Journ. Comp. Syst. Sci,

22(3):328–350, 1981.[63] D. Kozen. A probabilistic PDL. Journ. Comp. Syst. Sci, 30(2):162–178, 1985.[64] S. Lauritzen and D. Spiegelhalter. Local computations with probabilities on

graphical structures and their application to expert systems. Journ. Royal Statis-tical Soc., 50(2):157–224, 1988.

[65] T. Leinster. Basic Category Theory. Cambridge Studies in Advanced Mathemat-ics. Cambridge Univ. Press, 2014.

[66] C. Manning and H. Schütze. Foundations of statistical natural language pro-cessing. MIT Press, 2003.

[67] R. Mardare, P. Panangaden, and G. Plotkin. Quantitative algebraic reasoning. InLogic in Computer Science. IEEE, Computer Science Press, 2016.

[68] S. Mac Lane. Categories for the Working Mathematician. Springer, Berlin, 1971.[69] S. Mac Lane. Mathematics: Form and Function. Springer, Berlin, 1986.[70] A. McIver and C. Morgan. Abstraction, refinement and proof for probabilistic

systems. Monographs in Comp. Sci. Springer, 2004.[71] A. McIver, C. Morgan, and T. Rabehaja. Abstract hidden Markov models: A

monadic account of quantitative information flow. In Logic in Computer Science,pages 597–608. IEEE, Computer Science Press, 2015.

[72] A. McIver, C. Morgan, G. Smith, B. Espinoza, and L. Meinicke. Abstract chan-nels and their robust information-leakage ordering. In M. Abadi and S. Kremer,editors, Princ. of Security and Trust, number 8414 in Lect. Notes Comp. Sci.,pages 83–102. Springer, Berlin, 2014.

[73] E. Moggi. Notions of computation and monads. Information & Computation,93(1):55–92, 1991.

[74] T. Moon. The Expectation-Maximization algorithm. IEEE Signal ProcessingMagazine, 13(6):47–60, 1996.

[75] R. Nagel. Order unit and base norm spaces. In A. Hartkämper and H. Neu-mann, editors, Foundations of Quantum Mechanics and Ordered Linear Spaces,number 29 in Lect. Notes Physics, pages 23–29. Springer, Berlin, 1974.

282


https://web.stanford.edu/~jurafsky/slp3/



283283283

[76] R. Neal. Learning in graphical models. In M. Jordan, editor, A View of the EMAlgorithm that Justifies Incremental, Sparse, and other Variants, pages 355–368.Springer, Dordrecht, 1998.

[77] M. Nielsen and I. Chuang. Quantum Computation and Quantum Information.Cambridge Univ. Press, 2000.

[78] F. Olmedo, F. Gretz, B. Lucien Kaminski, J-P. Katoen, and A. McIver. Con-ditioning in probabilistic programming. ACM Trans. on Prog. Lang. & Syst.,40(1):4:1–4:50, 2018.

[79] V. Paulsen and M. Tomforde. Vector spaces with an order unit. Indiana Univ.Math. Journ., 58-3:1319–1359, 2009.

[80] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of PlausibleInference. Graduate Texts in Mathematics 118. Morgan Kaufmann, 1988.

[81] J. Pearl. Probabilistic semantics for nonmonotonic reasoning: A survey. InR. Brachman, H. Levesque, and R. Reiter, editors, First Intl. Conf. on Prin-ciples of Knowledge Representation and Reasoning, pages 505–516. MorganKaufmann, 1989.

[82] J. Pearl. Jeffrey’s rule, passage of experience, and neo-Bayesianism. In Jr. H. Ky-burg, editor, Knowledge Representation and Defeasible Reasoning, pages 245–265. Kluwer Acad. Publishers, 1990.

[83] J. Pearl. Causality. Models, Reasoning, and Inference. Cambridge Univ. Press,2nd ed. edition, 2009.

[84] J. Pearl and D. Mackenzie. The Book of Why. Penguin Books, 2019.[85] B. Pierce. Basic Category Theory for Computer Scientists. MIT Press, Cam-

bridge, MA, 1991.[86] C. Rao. Estimation and tests of significance in factor analysis. Pychometrika,

20(2):93–111, 1955.[87] S. Ross. Introduction to Probability Models. Academic Press, 9th edition, 2007.[88] S. Russell and P. Norvig. Artificial Intelligence. A Modern Approach. Prentice

Hall, Englewood Cliffs, NJ, 2003.[89] A. Scibior, O. Kammar, M. Vákár, S. Staton, H. Yang, Y. Cai, K. Ostermann,

S. Moss, C. Heunen, and Z. Ghahramani. Denotational validation of higher-order Bayesian inference. In Princ. of Programming Languages, pages 60:1–60:29. ACM Press, 2018.

[90] P. Selinger. A survey of graphical languages for monoidal categories. In B. Co-ecke, editor, New Structures in Physics, number 813 in Lect. Notes Physics,pages 289–355. Springer, Berlin, 2011.

[91] S. Selvin. A problem in probability (letter to the editor). Amer. Statistician,29(1):67, 1975.

[92] S. Sloman. Causal Models. How People Think about the World and Its Alterna-tives. Oxford Univ. Press, 2005.

[93] A. Sokolova. Probabilistic systems coalgebraically: A survey. Theor. Comp. Sci.,412(38):5095–5110, 2011.

[94] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search.MIT Press, Cambridge, MA, 2nd rev. edition, 2000.

[95] S. Staton, H. Yang, C. Heunen, O. Kammar, and F. Wood. Semantics for proba-bilistic programming: higher-order functions, continuous distributions, and softconstraints. In Logic in Computer Science. IEEE, Computer Science Press, 2016.

283


[96] M. Stone. Postulates for the barycentric calculus. Ann. Math., 29:25–30, 1949.[97] H. Tijms. Understanding Probability: Chance Rules in Everyday Life. Cam-

bridge Univ. Press, 2nd edition, 2007.[98] A. Tversky and D. Kahneman. Evidential impact of base rates. In D. Kahneman,

P. Slovic, and A. Tversky, editors, Judgement under uncertainty: Heuristics andbiases, pages 153–160. Cambridge Univ. Press, 1982.

[99] M. Valtorta, Y.-G. Kim, and J. Vomlel. Soft evidential update for probabilisticmultiagent systems. Int. Journ. of Approximate Reasoning, 29(1):71–106, 2002.

[100] C. Villani. Optimal Transport — Old and New. Springer, Berlin Heidelberg,2009.

[101] I. Witten, E. Frank, and M. Hall. Data Mining – Practical Machine LearningTools and Techniques. Elsevier, Amsterdam, 2011.

284

Structured Probabilistic Reasoning · ematics, computer science, artiﬁcial intelligence and...

Documents

Transcript of Structured Probabilistic Reasoning · ematics, computer science, artiﬁcial intelligence and...