· Statistical Methods in Applied Computer Science Lecture Notes Preliminary for course 2D5342,...

Statistical Methods in Applied ComputerScience

Lecture NotesPreliminary for course 2D5342, Data Mining,

Jan-April 2006.

Stefan Arnborg ∗

KTH Nada

February 23, 2006

Abstract

We overview fundamental inference principles for hypothesis and deci-sion choice, parameter(state) estimation and tracking methods. We alsoexplore methods for finding dependencies and graphical models, latentvariables and robust decision trees in a joint Bayesian framework. We alsoconsider the most important methods for performing Bayesian inferencein various settings: Analytic integration using conjugate families of dis-tributions and likelihoods, discretization, Monte Carlo as well as MarkovChain Monte Carlo and particle filters. We overview related probabilis-tic methods such as robust Bayesian analysis, evidence theory and PAClearning.

Contents1 Introduction 3

1.1 Professional context of statistics in Computer Science . . . . . . 31.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Schools of statistics 52.1 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Choosing between two alternatives . . . . . . . . . . . . . 72.1.2 Finite set of alternatives . . . . . . . . . . . . . . . . . . . 82.1.3 Infinite sets of alternatives and parameterized models . . 82.1.4 Recursive inference . . . . . . . . . . . . . . . . . . . . . . 92.1.5 Dynamic inference . . . . . . . . . . . . . . . . . . . . . . 92.1.6 Does Bayes give us the right answer? . . . . . . . . . . . . 102.1.7 A small Bayesian example. . . . . . . . . . . . . . . . . . 112.1.8 Bayesian decision theory and estimation . . . . . . . . . . 12

∗email: [email protected]; mail: NADA, KTH, SE-100 44 Stockholm, Sweden

1

2.1.9 Bayesian analysis of cross-validation . . . . . . . . . . . . 142.1.10 How to perform Bayesian inversion . . . . . . . . . . . . . 14

2.2 Test based inference . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.1 A small hypothesis testing example . . . . . . . . . . . . . 182.2.2 Multiple testing considerations . . . . . . . . . . . . . . . 19

2.3 Discussion: Bayes versus frequentism . . . . . . . . . . . . . . . . 202.3.1 Model checking in Bayesian analysis . . . . . . . . . . . . 20

2.4 Evidence Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.5 Beyond Bayes: PAC-learning and SVM . . . . . . . . . . . . . . 22

2.5.1 The Kernel Trick . . . . . . . . . . . . . . . . . . . . . . . 252.6 Uncertainty Management . . . . . . . . . . . . . . . . . . . . . . 26

3 Data models 273.1 Univariate data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1.1 Discrete distributions and the Dirichlet prior . . . . . . . 273.1.2 The normal and t distributions . . . . . . . . . . . . . . . 293.1.3 Nonparametrics and mixtures . . . . . . . . . . . . . . . . 313.1.4 Piecewise constant distribution . . . . . . . . . . . . . . . 323.1.5 Univariate Gaussian Mixture modeling . . . . . . . . . . . 34

3.2 Multivariate and sequence data models . . . . . . . . . . . . . . . 373.2.1 Multivariate models . . . . . . . . . . . . . . . . . . . . . 373.2.2 Dependency tests for categorical variables using Dirichlet 383.2.3 The multivariate normal distribution . . . . . . . . . . . . 403.2.4 Dynamical systems and Bayesian time series analysis . . . 403.2.5 Discrete states and observations: The hidden Markov model 42

3.3 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.3.1 Causality and direction in graphical models . . . . . . . . 433.3.2 Segmentation and clustering . . . . . . . . . . . . . . . . . 463.3.3 Graphical model choice - local analysis . . . . . . . . . . . 473.3.4 Graphical model choice - global analysis . . . . . . . . . . 493.3.5 Categorical, ordinal and Gaussian variables . . . . . . . . 51

3.4 Missing values and errors in data matrix . . . . . . . . . . . . . . 523.4.1 Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.5 Segmentation - Latent variables . . . . . . . . . . . . . . . . . . . 55

4 Approximate analysis with Metropolis-Hastings simulation 564.1 Why MCMC and why does it work? . . . . . . . . . . . . . . . . 57

4.1.1 The particle filter . . . . . . . . . . . . . . . . . . . . . . . 604.2 Basic classification with categorical attributes . . . . . . . . . . . 61

5 Two cases 625.1 Case 1: Individualized media distribution - significance of profiles 625.2 Case 2: Schizophrenia research–hunting for causal relationships

in complex systems . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6 Conclusions 67

2

1 IntroductionComputers have since their invention been described as devices performing fastand precise computations. For this reason it might be easy to conclude thatprobability and statistics has little to do with computer science, as quite manyof our students have indeed done. This attitude may have had some justifica-tion before computers became more than a prestigious and expensive piece ofequipment. Today, computers are involved everywhere, and the ubiquitous un-certainty characterizing the real non-computer world can no longer be separatedfrom computing. Automatically recorded data are collected in vast amounts,and the kind of interpretation of sounds, images and other recordings that wasearlier made by humans, or under close supervision by humans, must inevitablybe performed to larger and larger extent by computers. If we emphasize the hu-man performance of interpretation as a gold standard, we may call the method-ology used in such applications artificial intelligence. If we emphasize thatwe must make machines learn from experience we may call it machine learning.Other popular names of the emerging technologies are data mining, emphasizingthe large amounts of data and the importance of finding business opportunitiesin them, or perhaps uncertainty management, emphasizing that computer ap-plications must be made more and more aware of the basic uncertainties in thereal world. The terms pattern recognition and information fusion have also beenused. Knowledge discovery is another term that evokes emotions, particularlyin natural sciences. These notes were made for a course in statistical methodsin computer engineering, where the theme is to unify the methodology in all theabove fields with probabilistic and statistical tools.

1.1 Professional context of statistics in Computer ScienceData acquired for analysis can have many different formats. We will describethe analysis of measurements that can be thought of as samples drawn from alarger population, and the conclusions will be phrased in terms of this largerpopulation. I will focus on very simple models. My experience is that it istoo easy to start working with complex models without understanding the ba-sics, which can lead to all sorts of erroneous conclusions. As the investigator’sunderstanding of a problem area improves, the statistical models tend to be-come complex. Some examples of such areas are genetic linkage studies[21],ecosystem studies[53] and functional MR investigations[90], where the signalsextracted from measurements are very weak but potentially extremely usefulfor the application area. As an example, analysis of fMRI experiments havecomplex models for the signal distortion, noise, movement artifacts, variation ofanatomy and function among different subjects, variations within the individualand even drift in a single experiment. All these disturbances can be modeled us-ing basic physics, knowledge of anatomy and statistical summaries on inter-brainvariation and the total model is overwhelmingly complex[54]. Experiments aretypically analyzed using a combination of visualization, Bayesian analysis andconventional test and confidence based statistics. When it becomes necessaryto develop more sophisticated models, it is vital that the analyst communicatesthe developments to the non-statistical members of the research team. In engi-neering and commercial applications of data mining, the goal is not normally toarrive at eternal truths, but to support decisions in design and business. Nev-

3

ertheless, because of the competitive nature of these activities, one can expectwell founded analysis methods and understandable models to provide more use-ful answers than ad hoc ones. In any case, even with short deadlines it seemsimportant that the methodological discussion in a project contains more thanthe question of which buttons to press in commercial data analysis packages.

If each sample point contains measurements of two variables, it is easy toproduce a scatter plot in two dimensions where sometimes a conclusion is imme-diate. The human eye is very good at seeing structure in such plots. However,sometimes it is too good, and constructs structure that does simply not exist.If one makes a scatter plot with points drawn uniformly inside a polygon, thepoint cloud will typically be deemed dense in some areas and sparse in others, inother words one sees structure that does not exist in the underlying distribution,as exemplified in figure 1.

Figure 1: A sample of points distributed uniformly in the shown polygon. Itis easy to find areas with apparently lower density of points. But there is nostructure in the underlying distribution.

There are several other reasons to complement visualization methods withmore sophisticated statistical methods: Most data sets will contain many morethan two variables, and this leads to a dimensionality problem. Producingpairwise plots means that many plots are produced. It is difficult to examine allof them carefully, and some of them will inevitably accidentally contain structurethat would be deemed real in a single plot examination but not as an extremeamong many plots. It would also mean that co-variation of several variables thatcannot be explained as a number of two-variable co-variations cannot be found.Nevertheless, scatter plots are among the most useful explanatory devices oncea structure in the data has been verified to be ’probably real’.

This text emphasizes characterization of data and the population from whichit is drawn with its statistical properties. Nonetheless, the application ownershave typically very different concerns: they want to understand, they want tobe able to predict and ultimately to control their objects of study. This meansthat the statistical investigation is a first phase, which must be accompanied by

4

activities extracting meaning from the data. There is relatively little theory onthese later activities, and it is probably fair to say that their outcome dependsmostly on the intellectual climate in the team of which the analyst is only onepart.

1.2 SummaryOur purpose is to explain some advantages of the Bayesian approach and toshow how probability models can capture the information or knowledge we areafter in an application. It is also our intention to give a full account of the com-putations required. It can serve as a survey of the area, although it focuses ontechniques being investigated in present projects in medical informatics, defensescience and customer segmentation. Several of the computations we describehave been analyzed at length, although not exactly in the way and with the sameconclusions as found here. The contribution here is a systematic treatment thatis confined to pure Bayesian analysis and puts several established data miningmethods in a joint Bayesian framework. We will see that, although many compu-tations of Bayesian data-mining are straightforward, one soon reaches problemswhere difficult integrals have to be evaluated, and presently only Markov ChainMonte Carlo (MCMC) methods are available. There are several recent booksdescribing the Bayesian method from both a theoretical[13], an ideological[79]and an application oriented[16] perspective. Particularly Ed Jaynes unfinishedlecture notes[48] have provided inspiration for me and numerous students us-ing them all over the world. A current survey of MCMC methods, which cansolve many complex evaluations required in advanced Bayesian modeling, canbe found in the book[40]. Books explaining theory and use of graphical modelsare Lauritzen[50] and Cox and Wermuth[23]. A tutorial on Bayesian networkapproaches to data mining is found in (Heckermann[44]) and are thoroughlycovered in (Jensen[49]). We will describe data mining in a relational data struc-ture with discrete data (discrete data matrix) and the simplest generalizationsto numerical data. A recent book with good coverage of visualization integratedwith models and methods is the second edition of Gelman et al[38]. Good tablesof probability distributions can be found in [63, 13, 38], and a fascinating bookon probability theory is [34].

These lecture notes were written for a Data Mining graduate course givenannually 1996-2006. A significant update was made when it was decided to givethe course at the advanced (Masters) level. The contents of these notes hasbeen influenced by the research undertaken by me and by colleagues and PhDstudents I worked with. A short version of these notes was published in Wang[1,Ch 1].

2 Schools of statisticsStatistical inference has a long history, and one should not assume that all scien-tists and engineers analyzing data have the same expertise and would reach thesame type of conclusion using the objectively ’right’ method in the analysis of agiven data set. On the contrary, analysis proceeds by a trial-and-error processinfluenced by developments in the particular science or application communityas well as in basic statistical sciences. A lot of science could once be developed

5

using very little statistical theory, but this is no longer true. Probability theoryis the basis of statistics, and it links a probability model to an outcome. But thislinking can be achieved by a number of different principles. A pure mathemati-cian interested in mathematical probability would only consider abstract spacesequipped with a probability measure. Whatever is obtained by analyzing suchmathematical structures has no immediate bearing on how we should interpreta given data set collected to give us knowledge about the world.

Figure 2: Andrei Kolmogorov(1903-1987) is the mathematician best knownfor shaping probability theory into a modern axiomatized theory. His axiomsof probability tells how probability measures are defined, also on infinite andinfinite-dimensional event spaces. He did not work on statistical inference aboutthe real world.

When it comes to inference about real-world phenomena, there are two dif-ferent and complementary views on probability that have competed for theposition as ’the’ statistical method. With both views, we consider models thattell how data are generated in terms of probability. The models used for analy-sis reflect our - or the application owners - understanding of the problem area.In a sense they are hypotheses and in inference a hypothesis is often more orless equated with a probability model. With probability theory we can find theprobability of observing data with specified characteristics under a probabilitymodel. But inference is concerned with saying something about which proba-bility model generated our data - for this reason inference was sometimes calledinverse probability[26]. The name Bayesian, referring to inference of the 19thcentury style of inverse probability, is however more recent[35].

2.1 Bayesian InferenceThe first applications of inference used Bayesian analysis[26], where we can di-rectly talk about the probability that a hypothesis in the form of a probabilitymodel generated our observed data. The basic mechanism is easy: Let our mod-els be defined by their data generating probabilities. The notation used below

6

Figure 3: Thomas Bayes, (-1762), was a Presbyterian clergyman and amateurmathematician, but he was also a member of the Royal Society. It is believedthat he was admitted because he participated in a discussion on the rationalityof the new, Newton type, mathematics when it was attacked by the bishopBerkeley[11]. Bayes was also the first to analyze a way to infer a probabilitymodel from observations, although his essay is technically weak by modernstandards. It is not verified that this contemporary portrait of a priest actuallyis a likeness of Thomas Bayes, but it is often used as such.

is quite compact, but you should observe that when actual probability distri-butions are substituted into the formulas one usually has a quite demandinganalysis to make. Examples are given in exercises and in Chapter 3.

2.1.1 Choosing between two alternatives

With two models H1 and H2, the probability of seeing data D is P (D|H1) andP (D|H2), respectively. Using the standard definition of conditional probability,P (A|B) = P (AB)/P (B), we can invert the conditional of the model:

P (H1|D) = P (D|H1)P (H1)/P (D).

The data probability P (D) regardless of generating model is not unexpectedlydifficult to assess and cannot be used. For this reason, Bayesian analysis does notnormally consider a single hypothesis, but is used to compare different models

7

in such a way that the data probability regardless of model is not needed. Usingthe same equation for H2, we can eliminate the data probability:

P (H1|D)P (H2|D)

=P (D|H1)P (D|H2)

P (H1)P (H2)

(1)

This rule says that the odds we assign to the choice between H1 and H2, theprior odds P (H1)/P (H2), is changed to the posterior odds P (H1|D)/P (H2|D),the odds after also seeing the data, by multiplication with the Bayes factorP (D|H1)/P (D|H2). In other words, the Bayes factor contains all informationprovided by the data relevant for choosing between the two hypotheses. Therule assumes that we have subjective probability, dependent on information theobserver holds, e.g., by having seen the outcome D of an experiment.

If we assume that exactly one of the hypotheses is true, we can normalizeprobabilities by P (H1|D)+P (H2|D) = 1 and find the unknown data probabilityP (D) = P (H1|D)P (H1) + P (H2|D)P (H2).

2.1.2 Finite set of alternatives

If we have more than two hypotheses, a similar calculation leads to a formuladefining a posterior probability distribution over the hypotheses that dependson the prior distribution:

P (Hi|D) =P (D|Hi)P (Hi)∑j P (D|Hj)P (Hj)

(2)

The assumption behind the equation is that exactly one of the hypothesesis true. Then the data probability is P (D) =

∑j P (D|Hj)P (Hj).

2.1.3 Infinite sets of alternatives and parameterized models

If we have a parameterized model Hλ, the data probability is also dependenton the value of the parameter λ which is taken from some parameter spaceor possible world space Λ. In Bayesian analysis the parameter space can beoverwhelmingly complex, such as in the case of image reconstruction where Λ isthe set of possible images represented, e.g., by voxel vectors of significant lengthin 3D images. But the parameter space can also be simple, like a postulated’true value’ of a measured real valued quantity. We can also let the parameterspace be a finite set and so can regard equation (2) as a special case of equation(3) below.

There are also interesting cases where the parameter space has complexstructure. Typically, a parameter value in such cases is a vector whose dimensionis not fixed. Such models can describe a multiple target problem in defenseapplications, where one wants to make inference both about the number ofincoming targets and the state (position and velocity) of each, or in geneticstudies where one assumes that a disease is determined by combinations ofan unknown number of genes, or simply in an effort to describe a probabilitydistribution as a mixture of an unknown number of normal distributions. Inthe latter case, the parameter can be the number of such normal componentsfollowed by a sequence of mean values and variances of the components.

8

Consideration of parameterized models leads to the parameter inference rule,which can be explained as a limit form of equation (2):

f(λ|D) ∝ P (D|Hλ, λ)f(λ), (3)

where f(λ) is the prior density and f(λ|D) is the posterior density, and ∝is a sign that indicates that a normalization constant (independent of λ) hasbeen omitted. For posteriors of parameter values the concept of credible setis important. A credible set is a set of parameter values among which theparameter has a high probability of lying, according to the posterior distribution.For a real-valued parameter, a credible set which is a (possibly open-ended)interval is called a credible interval. Likewise, a credible interval with posteriorprobability 0.95 is called a 95% credible interval. There are typically many 95%credible intervals for a given posterior. It may seem natural to choose thatinterval for which the posterior probability density is larger in the interval thanoutside it. This interval is not robust to rescaling, however. A robust credibleinterval is one that is centered in the sense that the probability is the same forthe posterior to lie above as below the interval. In the case of a 95% credibleinterval, these two probabilities are thus 2.5%.

There are many situations where the ’data’ part of the analysis is not pre-cisely known, but only a probability distribution is available, perhaps obtainedas a posterior in a previous inference. In this case the inference rule will be

f(λ|f(x)) ∝∫

f(x|λ)f(x)f(λ)dx. (4)

2.1.4 Recursive inference

Bayesian analysis for many observed quantities can be formulated recursively,because the posterior obtained after one observation can be used as a prior forthe next observation, so called recursive estimation:

f(λ|Dt) ∝ f(dt|λ)f(λ|Dt−1) (5)f(λ|D0) = f(λ)

Here Dt = (d1, . . . , dt) is the sequence of observations obtained at differ-ent times, and we have assumed they are independently generated, f(Dt|λ) =Πti=1f(di|λ).

Exercise 1 Show that (5) follows from (3) if the di are independently drawnfrom a common distribution

2.1.5 Dynamic inference

The recursive formulation is the basis for the Chapman-Kolmogoroff approachwhere the task is to track a system state that changes with time, having stateλt at time t. This case is called dynamic inference, since it tries to estimatethe state of a moving target at times indexed from 1 upwards. The Chapman-Kolmogoroff equation is:

9

f(λt|Dt) ∝ f(dt|λt)∫

f(λt|λt−1)f(λt−1|Dt−1)dλt−1 (6)

f(λ0|D0) = f(λ0),

where f(λt|λt−1) is the maneuvering (process innovation) noise assumed, oftencalled the transition kernel. The latter is a pdf over state λt dependent on stateat the previous time-step, λt−1. This distribution may depend on t, but often astationary process is assumed where the distribution of λt depends on λt−1 butnot explicitly on t. As an example, if we happen to know that the state doesnot change, we would use Dirac’s delta δ(λt−λt−1) for the innovation term andequation (6) will simplify to (5). If we constrain the model to be linear withGaussian process and measurement noise with known covariance, equation (6)simplifies to a linear algebra equation, whose solution is the classical Kalmanfilter (this is pedagogically explained in [10]).

In many applications, additional information can be obtained by recom-puting the past trajectory of the system. This is because later observationscan give information about the earlier behavior. This is known as retrodiction.When the system has been observed over T steps, the posterior of the trajectoryλ = (λ0, λ1, . . . , λT ) is given by:

f(λ|DT ) ∝ f(λ0)T∏t=1

f(dt|λt)f(λt|λt−1) (7)

2.1.6 Does Bayes give us the right answer?

Above we have only explained how the Bayesian crank works. It would of coursebe nice to know also that we will get the right answers.

The Bayes factor (1) estimates the support given by the data to the hypothe-ses. Inevitably, random variation can give support to the ’wrong’ hypothesis.A useful rule to apply when choosing between two hypotheses as in (1) is thefollowing: If the Bayes factor is k in favor of H1, then the probability of gettingthis factor or larger from an experiment where H2 was the true hypothesis isless than 1/k. For many specific hypothesis pairs, the bound is much better[72].

There is also a nice characterization of long run properties of the equa-tion (5) that has an accessible proof in [38]: If the observations are generatedby a common distribution g(d) and we try to find the parameter λ by usingequation (5), then as the number of observations tends to infinity, the poste-rior f(λ|Dn) will concentrate around a number λ̂ that minimizes the Kullback-Leibler distance between the true and the estimated distribution of observations,λ̂ = argminλKL(g(·), f(·|λ)), where KL(g, f) =

∫f(x) log(f(x)/g(x))dx. Since

the KL distance is a metric, if, for some unique λ, the real distribution g(x) isequal to the considered distribution f(x, λ), then the dynamic Bayesian infer-ence will asymptotically give the right answer. Convergence rate is discussed in[38].

Exercise 2 Assume that we use equation (1) to choose between two modelsstating respectively probability a and b for heads in a coin tossing series. Assumealso that the true probability of heads is c. For which values of c will the Bayesfactor in favor of a against b go to zero when the number of tosses increases?

10

2.1.7 A small Bayesian example.

We will see how Bayes’ method works with a small example, in fact very similarto the example used by Thomas Bayes(1703 – 1762). Assume we have founda coin among the belongings of a notorious gambling shark. Is this coin fairor unfair? The data we can obtain is a sequence of outcomes in a tossingexperiment, represented as a binary string. Let one hypothesis be that the coinis fair, Hr. Then P (D|Hr) = 2−n, where n = |D| is the number of tossesmade. Since the tosses are assumed independent, the number of ones, s, andthe number of zeros, f , completely characterizes an experiment. Since everyoutcome has the same probability, we can not evaluate the experiment withrespect to only Hr, but we must have another hypothesis that can fit better orworse to an outcome. Bayes used a parameterized model where the parameteris the unknown probability, p, of getting a one in a toss. For this model Hp,we have P (D|Hp) = ps(1 − p)f , and the probability of an outcome is clearly afunction of p. We can consider Hp a whole family of models for 0 ≤ p ≤ 1. If weassume, with Bayes, that the prior distribution of p is uniform in the intervalfrom 0 to 1, we get a posterior distribution equal to the normalized likelihood,f(p|D) = cps(1 − p)f , a Beta distribution where the normalization constant is(as can be found from a table or proved by double induction) c = (n+ 1)!/(s!f !).This function has a maximum at the observed frequency s/n. We cannot saythat the coin is unfair just because we have s �= f since the normal variationmakes inequality very much more likely than equality for a large number oftosses even if the coin is fair.

If we want to decide between fairness and unfairness we must introduce acomposite hypothesis by specifying a probability distribution for the parameterp in Hp. A conventional choice is again the uniform distribution. Let Hu bethe hypothesis of unfairness, expressed as Hp with a uniform distribution on theparameter p. By integration we find P (D|Hu) =

∫ 1

0P (D|Hp)dp =

∫ 1

0ps(1 −

p)fdp = s!f !/(n+ 1)!. Suppose now that we toss the coin twelve times andobtain the sequence 00011000001, three successes and nine failures. The Bayesfactor in favor of unfairness will be 2123!9!/13! = 1.4. This is a too small valueto be of interest. Values above 3 are worth mentioning, above 30 significant,and factors above 300 would give strong support to the first hypothesis, whereasvalues below 1/3, 1/30 and 1/300 give similar support to the second hypothesis.But we are only comparing two hypotheses – this scheme cannot tell us thatnone of the alternatives is plausible.

We have here compared two data generating mechanisms, out of a potentiallyunbounded set of possibilities. We have simply assumed that the tosses areindependent and identically distributed in the sense that the success probabilityis constant over the experiment. It is perfectly possible to assume that there isautocorrelation in the experiment, or that the success probability drifts duringthe experiment. In order to investigate these possibilities we need different datagenerating models (and longer experiments, because there are more parametersto consider). Once credible models of those aspects of an experiment we wantto consider are available, the analysis follows the same steps as those above.

Exercise 3 A cheap test is available for a serious disease. Assume that thistest is cheap in the sense of having 5% probability of giving negative result ifyou have the disease, and 10% probability of giving positive result if you do not

11

have it. Moreover, in a population of individuals similar to you, 0.1% have thedisease. How would you compute your probability of having the disease when thetest gives

a) positive result?b) negative result?c) Can the precision be improved by repeating the test? What assumptions

are reasonable for answering this question?

Exercise 4 The assumption that an unfair coin has a uniform distribution for pis not very convincing if the coin is physical and we have had a chance to look atit without actually tossing it. Assume that we instead assign a prior distributionfor p by saying that the coin is well expected to be balanced in a series of 10 trials.This can be formalized as stating the prior to be be proportional to p5(1− p)5 inthe unbalanced case. How would such an assumption change the posterior andinterpretation of the example (with three successes and nine failures)?

Exercise 5 Two points are selected uniformly and independently on a 3D sphereof radius 1.(i) Find the distribution of the (3D Euclidean) distance between the points.(ii) Find the maximum probability, mean and median distances between thepoints.(iii) What are the MAP, mean and median estimates for the squared distance?Hint: use a pdf whose variable is the squared distance.

2.1.8 Bayesian decision theory and estimation

The posterior odds gives a numerical measure of belief in the two hypothesescompared. Suppose our task is to decide by choosing one of them. If the Bayesfactor is greater than one,H1 is more likely thanH2 assuming no prior preferenceof either. But this does not necessarily mean that H1 is true, since the data canbe misleading by natural random fluctuation. Two types of error are possible:Choosing H1 when H2 is true, and choosing H2 when H1 is true. Our choicemust take the consequences of the two types of error into account. If it is muchworse to act on H1 when H2 is true than vice versa, the choice of H1 mustbe penalized in some way, whereas if the consequences of both error types aresimilar we should indeed chose the alternative of largest posterior probability.Statistical Bayesian decision theory resolves the problem by introducing a costfor each pair of choice and true hypothesis. For finite hypothesis sets, the tableof these costs forms a matrix, the cost matrix, which is column indexed by truehypothesis and row indexed by our choice. The recipe for choosing is to makethe choice with smallest expected cost[8]. This rule is applicable also whensimultaneously making many model comparisons. The term utility is equallycommon in statistical decision making. It differs from cost only by a sign, andwe thus strive to maximize it.

When making inference for the parameter value of a parameterized model,equation (3) gives only a distribution over the parameter value. If we want apoint estimate λ̂ of the parameter value, we should also use Bayesian decisiontheory. We want to minimize the loss incurred by stating the estimate λ̂ whenthe true value is λ, L(λ̂, λ). But we do not know λ. As with a discrete setof decision alternatives we minimize the expected loss over the posterior for λ,

12

Figure 4: Pierre Simone d Laplace was a court (later Napoleon’s) mathemati-cian contributing to quite many problems in mathematics, fluid mechanics andastronomy. He rediscovered Bayes’ method and developed it in a more mathe-matically mature way. He used the same uniform prior for an unknown prob-ability as Bayes did, and used it to compute the probability that the sun willrise tomorrow, given that it has risen N times without exception. The answer,(N+1)/(N+2) is obtained by using the posterior mean value estimator which ina discrete setting is known as the Laplace estimator: for a discrete distributionover d outcomes, where ni observations of outcome i were observed, the Laplaceestimate of the probability distribution (p1, . . . , pn) is pi = (ni + 1)/(n + d),the relative frequencies after one more observation of each outcome has beenadded. In this course we can also use Laplace’s parallel combination: this termis sometimes used to describe the computation (5) for a finite Λ: Take two vec-tors indexed by state, multiply them component-wise and normalize. In Matlab:posterior=likelihood.*prior; posterior=posterior/sum(posterior);.

∫L(λ̂, λ)f(λ|D)dλ. If the loss function is the squared error, the optimal esti-

mator is the mean of f(λ|D); if the loss is the absolute value of the error, theoptimal estimator is the median; with a discrete parameter space, minimizingthe probability of an error (no matter how large) gives the Maximum A Pos-teriori(MAP) estimate. As an example, when tossing a coin gives s heads andf tails, the posterior with a uniform prior is f(p|s, f) = cps(1 − p)f , the MAPestimate for p is the observed frequency s/(s + f), the mean estimate is theLaplace estimator (s + 1)/(s + f + 2) and the median is a fairly complicatedquantity expressible, when s and f are known, as the solution to an algebraicequation of high degree. The Laplace estimator is often preferable to the simpleobserved frequency estimator, see figure 4.

To see that minimizing expected squared error leads to the mean value es-timate, consider the squared error loss function L(λ, λ̂) = (λ − λ̂)2. Minimiz-ing the expected loss wrt λ̂ leads to finding a zero of d

dλ̂

∫λL(λ, λ̂)f(λ|x)dλ =

− ∫λ2(λ− λ̂)f(λ|x)dλ = −2λ̄+ 2λ̂. This zero is λ̂ = λ̄, and it is easily verified

to give a minimum, since the second derivative with respect to λ̂ is 2.

13

Exercise 6 Find the optimal estimator from the posterior when the loss func-tion is

a) L(λ, λ̂) = |λ− λ̂|;b) L(λ, λ̂) = δ(λ− λ̂); where δ is Dirac’s delta function.

2.1.9 Bayesian analysis of cross-validation

*** (not ready)

2.1.10 How to perform Bayesian inversion

Development of Bayesian analysis was only enabled when desk-top computingpower was made available. Before computers were standard tools, Bayesiananalysis was completely dependent on formulating the inference problem usingequations (1) or (2) with a small number of alternatives, and various analyticalmodels that happened to be tractable because of conjugacy properties for theequations (3),(5) and (6). We have already mentioned the Kalman filter, whereassuming all distributions Gaussian transforms the solution of (6) into a linearalgebra problem. Several such special cases exist, and the corresponding fami-lies of functions with their properties are listed, e.g., in [13, App A] and [38]. Inour analysis of graphical models (also known as influence diagrams or BayesianNetworks) we will thoroughly analyze one such example, the use of Dirichletdistributions to make inference about discrete probability distributions. Forinference about low-dimensional parameter spaces, it is possible to discretizeparameter space and reduce the problems (3),(5) and (6) to the discrete prob-lem (2), either using some sophistication with numerical analysis tools or justassuming that both likelihoods and priors are (multidimensional) histograms.

In general, however, the parameter and observation spaces of interest arevery high-dimensional (in genetic linkage studies and imaging, the parameterdimension can be a couple of thousand and many millions, respectively). Inthese cases Markov Chain Monte Carlo (MCMC) is the only presently feasibletechnique. This method assumes that a probability density function on a con-tinouos or discrete space is given in unnormalized form, π(x) = cf(x), wherewe can evaluate f but not c or π. A Markov chain is constructed in such away that it has a unique stationary distribution which is π. This means thata long chain x1, . . . , xN can be used as an approximation to the distribution,and the average of a function g(x) over π can be approximated by the sum∑N

i=1 g(xi)/N . The details will be given in Ch. 4. The chain is designed witha proposal distribution q(z|x) which is used to generate a next proposed statez when the last state of the chain is x. So the chain x1, . . . , xt is extended bydrawing a proposed new state z from the distribution q(z|xt), and this proposalis evaluated by computing the ratio

r =f(z)q(xt|z)f(xt)q(z|xt) . (8)

If r > 1, the proposal is accepted. Otherwise the proposal is accepted withprobability r, and rejected otherwise. If the proposal is accepted, xt+1 is setto z, otherwise xt+1 is set to xt. This can be implemented by comparing astandard uniformly distributed pseudo random u number with r and acceptingthe proposed state if r > u. Equation (8) shows that a move is preferred to the

14

Figure 5: Andrei Markov (1856-1922) was a mathematician developing, amongother things, the theory of Markov processes. Markov assumptions are commonassumptions in statistical modeling saying that a variables distribution dependsonly on a subset of the variables in a model. Examples of Markov assumptionsare found in Markov chains, where only the current state and not the full his-tory of the process decides the probability distribution of the next state, andin graphical models, where the immediate ancestors (neighbors for undirectedmodels) in a graph give all available information (among ancestors in the caseof directed models) about a variable.

extent that it would increases the probability of the state but also if the proposaldistribution makes it easy to return to the old state. A proposal distributionwith q(z|x) = q(x|z) is called a symmetric proposal distribution. Symmetricproposals are the most common, and they simplify (8) a lot.

We can illustrate the MCMC method with a small example. Assume a distri-bution consisting of two different normal distributions, so that with probabilityabout 0.5 each is selected as the distribution of a new item. Figure 6 showtraces for a distribution with two high-probability regions (top frame). Thenthree traces with different proposal distributions are shown. In the first casewith a small proposal step, the trace has difficulty switching between the twohigh-probability regions, and therefore the estimate of their relative weightswill be of low accuracy. The second trace mixes well and gives an excellent esti-mate. In the last trace with a large step size, most of the proposals are rejected

15

because they fall in low probability regions. Therefore the estimate becomesinaccurate - the trace has too many repeated values and looks like an estimateof a distribution over a finite set of values.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

5

10

15

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.5

1

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.5

1

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

0.5

1

Figure 6: MCMC computation.

2.2 Test based inferenceIt is completely feasible to reject the idea of subjective probability, in fact, thisis the more common approach in many areas of science. If a coin is balanced,the probability of heads is 0.5, and in the long run it will land heads about halfthe number of times it was tossed. However, it might not, but even then onecan claim that the probability is still the same. In other words, we can claimthat the updating of prior to posterior through the likelihood has nothing to dowith the ’objective’ probability of the coin landing heads.

The perceived irrelevance of long run properties of hypothesis probabili-ties made one school of statistics reject subjective probability altogether. Thisschool works with what is usually known as objective probability or frequentiststatistics. Data is generated in repeatable experiments with a fixed distribu-tion of the outcome. Since we cannot talk about the objective probability of ahypothesis, we can not use such probabilities to express our belief in them. Inthe development of probability, many intriguing developments occurred that arenow not visible in the standard engineering statistics and probability, and a veryreadable account can be seen in the recent book [78]. In particular, Cournothypothesized that the connection between statistics and the real world is thatan event (picked in advance) with probability zero will not occur. In the worldof finite time and resources, this translates to the thesis that an event (alsopicked in advance of testing) with small probability will usually not happen.This is the basis for the modern hypothesis-testing framework that is standardin many applied disciplines, above all clinical testing.

One device used by a practitioner of objective probability is testing. For a

16

Figure 7: Antoine Augustine Cournot (1801–1877). He formulated ’Cournot’sbridge’ or Cournot’s principle as a link between mathematical probability andthe real world: An event, picked in advance, with very small or zero probability,will not happen. If we observe such an event, we can conclude that one (ormore) of the assumptions made to find its probability is incorrect. In hypothesistesting we apply Cournot’s bridge to a case where the null hypothesis is the onlyquestionable assumption we made in computing the p-value – thus we can rejectthe null hypothesis if we observe a low p-value. The principle was considered offundamental importance by several pioneers of probability and inference, but itdoes not figure prominently in current historic surveys. It was recently broughtto attention (for another purpose than ours) by Glenn Shafer[78].

single hypothesis H , a test statistic is designed as a mapping t of the possibleoutcomes to an ordered space, normally the real numbers. The data probabilityfunction P (D|H) will now induce a distribution of the test statistic on the realline. I continue by defining a rejection region, an interval with low probability,typically 5% or 1%. Next the experiment is performed or the dataD is obtained,and if the test statistic t(D) falls in the rejection region, the hypothesis H isrejected. For a parameterized hypothesis, rejection depends on the value ofthe parameter. In objective probability inference about real valued parameterswe use the concept of a confidence interval, whose definition is unfortunatelyrather awkward and is omitted. It is discussed in all elementary statistics texts,but often the students leave introductory statistics courses believing that theconfidence interval is a credible interval. Fortunately, this does not mattera lot, since numerically they are practically the same. The central idea intesting is thus to reject models. If we have a problem of choosing betweentwo hypotheses, one is singled out as the tested hypothesis, the null hypothesis,

17

Figure 8: R A Fisher (1890–1962) was a leading statistician as well as geneticist.In his paper Inverse probability[36], he rejected the dependency on priors andscaling exhibited by Bayesian analysis, and launched an alternative concept,’fiducial analysis’. Although this concept was not developed after Fishers time,the standard definition of confidence intervals (developed by J Neyman) hasa similar flavor. The fiducial argument was apparently the starting point forDempster in developing evidence theory.

and this one is tested. The test statistic is chosen to make the probability ofrejecting the null hypothesis maximal if data is in fact obtained through thealternative hypothesis. Unfortunately, there is no strong reason to accept thenull hypothesis just because it could not be rejected, and there is no strongreason to accept the alternative just because the null was rejected. But this ishow testing is usually applied.

The concept of p-value is an important concept in testing. This value isused when the rejection region is an infinite interval on one side of the real line,say the left one. The p-value is the probability of obtaining a test statistic notlarger than the one obtained, under the null hypothesis, so that a p-value lessthan 0.01 allows one to reject the null hypothesis on the 1% level – the observedvalue is less than it could plausibly be.

2.2.1 A small hypothesis testing example

Let us analyze coin tossing again. We have again the two hypotheses Hf andHu. Choose Hf , the coin is fair, as the null hypothesis. Choose the number ofsuccesses as test statistic. Under the null hypothesis we can easily compute thep-value, the probability of obtaining nine or more failures with a fair coin tossedtwelve times,

∑12i=9

(12j

)2−12 = .075. This is 7.5%, so the experiment does not

allow us to reject fairness at the 5% level. On the other hand, if the testingplan was to toss the coin until three outcomes of each type has been seen, thep value should be computed as the probability of seeing nine or more failures

18

before the third success:∑∞

i=9

(j+2j

)2−(j+3) = .0325. Since this is 3.25%, we

can now reject the fairness hypothesis at the 5% level. This is a feature of usingtest statistics: the result depends not only on the choice of null hypothesis andsignificance level, but also on the experimental design, i.e. on data we did notsee but that could have been seen.

If we chose Hu as the null hypothesis we would put the rejection region inthe center, around f = s, since in this area Hu has smaller probability than Hf .In many cases we will be able to reject either both of Hf and Hu, or neither ofthem. This shows the importance of the choice of null hypothesis, and also acertain amount of subjectivity in inference by testing.

2.2.2 Multiple testing considerations

Modern analyses of large data sets involve making many investigations, andmany tests if a testing approach is chosen. This means that some null hypotheseswill be rejected despite being ’true’. If we test 10000 null hypotheses on the 5%level, 500 will be rejected on the average even if they are all true. Correcting formultiple testing is necessary. This correction is an analog of the bias introducedin Bayesian model choice when the two types of error have different cost.

There are basically two types of error control proposed for this type of anal-ysis: In family-wise error control (FWE[45]), one estimates the probability offinding at least one erroneous rejection if there is one, whereas in the recentlyanalyzed false discovery rate control (FDR[6]) one is concerned with the fractionof the rejected hypotheses that is erroneously rejected.

A Bonferroni correction divides the desired significance, say 5%, with thenumber of tests made. So in the example of 10000 tests, the rejection level shouldbe decreased from 5% to 0.0005%. Naturally, this means a very significant lossin power, and most of the ’false’ null hypotheses will probably not be rejected.It was shown by Hochberg[45] that one could equally well truncate the rejectionlist at element k where k = max{i : pi ≤ q/(m− i+ 1)}, (pi)m1 is the increasinglist of p-values and q is the desired FWE rate.

A recent proposal is the control of false discovery rate(FDR). Here we areonly concerned that the rate(fraction) of false rejections is below a given level.If this rate is set to 5%, it means that of the rejected null hypotheses, on theaverage no more than 5% are falsely rejected. It was shown[6, 7] that if thetests are independent, one should truncate the rejection list at element k wherek = max{i : pi ≤ qi/m}.

If we do not know how the tests are correlated, it was shown in [7] that thecut-off value is safe for FDR control if it is changed from qi/m to qi/(mHm),whereHm =

∑mi=1 1/i, themth harmonic number. A more surprising (and more

difficult to prove) result is that, for many types of positively correlated tests,the correction factor 1/Hm is not necessary [7]. People practicing FDR seemoften to have come to the conclusion that the correction factor is too drastic forall but the most exotic and unlikely cases.

These new developments in multiple testing analyses will have a significanteffect on the practice of testing. As an example, from the project described insection 5.2, in a 4000-test example, the number of p-values below 5% where 350,9 would be accepted by the Bonferroni correction and 257 with FDR controlledat 5%. This means that the scientists involved can analyze 257 effects instead

19

of 9, with a 5% risk that one of the 9 was wrong and only 13 of the 257 expectedto be wrong.

FDR control is a rapidly developing technique, motivated by the need foracceptable alternatives to the Bonferroni correction which has very little powerwhen many tests have been performed. Such applications are rapidly developingin medical research, where fMRI and micro-array investigations produce verylarge numbers of very weak signals.

2.3 Discussion: Bayes versus frequentismConsidering that both types of analysis is used heavily in practical and seriousapplications and by the most competent analysts, it would be somewhat opti-mistic if one thought that one of these approaches could be shown right andthe other wrong. Philosophically, Bayesianism has a strong normative claimin the sense that every method that is not equivalent to Bayesianism can giveresults that are irrational in some circumstances, for example if one insists thatinference should give a numerical measure of belief in hypotheses that can betranslated to a fair betting odds[27, 74], or if one insists that this numerical mea-sure be consistent under propositional logic operations[24]. Among stated prob-lems with Bayesian analysis the most important is probably a non-robustnesssometimes observed with respect to choice of prior. This has been counteredby introduction of families of priors in robust Bayesian analysis[9]. Another isthe need for priors in general, because there are situations where it is claimedthat priors do not exist. However, this is a weaker criticism since the subjectiveprobability philosophy does not recognize probabilities as existing: they are as-sessed, and can always be assessed[27, Preface]. Assessment of priors introducesan aspect of non-objectivity in inference according to some critics. However,objective probability should not be identified with objective science: good sci-entific practice means that all assumptions made, like model choice, significancelevels, choice of experiment, as well as choice of priors, are openly described anddiscussed.

One can also ask if Bayesian and testing based methods give the same resultin some sense. Under many different assumptions it can be proven that frequen-tist confidence intervals and Bayesian credible intervals will asymptotically, formany observations, converge. But in some cases, like with censored data, thismight not be true[68].

2.3.1 Model checking in Bayesian analysis

It is possible to combine the Bayesian and hypothesis-testing frameworks[38].The parameter inference (3) gives a posterior over the parameter space, and areasonable interpretation is that data is generated by the predictive distributionP (D∗) =

∫λP (D∗|λ)f(λ|D)dλ. Now select a test statistic t(D). By obtaining

the test statistic of many predictive data sets D∗ by this distribution we canobtain an approximate distribution of the test statistic, and if the test statisticof the real data D (from which the posterior was obtained) falls far out in thetails of the approximate distribution it is an indication that the parameterizedmodel does not capture the data set obtained. It can be seen as a test of thehypothesis that the data was actually generated by the parameterized modelused in the derivation of the posterior. The technique is suggested by Gelman

20

also for ’informal’, visual, testing. In figure 11 we can see an example of thismethod.

2.4 Evidence TheoryAn important development in uncertainty management is the Dempster-Shaferor evidence theory. Originally formulated by Demptster[29], the idea was tohandle the case where several sources of information having slightly differentframes of reference were to be combined. Specifically, it is assumed that eachsource has its own possible world set, but precise beliefs about it in the formof a posterior probability distribution. The impreciseness results only froma multivalued mapping, ambiguity in how the sources information should betranslated to a common frame of discernment. Specifically, each source has aprecise probability distribution over its private frame of discernment, but theprivate frame is not aligned with the common frame. Thus if the private atomAi of source i can be mapped to, e.g., A, B or C in the common frame, then theprobability assigned to Ai will be allocated to the composite event {A,B,C} inthe common frame, etc. It is fairly plausible that the information given by thesources is well representable as a DS structure, which is a probability distributionover the non-empty subsets of Λ (usually a finite set in DS applications). Thesestructures are usually denoted m, for belief mass distribution. In Dempster’sview, a DS-structure gives an upper and a lower bound for the probability ofeach event e ⊂ Λ, called its belief (lower bound) and plausibility (upper bound),respectively: ∑

e′⊂e

me′ ≤ P (e) ≤∑

{e′:e∩e′ �=∅}m′

e (9)

In evidence theory, the specification of priors is usually ignored. The DS-structure is thought of as an evidence (prior, likelihood, or something else)about the system state. Two central ideas are pignistic transformations whichtake a DS-structure to a probability distribution over Λ, and combination rules,which say how two evidences can be combined to form a new one. The pignistictransformation estimates a precise probability distribution from a DS-structureby distributing evenly the probability masses of non-singleton focal elements totheir singleton members. The relative plausibility transformation, on the otherhand, consists of the normalized plausibility values for the singletons of 2Λ. TheDempster’s combination rule combines two DS-structures into a new one usinga random set operation: the random set intersection of the operands (regardedas random sets) conditioned on being non-empty. An alternative combinationrule, the MDS rule, was recently proposed by Fixsen and Mahler [37].

Whereas Dempster’s combination rule can be expressed as

mDS(e) ∝∑

e=e1∩e2m1(e1)m2(e2), e �= ∅, (10)

the MDS rule is

mMDS(e) ∝∑

e=e1∩e2m1(e1)m2(e2)

|e||e1||e2| . (11)

21

It is illuminating to see how the pignistic and relative plausibility transfor-mations emerge from a precise Bayesian inference: The observation space can inthis case be considered to be 2Λ, since this represents the only distinction amongobservation sets surviving from the likelihoods. The likelihood will be a functionl : 2Λ × Λ → [0, 1], the probability of seeing evidence e ⊂ Λ given world stateλ ∈ Λ. Given a precise e ∈ 2Λ as observation and a uniform prior, the inferenceover Λ would be f(λ|e) ∝ l(e, λ), but since we in this case have a probabilitydistribution over the observation space, we should use equation (4), weightingthe likelihoods by the masses of the DS-structures. Applying the indifferenceprinciple, l(e, λ) should be constant for λ varying over the members of e, foreach e. The other likelihood values (λ /∈ e) will be zero. Two natural choicesof likelihood are l1(e, λ) ∝ 1 and l2(e, λ) ∝ 1/|e|, for λ ∈ e. Amazingly, thesetwo choices lead to the relative plausibility transformation and to the pignistictransformation, respectively:

fi(λ|m) ∝∑

{e:λ∈e}m(e)li(e, λ) (12)

={ ∑

{e:λ∈e}m(e)/∑

e |e|m(e) , i = 1∑{e:λ∈e}m(e)/|e| , i = 2

It is also possible to combine two pieces of fuzzy (DS) evidence in the form oftwo DS-structuresm1 andm2. we find the task of combining the two likelihoods∑

em1(e)l(e, λ) and∑

em2(e)l(e, λ) using Laplace’s parallel composition as inequation (4) over Λ, giving

f(λ) ∝∑e1,e2

m1(e1)m2(e2)li(e1, λ)li(e2, λ).

For the choice i = 1, this gives the relative plausibility of the result of fusingthe evidences with Dempster’s rule; for the likelihood l2 associated with thepignistic transformation, we get

∑e1,e2

m1(e1)m2(e2)l(e1, λ)l(e2, λ)/(|e1||e2|).This is the pignistic transformation of the result of combining m1 and m2 usingthe MDS rule. There is some current debate on which estimation (pignistic orrelative plausibility) and combination (Dempster’s DS or Fixen/Mahler’s MDS)operations that should be used. I hope the above derivations show convincinglythat the choice of such operators depend on the statistical model chosen forthe application. In other words, none of the possible choices are intrinsicallycorrect.

For a discussion of the relationships between standard and robust Bayesiananalysis and evidence theory, see, e.g., [2]. An alternative interpretation ofDS-structures exists, and was first argued for in [76].

2.5 Beyond Bayes: PAC-learning and SVMWhereas Bayesian analysis is based on the assumption that priors and likeli-hoods are precisely known, there is an interesting approach that is probabilisticalthough it makes no assumptions on the probability distributions involved.This paradigm was originated by Chervonenkis and Vapnik[86], and was re-cently developed in computational learning theory [84] and even more recentlyin the Support Vector Machine (SVM) method[25].

22

How can a method be probabilistic and at the same time be independentof the actual probability distributions involved? First of all, we consider onlythe problem of predicting an unknown quantity y from a known quantity x,and using a set of training samples (xi, yi)i∈I . Here the xi are vectors with realvalued components drawn from a feature space Rd, and the yi are either binaryclass indicators (the classification problem) or real numbers (for a regressionproblem). If more than two classes are to be distinguished, or if more than onereal number is to be predicted, there are a number of ways to organize thisusing the basic binary classification or single variable prediction method we willdescribe.

In order to obtain distribution independence we must inevitably assume thatthere is some type of link between the training sample and the quantities that wetry to predict. This assumption is that the training sample has been generatedby the same probability distribution as those examples that we want to predict.Such a distribution can be described as a joint probability distribution p(x, y) oras a family of conditional probability distributions p(y|x) and a distribution overthe p(x). If we ignore the latter, we run the risk of having a training sample thatis unrepresentative for future use, if we ignore the former we run the risk thatthe relation between x and y is different during training from what it is duringuse of the predictor. Unfortunately, even when the assumption of commondistribution during training and testing is fulfilled, we run the risk of getting abad predictor because the training sample was accidentally unrepresentative.

With the concept of PAC-learning, we consider a particular classifier (re-stricting ourselves temporarily to the binary classification case) C(x) : x �→{−1, 1}. The classifier error is said to be the probability that C(x) is differentfrom y when (x, y) is drawn according to the distribution p(x, y). The empir-ical error is the fraction of sample points where C(xi) �= yi. If the classifier isconstructed from a random sample (xi, yi)i∈I drawn according to p(x, y) andthe probability is less than δ to obtain a classifier error larger than ε, thenwe say that we can obtain a (δ, ε) classifier for the distribution p. If there isε = ε(n, δ) so that this is true regardless of the distribution p(x, y), then we havea PAC bound for the classification problem. This formulation has the flavor ofstatistical testing in the sense that it asserts that the probability of obtainingmisleading data is small, regardless of the underlying distribution. On the neg-ative side, it is obvious that bounds obtained over any distribution will usuallybe pessimistic compared to cases where some knowledge about the plausibilityof possible distributions is encoded into a Bayesian analysis.

The basic form of the classifier in SVM stems from Rosenblatt’s perceptron[71]. This was a hardware device taking a number of real-valued inputs withcorrect binary classification as training data, used to set weights used to linearlyseparate further real-valued inputs. Let the input be a real valued vector, x =(x1, . . . , xn) and the class y be −1 or +1. The classifier decides the class of xby the rule C(x) = sign(

∑xiwi + b), where the wi and b are the weights of the

perceptron. A good deal of research has gone into finding good ways to set theweights using repeated scanning of the training set. However, if the classes ofnegative and positive examples can be linearly separated, finding a separatinghyperplane is equivalent to finding a feasible solution of a linear program andis in principle easy[20]. A key empirical finding has been that the classifier getsbetter if the hyperplane is placed at maximum distance from all training points,giving a wide margin classifier. Such a classifier can be found using a further

23

development of Lagrange’s method with multipliers, developed by Karush, Kuhnand Tucker, and it has been possible to explain its good behaviour with PAClearning theory. We will not go into the detailed algorithm for finding widemargin separators and narrow margin approximators, for details see [25, Ch 5].An important property of the algorithms is that they work entirely with scalarproducts in the feature space. This is what makes the ’kernel trick’ of the nextsection easy to implement.

A similar technique can be used for the regression problem. Here we wantto predict y by C(x) =

∑xiwi + b, and the idea is to define the weights such

that the maximum error |y−C(x)| is minimized. This method is similar to theAdaline, an early neural network due to Widrow and Hoff[88].

Dynamic prediction problems can be handled with the SVM regression ap-proach. If we have a time series (xi, yi), trying to predict yi from the laggedvector (xi−d, . . . , xi−1) for a suitable choice of d is a standard way to solve thedynamic prediction problem.

The perceptron and the Adaline were designed as learning machines witha built-in assumption of linearity. This was heavily critized by Minsky andPapert[55], whose book on the perceptron more or less blocked further researchon neural networks for a long time.

The SVM builds on the concept of the Perceptron and Adaline. However,the specification of a wide-margin separator or smallest error hyperplane meansthat the PAC learning analysis can be performed, resulting in distribution-independent error bounds. Specifically, Cristianini and Shawe-Taylor[25, Ch4] prove a large number of PAC bounds. The derivation of these bounds issomewhat lengthy, we will just give an example to show how they usually appear:

Proposition 1 Given a set of n examples with binary classification, such thatthe support of the distribution of x lies in a ball of radius R. Fix γ. Withprobability 1−δ, any two parallel hyperplanes 2γ apart and separating the positiveand negative examples (thus with margin γ), has error no more than

ε(n, δ, γ) =2n

(64R2

γ2log

enγ

9R2log

32nγ2

+ log4δ

), (13)

if n < 2/ε and 64R2/γ2 < n2.

Similarly for regression (where y is a real number), if all training points liewithin two parallel hyperplanes 2(Θ−γ) apart, the corresponding predictor hasresidual greater than Θ with probability at most ε(n, δ, γ) as defined in (13),again if n < 2/ε and 64R2/γ2 < n2.

As is obvious, the common distribution p(x, y) may well prevent accurateprediction, particularly if it factors into a distribution for x and another for y,p(x, y) = p(x)p(y). It is thus important in the formulation above that in suchcases we are unlikely to obtain a single classifier or regressor with the requiredperformance on the training set. We are also not allowed to create samplesrepeatedly until we find one on which a good predictor or classifier exists. Itis also important to note that the bound γ is defined before we have seen theexamples, although there are ways around this. Classification is often usefuleven on distributions where we are unlikely to obtain a good training set, butthe real distribution of examples gives some outliers. With few outliers we canstill find (ε, δ) bounds, and a number of related bounds are given in [25, Ch 4].

24

An interesting property of the bound (13) is that it is independent of thedimension (number of components) of x. In the perceptron discussion [55], it wassoon noted that many important examples exist where a nonlinear mapping to ahigher dimensional space makes classes linearly separable which are not linearlyseparable in the original space. As an example, if x is two-dimensional, themap Φ : (x1, x2) �→ (x1, x2, x

21, x1x2, x

22) has the interesting property that linear

separators in the target space R5 correspond to conic section separators in theoriginal space R2. Conic sections are e.g., ellipses, hyperbola and parabola,but also two parallel lines (separating the set between the lines from thoseoutside the lines on both sides) and two crossing lines (separating the sectionsopposite each other from the other two opposite sections). The insensitivity ofthe bound (13) to the dimension of the feature space means that we can expectgood generalization performance even after mapping the original examples toa high- or even infinite-dimensional space. This is accomplished in a generalfashion with the Kernel trick (next section).

The stringency and elegance of the distribution-independent framework hasmaybe promoted its use more than justified. It is not difficult to guess that thebounds obtained will in many cases be quite large, even larger than one (whichis rather damaging for error bounds on probabilities). The popularity of theSVM approach stems of course mainly from the experience that the results itgives are ’useful in practice’ despite the depressingly large strict error bounds.Alternative analyses of similar methods are given in a new book by Shafer,Gammerman and Vovk [77]. The Relevance Vector Machine of Tipping [83]uses a Bayesian approach and realizes a prediction and classification structurethat is in some respects more attractive than the SVM.

2.5.1 The Kernel Trick

As stated above, the SVM algorithms work using the example vectors onlythrough scalar products in the feature space. Thus, if we map the features xi toa higher dimensional space using the mapping Φ, we will have to compute scalarproducts Φ(x) ·Φ(z). For the quadratic map Φ : (x1, x2) �→ (x1, x2, x

21, x1x2, x

22)

where subscripts denote vector components, we get Φ(x) ·Φ(z) = x1z1 + x2z2 +x2

1z21+x1x2z1z2+x2

2z22 . This can almost be expressed as (x·z+1)2, the difference

being that there is a constant term and that the different terms are re-weightedby constant multipliers, and corresponds to a slightly different map:

Φ′(x) = (√2x1,

√2x2, x

21,√2x1x2, x

22).

The kernel technique is built on Mercer’s theorem, which says that certaintypes of functions K(x, y) of two feature vectors will certainly correspond tosome mapping Φ in the sense that Φ(x) · Φ(z) = K(x, z). In this case we donot need to explicitly map the example vectors to high-dimensional space: TheSVM algorithm works by only using vectors through scalar products, and thesecan be directly evaluated from the kernel K(x, z).

Mercer’s theorem is part of the toolbox in mathematical physics [22] and isquite interesting: it says under what conditions the eigenfunctions of an integralequation have the property that the scalar product of the eigenfunction vectorsat two points is equal to the kernel at the two points. This allows one to usemappings into infinite-dimensional space, since the eigenfunction sets are often

25

infinite. The background is Mercer’s condition on a symmetric kernel K(x, y)of an integral equation: ∫

C

K(x, y)f(x)dx = λf(y)

This equation can be though of as a simple eigenvector problem, but inan infinite-dimensional Hilbert space. There is a sequence of eigenvalues λi de-creasing in magnitude, and a corresponding sequence of real valued (normalized)eigenfunctions Ψi : C → R. We are interested in cases where all eigenvalues arepositive and where the scaled eigenfunctions Φi =

√λiΨi satisfy our require-

ment

K(x, y) =∑i

Φi(x) ·Φi(y).

If this is the case, the kernel trick works for the map Φ(x) = (Φ1(x),Φ2(x), . . .)and we can run the margin algorithm without actually mapping the feature vec-tors to the possibly infinite-dimensional space. Mercer’s theorem says that thiswill be the case for a symmetric kernel K(x, y) whenever Mercer’s condition issatisfied: ∫

C

K(x, y)g(x)g(y)dxdy ≥ 0,

for every square integrable function g(x), and if C is a compact subset ofsome space Rd (the original feature space).

We can now generalize our conic section separator (x · z + 1)2 to K(x, z) =(x · z + 1)d. This kernel correspond to mapping a feature vector x to a high-dimensional vector whose components are all monomials of degree not greaterthan d.

There are large numbers of kernels suitable for different types of modelingproblems, see, e.g., [25, 73]. The kernel trick is not only applicable in the SVMcontext: Whenever the computation required can be described in terms of scalarproducts of feature vectors the method is applicable. Several examples are given,e.g., in [14].

2.6 Uncertainty ManagementInterpretation of observations is fundamental for many engineering applications,and is studied under the heading of uncertainty management. Designers haveoften found statistical methods unsatisfactory for such applications, and in-vented a considerable battery of alternative methods claimed to be better insome or all applications. This has caused significant problems in applicationslike tracking in command and control, where different tracking systems withdifferent types of uncertainty management cannot easily be integrated to makeoptimal use of the available plots and bearings. Among alternative uncertaintymanagement methods are Dempster-Shafer theory[76] and many types of non-monotonic reasoning. These methods can be - to some extent - interpretedas a robust Bayesian analysis, where the analyst need not give precise priorsand likelihoods but can specify convex sets of priors and likelihoods, [9, 2] andBayesian analysis with infinitesimal probabilities, respectively[89, 5]. We have

26

also generalized the analyses by de Finetti, Savage and Cox, showing that underslightly weaker assumptions than theirs, uncertainty management where beliefis expressed with families of probability distributions that can contain infinites-imal probabilities is the most general method satisfying compelling criteria onrationality[4, 3]. Other alternative methods, like fuzzy logic and rough set the-ory, neural networks and case-based reasoning should rather be seen as a searchfor more useful model families than those used traditionally in statistics. Thesemethods can in principle be described in the Bayesian framework[91, 60, 67, 47],although frequently one makes heuristic validation tests rather than statisticalones. Moreover, since these methods were also developed with the intention ofavoiding statistics, there is a certain reluctance to start viewing them as statis-tical models, besides the feeling that such analyses might be infeasible. A fewstatistical methods can be applied even to rather unorthodox statistical models.If my favorite method tells me that there is a certain structure in the data,how can I verify this? The standard way is to obtain a new set of observationsand see if the new set shows the same structure. But new observations are inmany cases unobtainable for cost or time reasons. By resampling techniquesnew observation sets can be created, by taking a part of the original data set. Ifthe same structure can be seen in all or most parts of the original data set, thisgives confidence that the structure seen is not only random noise. Likewise, amethods tendency to see things in data that has no real meaning can be checkedby testing it on random data. By taking a real data set consisting of a numberof cases each with a set of variables, and randomly shuffling each variable amongthe cases, one gets a data set where no serious method should find significantstructure. Likewise, complex prediction methods can be tested by randomlypartitioning the available data set into one training set from which the predic-tor is obtained and another one from which its performance is estimated. Thesemethods should also be used when analyzing data using conventional statisti-cal methods, since they provide good tests against programming and handlingerrors.

3 Data modelsThe formulas in Chapter 2.1 always contain a data probability distribution term,in parameterized models as a function of (conditioned on) parameters. We willnow go through a number of such distributions often used in Bayesian analysesand uncertainty management. You have probably seen most of it in standardstatistics courses, but the Bayesian angle of these notes makes the emphasisdifferent. We start out from simple data types and work towards complex ones.

3.1 Univariate dataUnivariate data consist of real numbers, integers, or categorical values. Thelatter are usually coded as integers, but the ordering has no significance – thevalues can be thought of as members of an unordered set.

3.1.1 Discrete distributions and the Dirichlet prior

We have already seen the coin-tossing example. Here the data are binary cat-egorical (heads or tails), and the parameter is the probability of heads (or, by

27

symmetry, tails). The inference is about this parameter, and the priors andposteriors in section 2.1.7 are distributions for the parameter. The distributionschosen are called Beta distributions, and many of them (specifically, those withinteger parameters, since the number of successes and failures in our experimentmust be integers) are generated from the uniform distribution by multiplicationwith likelihoods and normalization.

This can be generalized to general discrete distributions over values for acategorical variable.

For a discrete distribution over d values, the outcome is a number from 1 to d.The parameter set is a sequence of probabilities x = (x1, . . . xd), constrained by0 ≤ xi and

∑i xi = 1 (often the last parameter xd is omitted - it is determined

by the first d− 1 ones). The likelihood after observing a set of outcomes whereoutcome i occurred ni times is Πix

ni

i . Normalizing this quantity so that itsintegral over the polytope 0 ≤ xi and

∑i xi = 1 becomes one, we have an

example of the Dirichlet distribution.The Dirichlet distribution with parameter set α is

Di(x|α) = Γ(∑i

αi)/∏i

Γ(αi)

(∏i

x(αi−1)i

), (14)

where Γ(n + 1) = n! for natural number n. The normalizing constantΓ(∑

i αi)/∏

i Γ(αi) gives a useful mnemonic for integrating∏

i x(αi−1)i over the

d − 1-dimensional hyperplane segment with xd = 1 −∑d−1i=1 xi and xi ≥ 0. It

is very convenient to use Dirichlet priors, for the posterior is also a Dirichletdistribution: After having obtained data with frequency count n = (n1, . . . , nd)we just add it to the prior parameter vector α to get the posterior parametervector α+ n.

With no specific prior information for x, it is necessary from symmetryconsiderations to assume all Dirichlet parameters equal, αi = α. A convenientprior is the uniform prior with αi = 1. This is, e.g, the prior used by Laplace toderive the rule of succession, see Ch 18 of [48] or [28]. Other priors have beenused, e.g., α = 1/2 in the case d = 2, which is a minimum information (Jeffrey’s)prior. The value α = 1/2 has also been used for d > 2(Madigan and Raftery[52]).Cheeseman and Stutz[18] report the use of α = 1+1/d. Experiments have shownlittle difference between these choices, but it is easy to see that Jeffreys priorpromotes xi close to 0 or 1 somewhat whereas α = 1 + 1/d penalizes extremeprobabilities. If we get significant differences between different uninformativepriors this warrants a closer investigation on the adequacy of data and modelingassumptions.

We will now find the data probability for a general discrete distribution(x1, . . . , xd). Integrating out the xi with the prior gives the probability of thedata given model M (M is characterized by a parameterized probability distri-bution and a prior Di(x|α) on its parameters):

p(n|M) =∫

p(n|x)p(x)dx

=∫ (

n

n1, . . . , nd

)∏i

xni

i

∏i

xαi−1i

Γ(α.)∏i Γ(αi)

dx

28

=(

n

n1, . . . , nd

)Γ(dα)Γ(α)d

∏i Γ(ni + α)Γ(n+ dα)

(15)

=Γ(n+ 1)Γ(dα)

∏i Γ(ni + α)

Γ(α)dΓ(n+ dα)∏

i Γ(ni + 1). (16)

As is easily seen, the uniform prior (αi = 1, all i) gives a probability for eachsample size that is independent of the actual data:

p(n|M) =Γ(n+ 1)Γ(d)Γ(n+ d)

. (17)

Exercise 7 Derive the normalization constant for the Dirichlet distribution(i) with αi = 1, all i;(ii) with integer parameters αi.

3.1.2 The normal and t distributions

The most common assumption about a real valued variable is that it has thenormal distribution. This assumption has some motivation as there are severaltheorems saying that under certain assumptions a quantity will be normallydistributed. However, in practice most data sets analyzed do not have thenormal distribution. Usually, the tails of empirical distributions are longer thanthe very short tails of the normal distribution. So whether or not the normalassumption is appropriate depends a lot on what kinds of data you have, andon the purpose of the analysis.

The normal distribution has parameter µ and σ2, the mean and variance.The distribution is

f(x|µ, σ2) = N(x|µ, σ2) =1√2πσ

exp(− 12σ2

(x− µ)2).

There is an alternative parameterization using the precision λ instead of vari-ance, with λ = 1/σ2. It is particularly popular in Bayesian statistics[13].

So let us analyze a sequence of real valued measurements (xi)ni=1, assumedto be independently drawn from some normal distribution whose parameters wewant to make inference about. The likelihood function is then

f((xi)|µ, σ2) =1

(√2πσ)n

exp

(− 12σ2

∑i

(xi − µ)2)

which, applying the abbreviations x =∑

i xi/n and s2 =∑

i(xi − x)2/n,simplifies to

f((xi)|µ, σ2) =1

(√2πσ)n

exp(− 12σ2

(ns2 + (n+ 1)x2 − 2nµx+ nµ2)). (18)

Since the likelihood is only dependent on data through the summaries xand s2, these are so called sufficient statistics - we do not need to retain theindividual values for inference under the normality assumption. However, the

29

individual values can well be needed for other purposes, e.g., for testing whetheror not the assumption of normality is plausible.

In order to apply (3), we must have priors on µ and σ2. But we could alsoassume that one of these is known, and that the purpose of analysis is to makeinference on the other. This allows for a more pedagogical progression, but isalso prototypical for some common real concerns. We could for example assumethat σ2 is known, like in the case where we make repeated measurements of afixed quantity in order to improve precision. We may have access to a statementof measuring accuracy that can be translated to a known value for σ2 whenmaking inference about µ. Or, we may want to make inference on measurementaccuracy by repeatedly measuring a fixed quantity where the ’true’ value isknown, like measuring the length of the archive meter, when making inferenceabout σ2.

Assume we know the value of σ2. A natural choice of uninformative priorfor µ is the uniform distribution. However, since µ can vary over the whole realline, the uniform distribution does not exist as a probability distribution. It ishowever possible to use priors that cannot be normalized to probability distri-butions, and it is standard for the normal distribution. Such non-normalizablepriors are called improper priors.

The case of fixed σ2 and improper uniform prior allows us to rewrite thelikelihood (18) as

f((xi)|µ, σ2) = f(µ, σ2|(xi)) ∝ exp(− 12σ2

(n(µ− x)2 + C)),

where the quantity C = x2+ns2 does not contain the parameters µ or σ. This is,regarded as an unnormalized posterior for µ, a normal distribution with mean xand variance σ2/n. So despite the fact that the prior is improper, the posterior,the product of the improper prior 1 and the likelihood, is a proper distributionfor all non-empty data sets (n > 0), essentially because the likelihood has aconvergent integral with respect to µ over the real line.

One standard assumption for the normal distribution is that (µ, log σ) areuniformly distributed over the whole plane, and thus that σ has a distributionproportional to 1/σ. This is also an improper prior since the integral

∫1/σdσ

diverges, both at 0 and at +∞. Again, the posterior is a proper distribu-tion, since the normalization constant is the inverse of the convergent integral∫∞0 σ−n−1exp(− 1

2σ2n(µ−x)2 +C)dσ. You may not have seen this distribution,but looking in a table such as [63] or [13, App.A], it is readily identified as aGamma distribution (see exercise 9).

Finally, we may want to make inference on both µ and σ. Their joint pos-terior under the standard improper prior assumption is given by equation (18)multiplied by the improper prior σ−n, which is maybe somewhat unintuitive.It would be nice to find the marginal probabilities of the two parameters, likein the case where we are only interested in the mean µ. This is obtained byintegrating out the σ, an instance of a common procedure known as integrat-ing out nuisance parameters. Somewhat surprisingly, the marginal distributionof µ is no longer a normal distribution but the somewhat more long-tailed t-distribution. The reason is that averaging over a long tail of the distributionover σ makes the posterior a continuous mixture of normal distributions with

30

varying variances. Indeed,

[rl]f(µ|(xi)) ∝∫∞0

σ−n−1 exp(− 1

2σ2(n((x−µ)2+x2+ns2))

)dσ ∝ (19)(

1 + n(µ−x)2

ns2

)−n/2

, (20)

where the last line was obtained after a change of variable z = (ns2 +n(µ−x)2/(2σ2). This is an instance of the t distribution, which is usually given theparametric form with parameters called mean (µ), variance (σ) and degrees offreedom (ν):

Exercise 8 A sample y1, . . . , yn of real numbers has been obtained. It is knownto consist of independent variables with a common normal distribution. Thisdistribution has known variance σ2 and the mean is known to be either 1 or 2(hypotheses H1 and H2, both cases considered equally plausible).(i) Which is the data probability function P (D|Hi)?(ii) Describe a reasonable Bayesian method for deciding the mean value.(iii) Characterize the power of the suggested procedure as a function of σ2, as-suming that the sample consists of a single point.

Exercise 9 Show that the posterior for inference of normal distribution vari-ance under the standard assumption is a gamma distribution, and find its pa-rameters.

3.1.3 Nonparametrics and mixtures

A common desire in analysis of univariate (and multivariate) data is to makeinference on the distribution of a sample without assuming anything about theanalytical form of the distribution. In other words we do not want to make infer-ence about parameters of a schoolbook distribution, but about the distributionitself. This field of investigation is called non-parametric inference. In principle,the equation (3) is applicable also in non-parametric analysis. In this case theparameter set is the set of all distributions we want to consider, and we mustnaturally, because this set is overwhelmingly large, have a suitable prior for thisparameter. We can also expect that posterior computation will be completelydifferent from the simple estimation of a few parameters we saw in the previousexamples. The set of all distributions includes also functions with strange setsof singularities.

In applying non-parametric Bayesian analysis we feed an observation set into(3). Here we have the first clue to non-parametrics: if the sample is small, wecan never get a good handle on the small scale behavior of the distribution, andwe must content ourselves with inference among a set of fairly regular (smooth)functions. We will consider a couple of ways to accomplish this. The firstmethod consists in discretization: divide the range of the variable into bins andassume a general discrete distribution for the probabilities of the variable fallinginto each bin. From such a general distribution, a pdf that is piece-wise constantover each bin can be used as an estimate for the actual distribution.

The second example shows how a distribution can be modeled as a mixture ofsimpler distributions. If the components are normal distributions, the mixturecan be expressed as f(x) =

∑ni=1 λiN(x, µi, σi). Here, the mixing coefficients

31

λi form a probability distribution, i.e., they are non-negative and sum to one.We can think of drawing a value of the mixture as first drawing the index iaccording to the distribution defined by the mixing coefficients and then drawingthe variable according to N(x|µi, σ2

i ) In this case we cannot know for certainwhich component generated a particular data item, since the support of thedistributions overlap.

3.1.4 Piecewise constant distribution

This uses the method of section 3.1.1 above. We design an MCMC trace thatapproaches the posterior distribution of the breakpoints delineating the binsof constant intensity. Since different ’binnings’ of the variable gives differentresolutions of the estimate, we can approach the scale selection problem bytrying different bin sizes and obtaining a posterior estimate for the appropriatenumber of bins, using (1) to compare the description of an interval either as oneor as two bins in each case assuming that the probability density of values isconstant over each bin.

Consider now a sample of real valued variables (xi) and a division of the realline into bins, so that bin i consists of the interval from bi to bi+1. It does notmatter much to which bin we assign the dividing points bi – these points form aset of probability zero in this model. Each division into bins can be considereda composite model, and we compare the posterior probabilities of these modelsto find a probability distribution over models. As a pre-view of the MCMCmethod, we will describe an iterative procedure to sample from this posteriorof bin boundaries and bin probabilities with respect to the observed data. Thissampling is organized as a sequence of model visits, where we visit the models insuch a way that the chain, taken as a sample, is a draw from the posterior. Whenwe are positioned at one model, we consider visiting other models producedeither by splitting a bin into two, or by merging two adjacent bins into one.We are thus in each step comparing two models, the finer having d bins. Wealways consider a uniform Dirichlet prior for the probabilities (x1, . . . , xd) of avariable falling into each of the d bins. Bin i has width wi. The data probabilityis then Πix

ni

i . The normalization constant in (14) tells us that over uniformlydistributed xi, the data probability will be ΠiΓ(ni + 1)/Γ(Σini + 1).

Consider the adjacent model where bins j and j + 1 have been merged. Wenow assume in the model that the probability density is constant over bins jand j + 1, so the probability xj in the coarser model should be split into αxjand (1−α)xj , which probabilities generated the counts nj and nj+1 in the finermodel. The proportion α is the fraction of the combined bin length belongingto the first constituent, α = wj/(wj + wj+1). The data probability for thecoarser model will now be the product xn1

1 · · · (αxj)nj ((1−α)xj)nj+1xnj+2j+2 · · ·xnd

d

Integrating out the xi (now only d−1 variables) leads to the posterior probability

αnj (1 − α)nj+1Πi/∈{j,j+1}Γ(ni + 1)Γ(nj + nj+1 + 1)

Γ((∑

ni + 1)− 1). (21)

We must have some idea of how many bins there should be. A conventionalassumption is that change-points are events with constant intensity, so the num-ber of change points should follow some Poisson distribution so the probabilityof d change points is exp(−λ)λd/d!, with a hyper-parameter λ saying how manychange-points we expect. The Bayes factor in favor of the finer model is thus

32

α−nj (1− α)−nj+1Γ(nj + 1)Γ(nj+1 + 1)λ

(n+ d− 1)Γ(nj + nj+1 + 1)(d+ 1). (22)

The technique used above to integrate out the xi is known as Rao-Black-wellization[17]. It can improve MCMC-computations tremendously.

While this method is very easy to apply for a univariate data inference prob-lem, it does not really generalize to high-dimensional problems. The assumptionof constant probability density over each bin is sometimes appropriate, as inmaking inference on the intensity of an event over time, in which case it triesto identify a set of ’change points’. A classical example of change-point analysisis a data set over times at which coal-mining disasters occurred in England[64].After having identified probable change points of the disaster intensity, one canfor example try to identify changes in coal-mining practice associated with thechange-points in intensity. A result from the popular coal-mining disaster dataset is shown in figure 9. This data set has been analyzed with similar objectivesin several papers, e.g., in [43]. Compared to the analysis there, which uses aslightly different prior on the function space, we get more change-points butsimilar density estimates. It is typical in MCMC-approaches that subtle differ-ences between models that seem equally plausible give different results. Sinceprogramming is involved, it is also useful to check the model by running it ona similar set of points generated with a uniform distribution, and check thatit proposes a more even density of change points than the actual coal-miningdisaster data. One random example is shown in figure 10. The graph can bejudged with experience to ’more random’ for the synthetic data than for the realdata, since it has a much more diffuse distribution of change-points. However,there is an embarrassing lack of obviousness in this judgment. It seems as ifthe real data could with significant plausibility actually have been generatedwith constant intensity. In order to convince ourselves more thoroughly thatthe coal-mining disaster data cannot reasonably be a sample from a uniformdistribution, we can generate a large number of random data sets and comparetheir cumulative plots with those of the coal mining data. The result is shownin figure 11.

It may be interesting to know what might have caused the change around1888: according to the analysis in [64], this time period was characterized by afairly steep decline in productivity (yield per hour worked). This was apparentlynot caused by state regulation of safety measures in mining, but in the build-upof trade unions in the mining industry.

In other applications, the assumption of piece-wise constant densities maybe less appropriate. An example is when one estimates a distribution in orderto find clusters in the underlying generation mechanism. For example, we maybe interested in finding possible subspecies in a population where each popula-tion member has a quantity (weight, height etc) that is explained as a normaldistribution with sub-species specific parameters µ and σ2. In this case we canmodel the distribution as a mixture of normals (next section 3.1.5).

Exercise 10 We tested above the hypothesis that the coal-mining disaster den-sity is in fact constant. Is it possible to also test the hypothesis that the intensityis piece-wise constant?

33

1840 1860 1880 1900 1920 1940 1960 19800

0.5

1

1840 1860 1880 1900 1920 1940 1960 19800

0.5

1

1840 1860 1880 1900 1920 1940 1960 19800

0.01

0.02

−0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.50

0.5

1

Figure 9: Coal mining disaster data. plots of occurrence times of disasters,MC estimate of density of change points, five overlaid density estimates and theobserved number of change points distribution. The second plot shows what isknown as the probability hypothesis density (PHD) in the more complex caseof multi-target tracking[?], expected to be used, e.g., in future intelligent cruisecontrols to keep track of other vehicles and obstacles in road traffic.

3.1.5 Univariate Gaussian Mixture modeling

Consider the problem of deciding, for a set of real numbers, the most plausibledecompositions of the distribution as a weighted sum (mixture) of a numberof distributions each being a univariate normal (Gaussian) distribution. Thisproblem has significance when we try to find ’discrete’ circumstances behind ameasured variable which is also influenced by various chance fluctuations. Notethat a single column of discrete data is not decomposable in this way because amixture of discrete distributions is again a discrete distribution. But a mixtureof normal distributions is not itself a normal distribution. In the frequentlyused Enzyme problem[66], the discovered components, if any, could correspondto genetic factors in a population. There are quite many approaches to solve thisproblem, and many carry over to the more general problem of modeling a matrixof reals as coming from a mixture of multivariate Gaussians[46]. The approachpresented here seems to have an advantage in that it is not necessary to includethe parameters (mean and variance) of the participating Gaussians. It simulatesonly the assignment of variables to classes, and for each such assignment itcomputes the exact posterior of the data including the latent class variable.This posterior is, for class I:

p(D|I) = |I|!∫ ∏

i∈Ip(di|Θ)p(Θ)dΘ, (23)

If we assume a uniform prior for the distribution over classes, this factor willonly depend on the total sample size and can be ignored (see equation (17)),

34

1840 1860 1880 1900 1920 1940 1960 19800

0.5

1

1840 1860 1880 1900 1920 1940 1960 19800

0.5

1

1840 1860 1880 1900 1920 1940 1960 19806

8

10x 10

−3

−0.5 0 0.5 1 1.5 2 2.50

0.5

1

Figure 10: Checking the model: Similar data set with uniformly distributedpoints.

so we just multiply together the contributions from each class to get the modelprobability for the current class assignment.

Just as the discrete probability distribution has the Dirichlet conjugate fam-ily, the Gaussian univariate has a conjugate family, which is a distribution overthe mean and variance parameters of the Gaussian. It is possible to definepriors were the mean and variance parameters of the Gaussian are a priori in-dependent, see [16], but with that prior the data probability is not expressiblein closed form (it involves the exponential integral). For the natural conjugateof a Gaussian, in the notation of Bernardo and Smith[13], we use the inverse ofthe variance as a parameter (the precision) and have the Gaussian distribution:

f(x|µ, λ) =√

λ

2πe−λ/2(x−µ)2 , (24)

and the natural prior for µ and λ is a normal-gamma distribution where theprecision λ is drawn from a gamma distribution with parameters α > 0 andβ > 0. The mean µ is then normally distributed with mean µ0 and precisionn0λ, where n0 > 0 and µ0 are new constants:

p(λ|α, β) = βα

Γ(α)λα−1e−βλ (25)

p(µ|µ0, n0, λ) =

√n0λ

2πe−n0λ/2(µ−µ0)2 , (26)

The joint probability distribution is thus:

p(D,µ, λ) =βα

Γ(α)λα−1e−βλ(

λ

2π)n/2

e−λ/2Σi(xi−µ)2

√n0λ

2πe−n0λ/2(µ−µ0)2 (27)

35

0 20 40 60 80 100 1201880

1890

1900

1910

1920

1930

1940

1950

1960

1970

Figure 11: Coal mining data against 100 sets of uniformly distributed randomdata.The coordinates are cumulative count and year. The lowest curve comesfrom the real data and is considerably below those from the 100 random datasets. The coal-mining disaster data has thus significant non-uniform structure

and we want to obtain the data probability by integrating over µ and λ,which after some calculation and the substitutions n′s2 = Σx2

i + n0µ20, n′m =

Σxi + n0µ0 and n′ = n+ n0 yields:

p(D|n0, α, β) =βα

Γ(α)n!(2π)−n/2

√n0

n′

∫ ∞

0

e−n′λ/2(s2−m2)λn/2+α−1e−βλdλ

(28)The value of the integral is the inverse normalization constant for a gamma

distribution with parameters (α+ n/2, β+ n′/2(s2 −m2)) and can be obtainedby substituting in (25). The probability of data D and class assignment {Ic}c∈Cis then

p(D, {Ic}c∈C) = (2π)n/2√n0Πc

nc!Γ(α+ nc/2)√n′c(β + n′

c/2(s2c −m2c))

(29)

The choice of the parameters can be made so that µ and λ are fairly evenlydistributed over a range covering the values found likely by inspection of thedata. The coupling of the precisions of the distributions of the data pointsand of the prior class mean seems to constrain this model in a bad way, sinceit penalizes sharp peaks in the outskirts of the distribution. A theoreticallysound way to deal with this problem is to assign a so called hyper-prior for theparameter µ0. The result of such an exercise is a hierarchical model.

Equation (29) gives an exact probability of data and class assignment, givenparameters µ0, α, β and n0. The Metropolis-Hastings proposal will be a reas-signment of the class of one data point. This changes the nc, mc and s2c valuesof two classes with amounts easily computed. The resulting density ratio of thedistribution (29) which controls the probability of taking the proposed move,

36

is also easy to compute. In the hierarchical model we would also have a class-specific prior mean µc, which is also recomputed in a move. It is practical toforbid two class means to switch by these moves, so that the classes can alwaysbe recognized by their relative prior means.

In this one-dimensional example, switching means is a symptom of a problemthat affects the naive mixture identification of which the above is an example:Since membership of cases in mixture components is not dependent on therelative order of the components, an MCMC trace may switch the labelingof the components. This is accomplished by a statistically rare switching ofmembership of cases between two components (that are relative close in theircomponent parameters) which after some time will effectively mean that thetwo components have switched ’labels’, i.e. their position in the ordered listof components. This phenomenon is of no importance when the componentsare dissimilar enough to prevent switching from occurring in practice, whichis a common case. For more subtle mixture-identification problems, methodsare available either to simply prevent switching [58], or of a more sophisticatednature[75].

3.2 Multivariate and sequence data modelsIn principle, a univariate data item or random variable can be generalized bytwo recursive operations, forming sets and sequences, respectively. A (multi-variate) set is formed as a finite collection, or as an indexed sequence. Sincethe operations are recursive, we can in principle create sets of sequences, se-quences of sets, and many more complicated structures, which we will avoidhere. A sequence can be thought of as a time series, but they are used also todescribe, e.g., genes and proteins. For time series we may want to predict theirfuture behavior, which is typically done using various filter operations that canbe collectively described with equation (6). But for other sequences it may bemore important to make inference about a hidden ’latent structure’, using theretrodiction equation (7).

Sets of variables are typically analyzed using either a graphical dependencymodel (common for categorical data), or with various types of normality as-sumptions, like factor analysis and regression. Also here it may be useful tomake inferences about hidden structure in the set, typically by decomposingthe data set into classes that each have a simpler statistical structure than thetotal set.

3.2.1 Multivariate models

We consider a data matrix where rows are cases and columns are variables. Ina medical research application, the row is associated with a person or an inves-tigation (patient and date). In an internet use application the case could be aninteraction session. The columns describe a large number of variables that couldbe recorded, such as background data (occupation, sex, age, etc), and numbersextracted from investigations made, like sizes of brain regions, receptor densi-ties and blood flow by region, etc. Categorical data can be equipped with aconfidence (probability that the recorded datum is correct), and numerical datawith an error bar. Every datum can be recorded as missing, and the reasonfor missing data can be related to patients condition or external factors (like

37

equipment unavailability or time and cost constraints). Only the latter type ofmissing data is (at least approximately) unrelated to the domain of investiga-tion. On the level of exploratory analysis we confine ourselves to discrete andmultivariate normal distributions, with Dirichlet and inverse Wishart priors. Inthis way, no delicate and costly MCMC methods will be required until miss-ing data and/or segmentation is introduced. If the data do not satisfy theseconditions (e.g., normality for a real variable), they may do so after suitabletransformation and/or segmentation. Another approach is to ignore the distri-bution over the real line and regard a numerical attribute as an ordinal one, i.e.,considering only the ordering between values. Such ordinal data also appearnaturally in applications where subjects are asked to grade a quantity, like theirappreciation of a phenomenon in organized society or their valuation of theirown emotions.

In many applications the definition of the data matrixes is not obvious. Forexample, in text mining applications, the character sequences of informationitems are not directly of interest, but a complex coding of their meaning mustbe done, taking into account the (natural) language used and the purpose ofthe application.

3.2.2 Dependency tests for categorical variables using Dirichlet

Assume that we observe pairs (a, b) of discrete variables. How can we find outwhether or not they are dependent? There is a large number of procedures pro-posed to answer this question, but one that is particularly elegant and intuitivein Bayesian analysis. We want to choose between two parameterized models,one that captures independence and one that captures dependency. The firstone MI is characterized by two discrete distributions, one for A assumed tohave dA outcomes and one for B assumed to have dB outcomes. Their prob-ability vectors are (xA1 , . . . , x

AdA

) and (xB1 , . . . , xBdB

), respectively. This modelgives the probability xixj for outcome (i, j). The second model MD says thatthe two variables are generated jointly by one discrete distribution having dAdBoutcomes, each outcome defining both an outcome for A and one for B. Thisdistribution is characterized by the probability vector (xAB

1 , . . . , xABdAdB

).In order to get composite hypotheses we must equip our models with priors

over the parameter spaces. Let us choose uniform priors for all three partici-pating distributions.

A set of outcomes for a sequence of discrete variables can be convenientlysummarized in a contingency table. In our case this is a matrix n with elementnij giving the number of occurrences of (i, j) in the sample. We can now findthe probabilities of an outcome n for the two models by integrating out theparameters from the product of likelihood and prior.

As we saw in the derivation of equation (17), the uniform prior (αi = 1, alli) gives a probability for each sample size that is independent of the actual data:

p(n|M) =Γ(n+ 1)Γ(d)Γ(n+ d)

. (30)

Consider now the data matrix over A and B. Let nij be the number of rowswith value i for A and value j for B. Let n.j and ni. be the marginal countswhere we have summed over the ’dotted’ index, and n = n.. =

∑ij nij . The

38

probability of the data given MD is obtained by replacing the products andreplacing d by dAdB in equation (17):

p(n|MD) =Γ(n+ 1)Γ(dAdB)Γ(n+ dAdB)

. (31)

We now consider the model of independence, MI . Assuming parameters xA

and xB for the two distributions, a row with values i for A and j for B will haveprobability xAi x

Bj . For discrete distribution parameters xA, xB, the probability

of the data matrix n will be:

p(n|xA, xB) =(n

n11, . . . , ndAdB

) dA,dB∏i,j=1

(xAi xBj )

nij

=(

n

n11, . . . , ndAdB

) dA∏i=1

(xAi )ni.

dB∏j=1

(xBj )n.j .

Integration over the uniform priors for A and B gives the data probabilitygiven model MI :

p(n|MI) =∫p(n|xAxB)p(xA)p(xB)dxAdxB

=∫ (

n

n11, . . . , ndAdB

) dA∏i=1

(xAi )ni.

dB∏j=1

(xBj )n.j ×

Γ(dA)Γ(dB)dxAdxB

=Γ(n+ 1)Γ(dA)Γ(dB)Γ(n+ dA)Γ(n+ dB)

∏i Γ(ni. + 1)

∏j Γ(n.j + 1)∏

ij Γ(nij + 1).

From equations (31) and (32) we obtain the Bayes factor in favor of inde-pendence:

p(n|MI)p(n|MD)

=

Γ(n+ dAdB)Γ(dA)Γ(dB)Γ(n+ dA)Γ(n+ dB)Γ(dAdB)

∏j Γ(n.j + 1)

∏i Γ(ni. + 1)∏

ij Γ(nij + 1). (32)

The analysis presented above will be used in, and continued in, section 3.3 ongraphical models. The gamma functions have very large values, and expressionsinvolving them should normally be evaluated as logarithms to prevent numericoverflow. For example, Matlab has a useful function gammaln that evaluates thelogarithm of the gamma function directly.

39

3.2.3 The multivariate normal distribution

The multivariate normal distribution is ubiquitous in statistical modelling andhas an abundance of interesting properties. We will describe it very shortlyand indicate some central analysis methods based on it. The distribution is adistribution over points in a d-dimensional space, and one way to get a multivari-ate normal distribution is to multiply together a number of univariate normaldistributions, one over each dimension xi. Some of these may have variancezero, which effectively is a Dirac δ-function and constrains the density to a sub-space (such distributions are called degenerate). This gives a distribution whoseequidensity surfaces form an ellipsoid in d-space, possibly but not necessarilywith different lengths on the principal axes which are parallel to the coordinateaxes. In case several principal axes have the same length, it is not possible totell the orientation of them, we only know that they are mutually orthogonaland span a subspace. This is also the most general form of it, with one impor-tant exception, namely that any distribution obtained by rotating the frame ofreference is also a multivariate normal distribution. So, e.g., the ball used inAmerican football has only one well-defined principal direction while the ballused in soccer has none. The analytic form of the density function is

p(x) = (2π)−d/2|Σ|−1/2 exp(−12(x− µ)TΣ−1(x− µ)

),

where x varies over Rd, µ is the mean and Σ is the variance (often called co-variance) matrix, a symmetric positive definite d by d matrix. The inverse of thevariance matrix is sometimes called the precision matrix. The eigenvectors of Σare indeed the principal directions and its eigenvector eith the largest eigenvalueshows the direction of largest variation of the distribution.

Models based on the multivariate normal distribution may be oriented to-wards making inferences about the orientation of the axes particularly in princi-pal component and factor analysis, and regression, or in expressing an empiricaldata set as a mixture of multivariate distributions. The latter can be used bothto explain structure of the data and as a non-parametric inference method formultivariate data, a generalization of the method described in section 3.1.5.

Exercise 11 The multivariate normal distribution has the following nice recur-sive characterization: A distribution over d-dimensional space is normal if andonly if either, d = 1 and it is the univariate normal distribution, or, d > 1 andboth all marginals and all conditionals of it are multivariate normal distributionsover d − 1-dimensional space. Show that this is true! Hint: A conditional of amultivariate distribution is obtained by conditioning on one of its variables. Amarginal is obtained by integrating out one of its variables.

3.2.4 Dynamical systems and Bayesian time series analysis

The analysis of time series is one of the central tasks in statistics. We will hereonly describe a number of useful but maybe less known methods.

For the case of a series with known likelihoods connecting observations tosystem state and known system state dynamics, the task will normally be topredict the next or a future state of the system. The standard solution methodis conveniently formulated in the Chapman Kolmogoroff equation (6), which can

40

be solved with a Kalman filter for known linear dynamics and known Gaussianprocess noise. For non-linear, non-Gaussian systems, the sequential MCMC orparticle filter method described in section 4.1.1 can be used. There are trickycases if the state and observation spaces have non-uniform structure, such asin the multiple target tracking problem, where the task is to find the probablenumber of targets (and their states) from cluttered radar readings. This problemhas an established solution in the FISST[42] method.

Bayesian methods can also be used to make inference, not only about mea-surement and process noise, but also on the system state space dimension. Forthe latter, a basic theorem is Takens[80], considering a system governed by afirst-order differential equation and an observation process, contaminated byprocess and measurement noise χ(t) and ψ(t):

s′(t) = G(s(t)) + χ(t) (33)x(t) = F (s(t)) + ψ(t).

Here the state and observation spaces are assumed to be of finite dimension(but not finite!), although a major interest in the physical interpretation of (33)is that the finite-dimensional state space of physical systems is often embeddedin an infinite-dimensional state space, as in some laminar flow problems. Thetheorem says that under suitable but general assumptions, the space of lag-vectors (x(t), x(t − T ), . . . , x(t − (d − 1)T )) is diffeomorphic to the state spacefor d large enough, and its interest stems from the possibility of constructing anapproximate state space from this principle, i.e., by estimation the lowest valueof d + 1 that gives a degenerate set of lag-vectors obtained from experimentaldata. The exact formulation of the theorem has a number of regularity assump-tions and it is formulated in asymptotic form assuming no noise. However, it isin some cases possible to construct non-linear predictors for such systems withextremely good performance. The idea is to find suitable values for the dimen-sion of a state space and time step, collect a large number of trajectories, and tofind approximations of the functions F and G using numerical approximationmethods. The main problem with finding such predictors are that the processand measurement noise may be too large for reliable identification and that thesystem may be drifting so that the functions F and G change with time. Ananalysis of such methods used to solve challenge problems in time series predic-tion in the 1991 Santa Fe competition can be found in [39]. A more rigorousanalysis based on Bayesian analysis and where the functions F and G of (33)are represented as special types of neural networks can be found in [85]. Thefunctions F and G are represented parametrically as:

F (s) = B tanh(As+ a) + b

G(s) = s+D tanh(Cs+ c) + d,

where the number of columns of B and rows of A is the number of hiddencells in the observation function neural network and the number of columns ofD and rows of C is the number of hidden cells in the transition kernel ANN.The components of matrices and vectors are the weights of the networks. Themethod puts independent Gaussian priors on all weights. In their Matlab imple-

41

mentation, Valpola and Karhunen do not use MCMC but a numerical iterativeprocedure they claim to be significantly faster than MCMC.

3.2.5 Discrete states and observations: The hidden Markov model

This special case occurs naturally in some applications like speech analysis aftersome signal processing (the number of phonemes in speech is finite, around 27),and is intrinsic in bio-informatics analysis of genes and proteins.

Consider a finite sequence of observations (x1, x2, . . . , xn). Assume that thissequence was obtained from an unknown Markov chain on a finite state space:(s1, s2, . . . , sn), with state transitions and observations defined by discrete prob-ability processes P (st|st−1) and P (xt|st). This is the same setup as in equation(6). The sizes of the state space and sometimes the observation space is howeveroften not known, but can be guessed, or subject to inference based on the retro-diction equation (7). Assume they are fixed to do and ds, respectively. Now theunknowns are the state transition probabilities, a ds by ds matrix of ds discreteprobability distributions, and one do by ds observation distribution matrix ofone discrete observation distribution for each system state. If no particular ap-plication knowledge about the parameters is available, it is natural to assumepriors giving uniform Dirichlet priors to the probability distributions, and themissing observations can be either assumed uniformly distributed, or if they arefew, distributed as the non-missing observations.

An MCMC procedure for making inferences about the two (ds×ds and do×ds) matrices goes as follows: The variables are the state vector (s1, s2, . . . , sn)and the two matrices. We can also accommodate a number of missing observa-tions as variables. The discrete distributions, the state vector and the missingvalues are initialized to essentially arbitrary values. In each step of the chainwe propose a change in one of the 2ds distributions, a state si or one or moreof the missing observations. By considering the change in probability (7) andsymmetry of the proposal distribution we compute the acceptance probabilityfor the move, draw a standard uniform random number and accept or rejectthe proposal depending on whether or not the random number is less than theacceptance probability.

The probability is the product of all transition and observation probabilitiesconnecting a state with its successor state in S, or a state st with the corre-sponding observation. Only a few of these probabilities actually change by theproposal and it is important to take advantage of this in the computation. Eval-uation of the run must be made by examination of summaries of a trace of thecomputation. Typically, peaked posteriors are a sign that some real structureof the system has been identified whereas diffuse posteriors leave the possibilitythat there is no structure in the series (high noise level or even only noise inrelation to the length of the series) still plausible.

3.3 Graphical ModelsGiven a data matrix, the first question that arises concerns the relationshipsbetween its variables(columns). Could some pairs of variables be consideredindependent, or do the data indicate that there is a connection between them -either directly causal, mediated through another variable, or introduced throughsampling bias? These questions are analyzed using graphical models, directed

42

or decomposable[52]. As an example, in figure 12 M1 indicates a model whereA and B are dependent, whereas they are independent in model M2. These arethus graphical representations of the models called MD and MI , respectively,in section 3.2.2. In figure 13, we describe a directed graphical model M ′′

4 in-dicating that variables A and B are independently determined, but the valueof C will be dependent on the values for A and B. The similar decomposablemodel M4 indicates that the dependence of A and B is completely explained bythe mediation of variable C. We could think of the data generation process asdetermining A, then C dependent on A and last B dependent on C, or equiv-alently, determining first C and then A dependent on C and B dependent onC.

M1

M1’

M2

M2’

A

A

A

A

B

B

B

B

Figure 12: Graphical models, dependence or independence?

Bayesian analysis of graphical models involves selecting all or some graphson the variables, dependent on prior information, and comparing their posteriorprobabilities with respect to the data matrix. A set of highest posterior proba-bility models usually gives many clues to the data dependencies[51, 52], althoughone must - as always in statistics - constantly remember that dependencies arenot necessarily causalities.

3.3.1 Causality and direction in graphical models

Normally, the identification of cause and effect must depend on ones under-standing of the mechanisms that generated the data. There are several claimsor semi-claims that purely computational statistical methods can identify causalrelations among a set of variables. What is worth remembering is that these

43

BA

C

A B

C

A B

C

A B

C

A B

C

M3

M3’

M4

M4’

M4’’

Figure 13: Graphical models, conditional independence?

methods create suggestions, and that even the concept of cause is not unambigu-ously defined but a result of the way the external world is viewed. The claimthat causes can be found is based on the observation that directionality can insome instances be identified in graphical models: Consider the models M ′′

4 andM ′

4 of figure 13. In M ′4, variables A and B could be expected to be marginally

dependent, whereas in M ′′4 they would be independent. On the other hand,

conditional on the value of C, the opposite would hold: dependence between Aand B inM ′′

4 and independence in M ′4! This means that it is possible to identify

the direction of arrows in some cases in directed graphical models. It is difficultto believe that the causal influence should not follow the direction of arrows inthose cases, and this opens the possibility, e.g., of scavenging through all kindsof data bases recording important variables in society, looking for causes of un-wanted things. Since causality immediately suggest the possibility of controlby manipulation of the cause, this is both a promising and a dangerous idea.Certainly, this is a potentially useful idea, but it cannot be applied in isolationfrom the application expertise, as the following example illustrates. It is knownas Simpson’s paradox, although it is not paradoxical at all.

Consider the application of drug testing. We have a new wonder drug thatwe hope cures an important disease. We find a population of 800 subjects whohave got the disease; they are asked to participate in the trial and given a choicebetween the new drug and the alternative treatment currently assumed to be

44

best. Fortunately, half the subjects, 400, choose the new drug. Of these, 200recover. Of those 400 who chose the traditional treatment, only 160 recovered.Since the test population seems large enough, we can conclude that the newdrug causes recovery of 50% of patients, whereas the traditional treatment onlycures 40%. Since this is known to be a nasty disease difficult to cure, particu-larly for women, this seems to be a cause for celebration until someone suddenlyremembers that the recovery rate for men using traditional treatment is sup-posed to be better than the 50% shown by the trial. Maybe the drug is notadvantageous for men? Fortunately, in this case it was easy to find the sex ofeach subject and to make separate judgments for men and women.

So when men and women are separated, we find the following table:

recovery no recovery Total rec. ratemen

treated 180 120 300 60%not treated 70 30 100 70%

womentreated 20 80 100 20%

not treated 90 210 300 30%total

treated 200 200 400 50%not treated 160 240 400 40%

Obviously, the recovery rate is lower for the new treatment, both for womenand for men! Examining the table reveals the reason, which is not paradoxical atall: the disease is more severe for women, and the explanation for the apparentbenefits of the new treatment is simply that it was tried by more men, forwhom the disease is less severe. The sex influences both the severity of thedisease and the willingness to test the new treatment, in other words sex is aconfounder. This situation can always occur in studies of complex systems likeliving humans and most biological, engineering or economic systems that are notentirely understood, and the confounder can be much more subtle than sex. Afirst remedy is to balance the design with respect to known confounders. But inthese types of systems there are always unknown confounders, and the only safeway to test new treatments is to make a ’double blind’ test, where the treatmentis assigned randomly, and neither the subject nor the physician knows whichalternative was chosen for a subject. Needless to say, this design is ethicallyquestionable or at least debatable, and also one reason why drug testing isextremely expensive. For complex engineered systems, however, systematicallydisturbing the system is one of the most effective ways of understanding itsdynamics, hindered not by ethical but sometimes by economic concerns.

When we want to find the direction of causal links, the same effect can occur.In complex systems of nature, and even in commercial warehouse data bases, itis not unlikely that we have not even measured the variable that will ultimatelybecome the explanation of a causal effect. Such an unknown and unmeasuredcausal variable can easily turn the direction of causal influence indicated by thecomparison between modelsM ′′

4 andM ′4, at least unless the data is abundant. A

pedagogical example of this effect is when a causal influence from a barometerreading to tomorrows weather is found. The cause of tomorrows weather is

45

however not the meter reading, but the air pressure it measures, and one caneven claim that the air pressure is caused by weather in the vicinity that willbe tomorrows weather here. This is not the total nonsense it may appear to be,because it is probably more effective to control tomorrows weather by controllingthe weather in the vicinity than by controlling the local air pressure or bymanipulating the barometer. In other cases this confounding can be much moresubtle.

Nevertheless, the new theories of causality have attracted a lot of interest,and if applied with caution they should be quite useful[61, 41]. Their philosoph-ical content is that a mechanism, causality, that could earlier not or only withdifficulty be formalized, has become available for analysis in observational data,whereas it could earlier only be accessed in controlled experiments.

3.3.2 Segmentation and clustering

A second question that arises concerns the relationships between rows (cases)in the data matrix. Are the cases built up from distinguishable classes, so thateach class has its data generated from a simpler graphical model than that ofthe whole data set? In the simplest case these classes can be directly read offin the graphical model. In a data matrix where inter-variable dependencies arewell explained by the model M4, if C is a categorical variable taking only fewvalues, splitting the rows by the value of C could give a set of data matricesin each of which A and B might be independent. However, the interestingcases are where the classes cannot be directly seen in a graphical model becausethen the classes are not trivially derivable. If the data matrix of the examplecontained only variables A and B, because C was unavailable or unknown tointerfere with A and B, the highest posterior probability graphical model mightbe one with a link from A to B. The classes would still be there, but since Cwould be latent or hidden, the classes would have to be derived from the A andB variables only. A different case of classification is where the values of onenumerical variable are drawn from several normal distributions with differentmeans and variances. The full column would fit very badly to any single normaldistribution, but after classification, each class could have a set of values fittingwell to a normal distribution. The problem of identifying classes is known asunsupervised classification. One comprehensive system for classification basedon Bayesian methodology is described by Cheeseman and Stutz[18]. In figure14 we illustrate the use of segmentation: The segment number can be regardedas a hidden variable, and if the segments are chosen to minimize within-segmentdependencies, and succeeds, then the data is given a much more explanatorygraphical model.

Finally, it is possible that a data matrix with many categorical variableswith many values gives a scattered matrix with very few cases compared to thenumber of potentially different cases. Aggregation is a technique by which acoarsening of the data matrix can yield better insight, such as replacing the ageand sex variables by the categories kids, young men, adults and seniors in a carinsurance application. The question of relevant aggregation is clearly related tothe problems of classification. For ordinal variables, this line of inquiry leadsnaturally to the concept of decision trees, that can be thought of as a recursivesplitting of the data matrix by the size of one of its ordinal variables.

46

A

B

C

D

L

A

B

CD

Figure 14: Segmentation explains data: The variables themselves are highlydependent (left). If a segmentation into components with independent variablescan be found, then the variable indicating segment number can be added to getsimpler dependencies.

3.3.3 Graphical model choice - local analysis

We will analyze a number of models involving two or three variables of cate-gorical type, as a preparation to the task of determining likely decomposable ordirected graphical models. First, consider the case of two variables, A and B,and our task is to determine whether or not these variables are dependent. Theanalysis in section 3.2.2 shows how we can choose between models M1 and M2:

Let dA and dB be the number of possible values for A and B, respectively.It is natural to regard categorical data as produced by a discrete probabilitydistribution, and then it is convenient to assume Dirichlet distributions for theparameters (probabilities of the possible outcomes) of the distribution.

We will find that this analysis is the key step in determining a full graphicalmodel for the data matrix.

We could also consider a different model M ′1, where the A column is gener-

ated first and then the B column is generated for each value of A in turn. Thiscorresponds to the directed model M ′

1: With uniform priors we get:

p(n|M ′1) =

Γ(n+ 1)Γ(dA)Γ(dB)dA

Γ(n+ dA)

∏i

Γ(ni. + 1)Γ(ni. + dB)

(34)

Observe that we are not allowed to decide between the undirected M1 andthe directed model M ′

1 based on equations (31) and (34). This is because thesemodels define the same set of pdf:s involving A and B, the difference lying only

47

in the structure of parameter space and parameter priors. They overlap on aset of prior probability one.

In the next model M2 we assume that the A and B columns are indepen-dent, each having its own discrete distribution. There are two different ways tospecify prior information in this case. We can either consider the two columnsseparately, each being assumed to be generated by a discrete distribution withits own prior. Or we could follow the style of M ′

1 above, with the differencethat each A value has the same distribution of B-values. The first approachcorresponds to the analysis in section 3.2.2, equation (32). From equations (31)and (32) we obtain the Bayes factor for the undirected data model:

p(n|M2)p(n|M1)

=

Γ(n+ dAdB)Γ(dA)Γ(dB)Γ(n+ dA)Γ(n+ dB)Γ(dAdB)

∏j Γ(n.j + 1)

∏i Γ(ni. + 1)∏

ij Γ(nij + 1). (35)

The second approach to model independence between A and B gives thefollowing:

p(n|M ′2) =

Γ(n+ 1)Γ(dA)Γ(n+ dA)

∫(∏i

(ni.

ni1 . . . nidB

)∏j

xnij

j )Γ(dB)dxB =

Γ(n+ 1)Γ(dA)Γ(dB)Γ(n+ dA)

(∏i

(ni.

ni1 . . . nidB

))∏j

xn.j

j dxB =

Γ(n+ 1)Γ(dA)Γ(dB)Γ(n+ dA)Γ(n+ dB)

∏i Γ(ni. + 1)

∏j Γ(n.j + 1)∏

ij Γ(nij + 1). (36)

We can now find the Bayes factor relating models M ′1 (equation 34) and M ′

2

(equation 36), with no prior preference of either:

p(M ′2|D)

p(M ′1|D)

=p(n|M ′

2)p(n|M ′

1)=∏

j Γ(n.j + 1)∏

i Γ(ni. + dB)Γ(dB)dA−1Γ(n+ dB)

∏ij Γ(nij + 1)

(37)

Consider now a data matrix with three variables, A, B and C (figure 13).The analysis of the model M ′

3 where full dependencies are accepted is verysimilar to M1 above (equation 31). For the model M4 without the link betweenA and B we should partition the data matrix by the value of C and multiplythe probabilities of the blocks with the probability of the partitioning definedby C.

Since we are ultimately after the Bayes factor relating M4 and M3 respec-tively M ′

4 and M ′3, we can simply multiply the Bayes factors relating M2 and

M1 (equation 32) respectively M ′2 and M ′

1 (equation 37) for each block of thepartition to get the Bayes factors sought:

48

p(M4|D)p(M3|D)

=p(n|M4)p(n|M3)

=

Γ(dA)dCΓ(dB)dC

Γ(dAdB)dC

∏c

Γ(n..c + dAdB)∏

j Γ(n.jc + 1)∏

i Γ(ni.c + 1)Γ(n..c + dA)Γ(n..c + dB)

∏ij Γ(nijc + 1)

(38)

and the directed case is similar[44]. The values of the gamma function arerather large even for moderate values of its argument. For this reason theformulas in this section are always evaluated in logarithm form, where productslike formula (38) translate to sums of logarithms.

Exercise 12 Derive a formula similar to (38) for the directed case, p(M ′4|D)/p(M ′

3|D)

3.3.4 Graphical model choice - global analysis

Mental Work Lipoproteins

Physical Work Smoking

AnamnesisBlood Pressure

Figure 15: Symptoms and causes relevant to heart problems

If we have many variables, their interdependencies can be modeled as agraph with vertices corresponding to the variables. The example of figure 15 isfrom [51], and shows the dependencies in a data matrix related to heart disease.Of course, a graph of this kind can give a data probability to the data matrixin a way analogous to the calculations in the previous section, although theformulae become rather involved, and the number of possible graphs increasesdramatically with the number of variables. It is completely infeasible to list andevaluate all graphs if there is more than a handful of variables. An interestingpossibility to simplify the calculations would use some kind of separation, sothat an edge in the model could be given a score independent of the inclusion orexclusion of most other potential edges. Indeed, the derivations of last sectionshow how this works. Let C in that example be a compound variable, obtainedby merging columns {c1, . . . cd}. If two models G and G′ differ only by the

49

presence and absence of the edge {A,B}, and if there is no path between A andB except through vertex set C, then the expressions for p(n|M4) and p(n|M3)above will become factors of the expressions for p(n|G) and p(n|G′), respectively,and the other factors will be the same in the two expressions. Thus, the Bayesfactor relating the probabilities of G and G′ is the same as that relatingM4 andM3. This result is independent of the choice of distributions and priors of themodel, since the structure of the derivation follows the structure of the graphof the model - it is equally valid for Gaussian or other data models, as long asthe parameters of the participating distributions are assumed independent inthe prior assumptions.

We can now think of various ’greedy’ methods for building high probabilityinteraction graphs relating the variables (columns in the data matrix). It isconvenient and customary to restrict attention to either decomposable(chordal)graphs or directed acyclic graphs. Chordal graphs are fundamental in manyapplications of describing relationships between variables (typically variables insystems of equations or inequalities). They can be characterized in many dif-ferent but equivalent ways, see (Rose [69], Rose, Lueker and Tarjan[70]). Onesimple way is to consider a decomposable graph as consisting of the union ofa number of maximal complete graphs (cliques, or maximally connected sub-graphs), in such a way that (i) there is at least one vertex that appears in onlyone clique (a simplicial vertex), and (ii) if an edge to a simplicial vertex is re-moved, another decomposable graph remains, and (iii) the graph without anyedges is decomposable. A characteristic feature of a simplicial vertex is that itsneighbors are completely connected by edges. This recursive definition can bereversed into a generation procedure: Given a decomposable graph G on theset of vertices, find two vertices s and n such that (i): s is simplicial, i.e., itsneighbors are completely connected, (ii): n is connected to all neighbors of s.Then the graph G′ obtained by adding the edge between s and n to G is alsodecomposable. We will call such an edge a permissible edge of G. This proce-dure describes a generation structure (a directed acyclic graph whose verticesare decomposable graphs on the set of vertices) containing all decomposablegraphs on the variable set. An interesting feature of this generation process isthat it is easy to compute the Bayes factor comparing the posterior probabilitiesof the graphs G and G′ as graphical models of the data: Let s correspond to A,n to B and the compound variable obtained by fusing the neighbors of s to Cin the analysis of section 5. Without explicit prior model probabilities we have:

p(G|D)p(G′|D)

=p(n|M3)p(n|M4)

.

A search for high probability graphs can now be organized as follows:1. Start from the graph G0 without edges.2. Repeat: find a number of permissible edges that give the highest Bayes

factor, and add it if the factor is greater than 1. Keep a set of highest probabilitygraphs encountered.

3. Then repeat: For the high probability graphs found in the previous step,find simplicial edges whose removal increases the Bayes factor the most (ordecreases it the least).

For each graph kept in this process, its Bayes factor relative to G0 can befound by multiplying the Bayes factors in the generation sequence. A procedure

50

similar to this one is reported by (Madigan and Raftery[52]), and its results onsmall variable sets was found good, in that it found the best graphs reportedin other approaches. It must be noted, however, that we have now passed intothe realm of approximate analysis, since we cannot know that we will find allhigh probability graphs. For directed graphical models, a similar method ofobtaining high probability graphs is known as the K2 algorithm[6].

3.3.5 Categorical, ordinal and Gaussian variables

We now consider data matrices made up from ordinal and real valued data,and then matrices consisting of both ordinal, real and categorical data. Thestandard choice for a real valued data model is the univariate or multivariateGaussian or normal distribution. It has nice theoretical properties manifest-ing themselves in such forms as the central limit theorem, the least squaresmethod, principal components, etc. However, it must be noted that it is alsounsatisfactory for many data sets occurring in practice, because of its narrowtail and because many real life distributions deviate terribly from it. Severalapproaches to solve this problem are available. One is to consider a variableas being obtained by mixing several normal distributions. This is a specialcase of the classification or segmentation problem discussed below. Anotheris to disregard the distribution over the real line, and considering the variableas just being made up of an ordered set of values. A quite useful and robustmethod is to discretize the variables. This is equivalent to assuming that theirprobability distribution functions are piecewise constant. Discretized variablescan be treated as categorical variables, by the methods described above. Themethod wastes some information, but is quite simple and robust. Typically, thegranularity of the discretization is chosen so that a reasonably large number ofobservations fall in each level. It is also possible to assign an observation toseveral levels, depending on how close it is to the intervals of adjacent levels.This is reminiscent of linguistic coding in fuzzy logic[91]. Discretization doeshowever waste information in the data matrix. It is possible to formulate thetheory of section 3.3.3 using inverse Wishart distributions as conjugate priorsfor multivariate normal distributions, but this is leads to fairly complex formu-las and is seldom implemented. Since most practically occurring distributionsdeviate a lot from normality, it is in practice necessary to model the distribu-tions as mixtures of normal distributions which leads to even higher conceptualcomplexity. A compromise between discretization and use of continuous dis-tributions is analyses of the rankings of the variables occurring in data tables.When considering the association between a categorical and a continuous vari-able one would thus investigate the ranks of the continuous variable, which areindependently and uniformly distributed over their range for every category ifthere is no association. Using a model where the ranks are non-uniformly dis-tributed (e.g. with a linearly varying density), we can build the system of modelcomparisons of section 3.3.3. The difficulty is that the nuisance parameters can-not be analytically integrated out, so a numerical quadrature procedure mustbe used.

51

3.4 Missing values and errors in data matrixData collected from experiments are seldom perfect. The problem of missingand erroneous data is a vast field in the statistics literature. First of all thereis a possibility that ’missingness’ of data values are significant for the analysis,in which case missingness should be modeled as an ordinary data value. Thenthe problem has been internalized, and the analysis can proceed as usual, withthe important difference that the missing values are not available for analysis.A more sceptical approach was developed by Ramoni and Sebastiani[65], whoconsider an option to regard the missing values as adversaries (the conclusionson dependence would then be true no matter what the missing values are). Athird possibility is that missingness is known to have nothing to do with theobjectives of the analysis. For example, in a medical application, if data ismissing because of the bad condition of the patient, missingness is significantif the investigation is concerned with patients. But if data is missing becauseof unavailability of equipment, it is probably not - unless maybe if the investi-gation is related to hospital quality. In Bayesian data analysis, the problem ofmissing or erroneous data creates significant complications, as we will see. Asan example, consider the analysis of the two-column data matrix with binarycategorical variables A and B, analyzed against models M1 and M2 of section5. Suppose we obtained n00, n01, n10 and n11 cases with the values 00, 01,etc. We then have a posterior Dirichlet distribution with parameters nij for theprobabilities of the four possible cases. If we now receive a case where both Aand B are unknown, it is reasonable that this case is altogether ignored. Butwhat shall we do if a case arrives where A is known, say 0, but B is unknown?One possibility is to waste the entire case, but this is not orthodox Bayesian,since we are not making use of information we have. Another possibility is touse the current posterior to estimate a pdf for the missing value, in our case theprobability that B has value 0 is p0 = n00/n0.. So our posterior is now either aDirichlet with parameters n00, n01 − 1, n10 − 1 and n11 − 1 (probability p0) orone with parameters n00 − 1, n01, n10 − 1 and n11 − 1 (probability 1− p0). Butthis means that the posterior is now a weighted average of two Dirichlet distri-butions, in other terms, is not a Dirichlet distribution at all! As the number ofmissing values increases, the number of terms in the posterior will increase ex-ponentially, and the whole advantage with conjugate distributions will be lost.So wasting the whole case seems to be a reasonable option unless we find a moreclever way to proceed.

Assuming that data is missing at random, it is relatively easy to get anadequate analysis. It is not necessary to waste entire cases just because theyhave a missing item. Most of the analyses made refer only to a small number ofcolumns, and these columns can be compared for all cases that have no missingdata in these particular columns. In this way it is, e.g., possible to make agraphical model for a data set even if every case has some missing item, sinceall computations of section 3.3.3 refer to a small number of columns. In thissituation it is even possible to impute the values missing, because the graphicalmodel obtained shows which variables influence the missing one most. So everymissing value for a variable can be guessed by predicting it from values of thecase for neighbors to the variable in the graph of the model. When this is done,one must always remember that the value is guessed. It can thus never be usedto create a formal significance measure - that would be equivalent to using the

52

same data twice, which is not permitted in formal inference.The method of imputing missing values has a nice formalization in the Ex-

pectation Maximization (EM) method: This method is used to create values formissing data items by using a parameterized statistical model of the data. Inthe first step, the non-missing data is used to create an approximation of theparameters. Then the missing data values are defined (given imputed values)to give highest probability to the imputed data matrix. We can then refine theparameter estimates by maximization of probability over parameters with thenow imputed data, then over the missing data, etc. until convergence obtains.This method is recommended for use in many situations despite the fact that itis not strictly Bayesian and it violates the principle of not creating significancefrom guessed (imputed) values. It is important to verify that data are reallymissing at random, otherwise the distribution of missing data values cannot beinferred from the distribution of non-missing data. A spot check that is easy toperform is to code a variable with many missing items as a 0/1 indicator columnof missingness, and check its association to other columns using equation (32).The most spectacular use of the EM algorithm is for automatic (unsupervised)classification in the AUTOCLASS model (see section 3.5).

Imputation of missing values can also be performed with the MCMC method,by having one variable for each missing value, The trace will give a picture oftheir joint posterior distribution, and this has the advantage of not creatingmore significance than justified by the model and the data. The method isstrongly recommended over simple imputation by Gelman.

The related case of errors in data is more difficult to treat. How do wedescribe data where there are known uncertainties in the recording procedure?This is a problem worked on for centuries when it comes to real valued quan-tities as measured in physics and astronomy, and is one of the main featuresof interpretation of physics experiments. When it comes to categorical datathere is less help in the literature - an obvious alternative is to relate recordedvs actual values of discrete variables as a probability distribution, or - which isfairly expedient in our approach - as an equivalent sample.

3.4.1 Decision trees

Decision trees are typically used when we want to predict a variable - the classvariable - from other - explanatory - variables in a case, and we have a datamatrix of known cases. When modeling data with decision trees, we are usuallytrying to segment the data set into ranges - n-dimensional boxes of which someare unbounded - such that a particular variable - the class variable - is fairlyconstant over each box. If the class variable is truly constant in each box,we have a tree that is consistent with respect to the data. This means thatfor new cases, where the class variable is not directly available, it can be wellpredicted by the box into which the case falls. The method is suitable where thevariables used for prediction are of any kind (categorical, ordinal or numerical)and where the predicted variable is categorical or ordinal with a small domain.There are several efficient ways to heuristically build good decision trees, and itis a central technique in the field of machine learning[56]. Practical experiencehas given many cases were the predictive performance of decision trees is good,but also many counter-intuitive phenomena have been uncovered by practicalexperiments. Recently, several treatments of decision trees have been published

53

where it is discussed whether or not the smallest possible tree consistent with allcases is the best one. This turned out not to be the case, and the argument thata smallest decision tree should be preferred because of some kind of Occam’srazor argument is apparently not valid, neither in theory nor in practice[87, 12].The explanation is that the data one wants to classify has usually not beengenerated in such a way that the smallness of the tree gives a good indicationof generalizing power. An example is shown in Figure 16. The distributionof the data is well approximated by two multivariate normal distributions, butsince these lie diagonally, the ortholinear separating planes of the decision treeproduced from a moderate size sample fit badly to the distribution and depends alot on the sample. In this example, decision trees with general linear boundariesperform much better, and in fact the optimal separator of the two distributionsis close to the diagonal line that separates the point sets with a largest possiblemargin(section 2.5.

−10 −8 −6 −4 −2 0 2 4 6−6

−4

−2

0

2

4

6

8

Figure 16: Decision tree with bad generalization power

The Bayesian approach gives the right information on the credibility and gen-eralizing power of a decision tree. It is explained in recent papers by (Chipman,George and McCullogh[19]) and by (Paass and Kindermann[59]). A decisiontree statistical model is one where a number of boxes are defined on one setof variables by recursive splitting of one box into two by splitting the rangeof one designated variable into two. Data are assumed to be generated by adiscrete distribution over the boxes, and for each box it is assumed that theclass variable value is generated by another discrete distribution. Both thesedistributions are given uninformative Dirichlet prior distributions, and thus theposterior probability of a decision tree can be computed from data. Since largertrees have more parameters, there is an automatic penalization of large trees,but the distribution of cases into boxes also enters the picture, so it is not clearthat the smallest tree giving perfect classification will be preferred, or even that

54

a consistent tree will be preferred over an inconsistent one. The decision treeswe described here do not give a clear cut decision on the value of the decisionvariable for a case, but a probability distribution over values. If the probabilitydistribution is not peaked at a specific class value, then this indicates that pos-sibly more data must be collected before a decision can be made. Also, sincethe name of this data model indicates its use for decision making, one can getbetter trees for an application by including information about the utility of thedecision in the form of a loss function and by comparing trees based on theexpected utility rather than model probability.

For a decision tree T with d boxes data with c classes, and where the numberof cases in box i with class value k is nik, and n = n.., we have with uniformpriors on both the assignment of case to box and of class within box,

p(D|T ) = Γ(n+ 1)Γ(d)Γ(n+ d)

∏i

Γ(ni. + 1)Γ(c)Γ(ni. + c)

However, in order to compare two trees T and T ′, we would have to form theset of intersection boxes and ask about the probability of finding the data witha common parameter over the boxes belonging to a common box of T relativeto the probability of the data when the parameters are common in boxes of T ′.For the case where T and T ′ only differ by splitting of one box i into i′ and i′′,the calculation is easy (ni′′j + ni′j = nij):

p(D|T ′)p(D|T ) =

Γ(ni. + c)Γ(ni′. + c)Γ(ni′′. + c)

∏j

Γ(ni′j + 1)Γ(ni′′j + 1)Γ(nij + 1)

(39)

A reasonably well generalizing decision tree can now be obtained top-downrecursively by selecting the variable and decision boundary to maximize 39 untilthe maximum is less than 1, where we make a terminal deciding the majorityof the labels of training cases falling in the box. This is a slight generalizationof the ’change-point’ problem describe in section 3.1.3.

3.5 Segmentation - Latent variablesSegmentation and latent variable analysis is directed at describing the data setas a collection of subsets, each having simpler descriptions than the full datamatrix. The related technique of cluster analysis, although not described here,can also be given a Bayesian interpretation as an effort to describe a data setas a collection of clusters with small variation around the center of each clus-ter. Suppose data set D is partitioned into dc classes {D(i)}, and each of thesehas a high posterior probability p(D(i)|Mi) wrt some model set {Mi}. Thenwe think that the classification is a good model for the data. However, someproblems remain to consider. First, what is it that we compare the classificationagainst, and second, how do we accomplish the partitioning of the data casesinto classes? The first question is the simplest to answer: we compare a clas-sification model against some other model, based on classification or not. Thesecond is trickier, since the introduction of this section is somewhat misleading.The prior information for a model based on classification must have some infor-mation about classes, but it does not have an explicit division of the data intoclasses available. Indeed, if we were allowed to make this division into classeson our own, seeking the highest posterior class model probabilities, we would

55

probably over-fit by using the same data twice - once for class assignment andonce for posterior model probability computation. The statistical model gen-erating segmented data could be the following: A case is first assigned to aclass by a discrete distribution obtained from a suitable uninformative Dirichletdistribution, and then its visible attributes are assigned by a class-dependentdistribution. This model can be used to compute a probability of the data ma-trix, and then, via Bayes rule, a Bayes factor relating the model with anotherone, e.g., one without classes or with a different number of classes. One can alsohave a variable number of classes and evaluate by finding the posterior distribu-tion of the number of classes. The data probability is obtained by integrating,over the Dirichlet distribution, the sum over all assignments of cases to classes,of the assignment probability times the product of all resulting case probabil-ities according to the respective class model. Needless to say, this integrationis feasible only for a handful of cases where the data is too meager to permitany kind of significant conclusion on the number of classes and their distribu-tions. The most well-known procedures for automatic classification are builton expectation maximization. With this technique, a set of class parametersare refined by assigning cases to classes probabilistically, with the probability ofeach case membership determined by the likelihood vector for it in the currentclass parameters[18]. After this likelihood computation, a number of cases aremoved to new classes to which they belong with high likelihood. This procedureconverges to a local maximum, where each case with highest likelihood belongsto its current class. But there are many local maxima, and the procedure mustbe repeated a number of times with different starting configurations.

We can also solve the problem with the MCMC approach[66]. The MCMCapproach to classification is the following: Assume that we have a data matrixand want a classification of its cases which makes the attributes independent.Define a class assignment randomly, and compute the probability of data, giventhe model with independent attributes, as in (32) which is easy to generalize tomore attributes. The MCMC will now implement a move function, proposing achanged class for some case. The move is accepted if the posterior probabilityincreases, or otherwise by a probability given by the ratio of new to old dataprobability (see section 4). This procedure is reasonably efficient, since it ispossible to evaluate the class probabilities incrementally, by keeping just thecurrent contingency table for each class and updating it incrementally. Sinceabsolute probabilities are held updated, we also avoid a common complication inMCMC applications arising when the dimension of the parameter space changes.Although it can sometimes be avoided it is not always so. The reversible jumpprocess was designed to cope with this phenomenon[15].

4 Approximate analysis with Metropolis-Hastingssimulation

We have seen several simple examples of the Markov Chain Monte Carlo ap-proach, and we will now explicate in detail their usages and mathematical jus-tifications – as well as their limitations.

The basic problem solved by MCMCmethods is sampling from a multivariatedistribution over many variables. The distribution can be given generically as

56

p(x, y, z, . . . , w). If some variable, e.g., y, represents measured signals, thenthe actual values measured, say a, can be substituted, and sampling will befrom a conditional distribution p(x, a, z, . . . , w). If some other variable, say x,represents the measured quantity, then sampling and selecting the x variable willgive samples from the posterior of x given the measurements. In other words,we will get a best possible estimation of the quantity given the measurementsand the statistical model of the measurement process.

The two basic methods for MCMC computation are the Gibbs sampler andthe Metropolis-Hastings algorithm. Both generate a Markov chain with statesover the domain of a multivariate target distribution and with the target dis-tribution as its unique limit distribution. Both exist in several more or lessrefined versions. The Metropolis algorithm has the advantages that it does notrequire sampling from the conditional distributions of the target distributionbut only finding the quotient of the distribution at two arbitrary given points,and it can be chosen from a set with better convergence properties. A thoroughintroduction is given by Neal[57]. To sum it up, MCMC methods can be usedto estimate distributions that are not tractable analytically or numerically. Weget real estimates of posterior distributions and not just approximate maximaof functions. On the negative side, the Markov chains generated have high au-tocorrelation, so a sample over a sequence of steps can give a highly misleadingempirical distribution, much narrower than the real posterior. Although signifi-cant advances have been made in the area of convergence assessment and choiceof samples, this problem is not yet completely solved.

The Metropolis-Hastings sampling method is, as mentioned before, organizedas follows: given the pdf p(x) of a state variable x over a state space X , and anessentially arbitrary symmetric proposal function q(x, x′), a sequence of statesis created in a Markov chain. In state x, draw x′ according to the proposalfunction q. If p(x′)/p(x) > 1, let x′ be the new state. Otherwise, let x′ bethe new state with probability p(x′)/p(x), otherwise keep state x. We willin the next subsection verify that p(x) is a stable distribution of the chain,and from general Markov chain theory there are several conditions that ensurethat there is only one limiting distribution and that it will always be reachedasymptotically. It is much more difficult to say when we have a provably goodsample, and in practice all the difficulties of hill-climbing optimization methodsmust be dealt with in order to assess convergence. Recent developments aimat finding MCMC procedures which can detect when an unbiased sample hasbeen obtained by using various variations of ’coupling from the past’[62]. Thesemethods are however still complex and difficult to apply on the classificationand graphical model problems.

Nevertheless, there have been great successes with MCMC methods for thosecases of Bayesian analysis where closed form solutions do not exist. With var-ious adaptations of the method, it is possible to express multivariate data asa mixture of multivariate distributions and to find the posterior distribution ofthe number of classes and their parameters [30, 31, 33, 66].

4.1 Why MCMC and why does it work?The main practical reason for using MCMC is that we often have an unnormal-ized probability distribution over a sometimes complex space, and we want asample of it to get a feel for it, or we might want to estimate the average of a

57

function on the space under the distribution. This section is a summary of [10,Ch 6], where a more complete account can be found. A typical way one obtainsa distribution which is not normalized is by using (3), where the product on theright side is not a normalized distribution and is often impossible to normalizeusing analytical methods (even in the case where we have analytical expressionsfor the prior and the likelihood). So assume we have a distribution π(x) = cf(x),with unknown normalization constant c, over a space X where we can evaluatef in every point of X (we will think of cf as a probability density function -a thorough measure-theoretic presentation is also possible, but this simplifiedview is appropriate in most applications). Given a set of N independent andidentically distributed examples xi with pdf π(x), we can estimate the expectedvalue of any function v(x) over π by taking the average

fN =∑i

v(xi)/N. (40)

This is a well-known technique known as Monte Carlo estimation. AssumingthatX and π are reasonably well-behaved, like a Euclidean n-dimensional space,the expected value is I =

∫Xπ(x)v(x)dx and the estimate fN converges to I

with probability 1 (’almost surely’). Moreover, if the variance of v can bebounded, the central limit theorem can be used to give a convergence estimate,namely that

√N(fN − I) stays bounded (also with probability one) and indeed

approaches the normal distribution with zero mean and the same variance as v.It is not easy in general to generate an iid sample for f where its variation

is large and irregular over X . If an upper bound M of f(x) is known, we canuse rejection sampling. In rejection sampling we generate the samples accordingto some distribution g(x), where the support of g covers the support of f (inother words, g can generate any point with non-zero probability according tocf), and keep a subset of them (i.e., we reject some of them). Specifically, if asample x′ was generated, we keep it with probability f(x′)/(Mq(x)). The set ofkept samples will be distributed as cf(x). Obviously, this method is practicalonly when the bound M is reasonably accurate and when g(x) is a reasonableapproximation of cf(x), and if this is not true the rejection probability can bevery close to 1 most of the time.

Another technique to obtain a sample from an irregular distribution is impor-tance sampling. Here we do not get a simple sample but a weighted sample, i.e.,each point is equipped with a weight and represented by a pair (xi, wi). If wecan produce a sample of the (preferably close to cf) distribution g, then settingwi = f(xi)/g(xi) and normalizing the weights so that they sum to one, pro-duced a weighted sample that corresponds to the distribution cf . The weightedsample can be used directly for estimation, changing (40) to fN =

∑i v(xi)wi.

The weighted sample can also be changed to an unweighted sample by resam-pling. An unweighted sample of M elements is obtained by drawing, M times,an integer j between 1 and N , according to the probability distribution w, andselecting the corresponding sequence of xj . Obviously, the information about fis filtered away in the sampling and resampling steps and the accuracy is lessthan with iid sampling according to cf . Typically, M should be at least 10Nfor obtaining reliable results.

The MCMC method produces a sequence (xi) distributed asymptotically ascf , where c is unknown. It is based on a combination of the above sampling

58

ideas, and it produces a sample in the form of a sequence of non-independentpoints with stationary (i.e., limiting) distribution π(x) = cf . We can use equa-tion (40) for this sample, but only if it is long enough to remove the effect oflocal autocorrelation of the xi.

Basically, a Markov Chain is represented with a transition kernelK(xt−1, xt)giving the conditional distributions p(xt|xt−1). We want to design the transitionkernel so that the sequences it produces have a stationary distribution π(x) =cf(x), where we can evaluate f but not c (and thus not π either). In otherswords, f(x) =

∫XK(xt−1, x)f(xt−1)dxt−1. We will achieve this by designing

the chain to be π-reversible, i.e., K(x, y)f(x) = K(y, x)f(y). Reversibility isalso called the detailed balance condition, a term which better explains whyreversibility leads to stationarity. Namely, if we have two disjoint subsets A andB, the integral

∫A

∫B K(x, y)cf(x)dxdy is the probability of a sample element in

B being followed by one in A, and this is by the detailed balance condition equalto the probability that an element in B is followed by one in A. So if the chain isergodic, i.e., its ensemble distribution is the same as its path distribution, thencf is a stationary distribution of the chain. A sufficient condition for this isthat the chain can reach every region in the support of f , i.e., every region withnon-zero probability can be reached from every point with non-zero probability.We will not go into the detailed conditions and proofs of convergence of MarkovChains, some more technical detail is given in [10] and a lot in [82, 81].

For simple (finite or of constant dimension) spaces X there is a simple wayto construct a Markov Chain that is obviously both irreducible and satisfiesthe detailed balance condition if a relative (i.e., unnormalized as the functionf above) density of the target distribution can be evaluated at any point, andthus has the required stationary distribution. The main problem, which has notyet a simple solution, is to design the chain in such a way that it has reasonablylow autocorrelation. Such a chain is said to mix rapidly.

Once state space and target distribution has been defined, the only designchoice is the proposal distribution q(x′|x), a distribution for a proposed new statex′ dependent on the current state x. We design m by giving a procedure bothto generate a new proposal x′ given current state x (using a random numbergenerator), and another to evaluate the density given x and x′. Using the latterprocedure and the unnormalized target density f , which can also be computedfor any state, we can compute an acceptance probability for a proposed statethat depends on the current state:

α(x, z) = min(1,f(z)q(x|z)f(x)q(z|x) ) (41)

A trace (x0, . . .) is now produce using the following algorithm:Start with any x0 in the support of ffor i=1, ... dogenerate z by the distribution q(z|xi)with probability α(xi, z), set xi+1 = z,otherwise set xi+1 = xiThere are several ways to prove that this gives a chain with stable distribu-

tion cf , despite the fact that we do not know the value of c. The only reallycomprehensible argument can be found, e.g., in [10] and goes as follows:

First we need the useful consequence of (41) that

59

f(xt)q(xt+1|xt)α(xt, xt+1) = f(xt+1)q(xt|xt+1)α(xt+1, xt). (42)

Equation (42) is proved by noting that exactly one of α(xt+1, xt) and α(xt, xt+1)is less than one (unless both are equal to one). In each case, a simple algebraicexpression without the minimum operator results from (41), that simplifies to(42) in both cases (as well as in the case where both alphas are one).

We next show that the resulting chain is π-reversible, where π(x) = cf(x).The distribution of xt+1 given xt is

p(xt+1|xt) = q(xt+1|xt)α(xt, xt+1) +

δ(xt+1 − xt)(∫

q(z|xt)(1− α(xt, z))dz.

The first term above comes from the case where a value for xt+1 is proposed andaccepted, the second term (where δ is Dirac’s delta function) has contributionsfrom all possible proposal values z weighted by the probability that the value isgenerated and also rejected. Multiplying by π(xt) and using (42) gives

π(xt)p(xt+1|xt) = π(xt)q(xt + 1|xt)α(xt, xt+1 +

π(xt)δ(xt+1 − xt)(1 −∫

q(z|xt)α(xt, z)dz) =...

π(xt+1)p(xt|xt+1),

the detailed balance equation which, together with aperiodicity and irreducibil-ity means that every chain has π = cf as a limiting distribution.

Exercise 13 We believe that a sequence of real numbers, (yi)Ni=1 was generatedfrom the mixture of two normal distributions, f(y) =

∑2i=1 λiN (µi, σi) where

λ1 + λ2 = 1.Design an MCMC procedure to make inference about the parameters of the

normal distributions and the source (1 or 2) of each yi. Check out what happenswhen the sample is too small for reliable inference.

4.1.1 The particle filter

The particle filter is a computation scheme adapting MCMC to the recursiveand dynamic inference problems (5,6). It is described concisely in [10, Ch 6]and a state-of-the-art survey is given in [32]. In the first case (5), we try toestimate a static quantity and expect to get better estimates as more data flowsin. In the second case (6), we try to track a moving target. The methodmaintains a sample that is updated for each time step and can be visualized asa moving swarm of particles. The sample represents our information about theposterior. The particle filter can handle highly non-Gaussian measurement andprocess noise, and the likelihoods can be chosen in very general ways, the onlyessential restriction being that relative likelihoods can be computed reasonablyefficiently, and that process states can be simulated stochastically from one timestep to the next. It is not necessary to have constant time steps, so jittered

60

observations can be easily handled provided the measurement times are known.It is not certain that the method performs better than sophisticated non-linearnumerical methods, the attractivity of the methods stems more from ease ofimplementation and flexibility.

In the steady state our inference of the state at time t is represented bya set of particles (x(t)

i )Ni=1. Each particle is brought to time t + 1 by drawingstate u

(t+1)i using the process kernel f(u(t+1)

i |x(t)i . In practice, we produce

many (say 10) u(t+1)i from each x

(t)i , so that the resampling will produce a

better result. Now the data dt+1 are considered and used to weight the u(t+1)i

particles: w(t+1)i = ft+1(dt+1|u(t+1)

i ). In order to get back to the situation attime t + 1, we now use the weights to draw N indices (i1, . . . , iN ) from theprobability distribution defined by the (normalized) weights. Finally, the set ofparticles defining our belief of system state at time t+ 1 is the set (xij )Nj=1.

4.2 Basic classification with categorical attributesFor data matrices with categorical attributes, the categorical model of Cheese-man and Stutz is appropriate. We assume that rows are generated as a finitemixture with a uniform Dirichet prior for the component (class) probabilities,and each mixture component has its rows generated independently, accordingto a discrete distribution also with a uniform Dirichet prior for each column andclass. Assume that the number of classes C is given, the number of rows is nand the number of columns is K, and that there are dk different values thatcan occur in column k. For a given classification, the data probability can becomputed; let n(c,k)

i be the number of occurrences of value i in column k of rowsbelonging to class c. Let x(c,k)

i be the probability of class c having the value i incolumn k. Let n(c)

ibe the number of occurrences in class c of the row i, and n(c)

the number of rows of class c. By (17) the probability of the class assignmentdepends only on the number of classes and the table size, Γ(n+1)Γ(C)/Γ(n+C).The probability of the data in class c is, if i = (i1, . . . , ik):

∫ ∏ki

(x(c,k)ik

)n(c)

i dx =K∏k=1

∫ dk∏i=1

(x(c,k)i )n

(c,k)i dx(c,k)

The right side integral can be evaluated using the normalization constant ofthe Dirichlet distribution, giving the total data probability of a classification:

Γ(n+ 1)Γ(C)Γ(n+ C)

∏c,k

∏i Γ(n

(c,k)i + 1)

Γ(n(c) + dk)

The posterior class assignment distribution is obtained normalizing over allclass assignments. This distribution is intractable, but can be approximatedby searching for a number of local maxima and estimating the weight of theneighborhood of each maximum.

Here the EM algorithm is competitive to MCMC-calculations in many cases,because of the difficulty of tuning the proposal distribution of MCMC-computationsto avoid getting stuck in local minima. The procedure is to randomly assigna few cases to classes, estimating parameters x(c,k)

i , assign remaining cases to

61

optimum classes, recomputing the distribution of each case over classes, reclas-sifying each case to optimum class, and repeating until convergence. Repeatingthis procedure one typically finds after a while a single most probable class as-signment for each number of classes. The set of local optima so obtained can beused to guide a MCMC simulation giving more precise estimates of the proba-bilities of the classifications possible. But in practice, a set of high-probabilityclassifications is normally a starting point for application specialists trying togive application meaning to the classes obtained.

5 Two cases

5.1 Case 1: Individualized media distribution - signifi-cance of profiles

The application concerns individualized presentation of news items(and is cur-rently confidential in its details). The data used are historical records of individ-ual subscribers to an electronic news service. The purpose of the investigationis to design a presentation strategy where an individual is first treated as the’average customer’, then as his record increases he can be included in one of aset of ’customer types’, and finally he can also get a profile of his own. Severalmethodological problems arise when the data base is examined: customers areidentified by an electronic address, and this address can sometimes be used byseveral persons, either in a household or among the customers of a store orlibrary where the terminal is located. Similarly, the same customer might beusing the system from several sites, and unless he has different roles on differentsites, he might be surprised when finding has his profile is different dependingon where he is. After a proposed switch to log-in procedures for users, theseproblems may disappear, but on the other hand the service may also become lessattractive. An individual can also be in different modes with different interests,and we have no channel for a customer to signal this, except by seeing whichlinks he follows. On the other hand, recording the access patterns from differentcustomers is very easily automated and the total amount of data is overwhelm-ing although many customers have few accesses recorded. The categorization ofthe items to be offered is also by necessity coarse. Only two basic mechanismsare available for evaluating an individuals interest in an item: To which degreehas he been interested in similar items before, and to which degree have similarindividuals been interested in this item? This suggest two applications of thematerial of this chapter: Segmentation or classification of customers into types,and evaluating a customer record against a number of types, to find out whetheror not they can confidently be said to differ. We will address these problemshere, and we will leave out many practical issues in the implementation.

The data base consists of a set of news items with coarse classification (intonews, sport, culture, entertainment, economy and with a coarse ’hotness in-dicator’ and a location indicator of local, national or international, all in theperspective of Stockholm, Sweden); a set of customers, each with a type, a’profile’ and a list of rejected and accepted items; and a set of customer types,each with a list of members. Initially, we have no types or profiles, but onlyclassified news items and the different individuals access records. The produc-tion of customer types is a fairly manual procedure: even if many automatic

62

classification programs can make a reasonable initial guess, it is inevitable thatthe type list will be scrutinized and modified by media professionals – the typesare of course also used for direct advertising. The AUTOCLASS model has theattractive property that it produces classes where the attributes are indepen-dent: we will not get a class where a customer of the class is either interestedin international sports events or local culture items, but not both. In the end,there is no unique objective criterion that the type list is ’right’, but if we havetwo proposed type lists we can distinguish them by estimating the fraction ofaccepted offers, and saying that the one giving a higher estimate is ’better’. Inother words, a type set giving either high or low acceptance probability to eachitem is good. But this can be easily accomplished with many types, leading toimpreciseness in assigning customers to types – a case of over-fitting.

The assignment of new individuals to types cannot be done manually becauseof the large volumes. Our task is thus now to say, for a new individual witha given access list, to which type he belongs. The input to this problem is aset of tables, containing for each type as well as for the new individual, thenumber of rejected and accepted offers of items from each class. The modelingassumptions required are that for each news category, there is a probability ofaccepting the item for the new individual or for an average member of a type.Our question is now if these data support the conclusion that the individualhas the same probability table as one of the types, or if we can say that he isdifferent from every type (and thus should get a profile of his own).

We can formulate the model choice problem by a transformation of the accesstables to a dependency problem for data tables that we have already treatedin depth. For a type t with ai accepts and ri rejects for a news category i, weimagine a table with three columns and

∑(ai + ri) rows: a 1 in column 1 to

indicate an access of a type, the category number i in column 2 of (ai+ri) rows,ai of which contain 1 (for accept) and ri a 0 (for reject) in column 3. We add asimilar set of rows for the access list of the individual, marked with 0 in column1. If the probability of a 0 (or 1) in column 3 depends on the category (column2) but not on column 1, then the user cannot be distinguished from the type.But columns 1 and 2 may be dependent if the user has seen a different mixtureof news categories compared to the type. In graphical modeling terms we coulduse the model choice algorithm. The probability of the customer belonging totype t is thus equal to the probability of model M4 against M3, where variableC in figure 13 corresponds to the category variable (column 2).

Although this is an elegant way to put our problem in a graphical modelchoice perspective, it is a little over-ambitious and a simpler method may beadequate: the types can in practice be considered as sets of reasonably preciseaccept probabilities, one for each news category, and an extension of the analysisof the coin tossing experiment is adequate. The customers can be described as anoutcome of a sampling, where we record the number of accepted and rejecteditems. If type i accepts an offer for category j with probability pij , and acustomer has accepted aj out of nj offers for category j, then the probabilityof this is, if the customer belongs to type i, paj

ij (1− pij)(nj−aj). For the generalmodel Hu of an unbalanced probability distribution (the accept probabilityis unknown but constant, and is uniformly distributed between 0 and 1), theprobability of the outcome is aj !(nj − aj)!/(nj + 1)!. Taking the product overcategory, we can now find the data probabilities {p(D|Hj)} and p(D|Hu), anda probability distribution over the types and the unknown type by evaluating

63

formula (2). In a prototype implementation we have the following customertypes described by their accept probabilities:

category Typ1 Typ2 Typ3 Typ4news-int 0.9 0.06 0.82 0.23news-loc 0.88 0.81 0.34 0.11sports-int 0.16 0 0.28 0.23sports-loc 0.09 0.06 0.17 0.21cult-int 0.67 0.24 0.47 0.27cult-loc 0.26 0.7 0.12 0.26tourism-int 0.08 0.2 0.11 0.11tourism-loc 0.08 0.14 0.2 0.13entert 0.2 0.25 0.74 0.28

Three new customers have arrived, with the following access records of pre-sented(accepted) offers:

category Ind1 Ind2 Ind3news-int 3(3) 32(25) 17(8)news-loc 1(1) 18(9) 25(14)sports-int 1(1) 7(2) 7(3)sports-loc 0(0) 5(5) 6(1)cult-int 2(2) 11(4) 14(6)cult-loc 1(1) 6(2) 10(3)tourism-int 0(0) 4(4) 8(8)tourism-loc 1(1) 5(1) 8(3)entert 1(1) 17(13) 15(6)sum 10 105 110

We compute the data probabilities under the hypotheses that each new cus-tomer belongs to either type 1 to 4, or to Hu. Then we compute the probabilitydistribution of the individuals types, under the uniform prior assumption:

Typ1 Typ2 Typ3 Typ4 Hu

Ind1 0.0639 0.0000 0.0686 0.0001 0.8675Ind2 0.0000 0.0000 0.0002 0.0000 0.9980Ind3 0.0000 0.0000 0.0000 0.0000 1.0000

Dropping the Hu possibility, we have:

Typ1 Typ2 Typ3 Typ4Ind1 0.4818 0.0000 0.5176 0.0005Ind2 0.0000 0.0000 1.0000 0.0000Ind3 0.0000 0.0000 0.9992 0.0008

Clearly, quite few observations are adequate for putting a new customerconfidently into a type. The profile of the third individual is significantly notin one of the four types(p(Hu) = 1). For the first individial, the access record is

64

too short to confidently put him in a class, but the first and second classes fitreasonably.

Actually, the model for putting individuals into types may be inappropriatewhen considering what the application models: A type should really be consid-ered a set of individuals with a clustered span of profiles, whereas the model weused considers a type to be the average accept profile over the individuals in thetype. This is a too precise concept. We can make the types broader in two ways:either by spanning them with a set of individuals and computing the probabilitythat the new individual has the same profile as one of the individuals definingthe type, or we can define a type as a probability distribution over accept pro-files. The second option is quite attractive, since the type accept probabilitiescan be defined with a sample that defines a set of Beta distributions of acceptprobabilities. In graphical modeling terms, we test the adequacy of a type foran individual by putting their access records into a three column table withvalues individual/type, category of item and accept indicator. The adequacy ofthe type for the individual is measured by the conditional independence of typeand accept given category, evaluated by formula (38), with news category corre-sponding to variable C. The computation is of course done on the contingencytable without actually expanding the data table. The results in our example, ifthe types are defined by a sample of 100 items of each category, is:

Typ1 Typ2 Typ3 Typ4Ind1 0.2500 0.2500 0.2500 0.2500Ind2 0.0000 0.0000 1.0000 0.0000Ind3 0.0000 0.0001 0.9854 0.0145

It is now even more clear than before that the access record for individual 1 isinadequate, and that the third individual is not quite compatible with any type.It should be noted that we have throughout this example worked with uniformpriors. These priors have no canonic justification but should be regarded asconventional. If specific information justifying other priors is available they caneasy be used, but this is seldom the case. The choice of prior will affect theassignment of individual to type in rare cases, but only when the access recordsare very short and when the individual does not really fit to any type.

5.2 Case 2: Schizophrenia research–hunting for causal re-lationships in complex systems

This application is directed at understanding a complex system - the humanbrain. Similar methods have been applied to understanding of complex engi-neered systems like paper mills, and micro-economic systems. Many investi-gations on normal subjects have brought immense new knowledge about thenormal function of the human brain, but mental disorders still escape under-standing of their causes and cures (despite a large number of theories, it is notknown why mental disorders develop except in special cases, and it is not knownwhich physiological and/or psychological processes cause them). In order to gethandles on the complex relationships between psychology, psychiatry, and phys-iology of the human brain, a data base is being built with many different types ofvariables measured for a large population of schizophrenia patients and control

65

subjects. For each volunteering subject, a large number of variables are obtainedor measured, like age, gender, age of admission to psychiatric care; volumes ofgray and white matter as well as cerebrospinal fluid in several regions of thebrain (obtained from MR images), genetic characterization and measurementsof concentrations in the blood of large numbers of substances and metabolites.For the affected subjects, a detailed clinical characterization is recorded.

In this application one can easily get lost. There is an enormous amount ofknowledge on the relationships and possible significances of these quite many (ca150) variables in the medical profession. At the same time, the data collectionprocess is costly, so the total number of subjects is very small compared, forexample, with national registers that have millions of persons but relatively fewvariables for each of them. This means that statistical significance problemsbecome important. A test set of 144 subjects, 83 controls and 61 affectedby schizophrenia, was obtained. This set was investigated with most methodsdescribed in this chapter, giving an understanding of the strongest relationships(graphical model), possible classifications into different types of the disease,etc.. In order to find possible causal chains, we tried to find variables andvariable pairs with a significant difference in co-variation with the disease, i.e.,variables and tuples of variables whose joint distribution is significantly differentfor affected person relative to control subjects. This exercise exhibited a verylarge number of such variables and tuples, many of which were known before,others not. All these associations point to probable mechanisms involved inthe disease, which seems to permeate every part of the organism. But it isnot possible to see which is the effect and what is the cause, and many ofthe effects can be related to the strong medication given to all schizophreniapatients. In order to single out the more promising effects, a graphical modelapproach was tried: Each variable was standardized around its mean, separatelyfor affected and controls. Then the pairs of variables were detected giving thehighest probability to the leftmost graph in figure 17. Here D stands for thediagnosis (classification of the subject as affected or control), and A and Bare the two variables compared. In the middle graph, the relationship can bedescribed as affecting the two variables independently, whereas in the left graphthe relationship can be described as the disease affecting one of them but witha relationship to the other that is the same for affected and for controls. Mostof the pairs selecting the first graph involved some parts of the vermis (a partof cerebellum). Particularly important was the pair subject age and posteriorsuperior vermis volume. As shown in figure 18, this part of the brain decreases insize with age for normal subjects. But for the affected persons the size is smallerand does not change with age. Neither does the size depend significantly onthe duration of the disease or the medication received. Although these findingscould be explained by confounding effects, the more likely explanation presentlyis that the reduction occurred before outbreak of the disease and that processesleading to the disease involve disturbing the development of this part of thebrain.

Several other variables were linked to the vermis in the same way: therewas an association for control subjects but not for the affected ones, indicatingthat the normal co-variation is broken by the mechanisms of the disease. Forvariables that were similarly co-varying with the vermis for controls and affected,there is a possibility that these regions are affected by the disease similarly asthe vermis.

66

In order to get an idea of which physiological and anatomical variablesthat give best predictors of the disease, a decision tree was produced usingthe Bayesian approach of section 3.4.1, choosing the most significant variableand decision boundary in a top-down fashion. This tree is shown in figure 19.Note that it does not accurately classify all cases, but leaves 6 of the 144 casesmisclassified. A bigger decision tree would classify all cases correctly but alsoover-fit to the training set, giving probably less accuracy on new cases.

D D D

BB B

A A A

Figure 17: Graphical models detecting co-variation

6 ConclusionsThis chapter emphasized the statistical approach to data mining. Meaning ofdata is therefore linked to its properties relative to statistical models. It alsoemphasizes a Bayesian approach, because of its intellectual appeal. The tech-niques proposed for analytical models and EM classification can be applied tovery large data sets, possibly after small tunings of the ’obvious’ algorithm.On the other hand, Markov Chain Monte Carlo methods are today not usedon gigantic data sets. I have only covered the most fundamental models andtechniques, and the larger application areas have developed more sophisticatedmodels and computations to solve their typical problems. The methods de-scribed here are easy to apply using general purpose software like C, Matlab,Octave, R, etc.

67

20 25 30 35 40 45 50 55 600.05

0.1

0.15

0.2

0.25

0.3

0.35

Age

PS

V

Figure 18: Association between age and posterior inferior vermis volume de-pends on diagnosis. The principal directions of variation for controls(o) andaffected subjects (+) are shown.

VenWhite

A

PSV

TemCSF

SubWhite

AC

CH

PIVC

C

C

C

41(1) 6(2)

5(1) 16(1)4

11

26(1)

Figure 19: Robust decision tree. Numbers on terminals indicate number of casesin the training set, misclassified subjects in parentheses.

68

Indexacceptance probability, 59Adaline, 24artificial intelligence, 3

Bayes factor, 8Bayesian analysis, 6belief mass, 21Bonferroni correction, 19

causality, 43classification problem, 23classifier error, 23combination rules, 21conditional independence, 44confidence interval, 17confounder, 45contingency table, 38cost matrix, 12Cournot’s principle, 17credible interval, 9credible set, 9

data mining, 3Dempster’s combination rule, 21Dempster-Shafer, 21detailed balance condition, 59Dirichlet distribution, 28discrete distribution, 28dynamic inference, 9

empirical error, 23ergodic, 59estimate, 12evidence theory, 21

fair betting odds, 20false discovery rate control, 19family-wise error control, 19FDR, 19feature space, 23frequentist statistics, 16fuzzy logic, 27, 51FWE, 19

hypotheses, 6

importance sampling, 58improper priors, 30

inference, 5information fusion, 3inverse probability, 6

Knowledge discovery, 3

Laplace estimator, 13

machine learning, 3MAP, 13marginal dependency, 44mathematical probability, 6Maximum A Posteriori, 13mixing coefficients, 31mixture, 31mixture of normals, 33models, 6

non-parametric, 31nuisance parameters, 30null hypothesis, 17

objective probability, 16

p-value, 18parameter space, 8parameterized model, 8particle filter, 60pattern recognition, 3perceptron, 23permissible edge, 50pignistic transformation, 21pignistic transformations, 21possible world space, 8posterior odds, 8precision, 29precision matrix, 40prior odds, 8proposal distribution, 59

recursive estimation, 9regression problem, 23rejection sampling, 58relative plausibility transformation, 21resampling, 58reversible, 59robust Bayesian analysis, 26rough set theory, 27

69

simplicial vertex, 50subjective probability, 8sufficient statistics, 29symmetric proposal, 15

test statistic, 17transition kernel, 10

uncertainty management, 3, 26utility, 12

weighted sample, 58weights, 23wide margin classifier, 23

70

References[1] S. Arnborg. In J. Wang, editor, Data Mining: Opportunities and Chal-

lenges, chapter 1: A Survey of Bayesian Data Mining. Idea Group Publish-ing, 2003.

[2] S. Arnborg. Robust Bayesian fusion: Imprecise and paradoxical reasoning.In FUSION 2004, pages 407–414, 2004.

[3] S. Arnborg and G. Sjödin. Bayes rules in finite models. In Proc. EuropeanConference on Artificial Intelligence, pages 571–575, Berlin, 2000.

[4] S. Arnborg and G. Sjödin. On the foundations of Bayesianism. In AliMohammad-Djarafi, editor, Bayesian Inference and Maximum EntropyMethods in Science and Engineering, 20th International Workshop, Gif-sur-Yvette, 2000, pages 61–71. American Institute of Physics, 2001.

[5] S. Benferhat, D. Dubois, and H. Prade. Nonmonotonic reasoning, con-ditional objects and possibility theory. Artificial Intelligence, 92:259–276,1997.

[6] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: Apractical and powerful approach to multiple testing. J. of the Royal statis-tical Society B, 57:289–300, 1995.

[7] Y. Benjamini and D. Yekutieli. The control of the false discovery rate inmultiple testing under dependency. Technical report, Stanford University,Dept of Statistics, 2001.

[8] J. O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer-Verlag, 1985.

[9] J. O. Berger. An overview of robust Bayesian analysis (with discussion).Test, 3:5–124, 1994.

[10] N. Bergman. Recursive Bayesian Estimation. PhD thesis, Linköping Uni-versity, 1999.

[11] G. Berkeley. The Analyst. In James Newman, editor, The World of Math-ematics, New York, 1956. Simon and Schuster.

[12] N. Berkman and T. Sandholm. What should be optimized in a decisiontree? Technical report, University of Massachusetts at Amherst, 1995.

[13] J.M. Bernardo and A.F. Smith. Bayesian Theory. Wiley, 1994.

[14] Barbara Caputo. A new kernel method for object recognition:spin glass-Markov random fields. PhD thesis, KTH, Sweden, October 11 2004.

[15] B. P. Carlin and S. Chib. Bayesian model choice via Markov chain MonteCarlo methods. Journal of the Royal Statistical Society, Series B, 57:473–484, 1995.

[16] B. P. Carlin and T. A. Louis. Bayes and Empirical Bayes Methods for DataAnalysis. Chapman & Hall, 1997.

71

[17] G. Casella and C. P. Robert. Rao-blackwellization of sampling schemes.Technical report, Cornell University, 1995.

[18] P. Cheeseman and J. Stutz. Bayesian classification (AUTOCLASS): The-ory and results. In U. M. Fayyad, G. Piatetsky-Shapiro, P Smyth, andR. Uthurusamy, editors, Advances in Knowledge Discovery and Data Min-ing. 1995.

[19] H. Chipman, E. I. George, and R. E. McCulloch. Bayesian CART. Tech-nical report, University of Chicago, 1995.

[20] V. Chvátal. Linear Programming. Freeman and Company, 1983.

[21] J. Corrander and M. Sillanpää. A unified approach to joint modeling ofmultiple quantitative and qualitative traits in gene mapping. J.theor. Biol.,218:435–446, 2002.

[22] R. Courant and D. Hilbert. Methods of Mathematical Physics V.I. Inter-science Publishers, New York, 1953.

[23] D. R. Cox and Nanny Wermuth. Multivariate Dependencies. Chapmanand Hall, 1996.

[24] R.T. Cox. Probability, frequency, and reasonable expectation. Am. Jour.Phys., 14:1–13, 1946.

[25] N. Cristianini and J. Shawe-Taylor, editors. Support Vector Machines andother kernel based methods. Cambridge University Press, 2000.

[26] A. Dale. A history of Inverse Probability: from Thomas Bayes to KarlPearson. Springer, Berlin, 1991.

[27] B. de Finetti. Theory of Probability. London:Wiley, 1974.

[28] P. de Laplace. On probability. In James Newman, editor, The World ofMathematics, New York, 1956. Simon and Schuster.

[29] A.P. Dempster. Upper and lower probabilities induced by a multi-valuedmapping. Annals of Mathematical Statistics, 38:325–339, 1967.

[30] D. K. Dey, L. Kuo, and S. K. Sahu. A Bayesian predictive approach todetermining the number of components in a mixture distribution. Technicalreport, University of Connecticut, 1993.

[31] J. Diebolt and C. Robert. Estimation of finite mixture distributions throughbayesian sampling. Journal of the Royal Statistical Society, Series B,56:589–590, 1994.

[32] A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo Methodsin Practice. Springer, 2001.

[33] M. D. Escobar and M. West. Bayesian density estimation and inferenceusing mixtures. Journal of the American Statistical Association, 90:577–588, 1995.

72

[34] W. Feller. An introduction to probability theory and its applications, vol-ume 1. Wiley, New York, 1956.

[35] S. Fienberg. When did Bayesian inference become ”Bayesian”? Bayesiananalysis, 1(1):1–40, 2005.

[36] R.A. Fisher. Inverse probability. Proceedings of the Cambridge Philosoph-ical Society, 26:528–535, 1930.

[37] D. Fixsen and R.P.S. Mahler. The modified Dempster-Shafer approach toclassification. IEEE Trans. SMC-A, 27(1):96–104, January 1997.

[38] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian DataAnalysis (Second Edition). Chapman & Hall, New York, 2003.

[39] N. Gershenfeld and A. Weigend. The future of time series: Learning andunderstanding. In A. Weigend and N. Gershenfeld, editors, Time SeriesPrediction: Forecasting the Future and Understanding the Past, pages 1–66.Addison-Wesley, 1993.

[40] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter. Markov Chain MonteCarlo in Practice. Chapman and Hall, 1996.

[41] C. Glymour and G. Cooper, editors. Computation, Causation and Discov-ery. MIT Press, 1999.

[42] I.R. Goodman, R. Mahler, and H.T. Nguyen. The Mathematics of DataFusion. Kluwer Academic Publishers, 1997.

[43] P. J. Green. Reversible jump Markov chain Monte Carlo computation andBayesian model determination. Biometrika, 82:711–732, 1995.

[44] David Heckerman. Bayesian networks for data mining. Data Mining andKnowledge Discovery, 1:79–119, 1997.

[45] Y. Hochberg. A sharper Bonferroni procedure for multiple tests of signifi-cance. Biometrika, 75:800–803, 1988.

[46] A. Holst. The Use of a Bayesian Neural Network Model for ClassificationTasks. Norstedts tryckeri AB, 1997. Ph D Thesis, Stockholm University.

[47] E. Hüllermeier. Toward a probabilistic formalization of case-based infer-ence. In D. Thomas, editor, Proceedings of the 16th International JointConference on Artificial Intelligence (IJCAI-99-Vol1), pages 248–253, S.F.,July 31–August 6 1999. Morgan Kaufmann Publishers.

[48] E. T. Jaynes. Probability Theory: The Logic of Science. Preprint: Wash-ington University, 1996.http://bayes.wustl.edu/etj/prob.html.

[49] F. B. Jensen. Bayesian Networks and Decision Graphs. Springer, 2001.

[50] Steffen L. Lauritzen. Graphical Models. Clarendon Press, 1996.

73

[51] D. Madigan and A. E. Raftery. Model selection and accounting fro modeluncertainty in graphical models using occam’s window. Technical report,University of Washington, 1993.

[52] D. Madigan and A. E. Raftery. Model selection and accounting for modeluncertainty in graphical models using occams window. J. American Sta-tistical Ass., 428:1535–1546, 1994.

[53] Heikki Mannila, Hannu Toivonen, Atte Korhola, and Heikki Olander.Learning, mining, or modeling? — a case study from paleoecology.

[54] V. Megalooikonomou, J. Ford, Li Shen, F. Makedon, and A. Saykin. Datamining in brain imaging. Statistical Methods in Medical Research, 9:359–394, 2000.

[55] M. Minsky and S. Papert. Perceptrons. MIT Press, Cambridge, Mass,1969.

[56] T.M. Mitchell. Machine Learning. McGraw-Hill, 1997.

[57] RM. Neal. Probabilistic inference using markov chain monte carlo methods.Technical report, Department of Computer Science, University of Toronto,1993. CRG-TR-93-1.

[58] M.-S. Oh and A. Raftery. Model-based clustering with dissimilarities: ABayesian approach. Technical Report 441, Dept of Statistics, University ofWashington, 2003.

[59] G. Paass and J. Kindermann. Bayesian classification trees with overlap-ping leaves applied to credit-scoring. Lecture Notes in Computer Science,1394:234–245, 1998.

[60] Z. Pawlak. Rough Sets. Kluwer, Dordrecht, 1992.

[61] Judea Pearl. Causality: Models, Reasoning, and Inference. CambridgeUniversity Press, Cambridge, England, 2000.

[62] J.G. Propp and D.B. Wilson. How to get a perfectly random sample froma generic Markov chain and generate a random spanning tree of a directedgraph. Journal of Algorithms, 27(2):170–217, May 1998.

[63] Lennart Råde and Bertil Westergren. Beta, Mathematical Handbook.Chartwell-Bratt, Old Orchard, Bickley Road, Bromley, Kent BR1 2NE,England, second edition, 1990.

[64] A. E. Raftery and V.E. Akman. Bayesian analysis of a Poisson processwith a change-point. Biometrika, 73:85–89, 1986.

[65] M. Ramoni and P. Sebastiani. Parameter estimation in Bayesian networksfrom incomplete databases. Intelligent Data Analysis, 2, 1998.

[66] S. Richardson and P. J. Green. On Bayesian analysis of mixtures with anunknown number of components (with discussion). Journal of the RoyalStatistical Society, Series B, 59:731–792, 1997.

74

[67] B. Ripley. Pattern Recognition and Neural Networks. Cambridge UniversityPress, 1996.

[68] J. Robins and L. Wasserman. Conditioning, likelihood and coherence: A re-view of some foundational concepts. J. American Statistical Ass., 95:1340–1345, 2000.

[69] D. J. Rose. Triangulated graphs and the elimination process. J. Math.Anal. Appl., 32:597–609, 1970.

[70] D. J. Rose, R. E. Tarjan, and G. S. Lueker. Algorithmic aspects of vertexelimination on graphs. SIAM J. Comput., 5:266–283, 1976.

[71] F. Rosenblatt. The perceptron: a probabilistic model for information stor-age and organization in the brain. Psychological Review, 65:386–408, 1958.

[72] R. Royall. On the probability of observing misleading statistical evidence(with discussion). J. American Statistical Ass., 95:760–780, 2000.

[73] C. Saunders, M. O. Stitson, J. Weston, L. Bottou, B. Schoelkopf, andA. Smola. Support vector machine - reference manual. Technical ReportCSD-TR-98-03, Department of Computer Science, Royal Holloway, Univer-sity of London, Egham, UK, 1998.

[74] L.J. Savage. Foundations of Statistics. John Wiley & Sons, New York,1954.

[75] Valeriu Savcenko. Bayesian methods for mixture modelling. Master’s thesis,Royal Institute of Technology, 2004.

[76] G. Shafer. A mathematical theory of evidence. Princeton University Press,1976.

[77] G Shafer, A. Gammerman, and V. Vovk. Algorithmic Learning in a RandomWorld. Springer, 2005.

[78] G Shafer and V. Vovk. Probability Theory – it’s only a Game. MIT Press,2001.

[79] D. S. Sivia. Bayesian Data Analysis, A Bayesian Tutorial. ClarendonPress: Oxford, 1996.

[80] F. Takens. Detecting strange attractors in turbulence. In D. A. Rand andL. S. Young, editors, Dynamical systems and turbulence, pages 366–381.Spinger-Verlag, New York, 1980.

[81] L. Tierney. Markov chains for exploring posterior distributions. Annals ofStatistics, 22:1701–1762, 1994.

[82] L. Tierney. Theoretical perspectives. In W. R. Gilks, S. Richardson, andD. J. Spiegelhalter, editors, Markov Chain Monte Carlo in Practice. Chap-man and Hall, 1996.

[83] Michael E. Tipping. Sparse bayesian learning and the relevance vectormachine. Journal of Machine Learning Research, 1:211–244, 2001.

75

[84] Leslie G. Valiant. A theory of the learnable. Communications of the ACM,27(11):1134–1142, November 1984.

[85] H. Valpola and J. Karhunen. An unsupervised ensemble learningmethod for nonlinear dynamic state-space models. Neural Computation,14(11):2647–2692, 2002.

[86] V. N. Vapnik. Estimation of Dependences Based on Empirical Data.Springer-Verlag, New York, 1982.

[87] G.I. Webb. Further experimental evidence against the utility of occamsrazor. Journal of AI research, 4:397–417, 1996.

[88] B. Widrow and M. E. Hoff. Adaptive switching circuits. In ProceedingsWESCON, pages 96–104, 1960.

[89] N. Wilson. Extended probability. In Proceedings of the 12th EuropeanConference on Artificial Intelligence, pages 667–671. John Wiley and Sons,1996.

[90] M. W. Woolrich, C. F. Beckmann, M. Jenkinson, and S. M. Smith. Mul-tilevel linear modelling for fMRI group analysis using Bayesian inference.Neuroimage, 21(4):1732–1747, 2004.

[91] L. A. Zadeh. The roles of fuzzy logic and soft computing in the conception,design and deployment of intelligent systems. Lecture Notes in ComputerScience, 1198:183–210, 1997.

76

· Statistical Methods in Applied Computer Science Lecture Notes Preliminary for course 2D5342,...

Documents

Transcript of · Statistical Methods in Applied Computer Science Lecture Notes Preliminary for course 2D5342,...