3 Semi-Supervised Clustering Application to Image Segmentation

12
Semi-Supervised Clustering: Application to Image Segmentation ario A.T. Figueiredo Instituto de Telecomunica¸ oes and Instituto Superior T´ ecnico, Technical University of Lisbon, 1049-001 Lisboa, Portugal; [email protected] Abstract. This paper describes a new approach to semi-supervised model-based clustering. The problem is formulated as penalized logistic regression, where the la- bels are only indirectly observed (via the component densities). This formulation al- lows deriving a generalized EM algorithm with closed-form update equations, which is in contrast with other related approaches which require expensive Gibbs sam- pling or suboptimal algorithms. We show how this approach can be naturally used for image segmentation under spatial priors, avoiding the usual hard combinatorial optimization required by classical Markov random fields; this opens the door to the use of sophisticated spatial priors (such as those based on wavelet representations) in a simple and computationally very efficient way. 1 Introduction In recent years there has been a considerable amount of interest in semi- supervised learning problems (see Zhu (2006)). Most formulations of semi- supervised learning approach the problem from one of the two ends of the unsupervised-supervised spectrum: either supervised learning in the presence of unlabelled data (see, e.g., Belkin and Niyogi (2003), Krishnapuram et al. (2004), Seeger (2001), Zhu et al. (2003)) or unsupervised learning with addi- tional information (see, e.g., Basu et al. (2004), Law et al. (2005), Lu and Leen (2005), Shental et al. (2003), Wagstaff et al. (2001)). The second perspective, known as semi-supervised clustering (SSC), is usually adopted when labels are completely absent from the training data, but there are (say, pair-wise) relations that one wishes to enforce or simply encourage. Most methods for SSC work by incorporating the desired relations (or con- straints) into classical algorithms such as the expectation-maximization (EM) algorithm for mixture-based clustering or the K-means algorithm. These re- lations may be imposed in a hard way, as constraints (Shental et al. (2003),

description

neuroscience

Transcript of 3 Semi-Supervised Clustering Application to Image Segmentation

Page 1: 3 Semi-Supervised Clustering Application to Image Segmentation

Semi-Supervised Clustering: Application toImage Segmentation

Mario A.T. Figueiredo

Instituto de Telecomunicacoes and Instituto Superior Tecnico, TechnicalUniversity of Lisbon, 1049-001 Lisboa, Portugal; [email protected]

Abstract. This paper describes a new approach to semi-supervised model-basedclustering. The problem is formulated as penalized logistic regression, where the la-bels are only indirectly observed (via the component densities). This formulation al-lows deriving a generalized EM algorithm with closed-form update equations, whichis in contrast with other related approaches which require expensive Gibbs sam-pling or suboptimal algorithms. We show how this approach can be naturally usedfor image segmentation under spatial priors, avoiding the usual hard combinatorialoptimization required by classical Markov random fields; this opens the door to theuse of sophisticated spatial priors (such as those based on wavelet representations)in a simple and computationally very efficient way.

1 Introduction

In recent years there has been a considerable amount of interest in semi-supervised learning problems (see Zhu (2006)). Most formulations of semi-supervised learning approach the problem from one of the two ends of theunsupervised-supervised spectrum: either supervised learning in the presenceof unlabelled data (see, e.g., Belkin and Niyogi (2003), Krishnapuram et al.(2004), Seeger (2001), Zhu et al. (2003)) or unsupervised learning with addi-tional information (see, e.g., Basu et al. (2004), Law et al. (2005), Lu and Leen(2005), Shental et al. (2003), Wagstaff et al. (2001)). The second perspective,known as semi-supervised clustering (SSC), is usually adopted when labelsare completely absent from the training data, but there are (say, pair-wise)relations that one wishes to enforce or simply encourage.

Most methods for SSC work by incorporating the desired relations (or con-straints) into classical algorithms such as the expectation-maximization (EM)algorithm for mixture-based clustering or the K-means algorithm. These re-lations may be imposed in a hard way, as constraints (Shental et al. (2003),

Page 2: 3 Semi-Supervised Clustering Application to Image Segmentation

40 Mario A.T. Figueiredo

Wagstaff et al. (2001)), or used to build priors under which probabilistic clus-tering is performed (Basu et al. (2004), Lu and Leen (2005)). This last ap-proach has been shown to yield good results and is the most natural forapplications where one knows that the relations should be encouraged, butnot enforced (e.g., in image segmentation, neighboring pixels should be en-couraged, but obviously not enforced, to belong to the same class). However,the resulting EM-type algorithms have a considerable drawback: because ofthe presence of the prior on the grouping relations, the E-step no longer hasa simple closed form, requiring the use of expensive stochastic (e.g., Gibbs)sampling schemes (Lu and Leen (2005)) or suboptimal methods such as theiterated conditional modes (ICM) algorithm (Basu et al. (2004)).

In this paper, we describe a new approach to semi-supervised mixture-based clustering for which we derive a simple, fully deterministic generalizedEM (GEM) algorithm. The keystone of our approach is the formulation ofsemi-supervised mixture-based clustering as a penalized logistic regressionproblem, where the labels are only indirectly observed. The linearity of theresulting complete log-likelihood, with respect to the missing group labels,will allow deriving a simple GEM algorithm.

We show how the proposed formulation is used for image segmentationunder spatial priors which, until now, were only used for real-valued fields(e.g., image restoration/denoising): Gaussian fields and wavelet-based priors.Under these priors, our GEM algorithm can be implemented very efficiently byresorting to fast Fourier or fast wavelet transforms. Our approach completelyavoids the combinatorial nature of standard segmentation methods, which arebased on Markov random fields of discrete labels (see Li (2001)).

Although we focus on image segmentation, SSC has been recently used inother areas, such as clustering of image databases (see Grira et al. (2005)),clustering of documents (see Zhong (2006) for a survey), and bioinformatics(see, e.g., Nikkila et al. (2001), Cebron and Berthold (2006)). Our approachwill thus also be potentially useful in those application areas.

2 Formulation

We build on the standard formulation of finite mixtures: let X = {x1, ...,xn}be an observed data set, with each xi ∈ IRd assumed to have been gen-erated (independently) according to one of a set of K probability densities{p(·|φ(1)), ..., p(·|φ(K))}. Associated with X , there’s a hidden/missing labelset Y = {y1, ...,yn}, where yi = [y(1)

i , ..., y(K)i ]T ∈ {0, 1}K, with y

(k)i = 1 if

and only if xi was generated by source k (“1-of-K” binary encoding). Thus,

p(X∣∣∣Y, φ(1), ...,φ(K)

)=

n∏i=1

K∏k=1

[p(xi|φ(k))

]y(k)i

. (1)

Page 3: 3 Semi-Supervised Clustering Application to Image Segmentation

Semi-Supervised Clustering 41

In standard mixture models, the hidden labels yi are assumed to be (in-dependent) samples from a multinomial variable with probabilities {η(1), ...,

η(K)}, i.e., P (Y) =∏

i

∏k(η(k))y

(k)i . This independence assumption will

clearly have to be abandoned in order to insert grouping constraints or priorpreference for some grouping relations. In Basu et al. (2004) and Lu and Leen(2005), this is done by defining a prior P (Y), in which the y1, ...,yn are notindependent. However, any such prior destroys the simple structure of EM forstandard finite mixtures, which is critically supported on the independenceassumption. Here, we follow a different route which in which the y1, ...,yn arenot modelled as independent, but for which we can still derive a simple GEMalgorithm.

Let the set of hidden labels Y = {y1, ...,yn} depend on a new set ofvariables Z = {z1, ..., zn}, where each zi = [z(1)

i , ..., z(K)i ]T ∈ IRK, according

to a multinomial logistic model (see Bohning (1992)):

P (Y|Z) =n∏

i=1

K∏k=1

(P [y(k)

i = 1|zi])y

(k)i

=n∏

i=1

K∏k=1

(ez

(k)i∑K

l=1 ez(l)i

)y(k)i

. (2)

Due to the normalization constraint∑K

k=1 P [y(k)i = 1|zi] = 1, we set (without

loss of generality) z(K)i = 0, for i = 1, ..., n (see Bohning (1992)). We are thus

left with a total of (n(K−1)) real variables, i.e., Z = {z(1), ..., z(K−1)}, wherez(k) = [z(k)

1 , ..., z(k)n ]T . Since the variables z

(k)i are real-valued and totally

unconstrained, it’s formally simple to define a prior p(Z) and to performoptimization w.r.t. Z. This contrasts with the direct manipulation of Y, which,due to its discrete nature, brings a combinatorial nature to the problems.

The prior grouping relations are now expressed by a prior p(Z); in partic-ular, preferred pair-wise relations are encoded in a Gaussian prior

p(Z|W, α) ∝K−1∏k=1

exp

⎡⎣−‖z(k) − α(k)1‖2

2− 1

4

n∑i,j=1

Wi,j(z(k)i − z

(k)j )2

⎤⎦ , (3)

where 1 = [1, ..., 1]T is a vector of n ones, α = [α(1), ..., α(K−1)]T , where α(k)

is a global mean for z(k), and W is a matrix (with zeros in the diagonal)encoding the pair-wise preferences: Wi,j > 0 expresses a preference (withstrength proportional to Wi,j) for having points i and j in the same cluster;Wi,j = 0 expresses the absence of any preference concerning the pair (i, j).The first term pulls the variables in z(k) towards a common mean α(k). If allWi,j = 0, we have a standard mixture model in which each probability η(k) isa function of the corresponding α(k). Defining

z = [z(1)1 , ..., z(1)

n , z(2)1 , ..., z(2)

n , ..., z(K−1)1 , ..., z(K−1)

n ]T =[(z(1))T, ..., (z(K−1))T

]T

and matrix ∆ (the well-known graph-Laplacian; see Belkin and Niyogi (2003)),

Page 4: 3 Semi-Supervised Clustering Application to Image Segmentation

42 Mario A.T. Figueiredo

∆ = diag{∑n

j=1 W1,j , ...,∑n

j=1 Wn,j

}−W, (4)

allows writing the prior (3) in the more standard Gaussian form

log p(z|W, α) = logN (z|β, Ψ−1) = −12(z−β)T Ψ (z−β)+

12

log(|Ψ | (2π)−n

),

(5)where the mean β and inverse covariance Ψ are given by

β = α ⊗ ((In + ∆)−11n) and Ψ = IK−1 ⊗ (In + ∆). (6)

In (6), ⊗ is the Kronecker matrix product, Ia stands for an a × a identitymatrix, and 1a = [1, 1, ..., 1]T is a vector of a ones. From this point on, weconsider W (but not α) as fixed, thus we omit it and write simply p(z|α).

3 Model estimation

3.1 Marginal maximum a posteriori and the GEM algorithm

Based on the above formulation, semi-supervised clustering consists in esti-mating the unknown parameters of the model, α, z, and φ = {φ(1), ...,φ(K)},taking into account that Y is missing. For this purpose, we adopt the marginalmaximum a posteriori criterion, obtained by marginalizing out the hidden la-bels; thus, since by Bayes law p(X ,Y, z|φ, α) = p(X|Y, φ)P (Y|z) p(z|α),(

z, φ, α)

= arg maxz,φ,α

∑Y

p(X|Y, φ)P (Y|z) p(z|α),

where the sum is over all the possible label configurations, and we are assumingflat priors for φ and α. We address this estimation problem using a generalizedEM (GEM) algorithm (see, e.g., McLachlan and Krishnan (1997)), that is, byiterating the following two steps (until some convergence criterion is met):

E-step: Compute the conditional expectation of the complete log-posterior,given the current estimates (z, φ, α) and the observations X :

Q(z, φ, α|z, φ, α) = EY [log p(X ,Y, z|φ, α)|z, φ, α,X ]. (7)

M-step: Update the estimate, that is, compute (znew, φnew, αnew), such thatthese new values are guaranteed to improve the Q function, i.e.,

Q(znew, φnew, αnew|z, φ, α) ≥ Q(z, φ, α|z, φ, α). (8)

It is well known that, under mild conditions, a GEM algorithm converges toa local maximum of the marginal log-posterior p(X , z|φ, α) (see Wu (1983)).

Page 5: 3 Semi-Supervised Clustering Application to Image Segmentation

Semi-Supervised Clustering 43

3.2 E-step

Using equation (1) for p(X|Y, φ), equation (2) for P (Y|z) (notice that z andZ are the same), and equation (5) for p(z|α), leads to

log p(X ,Y, z|φ, α) .=n∑

i=1

K∑k=1

y(k)i log p(xi|φ(k))− (z − β)T Ψ (z− β)

2

+n∑

i=1

[K∑

k=1

y(k)i z

(k)i − log

K∑k=1

ez(k)i

], (9)

where .= stands for “equal apart from an additive constant”. The importantthing to notice here is that this function is linear w.r.t. the hidden variablesy(k)i . Thus, the E-step reduces to the computation of their conditional expec-

tations, which are then plugged into p(X ,Y, z|φ, α).As in standard mixtures, the missing y

(k)i are binary, thus their expectation

(denoted as y(k)i ) are equal to their probabilities of being equal to one, which

can be obtained via Bayes law:

y(k)i ≡ E[y(k)

i |z, φ, α,X ] = P (y(k)i = 1|zi, φ,xi)

=p(xi|φ

(k))P (y(k)

i = 1|zi)∑Kj=1 p(xi|φ

(j))P (y(j)

i = 1|zi). (10)

Notice that (10) is similar to the E-step for a standard finite mixture (see,e.g., McLachlan and Krishnan (1997)), with the probabilities P (y(k)

i = 1|zi) =exp(z(k)

i )/∑

j exp(z(j)i ) playing the role of the class probabilities. Finally, the

Q function is obtained by plugging the y(k)i into (9).

3.3 M-Step: Density parameters φ

It’s clear from (9) that the maximization w.r.t. φ can be decoupled into

φ(k)

new = arg maxφ(k)

∑ni=1

∑Kk=1 y

(k)i log p(xi|φ(k)). (11)

This is the well-known weighted maximum likelihood criterion, exactly as itappears in the M-step for standard mixtures. The specific form of this updatedepends on the choice of p(·|φ(k)); e.g., this step can be easily applied to anyfinite mixture of exponential family densities (see Banerjee et al. (2004)), ofwhich the Gaussian is by far the one most often adopted.

3.4 M-step: z and α

The z and α estimates are updated by maximizing (or increasing, see (8))

Page 6: 3 Semi-Supervised Clustering Application to Image Segmentation

44 Mario A.T. Figueiredo

L(z, α) ≡n∑

i=1

[K∑

k=1

y(k)i z

(k)i − log

K∑k=1

ez(k)i

]− 1

2(z − β)T Ψ (z− β). (12)

Ignoring the second term (the log-prior), this would correspond to a standardlogistic regression (LR) problem, with an identity design matrix (Bohning(1992)), but where instead of the usual hard labels y

(k)i ∈ {0, 1} we have soft

labels y(k)i ∈ [0, 1].

The standard approaches to maximum likelihood LR are the Newton-Raphson algorithm (also known as iteratively reweighted least squares – IRLS;see Hastie et al. (2001)) and the bound optimization approach (BOA) (seeBohning (1992) and Lange et al. (2000)). In the presence of the log-prior,with a fixed α (thus fixed β), we have a quadratically penalized LR problem,and it’s easy to modify either the IRLS or the BOA (see below) for this case.However, since we assume that α is unknown, we adopt a scheme in whichwe maximize w.r.t. z and α in an iterative fashion, by cycling through

αnew = argmaxα

L(z, α) (13)

znew = argmaxz

L(z, αnew). (14)

It turns out that although (13) has a very simple closed form solution,

α(k)new = 1

n

∑ni=1 z

(k)i , for k = 1, .., K − 1, (15)

the maximization (14) can only be solved iteratively. Adopting the BOA willlead to a simple update equation which, for certain choices of W, can beimplemented very efficiently.

We now briefly recall the BOA for maximizing a concave function withbounded Hessian (see Bohning (1992)). Let G(θ) be a concave differentiablefunction, such that its Hessian H(θ) is bounded below by −B (i.e. H(θ) � −Bin the matrix sense, meaning that H(θ) + B is semi-definite positive), whereB is a positive definite matrix. Then, it’s easy to show that the iteration

θnew = arg maxθ

{θT g(θ)− 1

2(θ − θ)T B (θ − θ)

}= θ + B−1g(θ),

where g(θ) denotes the gradient of G(θ) at θ, monotonically improves G(θ),i.e., G(θnew) ≥ G(θ). In our specific problem, the gradient of the logistic log-likelihood function (i.e., (12) without the log-prior) is g(z) = y − p and theHessian verifies

H(z) � −12

[(IK−1 −

1K−1 1TK−1

K

)⊗ In

]≡ −B, (16)

where y = [y(1)1 , ..., y

(1)n , y

(2)1 , ..., y

(2)n , y

(K−1)1 ..., y

(K−1)n ]T is a vector arrange-

ment of Y, and p = [p(1)1 , ..., p

(1)n , p

(2)1 , ..., p

(2)n , p

(K−1)1 , ..., p

(K−1)n ]T with p

(k)i =

ez(k)i /

∑j ez

(j)i .

Page 7: 3 Semi-Supervised Clustering Application to Image Segmentation

Semi-Supervised Clustering 45

The update equation for solving (14) via a BOA is thus

znew = argmaxz

{2 zT g(z)− (z − z)T B (z− z) − (z− βnew)T Ψ (z− βnew)

}= (B + Ψ )−1

(g(z) + B z + Ψ βnew

)= (B + Ψ )−1 [g(z) + B z + αnew ⊗ 1n] , (17)

where, according to the definition of β in (6), we write βnew = αnew ⊗[(In + ∆)−11n

]. The equality Ψ βnew = αnew ⊗ 1n is show in the appendix.

Notice that (B + Ψ )−1 needs only be computed once. This is the funda-mental advantage of the BOA over IRLS, which would require the inversionof a new matrix at each iteration (see Bohning (1992)).

Summarizing our GEM algorithm: the E-step is the application of (10), forall i and k; the M-step consists in (11) followed by (one or more) applicationsof (13)-(14). Eq. (13) is implemented by (15), and (14) is implemented by (oneor more) applications of (17).

3.5 Speeding up the algorithm

The inversion (B+Ψ)−1, although it can be performed off-line, may be costlybecause (B + Ψ ) is a (n(K − 1)) × (n(K − 1)) matrix. We can alleviate thisproblem at the cost of using a less tight bound in (16), as shown by thefollowing lemma (proved in the appendix):

Lemma 1. Let ξK = 1/2, if K > 2, and ξK = 1/4, if K = 2; let B be definedas in (16). Then, B � ξK In(K−1).

This lemma allows replacing B by ξK In(K−1) in the BOA (because, ob-viously H(z) � −B � −ξK In(K−1)). The matrix inversion in (17) becomes(see proof in appendix):

(ξK In(K−1) + Ψ )−1 = IK−1 ⊗ ((ξK + 1)In + ∆)−1, (18)

which means that we avoid the (n(K − 1)) × (n(K − 1)) inversion and areleft with a single n×n inversion; for a general matrix (assuming the standardinversion cost of O(n3)), this yields a computational saving of roughly (K−1)3,which for large K can be very meaningful. Finally, careful observation of

znew =[IK−1 ⊗ [(ξK + 1)In + ∆]−1

](g(z) + ξK z + α ⊗ 1n)

reveals that it can be decoupled among the several z(k), yielding

z(k)new = [(ξK + 1)In + ∆]−1

(y(k) − p(k) + ξK z(k) + α(k)1n

). (19)

Page 8: 3 Semi-Supervised Clustering Application to Image Segmentation

46 Mario A.T. Figueiredo

4 Application to image segmentation

Let L = {i = (r, c), r = 1, ..., N, c = 1, ..., M} be a 2D lattice of n =|L| = MN sites/pixels. A K-segmentation R = {Rk ⊆ L, k = 1, ..., K}is a partition of L into K regions. As above, y is a “1-of-K” encoding ofR, i.e., (y(k)

i = 1) ⇔ (i ∈ Rk). The observation model (1) covers intensity-based or texture-based segmentation (each xi can be a d-dimensional vectorof texture features) and segmentation of multi-spectral images (e.g. color orremote sensing images, with d the number of spectral bands).

4.1 Stationary Gaussian field priors

If Wi,j only depends on the relative position of i and j the Gaussian fieldprior is stationary. If, in addition, the neighborhood system defined by Whas periodic boundary conditions, both W and ∆ are block-circulant, withcirculant blocks (see Balram and Moura (1993)), thus are diagonalized by a2D discrete Fourier transform (2D-DFT): ∆ = UHDU, where D is diagonal,U is the matrix representation of the 2D-DFT, and (·)H denotes conjugatetranspose. Since U is orthogonal (UHU = UUH = I), (19) can be written as

z(k)new = UH [(ξK + 1)In + D]−1 U

(y(k) − p(k) + ξK z(k) + α(k)1n

). (20)

We now have a trivial diagonal inversion, and the matrix-vector products byU and UH are not carried out explicitly but rather (very efficiently) via thefast Fourier transform (FFT) algorithm.

4.2 Wavelet-based priors

It is known that piece-wise smooth images have sparse wavelet-based repre-sentations (see Mallat (1998)); this fact underlies the excellent performance ofwavelet-based denoising and compression techniques. Piece-wise smoothnessof the z(k) translates into segmentations in which pixels in each class tend toform connected regions. Consider a wavelet expansion of each z(k)

z(k) = Lθ(k), k = 1, ..., K − 1, (21)

where the θ(k) are sets of coefficients and L is now a matrix where eachcolumn is a wavelet basis function; L may be orthogonal or have more columnsthan lines (e.g., in over-complete, shift-invariant, representations) (see Mallat(1998)). The goal now is to estimate θ = {θ(1), ..., θ(K−1)}, under a sparsenessprior p(θ). Classical choices for p(θ) are independent generalized Gaussians(see Moulin and Liu (1999)); a particular well-known case is the Laplacian,

p(θ) =∏K−1

k=1

∏j(λ/2) exp{−λ |θ(k)

j |}, (22)

which induces a strongly non-Gaussian, non-Markovian prior p(z), via (21).

Page 9: 3 Semi-Supervised Clustering Application to Image Segmentation

Semi-Supervised Clustering 47

The impact of adopting this wavelet-based prior is that the logistic regres-sion equations (see (12)) now have L as the design matrix, instead of identity;thus, matrix In in (16) must be replaced by LT L and all occurrences of z

(k)i

replaced by (Lθ(k))i. Finally, in Lemma 1, ξK becomes C/2 and C/4, forK > 2 and K = 2, respectively, where C is the maximum eigenvalue of LT L.Propagating all these changes through the derivations of the GEM algorithmleads to a simple closed-form update equation which involves the well-knownsoft-threshold non-linearity (see Figueiredo (2005) for details).

5 Experiments

5.1 Semi-supervised clustering

We show a simple toy experiment illustrating the behavior of the algorithm.The 900 data points in Fig. 1 (a) were generated by 6 circular Gaussians; thedesired grouping is the one shown by the symbols: stars, circles, and dots. InFig. 1(b), we show the result of estimating a 3-component Gaussian mixtureusing standard EM, which is of course totally unaware of the desired grouping.Prior information about the desired grouping is then embodied in the followingW matrix: we randomly choose 300 pairs of points (out of 900×899/2 � 4×105

possible pairs) such that both points belong to the same desired group, andset the corresponding Wi,j to one. The remaining elements of W are zero;notice that this is a highly sparse matrix. The mixture components producedby the proposed algorithm, under this prior knowledge, is shown in Fig. 1 (d).Convergence is obtained in about 30 ∼ 50 GEM iterations.

5.2 Image segmentation

We only have space to show a simple example (see Fig. 2). The observed imagecontains 4 regions, following Gaussian distributions with standard deviation0.6 and means 1, 2, 3, and 4. Matrix W for the Gaussian field prior hasWi,j = 1 if i and j are nearest neighbors, zero otherwise. More details andresults, namely with real images, can be found in Figueiredo (2005).

6 Conclusions

We have introduced a new formulation for semi-supervised clustering andshown how it leads to a simple GEM algorithm. We have demonstrated howour formulation can be applied to image segmentation with spatial priors,and have illustrated this using Gaussian field priors and wavelet-based pri-ors. Future work includes a thorough experimental evaluation of the method,application to other problems, and extension to the unknown K case.

Page 10: 3 Semi-Supervised Clustering Application to Image Segmentation

48 Mario A.T. Figueiredo

−4 −2 0 2 4 6 8−5

−4

−3

−2

−1

0

1

2

3

4

5(a)

−4 −2 0 2 4 6 8

−5

−4

−3

−2

−1

0

1

2

3

4

5(b)

−4 −2 0 2 4 6 8−5

−4

−3

−2

−1

0

1

2

3

4

5

(c)

−4 −2 0 2 4 6 8

−5

−4

−3

−2

−1

0

1

2

3

4

5 (d)

Fig. 1. Toy experiment with semi-supervised clustering; see text for details.

Appendix: Some proofs

Proof of Ψ βnew = αnew ⊗ 1n (used in (17)): Since (M ⊗ P)(Q ⊗ R) =MQ⊗PR, we have Ψ βnew = [IK−1 ⊗ (In + ∆)]

[αnew ⊗ ((In + ∆)−11n)

]=

IK−1 αnew ⊗[(In + ∆)(In + ∆)−11n

]= αnew ⊗ 1n. �� Proof Lemma 1:

Inserting K = 2 in (16) yields B = I/4. For K > 2, the inequality I/2 � B isequivalent to λmin(I/2−B) ≥ 0, which is equivalent to λmax(B) ≤ (1/2). Sincethe eigenvalues of the Kronecker product are the products of the eigenvaluesof the matrices, λmax(B) = λmax(I − (1/K)11T )/2. Since 11T is a rank-1matrix with eigenvalues {0, ..., 0, K − 1}, the eigenvalues of (I − (1/K)11T )are {1, ..., 1, 1/K}, thus λmax(I− (1/K)11T ) = 1, and λmax(B) = 1/2. ��Proof of equality (18): Using (M⊗P)−1 =M−1⊗P−1 and Ia⊗Ib =Iab, andthe definition of Ψ in (6), we can write

(ξK In(K−1) + Ψ )−1 =(ξK In(K−1) + IK−1 ⊗ (In + ∆)

)−1

= (ξK IK−1 ⊗ In + IK−1 ⊗ (In + ∆))−1

= (IK−1 ⊗ ((ξK + 1)In + ∆))−1

= IK−1 ⊗ ((ξK + 1)In + ∆)−1. ��

Page 11: 3 Semi-Supervised Clustering Application to Image Segmentation

Semi-Supervised Clustering 49

Fig. 2. Image segmentation example: upper left, observed image; upper right, max-imum likelihood segmentation; lower left, SSC-based segmentation under Gaussianfield prior; lower right, SSC-based segmentation under wavelet-based prior.

References

BALRAM, N. and MOURA, J. (1993): Noncausal Gauss-Markov Random Fields:Parameter Structure and Estimation. IEEE Transactions on Information The-ory, 39, 1333–1355.

BANERJEE, A., MERUGU. S., DHILLON, I. and GHOSH, J. (2004): Cluster-ing With Bregman Divergences. Proc. SIAM International Conference on DataMining, Lake Buena Vista.

BASU, S., BILENKO, M. and MOONEY, R. (2004): A Probabilistic Frameworkfor Semi-supervised Clustering. Proc. International Conference on KnowledgeDiscovery and Data Mining, Seattle.

BELKIN, M. and NIYOGI, P. (2003): Using Manifold Structure for Partially La-belled Classification. Proc. Neural Information Processing Systems 15, MITPress, Cambridge.

BOHNING, D. (1992): Multinomial Logistic Regression Algorithm. Annals of theInstitute of Statistical Mathematics, 44, 197–200.

CEBRON, N. and BERTHOLD, M. (2006): Mining of Cell Assay Images Using Ac-tive Semi-supervised Clustering. Proc. Workshop on Computational Intelligencein Data Mining, Houston.

Page 12: 3 Semi-Supervised Clustering Application to Image Segmentation

50 Mario A.T. Figueiredo

FIGUEIREDO, M. (2005): Bayesian Image Segmentation Using Wavelet-based Pri-ors. Proc. IEEE Conference on Computer Vision and Pattern Recognition, SanDiego.

GRIRA, N., CRUCIANU, M. and BOUJEMAA, N. (2005): Active andSemi-supervised Clustering for Image Database Categorization. Proc.IEEE/EURASIP Workshop on Content Based Multimedia Indexing, Riga,Latvia.

HASTIE, T., TIBSHIRANI, R. and FRIEDMAN, J. (2001): The Elements of Sta-tistical Learning. Springer, New York.

KRISHNAPURAM, B., WILLIAMS, D., XUE, Y., HARTEMINK, A., CARIN, L.and FIGUEIREDO, M. (2005): On Semi-supervised Classification. Proc. NeuralInformation Processing Systems 17, MIT Press, Cambridge.

LANGE, K., HUNTER, D. and YANG, I. (2000): Optimization Transfer UsingSurrogate Objective Functions. Jour. Computational and Graphical Statistics,9, 1–59.

LAW, M., TOPCHY, A. and JAIN, A. K. (2005): Model-based Clustering WithProbabilistic Constraints. Proc. SIAM Conference on Data Mining, NewportBeach.

LI, S. (2001): Markov Random Field Modelling in Computer Vision, Springer, Tokyo.LU, Z. and LEEN, T. (2005): Probabilistic Penalized Clustering. Proc. Neural In-

formation Processing Systems 17, MIT Press, Cambridge.MALLAT, S. (1998): A Wavelet Tour of Signal Processing. Academic Press, San

Diego, USA.MCLACHLAN, G. and KRISHNAN, T. (1997): The EM Algorithm and Extensions.

Wiley, New York.MOULIN, P. and LIU, J. (1999): Analysis of Multiresolution Image Denoising

Schemes Using Generalized-Gaussian and Ccomplexity Priors. IEEE Transac-tions on Information Theory, 45, 909–919.

NIKKILA, J., TORONEN, P., SINKKONEN, J. and KASKI, S. (2001): Analysis ofGene Expression Data Using Semi-supervised Clustering. Proc. Bioinformatics2001, Skovde.

SEEGER, M. (2001): Learning With Labelled and Unlabelled Data. Technical Report,Institute for Adaptive and Neural Computation, University of Edinburgh.

SHENTAL, N., BAR-HILLEL, A., HERTZ, T. and WEINSHALL, D. (2003): Com-puting Gaussian Mixture Models With EM Using Equivalence Constraints.Proc. Neural Information Processing Systems 15, MIT Press, Cambridge.

WAGSTAFF, K., CARDIE, C., ROGERS, S. and SCHRODL, S. (2001): Con-strained K-means Clustering With Background Knowledge. Proc. InternationalConference on Machine Learning, Williamstown.

WU, C. (1983): On the Convergence Properties of the EM Algorithm. Annals ofStatistics, 11, 95–103.

ZHONG, S. (2006): Semi-supervised Model-based Document Clustering: A Compar-ative Study. Machine Lerning, 2006 (in press).

ZHU, X. (2006): Semi-Supervised Learning Literature Survey. Technical Report,Computer Sciences Department, University of Wisconsin, Madison.

ZHU, X., GHAHRAMANI, Z. and LAFFERTY, J. (2003): Semi-supervised LearningUsing Gaussian Fields and Harmonic Functions. Proc. International Conferenceon Machine Learning, Washington DC.