Sequential minimal optimization in convex clustering repetitions

20
Sequential Minimal Optimization in Convex Clustering Repetitions Rikiya Takahashi IBM Research-Tokyo, 1623-14 Shimo-tsuruma, Yamato-shi, Kanagawa 242-8502, Japan Received 19 June 2011; revised 22 August 2011; accepted 24 October 2011 DOI:10.1002/sam.10146 Published online 16 December 2011 in Wiley Online Library (wileyonlinelibrary.com). Abstract: Computing not the local, but the global optimum of a cluster assignment is one of the important aspects in clustering. Fitting a Gaussian mixture model is a method of soft clustering where optimization of the mixture weights is convex if centroids and bandwidths of the clusters remain unchanged during the updates. The global optimum of the mixture weights is sparse and clustering that utilizes the fitted sparse mixture model is called the convex clustering. To make the convex clustering practical in real applications, the author addresses three types of issues classified as (i) computational inefficiency of the Expectation-Maximization algorithm, (ii) inconsistency of the bandwidth specifications between clustering and density estimation for high-dimensional data, and (iii) selection of the optimal clustering from several bandwidth settings. The extremely large number of iterations needed in the Expectation-Maximization algorithm is significantly reduced with an accurate pruning while choosing a pair of kernels and an element-wise Newton–Raphson method. For high-dimensional data, the convex clusterings are performed several times, with initially large bandwidths and succeeding smaller bandwidths. Since the number of clusters cannot be specified precisely in the convex clustering, practitioners often try multiple settings of the initial bandwidths. To choose the optimal clustering from the multiple results, the author proposes an empirical-Bayes method that can choose appropriate bandwidths if the true clusters are Gaussian. The combination of the repetitions of the convex clusterings and the empirical- Bayes model selection achieves stable prediction performances compared to the existing mixture learning methods. © 2011 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 5: 70–89, 2012 Keywords: convex clustering; sequential minimal optimization; empirical-Bayes method 1. INTRODUCTION Clustering and mixture-learning are two important data mining and machine learning tasks but they suffer from local optimality. Many of the datasets used in data mining are expressed as sets of multidimensional vectors. For such vector data, fitting Gaussian mixture models with a k-means [1–3] or Expectation-Maximization (EM) [4] algorithm is one of the basic techniques to summarize the characteristics, where each exemplar is assigned to one of the k compact groups called clusters. Despite their importance, clustering with the k-means or EM algorithms requires random ini- tializations and usually involves a local optimality issue, as shown in Fig. 1. To avoid this issue, practitioners often try the k-means algorithms many times, or use determin- istic annealing (DA) [5,6]. In some applications, we need a clustering method whose partitioning results are repro- ducible and do not depend on the random initializations. For example, in marketing applications, clustering is often Correspondence to: Rikiya Takahashi ([email protected]) used to segment customers. There marketers consider tar- geting strategies depending on these segmentation results. When several trials give different partitioning results, mar- keters cannot execute the targeting robustly, and they often feel difficulties to adopt such unstable clustering algo- rithms. Since professional statisticians and data miners do not strongly mind the randomness of the results because they regard that the training datapoints are already given by a random sampling process. Yet in our observations, prac- titioners and application users cannot correctly understand what randomness and robustness are. To develop com- pletely deterministic algorithms is often crucial to satisfy users, even when such algorithms require more computa- tional resources. To solve the local optimality issue, Lashkari and Golland proposed a convex clustering algorithm [7] where the global optimality of the solution is guaranteed. In the fitting of Gaussian mixture models whose centroids and bandwidths remain unchanged during the optimization, the negative log-likelihood to be minimized is convex with respect to the mixture weights, and many of the optimal mixture weights © 2011 Wiley Periodicals, Inc.

Transcript of Sequential minimal optimization in convex clustering repetitions

Page 1: Sequential minimal optimization in convex clustering repetitions

Sequential Minimal Optimization in Convex Clustering Repetitions

Rikiya Takahashi∗

IBM Research-Tokyo, 1623-14 Shimo-tsuruma, Yamato-shi, Kanagawa 242-8502, Japan

Received 19 June 2011; revised 22 August 2011; accepted 24 October 2011DOI:10.1002/sam.10146

Published online 16 December 2011 in Wiley Online Library (wileyonlinelibrary.com).

Abstract: Computing not the local, but the global optimum of a cluster assignment is one of the important aspects inclustering. Fitting a Gaussian mixture model is a method of soft clustering where optimization of the mixture weights isconvex if centroids and bandwidths of the clusters remain unchanged during the updates. The global optimum of the mixtureweights is sparse and clustering that utilizes the fitted sparse mixture model is called the convex clustering. To make the convexclustering practical in real applications, the author addresses three types of issues classified as (i) computational inefficiency of theExpectation-Maximization algorithm, (ii) inconsistency of the bandwidth specifications between clustering and density estimationfor high-dimensional data, and (iii) selection of the optimal clustering from several bandwidth settings. The extremely largenumber of iterations needed in the Expectation-Maximization algorithm is significantly reduced with an accurate pruning whilechoosing a pair of kernels and an element-wise Newton–Raphson method. For high-dimensional data, the convex clusteringsare performed several times, with initially large bandwidths and succeeding smaller bandwidths. Since the number of clusterscannot be specified precisely in the convex clustering, practitioners often try multiple settings of the initial bandwidths. To choosethe optimal clustering from the multiple results, the author proposes an empirical-Bayes method that can choose appropriatebandwidths if the true clusters are Gaussian. The combination of the repetitions of the convex clusterings and the empirical-Bayes model selection achieves stable prediction performances compared to the existing mixture learning methods. © 2011Wiley Periodicals, Inc. Statistical Analysis and Data Mining 5: 70–89, 2012

Keywords: convex clustering; sequential minimal optimization; empirical-Bayes method

1. INTRODUCTION

Clustering and mixture-learning are two important datamining and machine learning tasks but they suffer fromlocal optimality. Many of the datasets used in data miningare expressed as sets of multidimensional vectors. For suchvector data, fitting Gaussian mixture models with a k-means[1–3] or Expectation-Maximization (EM) [4] algorithm isone of the basic techniques to summarize the characteristics,where each exemplar is assigned to one of the k compactgroups called clusters. Despite their importance, clusteringwith the k-means or EM algorithms requires random ini-tializations and usually involves a local optimality issue,as shown in Fig. 1. To avoid this issue, practitioners oftentry the k-means algorithms many times, or use determin-istic annealing (DA) [5,6]. In some applications, we needa clustering method whose partitioning results are repro-ducible and do not depend on the random initializations.For example, in marketing applications, clustering is often

∗ Correspondence to: Rikiya Takahashi([email protected])

used to segment customers. There marketers consider tar-geting strategies depending on these segmentation results.When several trials give different partitioning results, mar-keters cannot execute the targeting robustly, and they oftenfeel difficulties to adopt such unstable clustering algo-rithms. Since professional statisticians and data miners donot strongly mind the randomness of the results becausethey regard that the training datapoints are already given bya random sampling process. Yet in our observations, prac-titioners and application users cannot correctly understandwhat randomness and robustness are. To develop com-pletely deterministic algorithms is often crucial to satisfyusers, even when such algorithms require more computa-tional resources.

To solve the local optimality issue, Lashkari and Gollandproposed a convex clustering algorithm [7] where the globaloptimality of the solution is guaranteed. In the fitting ofGaussian mixture models whose centroids and bandwidthsremain unchanged during the optimization, the negativelog-likelihood to be minimized is convex with respect to themixture weights, and many of the optimal mixture weights

© 2011 Wiley Periodicals, Inc.

Page 2: Sequential minimal optimization in convex clustering repetitions

Takahashi: Sequential Minimal Optimization 71

Fig. 1 Two examples of the local optima in four-means clustering, where symbols ‘•’, ‘�’, ‘�’, and ‘�’ represent each sample’s clusterassignment. The clustering quality of the assignment on the left is inferior to that on the right. [Color figure can be viewed in the onlineissue, which is available at wileyonlinelibrary.com.]

become zero (sparse). For n data points, the convex clus-tering utilizes n kernel distributions whose centers are the n

data points themselves. In optimizing the mixture weights, asubset of the n kernel distributions is automatically chosenwhere the number of clusters is also automatically deter-mined. The specified bandwidths essentially determine thenumber of clusters, where narrow bandwidths yield lots ofsmall clusters while wide bandwidths result in a limitednumber of large clusters.

The first part of this paper addresses the acceleration ofthe convex clustering primarily discussed in the conferencepaper [8]. Though the global optimality in the convex clus-tering is quite appealing, we experimentally confirmed thatthe original EM algorithm provided in Ref. [7] requiresthousands of iterations to converge. The EM algorithm isespecially inefficient when applied to the convex cluster-ing, mainly because of the sparsity of the solution. In theEM algorithm, the computational complexity varies in eachiteration and is proportional to the number of nonzero ele-ments in the mixture weights. Also, the EM algorithm hasa first-order convergence and the updates are small near thesparse optimum. Thus, early pruning of the irrelevant ker-nels, which is a key to acceleration, conflicts with the natureof the EM algorithm. The actual computational times aresensitive to the specified threshold of the mixture weight.

Our acceleration technique exploits a fast pruning and anelement-wise second-order optimization. Instead of usinga small threshold, we use a derivative-based conditionalexpression to accurately judge whether or not a kernel canbe trimmed off. Borrowing an idea from the sequentialminimal optimization [9], the judgment is done by choos-ing a pair of kernels, and the choice of the pairs is basedon the nearest-neighbor method. In optimizing the nonzeroelements of the mixture weights, we use an element-wiseNewton–Raphson method, instead of the first-order EMalgorithm. While our algorithm’s computational complex-ity per iteration is the same as that of the EM algorithm,the combination of the fast pruning and the second-orderNewton–Raphson method drastically reduces the requirednumber of iterations. The element-wise unidimensional

Newton–Raphson method instead of the standard multidi-mensional Newton–Raphson method is essential, becauseinversions of Hessian matrices that increase the computa-tional complexity are successfully eliminated.

The second part of this paper discusses the issues inhandling high-dimensional data with the convex cluster-ing algorithm, and model selection algorithm to determinethe optimal clustering from several partitioning results. Forhigh-dimensional datasets, when the maximum-likelihoodconvex clustering is performed with the true bandwidthsof the clusters, meaningful clusters cannot be acquiredbecause many irrelevant kernels still remain active afterthe optimization. While using much larger bandwidths thanthe true ones often gives some clustering, we observedthat the estimated cluster labels are unstable and differentfrom the true labels when the noises are high. In addition,from several bandwidth settings, which clustering should beadopted is an open problem because the acquired mixturemodel is an improper density estimate and likelihood-basedmodel selection such as likelihood cross-validation cannotbe performed.

This paper newly proposes (i) an iterative algorithmto utilize convex clusterings with refitting the clusterparameters and (ii) an empirical-Bayes model selection tochoose the optimal clustering. Convex clustering with largerbandwidths than the true ones resembles fitting a mixturemodel on Deterministic Annealing (DA) [5,6]. Annealingalgorithms optimize relaxed objective functions using hightemperatures, with which numbers of local optima are lowerthan the original objectives. To acquire the optimum inthe original problem, we need to gradually decrease thetemperature and performs optimizations again and again.On the basis of the analogy to the iterative decreases ofthe temperature, we propose to repeat convex clusteringsusing large initial bandwidths and refitting of the clusterparameters. While the repetition of clusterings correspondsto a nonconvex optimization, initial values are givendeterministically and the optimum is uniquely given. Hencewe can still eliminate the random initializations that someusers dislike. In performing a model selection from several

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 3: Sequential minimal optimization in convex clustering repetitions

72 Statistical Analysis and Data Mining, Vol. 5 (2012)

clusterings, we exploit post-processing Bayesian densityestimation whose hyperparameters can be optimized withan empirical-Bayes method. The automatic hyperparameterselection with the empirical-Bayes method can choosea reasonable clustering that incorporates the trade-offbetween the complexity of the fitted density and theexplanation ability to the training data.

We conclude that the convex clustering is efficient forfinding the initial clusters in annealing processes ratherthan choosing the final clusters. Despite the analogy toDA, the proposed repetitions of the convex clusteringsexperimentally yielded stabler prediction performances thanthe existing DA-EM algorithm. While DA can reducethe possibility to be trapped in poor local optima, thek initial clusters are chosen randomly, and the qualityof the k initial clusters affects the final performance.In contrast, the initial clusters in the proposed methodare nonparametrically chosen with optimizing the relaxedobjective, and the initial optimization instead of the randominitialization dramatically increases the quality of the finalresults. The remainder of this paper is organized as follows.To make the paper self-contained, Section 2 shows theobjective function of the convex clustering and discussesthe slow convergence issue in the simple EM algorithm.The fast pruning and use of the Newton–Raphson methodare introduced in Section 3. Section 4 introduces theproblems when applying the convex clustering to high-dimensional data, and the iterative refitting of the clusterparameters. While different initial bandwidths result indifferent final clusters, the empirical-Bayes method inSection 5 gives a criteria to choose the optimal one.The work related to convex clustering is addressed inSection 6. Section 7 shows the experimental performancesabout the convergence rates and the detectability of thehidden clusters. Section 8 concludes the paper.

2. THE SLOW CONVERGENCE IN CONVEXCLUSTERING

This section introduces the basics of the convex cluster-ing algorithm. Section 2.1 addresses the convex negativelog-likelihood and the EM algorithm first introduced inRef. [7]. Section 2.2 provides an experimental evaluationof the learning speed with the EM algorithm, and discussesthe reasons for the slow convergence.

2.1. The Convex Negative Log-Likelihood

To do the clustering, the negative log-likelihood of themixture models is minimized. For a d-dimensional vector x,let p(x|θ) be a probability distribution whose parameter isθ . While this paper only deals with the case in which p(x|θ)

is a Gaussian distribution, broader classes of distributions,such as an exponential family whose natural parameter is θ ,can be used for p(x|θ). Let m be the number of clusters and�m−1 be the simplex in R

m. Given a set of n data pointsD = {x1, x2, . . . , xn}, mixture modeling aims to minimizethe negative log-likelihood

− log p (D|λ, θ1, . . . , θm) = −n∑

j=1

log

[m∑

i=1

λip(xj |θ i

)],

(1)

where λ � (λ1, . . . , λm)T ∈ �m−1 is a vector of mixtureweights, and θ i is a distribution parameter assigned for thecluster i.

Though fitting all of the parameters {λ, θ1, . . . , θm}requires a nonconvex optimization, the optimization onlyfor λ on some fixed values of {θ1, . . . , θm} is convex[7,10]1. Hence the value of λ computed with a gradient-based method is the global optimum. On the basis ofJensen’s inequality, an iterative update rule is derived as

λ(t+1)i ← 1

n

n∑j=1

λ(t)i κij∑m

i′=1 λ(t)

i′ κi′j,

where λ(t)i is the estimated value of λi at the t th iteration

and κij � p(xj |θ i ). By introducing the auxiliary variablesz(t)1 , . . . , z

(t)n and η

(t)1 , . . . , η

(t)m , the iterative update rules can

be decomposed as

z(t)j ←

m∑i=1

λ(t)i κij , η

(t)i ← 1

n

n∑j=1

κij

zj

, and λ(t+1)i ← η

(t)i λ

(t)i .

(2)

The values of η(t)1 , . . . , η

(t)m can be exploited in the con-

vergence test because ∀i, limt→∞ η(t)i = 1. After conver-

gence, hard clustering can be performed with the NaıveBayes rule. The cluster that xj belongs to is maxi λ

(∞)i κij .

The update rules (2) can be used to automaticallydetermine the number of clusters. In many cases, mostof the values λ

(t)1 , . . . , λ

(t)m become zero (sparse) as t →

∞. Hence we define a set of indexes At � {i; λ(t)i > 0}.

Let π [i] for i ∈ {1, . . . , m} be an index such that π [i] ∈{1, . . . , n}. If n is huge (e.g., on the order of millions),then we set m n and each π [i] is randomly chosenfrom {1, . . . , n} without duplication. Otherwise, we simplyset m = n and π [i] = i. Then p(x|θ i ) is a distribution

1 The proof is simply given with Jensen’s inequality. Forφ ∈ [0, 1] and λ, λ′ ∈ �m−1, − log

[ ∑mi=1

(φλj + (

1 − φ)λ′

j

)p(x|θ i

)] ≤ −φ log[ ∑m

i=1 λjp(x|θ i

)] − (1 − φ

)log

[∑mi=1 λjp(

x|θ i

)].

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 4: Sequential minimal optimization in convex clustering repetitions

Takahashi: Sequential Minimal Optimization 73

centered at xπ [i]. At the t-th iteration, p(x|θ i ) is an activecluster distribution if i ∈ At , and the cardinality of At

gives the number of clusters. A simple choice of p(x|θ i )

is an isotropic Gaussian kernel p(x|θ i ) ∝ exp(−‖x −xπ [i]‖2

2/(2σ 2)), where ‖ · ‖2 denotes the L2-norm and σ 2 isan isotropic variance. Many mixture weights become zerowhen σ 2 is large, while a small value of σ 2 results in lots ofsmall clusters. Hence the variance σ 2 essentially determinesthe number of clusters.

In Eq. (2), the computational complexity at the t thiteration is O(n|At |). Because λ

(t+1)i = 0 if λ

(t)i = 0, a

kernel trimmed off at the t th iteration never becomes active.Therefore, the initial values of the mixture weights arerequired to be dense, where ∀i, λ

(0)i > 0, to reach the global

optimum. We adopt a simple choice that ∀i, λ(0)i ≡ 1/n. The

computational complexity per iteration is first O(mn) andgradually decreases as the iterations proceed.

2.2. Inefficiency of the EM Algorithm

Though the simple EM algorithm using Eq. (2) convergesto the global optimum, we experimentally confirmed thatlots of iterations are required until the mixture weightsbecome sparse. To reach the sparse solution, component i

that satisfies λ(t)i < 10−3/n is pruned in Ref. [7]. Figure 2

shows an experimental evaluation of the learning rate withthis thresholding rule, using an artificial 2D dataset. Eventhe same number of iterations as the number of data pointsis not sufficient for pruning all of the irrelevant kernels.

The slow pruning of the irrelevant kernels is caused bythe nature of the EM algorithm. Because EM algorithmsto learn mixture models have first-order convergence[11,12], they become slow near the optimum. Hence theupdate amounts become small when the mixture weightsare near zero, and pruning of the irrelevant kernels isprevented.

We think a naıve acceleration with a loose threshold isinappropriate. In Fig. 2, the updates of the mixture weightsare not monotonic for the number of iterations. The weightof a component grows in some iterations and shrinks inother iterations. Hence pruning with a loose thresholdinvolves a risk of trimming off the relevant kernels andmight make the optimizations unstable.

3. THE ACCELERATIONS

Instead of the first-order EM algorithm with heuristicthresholding, we introduce a second-order Newton–Raphson method with an exact pruning rule. When we

1

10

100

1000

1 10 100 1000 10000

# of

act

ive

com

pone

nts

# of iterations

0.000

0.002

0.004

0.006

0.008

0.010

0 20 40 60 80 100

mix

ture

wei

ght λ

i

# of iterations

-1.5 -0.5 0.5 1.5

-1.5

-0.5

0.5

1.5

t = 10

-1.5 -0.5 0.5 1.5

-1.5

-0.5

0.5

1.5

t = 100

-1.5 -0.5 0.5 1.5

-1.5

-0.5

0.5

1.5

t = 1,000

Fig. 2 The slow convergence of the EM algorithm with the thresholding rule λ(t+1)i ← 0 if λ

(t)i < 10−3/n. 1000 samples of R

2 pointsare distributed from a six-cluster Gaussian mixture model. The means of the clusters are (1, 0)T, (−1, 0)T, (0, 1)T, (0,−1)T, (1, 1)T,and (−1,−1)T. The mixture weight is 1/6 and the standard deviation is 0.2 in every cluster. We specified a bandwidth σ = 0.3 in thelearning. The top-left figure shows the numbers of active components as the iterations proceed, where even 1000 or 10,000 iterations arenot sufficient for the convergence. The top-right figure shows the dynamics in updating the mixture weights λ

(0)i , λ

(1)i , . . . , λ

(100)i for the

several components {i}. The updates are not monotonic, because many of the mixture weights first increase and later decrease. The slowconvergence can also be confirmed in the bottom three figures, where the resulting partitions at 10, 100, or 1000 iterations are shown.[Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 5: Sequential minimal optimization in convex clustering repetitions

74 Statistical Analysis and Data Mining, Vol. 5 (2012)

focus on the mixture weights of a pair of kernels, we canexactly determine whether or not the selected kernels canbe pruned. The idea of focusing on a pair of kernels issimilar to sequential minimal optimization [9], which isbasically a fast optimization technique to train Support Vec-tor Machines [13]. While the derived pruning conditionsare used to determine the zero elements of the mixtureweights, the non-zero elements are optimized with New-ton–Raphson updates. As we use element-wise updatingand do not evaluate the Hessian matrix, our algorithm’scomputational complexity per iteration is the same as thatof the EM algorithm. The choice of the pairs is basedon the nearest-neighbor method. By focusing on a pair ofkernels, Section 3.1 clarifies a direction to accelerate theconvex clustering. Section 3.2 derives the fast and exactpruning conditions and Section 3.3 addresses the element-wise Newton–Raphson updating. Section 3.4 mentions sev-eral issues to be cared in implementing the proposedalgorithm.

3.1. Analysis with a Pair of Kernels

Referring to Eq. (1), we have the negative log-likelihoodfor the convex clustering. Let (i, i ′) be a pair of componentsand assume that the mixture weight for each of theother components u �= i, i ′ is fixed. Then the valuesλ\i\i′ � 1 − λi − λi′ and cj\i\i′ �

∑u �=i,i′ λup(xj |θu) can

be regarded as constants. Because the mixture weights λi

and λi′ satisfy λi + λi′ ≡ 1 − λ\i\i′ , Eq. (1) is described asfii′(λi), a unidimensional function of λi ∈ [0, 1 − λ\i\i′ ].Figure 3 shows three typical shapes of the function fii′(λi)

depending on the choice of (i, i ′). An important thing isthat we can immediately set λi = 0 or λi′ = 0 when fii′(·)is monotonically decreasing or increasing, respectively.Generally, when two kernels p(x|θ i ) and p(x|θ i′) aresimilar probability distributions, then only one of (i, i ′)should be an active kernel and the other should beeliminated.

3.2. Fast and Accurate Pruning

By assessing the monotonicity of fii′(·), we performa fast pruning of the irrelevant kernels. Let us take thegradient of fii′(λi) at λi = 0, as

f ′i0i′ � ∂fii′

∂λi

∣∣∣∣λi=0

= −n∑

j=1

κij − κi′j(1 − λ\i\i′

)κi′j + cj\i\i′

. (3)

If f ′i0i′ > 0, then fii′(·) is monotonically increasing

within [0, 1 − λ\i\i′ ] and we can set λi = 0 and λi′ =1 − λ\i\i′ . In the same way, we evaluate

f ′ii′0 � ∂fii′

∂λi

∣∣∣∣λi=1−λ\i\i′

= −n∑

j=1

κij − κi′j(1 − λ\i\i′

)κij + cj\i\i′

.

(4)

If f ′ii′0 < 0, then fii′(·) is monotonically decreasing

within [0, 1 − λ\i\i′ ], where we set λi = 1 − λ\i\i′ andλi′ = 0. If f ′

i0i′ < 0 ∩ f ′ii′0 > 0, then the optimum of λi

is an interior point of [0, 1 − λ\i\i′ ]. The optimization forsuch nonzero values is discussed later in Section 3.3. Notethat there is no case such that f ′

i0i′ > 0 ∩ f ′ii′0 < 0, because

fii′ is a convex function.For each iteration, we update the mixture weights for

all of the active components. At the t th iteration, since thenumber of active components is |At |, the computationalcost to evaluate Eqs. (3) and (4) is O(n|At |), which is thesame as in the EM algorithm.

To make the optimization efficient, we set i ′ to be oneof the neighbors of i. Note that the optimum of λi or λi′tends to be zero when p(x|θ i ) and p(x|θ i′) are similar.Hence taking pairs {(i, i ′)}, such that i ′ is a neighbor of i,can increase the chances to prune the irrelevant kernels. Ina preprocessing step, we calculate a sequence of indexesε[i, 1], ε[i, 2], . . . , ε[i, m − 1] such that ε[i, k] is the k-nearest neighbor of i based on the parameter θ i . For eachiteration, we set i ′ = ε[i, k] with minimum possible k suchthat ε[i, k] is an active component. Also, because i is

-2 -1 0 1 2

-1.0

0.0

1.0

1

2

3

4

Selected kernels

1972.01972.21972.41972.61972.81973.01973.21973.41973.6

0 0.0005 0.001 0.0015 0.002

nega

tive

log-

likel

ihoo

d

mixture weight λi

i=1,i'=2i=2,i'=3i=2,i'=4

1972.1451972.1451972.1461972.1461972.1471972.1471972.1481972.1481972.1491972.149

0 0.0005 0.001 0.0015 0.002

nega

tive

log-

likel

ihoo

d

mixture weight λi

i=2,i'=4

Shapes of the negative conditional log-likelihood

Fig. 3 The three types of the shapes of the constrained negative log-likelihood with respect to the mixture weight λi , where∀u �= i, i ′, λu ≡ 1/n. For the selected points {1, . . . , 4} in the left-most figure, we show the cases (i, i ′) ∈ {(1, 2), (2, 3), (2, 4)}. Thecase (i, i ′) = (2, 4), in which the optimum is not sparse, is magnified in the right-most figure. [Color figure can be viewed in the onlineissue, which is available at wileyonlinelibrary.com.]

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 6: Sequential minimal optimization in convex clustering repetitions

Takahashi: Sequential Minimal Optimization 75

eliminated with high probability when the weight λi issmall, the pruning judgment based on Eqs. (3) and (4) isperformed on the ascending order of λi .

3.3. Element-Wise Newton–Raphson Updating

As another essential technique to accelerate the convexclustering, we derive the update rule for the nonzeroelements of the mixture weights. For the unidimensionalfunction fii′(·) introduced in Section 3.2, the first-order and

second-order derivatives h(1)

ii′ � ∂fii′∂λi

and h(2)

ii′ � ∂2fii′∂λ2

i

are

h(1)

ii′ = −n∑

j=1

κij − κi′j

λiκij + (1 − λ\i\i′ − λi

)κi′j + cj\i\i′

and

h(2)

ii′ =n∑

j=1

[κij − κi′j

λiκij + (1 − λ\i\i′ − λi

)κi′j + cj\i\i′

]2

> 0,

respectively. An element-wise Newton–Raphson update

λ(t+1)i ← λ

(t)i − h

(1)

ii′ /h(2)

ii′

can be done stably, because fii′ is convex. The computa-tional complexity per iteration to evaluate {h(1)

ii′ , h(2)

ii′ } forall i ∈ At is also O(n|At |).

3.4. Numerical Issues in Implementation

The accelerated algorithm for the convex clustering issummarized in Algorithm 1 using matrix notation, and herewe complement the numerical issues in implementation.

The vector z = (z1, . . . , zn)T caches the current value of∑m

i=1 λiκij for each j . When judging whether a componenti is pruned or not, the modified value of

∑mi=1 λiκij for each

j is calculated with substituting λi = 0 or λi′ = 0. Becauseonly λi and λi′ are tried to be updated, always summing the

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 7: Sequential minimal optimization in convex clustering repetitions

76 Statistical Analysis and Data Mining, Vol. 5 (2012)

values {λiκij } for all of i ∈ {1, . . . , m} is inefficient. We uti-lize the values of z and another vector v = (v1, . . . , vn)

T,which caches the gradient values at the corner λi = 0 orλi′ = 0.

The value of∑m

i=1 λiκij is quickly computed when themodification λi = 0 or λi′ = 0 is applied, as vj = zj −λiκij + λiκi′j or vj = zj + λi′κij − λi′κi′j , respectively. Ifthe pruning condition is satisfied, then we substitute eachelement of the vector v into that of the vector z. Otherwise,the values of the vector v are ignored and the algorithmproceeds to the next step. For each i, the calculationsof v and z are O(n). Hence Algorithm 1 keeps thecomputational complexity O(n|At |) per iteration.

The repeating additions and subtractions for the vector z

might integrate small numerical errors. Before the pruningtrials in each iteration, we recommend to recalculate thevector z using all of the active mixture weights and kernelvalues. This re-calculation only requires O(n|At |) and doesnot increase the computational complexity.

For high-dimensional data, every value of the probabilitydensity function p(xj |θ i ) becomes low, because of theexponential decay for the dimension. We can easily avoidthe underflow by taking the logarithms of the densities.Using a value lpj � maxi log p(xj |θ i ) that can be stablycalculated, we use a normalized kernel matrix K = (κij )

where

κij = p(xj |θ i )∑mi=1 p(xj |θ i )

= exp(log p(xj |θ i ) − lpj

)∑mi=1 exp

(log p(xj |θ i ) − lpj

)instead of the original kernel matrix K . The optimum of themixture weight vector λ with the normalized kernel matrixK is the same as that with the original kernel matrix K

because

arg maxλ

n∑j=1

logm∑

i=1

λiκij = arg maxλ

n∑j=1

logm∑

i=1

λiκij

−( n∑

j=1

lpj

),

where( ∑n

j=1 lpj

)is a constant and does not change the

result of the optimization.

4. ITERATIVE REFITTINGFOR HIGH-DIMENSIONAL DATA

This section discusses the issues in setting the bandwidthsof the clusters and the relationships between the bandwidthsettings and annealing processes. To handle heteroscedasticdata, we introduce a local Maximum-Likelihood (ML)estimation of the adaptive bandwidths in Section 4.1. The

local likelihood-based bandwidth specification only worksfor low-dimensional data. For high-dimensional data, themixture weight vector acquired with Algorithm 1 does notbecome sparse even when the correct bandwidths are used.Section 4.2 shows that the objectives for clustering anddensity estimation are inconsistent in the convex clusteringframework. Section 4.3 discusses the relationships betweenthe large bandwidths and high temperature in the annealingprocess, and addresses the necessity of refitting the clusterparameters. On the basis of the consideration about therefitting, Section 4.4 introduces an iterative algorithm thatrepeats convex clustering several times.

4.1. Local Maximum-Likelihood Bandwidthswith k-Nearest Neighbors

Real data are often heteroscedastic, with both sparseand dense regions existing in the same dataset. For suchdata, using a globally common bandwidth for every kernelcauses over-partitioning in the sparse regions and under-partitioning in the dense regions, as in Fig. 4. A naturalimprovement is to incorporate adaptive bandwidths, whereeach cluster has its own local bandwidth.

Let σ 2i be an isotropic variance assigned for the exemplar

xπ [i]. A sparse mixture model with adaptive isotropicvariances is given as

p (x) =m∑

i=1

λiN(x; xπ [i], σ

2i I d

)=

m∑i=1

λi√2πσ 2

i

dexp

(−‖x − xπ [i]‖2

2

2σ 2i

),

where I d is the d-dimensional identity matrix andN(·;μ,�) be the probability density function of the multi-variate Gaussian distribution whose mean is μ and whosevariance-covariance matrix is �.

Assume that the values of the nonzero mixture weightsare similar to one another, as in the basic c-means algo-rithm. Let ξ be a rough value of the number of clusters.Then each of the cluster has about (n/ξ) members dis-tributed around its centroid. Because each local band-width should be determined with these (n/ξ) members,a local Maximum-Likelihood (ML) estimate of σ 2

i isgiven as

σ 2i = 1

kd

k∑�=1

‖xε[i,�] − xπ [i]‖22, (5)

where k is a rounded integer of (n/ξ) and ε[i, �] is the�-nearest neighbor index for xπ [i].

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 8: Sequential minimal optimization in convex clustering repetitions

Takahashi: Sequential Minimal Optimization 77

Fig. 4 An inappropriate partitioning for heteroscedastic data with a single bandwidth. The datasets in the two figures are the same, withlow-density, middle-density, and high-density regions located at the left, the center, and the right, respectively. The left figure shows aclustering with a single bandwidth that is adjusted to the center clusters, where the low-density regions are over-partitioned while thehigh-density regions are under-partitioned. The right figure shows a desirable result of clustering with adaptive bandwidths. [Color figurecan be viewed in the online issue, which is available at wileyonlinelibrary.com.]

4.2. Inconsistency between Clustering and DensityEstimation

For high-dimensional data, the exemplar-based convexclustering does not induce the sparse mixture weights evenwhen the correct bandwidths are used. Let μi ∈ R

d andσ 2

i be the true centroid and the bandwidth of the clusterthe exemplar xπ [i] belongs to. When an exemplar xj

is distributed from the cluster centered at μi , the valueof (‖xj − μi‖2

2/σ2i ) obeys χ2(d), which is a chi-square

distribution with d degrees of freedom. As the mean and thestandard deviation of χ2(d) are d and

√2d , the probability

mass of χ2(d) is strongly concentrated around its meanvalue d when d is high. Then values of the probabilitydensity function N(xj ;μi , σ

2i I d) are close to (2πeσ 2

i )−d/2

for any choice of j .In contrast, when we assume μi = xπ [i], the value of the

kernel κii = (2πσ 2i )−d/2 is ed/2 times larger than the true

value. Let λ � (λ1, · · · , λm)T be the converged value of themixture weight vector with Algorithm 1. Because of theextremely large multiplier ed/2, most of the elements in λ

become positive to hold the explanation ability for the train-ing data D. Since the vector λ is not sparse, the ML densityestimation using the correct bandwidths does not producemeaningful clusters. Basically, the approximation μi =xπ [i] is significantly inaccurate when the dimension is high.

We can slightly relax the curse of dimensionality bydiscounting the value of the kernel κii not by assumingμi = xπ [i] but by regarding xπ [i] as the closest point toμi . Focusing on xε[i,1], which is the 1-nearest neighbor ofxπ [i], we calculate the value of the kernel κij as

κij =

⎧⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎩

1√2πσ 2

i

dexp

(−‖xε[i,1] − xπ[i]‖2

2

2σ 2i

)if j = π [i]

1√2πσ 2

i

dexp

(−‖xj − xπ[i]‖2

2

2σ 2i

)otherwise,

(6)

where σ 2i is the local ML estimate introduced in

Section 4.1. As the squared 1-nearest neighbor distance

‖xε(i,1) − xπ [i]‖22 in Eq. (6) is also influenced by χ2(d),

the ratio ed/2 between κii and κij such that j �= i is can-celed. While various modifications of distance structures toset μi �= xi can relax the curse of dimensionality, Eq. (6)is one of the simplest manipulations that still guarantee thatxπ [i] is the closest point to μi .

Yet heuristic modifications of the distance structures donot fundamentally solve the problem. Even when usingEq. (6), the differences among the squared distances {‖xj −xπ [i]‖2

2}i �=j largely fluctuate on the scale of√

2d . Hencethe relative ratio among {κij }mi=1 involves multipliers of theorder exp(

√d), which is highly volatile when d is high.

4.3. Using Large Bandwidths as an Annealing Process

One way to directly cancel the effects of the coefficientexp(

√d) is to multiply some discounting factor to the

squared distances. Let β (0 < β ≤ 1) be a relaxationfactor for the kernel distributions. Because exp(−β‖ · ‖2

2) ≡exp(−β‖ · ‖2

2)β , we consider a relaxed optimization

maxλ

n∑j=1

logm∑

i=1

λiκβ

ij s.t.m∑

i=1

λi = 1. (7)

The optimum of λ in Eq. (7) becomes sparser as we use thelower value of β. Hence we can acquire some clusteringwith a low value of β. Yet we should assess the meaningof the fitted result with Eq. (7), because the probabilitydensity function with β �= 1 is improper and different fromthe standard likelihood function.

We note that Eq. (7) resembles the mixture learning withDeterministic Annealing (DA) [5,6], that performs

maxλ,θ1,...,θm

n∑j=1

logm∑

i=1

λβ

i p(xj |θ i )β s.t.

m∑i=1

λi = 1.

In DA, the hyperparameter β is called the inverse tem-perature. As β becomes close to zero, then the objective

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 9: Sequential minimal optimization in convex clustering repetitions

78 Statistical Analysis and Data Mining, Vol. 5 (2012)

function gets more relaxed and the number of local optimabecomes smaller. With introducing a latent probability qij

with which the i-th cluster contributes to the generationof the j th exemplar, the DA-EM algorithm iterates updaterules

E-step: qij ← λβ

i p(xj |θ i )β∑m

i=1 λβ

i p(xj |θ i )βfor i ∈ {1, . . . , m}

and j ∈ {1, . . . , n}

M-step: λi ← 1

n

n∑j=1

qij and θ i ← arg maxθ i

n∑j=1

qij

log p(xj |θ i ) for i ∈ {1, . . . , m}.

The DA-EM algorithm first computes an optimum fora low value of β. Then the acquired optimum becomesthe initial parameter in a slightly harder optimization usinghigher β. The value of β is gradually increased and wefinally perform the original optimization with β = 1. Thecontinuity of the global optimum with respect to β is guar-anteed.

While DA-EM algorithm is designed to reduce thenumber of local optima in the parameter space, ourrelaxation reduces the number of local modes of the inputspace. When β is small, for many settings of the mixtureweights λ, the relaxed density

∑mi=1 λip(x|θ i )

β has lessnumber of local modes (peaks) with respect to the inputvector x ∈ R

d . Hence the convex clustering algorithmprunes the irrelevant kernels that do not strongly contributefor the local modes, and relevant kernels are located nearthe local modes of the relaxed density.

The analogy to DA tells us that the convex clusteringresult with low β is not guaranteed to be a good clustering.In DA, the acquired local optimum using β < 1 is not thefinal solution of the original objective and needs to movein harder optimizations using higher β. Similarly, in theconvex clustering, each mixture weight and relevant kernelfitted with low β are not guaranteed to be optimal andrelevant in modeling the original input density. The resultswith low β should be used as initial values, as in theDA-EM algorithm. Figure 5 demonstrates a fitting resultwith low β by projecting a high-dimensional dataset intoa unidimensional space, and emphasizes the importance ofrefitting the cluster parameters.

The difference between the proposed optimization andDA comes from the sparsity requirement. Instead of Eq. (7),one can consider an optimization

maxλ

n∑j=1

logm∑

i=1

λβ

i κβ

ij s.t.m∑

i=1

λi = 1. (8)

Eq. (8) is also a convex optimization when 0 < β ≤ 1,regarding the concavity of xβ . When β < 1, the first

-6 -4 -2 0 2 4 6X

Log

Den

sity

of X

-6 -4 -2 0 2 4 6X

Log

Den

sity

of X

Fig. 5 Insufficiency of the fitted result using low β (high tem-perature). 1000 samples of 100-dimensional data points were dis-tributed from an equally sized three-mixture of isotropic Gaussiandistributions whose centroids are (−4, 0, . . . , 0)T, (0, 0, . . . , 0)T,and (3, 0, . . . , 0)T. Every cluster’s covariance matrix is I d . Con-vex clustering algorithm was performed using the bandwidthI and β = 0.13. The figures show the values of the true orrelaxed density functions for samples {x = (x, 0, . . . , 0)T}. Theleft figure superimposes the true log-probability density functionlog p(x) and the fitted relaxed log-density log

∑mi=1 λip(x|θ i )

β

with a rescaling. The right figure shows the relaxed log-densitylog λip(x|θ i )

β for each fitted component. While the convex clus-tering algorithm captures the latent three components, the summeddensity is not three-modal in the projected subspace, because thelargest component dominates the value of relaxed density. In thiscondition, the three clusters cannot be detected with the Bayes’rule and refitting of both the mixture weights and the kernel cen-troids are suggested. [Color figure can be viewed in the onlineissue, which is available at wileyonlinelibrary.com.]

derivative of Eq. (8) always becomes infinite at the cornerλi = 0. Hence the optimum of Eq. (8) is always nonsparseand inefficient if used for clustering.

4.4. The Convex Clustering Repetition Algorithm

In refitting the cluster parameters, we repeat the convexclustering algorithms several times. In designing therepeating algorithms, one can perform the convex clusteringwithout updating the kernel distributions, where high β

is applied only for further pruning the irrelevant kernelsthat are judged to be relevant when β is low. Yet therelevant kernels found with low β are often not close tothe true cluster centroids, and hence we refit the centroidsand bandwidths as in the EM algorithm.

Let u be the step number in the refitting process andσ

(0)i ← σi , and κ

(0)ij ← κ

β

ij . Using Algorithm 1, we firstcompute the mixture weight as

λ(u) = arg max

λ

n∑j=1

logm∑

i=1

λiκ(u)ij . (9)

Then the latent cluster assignment probability is esti-mated as

q(u)ij ← λ

(u)i κ

(u)ij∑m

i=1 λ(u)i κ

(u)ij

for each j ∈ 1, . . . , n and i : λ(u)i > 0.

(10)

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 10: Sequential minimal optimization in convex clustering repetitions

Takahashi: Sequential Minimal Optimization 79

For i such that λ(u)i > 0, we refit the centroid x

(u)i and

the bandwidth σ(u)i as

x(u)i ←

∑nj=1 q

(u)ij xj∑n

j=1 q(u)ij

and(σ

(u)i

)2

←d

(u)i

)2 + ∑nj=1 q

(u)ij ‖xj − xi‖2

2

d + d∑n

j=1 q(u)ij

. (11)

Using x(u)i and the bandwidth σ

(u)i , we can compute a

new kernel matrix K(u+1) = (κ(u+1)ij ) as

κ(u+1)ij = N

(xj ; x

(u)i , (σ

(u)i )2I d

). (12)

We repeat Eqs. (9)–(12) in u = 0, 1, 2, . . . , until nomore kernels are pruned. Experimentally, five iterationswere sufficient because most irrelevant kernels are prunedin the first and second iterations.

Algorithm 2 summarizes the iterative refitting proce-dures. We note that the design of refitting algorithm is

not needed to be the maximum-likelihood estimator. WhileEq. (11) involves a regularization term using σ

(0)i ≡ σi ,

researchers can use more sophisticated Bayesian regular-izations including variational-Bayes method. Because wefinally apply more precise empirical-Bayes regularizationin Section 5, here we made the refitting algorithm simple.

5. EMPIRICAL-BAYES MODEL SELECTION

This section introduces an automatic selection of theoptimal clustering based on the maximum marginal-likelihood criteria. Several settings of the rough numberof clusters ξ and the inverse temperature β give differentclustering results. When the desirable number of clustersis determined a priori, then we can simply choose thecorresponding result. Otherwise, we often need to choosethe appropriate clustering from several candidates. As theresulting numbers of clusters are different among the initialparameter settings, we need a clustering quality measureeven when the numbers of clusters are different among theresults.

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 11: Sequential minimal optimization in convex clustering repetitions

80 Statistical Analysis and Data Mining, Vol. 5 (2012)

Here we adopt a maximum marginal likelihood crite-rion that adopts proper bandwidths in terms of densityestimation. When the true clusters are isotropic Gaussiandistributions, then the chosen number of active componentsbecomes close to the true number of clusters. Section 5.1derives the marginal likelihood score and Section 5.2 showsthe maximum marginal likelihood solution of the priorhyperparameters, which are called the empirical-Bayesestimates.

5.1. Deriving Approximate Marginal Likelihood

For a training likelihood function with a general Gaussianmixture model

p (D|�) =n∏

j=1

m∑i=1

φiN(xj ;μi , σ

2i I d

)where

� � {φ � (φ1, . . . , φm)T,μ1, · · · ,μm, σ 21 , . . . , σ 2

m},

we take its lower bound with Jensen’s inequality as

p (D|�) ≥n∏

j=1

m∏i=1

[φiN

(xj ;μi , σ

2i I d

)qij

]qij

� q(D|�, {qij }

).

We regard q(D|�, {qij }) as an approximated traininglikelihood. As the values of qij and σ 2

i , we can use those ofq

(u)ij and σ

(u)i calculated as in Section 4.4. Hence we simply

denote the converged values of q(u)ij and σ

(u)i by qij and σ 2

i ,respectively. In addition, we denote the converged value ofthe centroid x

(u)i by xi .

We place an m-dimensional symmetric Dirichlet distribu-tion prior for the mixture weight vector φ and heteroscedas-tic isotropic Gaussian prior for the centroid μi , as

p(φ|α) = � (α)

�m(α/m)

m∏i=1

φα/m−1i and

p(μi |ωi) = N(μi;μ0, ωiσ

2i I d

)for each i ∈ {1, . . . m},

where α is a concentration hyperparameter, μ0 = 1n

∑nj=1

xj is the global mean of the training data, and {ω1, . . . , ωm}is a set of prior bandwidths. The entire prior takesa factorial form p(�|�) = p(φ|α)

∏mi=1 p(μi |ωi) where

� = {α,ω1, . . . , ωm}.The approximated training likelihood q(D|�, {qij }) can

be analytically marginalized with the prior p(�|�),because q(D|�, {qij }) is a product of exponential-familydistributions. While we abbreviate the derivations, theclosed-forms of the approximated marginal likelihood

q(D|{qij }, �) �∫�

q(D|�, {qij })p(�|�)d� and theposterior q(�|D, {qij }, �) � q(D|�, {qij })p(�|�)/q

(D|{qij }, �) are given as

q(D|{qij }, �

)=

⎡⎣ n∏j=1

m∏i=1

q−qij

ij

⎤⎦ �(α)

�(α + n)

m∏i=1

�(α/m + ni )

�(α/m)

·m∏

i=1

[ (2πσ 2

i

)− dni2 (1 + niωi)

− d2

× exp

(− ni‖xi − μ0‖2

2

2(1 + niωi )σ2i

) ]and (13)

q(�|D, {qij }, �

) = � (α + n)∏mi=1 �(α/m + ni )

m∏i=1[

φni+α/m−1i N

(μi;

μ0 + niωixi

1 + niωi

,ωi

1 + niωi

σ 2i

) ],

respectively, where ni �∑n

j=1 qij ≡ nλi .Equation (13) is an unsupervised clustering quality score

that can compare the results having different numbers ofclusters. The explanation ability to training data is inte-

grated in the term∏m

i=1(2πσ 2i )−

dni2 that expresses the total

variances of the clusters. The term �(α)�(α+n)

∏mi=1

�(α/m+ni )

�(α/m)

is related to the mixture weights and is high when thenumber of clusters is low. For each cluster i, the term

(1 + niωi)− d

2 exp( − ni‖xi−μ0‖2

22(1+niωi )σ

2i

)incorporates the loss in

distributing its centroid from the global mean. When thereare many clusters, the penalty related with the Gaussianprior also increases, as well as those related with the Dirich-let prior. In evaluating the marginal likelihood, we shouldtake the logarithm of Eq. (13) in implementation. Thenwe can avoid the numerical underflow and can exploit thelog-gamma function, which is stably computed unlike thegamma function.

While we can evaluate the logarithm of Eq. (13) withfixed values of α and {ωi}, these prior hyperparameterscan be easily optimized, as shown in Section 5.2. Henceeach clustering quality should be measured after the priorhyperparameters are optimized.

5.2. Empirical-Bayes Solution

Empirical-Bayes methods adopt prior hyperparametersthat maximize the marginal likelihood. For the con-centration hyperparameter α, the maximum marginal-likelihood estimate α = maxα log �(α)

�(α+n)

∏mi=1

�(α/m+ni )

�(α/m)is

calculated with a fast modified-Newton method [14]. Aswe prefer a sparse optimum of φ, we use an initialization

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 12: Sequential minimal optimization in convex clustering repetitions

Takahashi: Sequential Minimal Optimization 81

α = 1. While α can be globally optimized with startingfrom several initial values, we experimentally confirmedthat the optimum of α is unique in many cases. For the priorbandwidth ωi , the maximum marginal-likelihood estimateis ωi = ( ‖xi−μ0‖2

dσ 2i

− 1ni

)+where (·)+ � max{·, 0}.

With computing each clustering’s best prior hyperparam-eters, the maximized marginal likelihood for each clusteringis compared. After choosing the best clustering, we cal-culate the point estimates of the mixture weights and thecentroids. Since a Dirichlet distribution’s mode can becomesparse while its mean is not sparse, we adopt Maximum APosteriori (MAP) estimation for the vector φ. Let n∗

i , x∗i ,

α∗, σ ∗i , and ω∗

i be the values of ni , xi , α, σi , and ωi cor-responding to the best clustering. The MAP solutions of φ

and μi are given as

φ∗i =

(α∗/m + n∗

i − 1)+∑m

i′=1

(α∗/m + n∗

i′ − 1)+ and μ∗

i = μ0 + ω∗i n

∗i x

∗i

1 + ω∗i n

∗i

.

The procedures to evaluate the optimized log marginallikelihood and the corresponding MAP estimates of themixture weights and the centroids are summarized inAlgorithm 3.

Remember that the estimated number of clusters withthe empirical-Bayes method is based on the assumption ofisotropic Gaussian clusters. When each cluster is a non-Gaussian distribution, relatively many kernels remain active

to cover each cluster’s non-Gaussianity. We should alsonote that large noises might decrease the estimated numberof clusters, because separating high-noise clusters are oftennot beneficial in stabilizing the estimates of the clusterparameters.

6. RELATED WORK

We start with prior work to acquire the global optimum inclustering. Representative algorithms to acquire the globallyoptimal clustering are spanning-tree cutting [15,16] andthe mean-shift [17]. Cutting the k-heaviest edges of theEuclidean minimum spanning tree for vector data can yieldthe global minimum of some loss function and can handlenonelliptic clusters. Yet, spanning-tree cutting is known tobe sensitive to the noises, where existence of a few tailsamples between the clusters strongly affects the resultingpartitions. The mean-shift algorithm finds the local modesof a kernel density estimator and can also detect nonellipticclusters while automatically determining the number ofclusters. Yet, it is essentially an EM algorithm [18] thathas the first-order convergence and the fitted mixturemodel is not sparse. Affinity Propagation [19] is similarto the convex clustering and also achieves the globallyoptimal clustering depending on hyperparameter settings.It is also an exemplar-based clustering algorithm whosecomputational cost is O(n2) per iteration.

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 13: Sequential minimal optimization in convex clustering repetitions

82 Statistical Analysis and Data Mining, Vol. 5 (2012)

The standard approach to automatically determine thenumber of clusters is fitting Dirichlet Process Mixtures[20,21]. We can use variational approximations [22,23] orMarkov Chain Monte Carlo methods [24–26] for the infer-ences of Dirichlet Process Mixtures, where the variationalmethods are Bayesian extensions of the EM algorithms.There is also work to combine the Dirichlet process pri-ors and exemplar-based clustering [27]. However, placingthe Dirichlet process prior generally results in nonconvexoptimizations.

The k-means and EM algorithms are also the basics ofclustering for similarity data. Some types of datasets usedin data mining are solely expressed as similarity matricesamong exemplars. Kernel k-means and spectral clustering[28,29] are effective techniques to group exemplars, withoutexplicit vectorial representations. These similarity-basedalgorithms are also effective for handling vector data, ifthe shapes of the clusters are non-elliptic. Since both kernelk-means and spectral clustering consequently perform thek-means algorithm, the k-means, EM algorithms, and theirBayesian extensions are essential in the clustering of bothvector and similarity data.

The accelerated algorithm with the Sequential MinimalOptimization is also effective in supervised learning, aswell as unsupervised clustering and density estimation.Many supervised learning problems including classificationand regression can be generalized as conditional densityestimations that fit p(y|x) where x ∈ R

dX and y ∈ RdY are

vectors of input and output variables, respectively. Givena set of observations {(x1, y1), . . . , (xn, yn)}, let us use aweighted kernel estimate

p(y|x) =∑m

i=1 wiKX(x, xπ [i])KY (y, yπ [i])∑mi=1 wiKX(x, xπ [i])

, (14)

where xπ [1], . . . , xπ [m] and yπ [1], . . . , yπ [m] are m(≤ n)

random samples of the training data, KX(·, xπ [i]) andKY (·, yπ [i]) are kernel distributions of input and outputvariables, respectively. For example, we can assumeKX(·, xπ [i]) and KY (·, yπ [i]) to be multivariate Gaussiandistributions whose centers are xπ [i] and yπ [i].

The optimization of the weights w = (w1, . . . , wm)T

in Eq. (14) while KX and KY remain unchanged isessentially the same as that in the convex clustering.Kullback–Leibler Importance Estimation Procedure [30]performs an optimization

maxw

n∑j=1

log

[m∑

i=1

wiKX

(xj , xπ [i]

)KY

(yj , yπ [i]

)]

s.t.n∑

j=1

m∑i=1

KX

(xj , xπ [i]

)wi = n, (15)

which is derived as the maximization of an empiricalapproximation of the density ratio between p(x, y) andp(x). Let us introduce a variable transformation λi �wi

n

∑nj=1 KX(xj , xπ [i]). Then Eq. (15) is modified as

maxλ

n∑j=1

log

[m∑

i=1

λi

nKX

(xj , xπ [i]

)∑nj ′=1 KX

(xj ′ , xπ [i]

)KY

(yj , yπ [i]

)]

s.t.m∑

i=1

λi = 1,

which is an equivalent objective to that of the convexclustering, when we set

κij = nKX

(xj , xπ [i]

)∑nj ′=1 KX

(xj ′ , xπ [i]

)KY

(yj , yπ [i]

).

While this paper focuses on clustering tasks, our futureplan includes the application of the sequential minimaloptimization in supervised learning tasks.

7. EXPERIMENTS

Our aims in the evaluation experiments are threefold.The first one is to compare the actual computational costsbetween the proposed and the original EM algorithmson the same kernel parameters. The second one is tosee the influences of the initial bandwidths and thetemperature settings in predictions. The third one is toquantitatively evaluate the detectability of the hiddenclusters. Section 7.1 presents the convergence rates usingartificial datasets. Section 7.2 examines the sensitivity to theinitial bandwidths and temperatures. Sections 7.3 shows thedetectability of the hidden clusters, using high-dimensionaldatasets.

7.1. Convergence Rate and Computational Time

We compared the convergence rate and the actual CPUtime between the proposed and the original EM algorithms.The CPU time was calibrated in a Debian GNU/Linuxx86_64 PC with an Intel® Xeon™ Processor 2.80 GHzand 16 GB main memory. Both the EM and the proposedalgorithms were implemented as Java™ programs using thesparse matrix classes in Apache Commons MathematicsLibrary version 2.22.

First we generated 1000 or 4000 data points in R2

from Gaussian mixture models as shown in Figs 6 and 7.Figure 6 shows the behaviors of the learning algorithms

2 http://commons.apache.org/math/

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 14: Sequential minimal optimization in convex clustering repetitions

Takahashi: Sequential Minimal Optimization 83

1

10

100

1000

1 10 100 1000 10000

# of

act

ive

com

pone

nts

# of iterations1 10 100 1000 10000

# of iterations1 10 100 1000 10000

# of iterations1 10 100 1000 10000

# of iterations

ProposedThresholding

1

10

100

1000

# of

act

ive

com

pone

nts

ProposedThresholding

1

10

100

1000

# of

act

ive

com

pone

nts Proposed

Thresholding

1

10

100

1000

10000

# of

act

ive

com

pone

nts

ProposedThresholding

1

10

100

1000

# of

act

ive

com

pone

nts

ProposedThresholding

1

10

100

1000#

of a

ctiv

e co

mpo

nent

s

ProposedThresholding

1

10

100

1000

# of

act

ive

com

pone

nts

ProposedThresholding

1

10

100

1000

10000

# of

act

ive

com

pone

nts

ProposedThresholding

00 1 2 3 4 5 0 1 2 3 4 5 1 2 3 4 5 0 20 40 60 80 100

CPU time [sec] CPU time [sec] CPU time [sec] CPU time [sec]

-1.5 -0.5 0.5 1.5

-1.5

-0.5

0.5

1.5

n = 1,000, s = 0.2 n = 1,000, s = 0.3 n = 1,000, s = 0.8 n = 4,000, s = 0.3

-1.5 -0.5 0.5 1.5

-1.5

-0.5

0.5

1.5

-1.5 -0.5 0.5 1.5

-1.5

-0.5

0.5

1.5

-1.5 -0.5 0.5 1.5

-1.5

-0.5

0.5

1.5

Fig. 6 Experimental comparisons between the proposed method and the EM algorithm using the thresholding rule λi ← 0 if λi < 10−3/n[7], in the single-bandwidth setting. The numbers of active components as the iterations proceed, and the final partitions with the proposedmethod are shown. The horizontal lines to confirm the convergence rates are the number of iterations (top), or CPU time (middle). Wegenerated 1000 or 4000 samples in which the cluster distributions are the same as those in Fig. 2. While the thresholding rule graduallyprunes each component, the proposed method achieves significant reductions of the components in the first 10 steps. As the cluster centeris constrained to one of the observed data points and is slightly apart from the true center of the underlying density, a marginally largerbandwidth than the true bandwidth yields a compact grouping. [Color figure can be viewed in the online issue, which is available atwileyonlinelibrary.com.]

in the single-bandwidth setting, while Fig. 7 shows theresults in the adaptive-bandwidth setting, respectively. Inthe adaptive-bandwidth setting, we attached several resultswith the local ML estimation using the k-nearest neighbormethod. Here the iterative refitting algorithm and theempirical-Bayes model selection is not applied in bothsettings.

In both the single- and adaptive-bandwidth settings, thecombination of the fast pruning and the Newton–Raphsonmethod drastically reduced the required numbers of iter-ations. In practice, about 10 iterations were sufficient inboth settings while hundreds or thousands of iterations wererequired in the EM algorithm.

Let us complement the information about the CPU time.For easy comparisons, in all of the results, the times tocalculate the kernel matrix and nearest-neighbor indexesare not included in the figures. Computations of the kernelmatrices are required in both the single- and adaptive-bandwidth settings. In the single-bandwidth setting, the

proposed algorithm additionally needs to calculate thenearest neighbor indexes based on the sorting of the kernelmatrix elements. The computational costs to calculate thekernel matrix were 1.07 ± 0.07 seconds when n = 1, 000and 15.15 s when n = 4, 000. The additional costs tosort the elements were 1.81 ± 0.05 s when n = 1000and 20.34 s when n = 4000. Hence even when the CPUcosts of additional sorting are incorporated, the proposedalgorithm is sufficiently faster than the EM algorithm. Incomputing the adaptive bandwidths, since both the EM andthe proposed algorithms need the sorting of the pairwisedistances, the initialization costs are the same. The coststo sort the pairwise distances and compute the adaptivebandwidths were 1.60 ± 0.02 s when n = 1000 and 29.00 swhen n = 4000. The costs to calculate the kernel matrixwere 0.77 ± 0.02 s when n = 1000 and 13.44 s whenn = 4000.

The absolute values of the CPU times are improved fromthe prior work [8], not by the nature of the algorithms but by

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 15: Sequential minimal optimization in convex clustering repetitions

84 Statistical Analysis and Data Mining, Vol. 5 (2012)

1

10

100

1000

0 1 2 3 4 5

# of

act

ive

com

pone

nts

1

10

100

1000#

of a

ctiv

e co

mpo

nent

s

1

10

100

1000

# of

act

ive

com

pone

nts

ProposedThresholding

0 1 2 3 4 5

ProposedThresholding

0 1 2 3 4 5

ProposedThresholding

0 2 4 6 8 10# of iterations

0 2 4 6 8 10# of iterations

0 2 4 6 8 10# of iterations

0 2 4 6 8 10# of iterations

ProposedThresholding

1

10

100

1000

# of

act

ive

com

pone

nts

1

10

100

1000

# of

act

ive

com

pone

nts

1

10

100

1000

# of

act

ive

com

pone

nts

ProposedThresholding

ProposedThresholding

1

10

100

1000

10000

# of

act

ive

com

pone

nts

ProposedThresholding

1

10

100

1000

10000

0 1 2 3 4 5

# of

act

ive

com

pone

nts

ProposedThresholding

CPU time [sec] CPU time [sec] CPU time [sec] CPU time [sec]

-1 0 1 2

-10

12

n = 1,000, x = 6

-1 0 1 2

-10

12

n = 1,000, x = 5

-1 0 1 2

-10

12

n = 1,000, x = 4

-1 0 1 2

-10

12

n = 4,000, x = 5

Fig. 7 Experimental comparisons between the proposed method and the EM algorithm using the thresholding rule in the adaptive-bandwidth setting. 1000 or 4000 samples of R

2 points are distributed from a six-cluster heteroscedastic Gaussian mixture model. Themixture weight is 1/6 in each cluster. The standard deviation for each of the three clusters centered at (−1, −1)T, (−1, 0)T, and (0,−1)T

is 0.2. The standard deviation for each of the two clusters centered at (−1, 2)T (2, −1)T is 0.1. The largest standard deviation assigned tothe cluster centered at (1, 1)T is 0.5. The results in the first 10 iterations are magnified. [Color figure can be viewed in the online issue,which is available at wileyonlinelibrary.com.]

the programming languages adopted. We adopted JavaTM

in the new implementation while GNU Octave was used inthe initial implementation. In the GNU Octave version, thepruning process could not be written as matrix calculationsand the primitive commands used were costly. Since weneed to perform the algorithm many times in the laterexperiments, we newly implemented a Java™ program.

Next we evaluated the convergences in higher-dimensional artificial datasets. As these artificial datasetsare used in the later experiments, we here summarize howto generate these datasets. Two thousand data points inR

d were generated from c-cluster heteroscedastic Gaus-sian mixture models in which all of the mixture weightswere 1/c. The c cluster centroids were distributed fromN(0, 52I d) and each cluster variance is the inverses of thevalues sampled from chi-square distribution whose degreesof freedom was ν and whose mean was σ 2

0 . The scale σ0

gives the basic noise level of the data while ν gives thedegrees of heteroscedasticity where lower values of ν makethe datasets more heteroscedastic.

For several datasets, Fig. 8 shows the calibrated com-putational costs in the initial convex clustering using large

bandwidths. In higher-dimensional datasets, still the pro-posed algorithm outperformed the EM algorithm. Refittingwas not applied in the results of Fig. 8 and we brieflyexplain the computational costs of refitting. Since the sec-ond convex clustering starts from the condition whose num-ber of clusters is given by the initial convex clustering,the computational costs in the second convex clustering isless than the initial convex clustering. Usually, refitting thecluster parameters is more costly in high-dimensional data,which fundamentally requires O(nd|At |) where At is theset of active components at t-th iteration of the initial clus-tering. Yet its order is the same as in that of the standardc-means algorithm when c = At .

7.2. Performance Dependency to the InitialBandwidths

This experiment investigates how the prediction per-formances depend on the initial bandwidths and aims toestablish a rule of thumb to reduce the costs to searchfor the appropriate hyperparameters. The artificial high-dimensional datasets introduced in Section 7.1 include the

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 16: Sequential minimal optimization in convex clustering repetitions

Takahashi: Sequential Minimal Optimization 85

1

10

100

1000

# of

act

ive

com

pone

nts

1

10

100

1000#

of a

ctiv

e co

mpo

nent

s

0 2 4 6 8 10CPU time [sec]

0 2 4 6 8 10CPU time [sec]

ProposedThresholding

Using b = 0.1

ProposedThresholding

Using b = 0.2d = 20, s0 = 20

1

10

100

1000

# of

act

ive

com

pone

nts

1

10

100

1000

# of

act

ive

com

pone

nts

0 2 4 6 8 10CPU time [sec]

0 2 4 6 8 10CPU time [sec]

ProposedThresholding

Using b = 0.1

1

10

100

1000

0 20 40 60 80 100

# of

act

ive

com

pone

nts

1

10

100

1000

# of

act

ive

com

pone

nts

# of iterations0 20 40 60 80 100

# of iterations

ProposedThresholding

ProposedThresholding

1

10

100

1000

# of

act

ive

com

pone

nts

1

10

100

1000

# of

act

ive

com

pone

nts

0 20 40 60 80 100# of iterations

0 20 40 60 80 100# of iterations

ProposedThresholding

ProposedThresholding

ProposedThresholding

Using b = 0.2d = 20, s0 = 60

1

10

100

1000

# of

act

ive

com

pone

nts

1

10

100

1000

# of

act

ive

com

pone

nts

0 20 40 60 80 100# of iterations

0 20 40 60 80 100# of iterations

ProposedThresholding

ProposedThresholding

1

10

100

1000

# of

act

ive

com

pone

nts

1

10

100

1000

# of

act

ive

com

pone

nts

0 20 40 60 80 100# of iterations

0 20 40 60 80 100# of iterations

ProposedThresholding

ProposedThresholding

1

10

100

1000

# of

act

ive

com

pone

nts

1

10

100

1000

# of

act

ive

com

pone

nts

0 2 4 6 8 10CPU time [sec]

0 2 4 6 8 10CPU time [sec]

ProposedThresholding

Using b = 0.005

ProposedThresholding

Using b = 0.01d = 500, s0 = 20

1

10

100

1000

# of

act

ive

com

pone

nts

1

10

100

1000

# of

act

ive

com

pone

nts

0 2 4 6 8 10CPU time [sec]

0 2 4 6 8 10CPU time [sec]

ProposedThresholding

Using b = 0.005

ProposedThresholding

Using b = 0.01d = 500, s0 = 60

Fig. 8 Experimental comparisons between the proposed method and the EM algorithm using the thresholding rule λi ← 0 if λi < 10−3/n[7], in higher-dimensional datasets. The number of clusters is c = 20 and the degrees of freedom to set the heteroscedasticity is ν = 5. Inall of the settings, the proposed algorithm outperformed the EM algorithm. When d = 500, the more iterations needed for the convergenceare caused by the difficulty of the pruning judgment, where every kernel vector is similar to each other. [Color figure can be viewed inthe online issue, which is available at wileyonlinelibrary.com.]

information of the true cluster labels. Hence we evaluatedthe detectability of these cluster labels with the unsuper-vised convex clustering, for various settings of the roughnumber of clusters ξ and the inverse temperature β. In addi-tion, we studied the characteristics of the hyperparameterselection with the proposed empirical-Bayes method.

The detectability of the hidden clusters is measured asthe Normalized Mutual Information (NMI) between theestimated and the true cluster labels. For two disjointsets � = {ψ1, . . . , ψ|�|} and � = {ω1, . . . , ω|�|}, NMIbetween � and � is given as 2I (�,�)/[H(�) + H(�)]where I (�,�) = 1

n

∑|�|a=1

∑|�|b=1 |ψa ∩ ωb| log n|ψa∩ωb|

|ψa ||ωb| and

H(�) = −∑|�|b=1

|ωb|n

log |ωb|n

. Higher NMI indicates asuperior predictive performance. The entropy normalization

factor 2/[H(�) + H(�)] makes NMI as a clustering qual-ity measure that can compare the results having differentnumbers of clusters.

In generating the datasets, we took care of the basic noiselevel σ0. In Ref. [8], σ0 = 1 was used with a referenceto [7]. Yet we found that this setting makes the datasetsunrealistically clean as shown in Fig. 9. As we think morenoisy data should be used for assessing the predictivecapabilities, larger values of σ0 are adopted in this paper.

We performed a grid search for the rough numberof clusters ξ and the inverse temperature β. The rangeof β was determined with the dimension d. As theinverse temperature β was introduced to relax the fac-tors exp(d/2) and exp(

√d) in Section 4.3, we set the

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 17: Sequential minimal optimization in convex clustering repetitions

86 Statistical Analysis and Data Mining, Vol. 5 (2012)

-20 -10 0 10

-20

-10

05

10

(d, s0) = (20, 1)-20 0 20 40

-40

-20

020

40

(d, s0) = (20, 20)-50 0 50

-40

020

60

(d, s0) = (500, 1)-50 0 50 100

-40

020

60

(d, s0) = (500, 20)

Fig. 9 Example of the input density in the artificial datasets. Each figure shows a two-dimensional projection of an artificial high-dimensional dataset, using Principal Component Analysis. Each sample’s color and symbol represents its belonging cluster. For all of thedatasets here, the number of clusters is c = 20 and the degrees of freedom to set the heteroscedasticity is ν = 5. Though we performboth the proposed and the referenced clustering algorithms without dimension reduction techniques, we imagine that σ0 = 1 makes theseparation of the samples too easy. Hence noisier datasets are adopted in this paper than in the prior work. [Color figure can be viewedin the online issue, which is available at wileyonlinelibrary.com.]

minimum value of β by 1/d. On β = 1/d, most of thedatasets were clustered into only a single or a few com-ponents. Hence we prepared the candidates of β as β ∈{d−1, d−0.9, · · · , d−0.1, 1}. The rough number of clusters ξ

was chosen from 10, 20, . . . , 100 and totally 10 × 10 gridsearch was performed for each dataset.

As β decreases, the influences of the absolute scale of thelocal bandwidths are weakened, because every element ofthe kernel matrix has a similar value to each other, while therelative ratio among the local bandwidths is still important.In high-dimensional datasets, because the squared distance‖xj − μi‖2

2 becomes close to the mean dσ 2i , the local

bandwidth estimate σ 2i does not strongly depend on the

number of neighbors determined with ξ . Hence only theratios among the values {σ 2

i }mi=1 are meaningful and wecan expect that ξ does not affect to the performances sostrongly as β.

Figure 10 shows the contour plots of the NMI scoresdepending on ξ and β. In these results, we appliedthe refitting of the cluster parameters. The initial con-vex clustering was performed with 10 iterations usingthe sequential minimal optimization. We tried refittingfive times, and 10 iterations were involved in the con-vex clustering after each refitting. Hence 10 + 5 × 10 =60 iterations were repeated during the optimization. Asexpected, the inverse temperature β was more sensi-tive to the results than the rough number of clustersξ , whereas sometimes ξ affected the performances withspecific β.

The empirical-Bayes method gave almost the optimalclustering when the noise level is not so high, while ittended to select more parsimonious models than the bestmodels. When the noise level is middle, the optimal β wasaround 1/

√d , which is a scale of the squared distances

after the discounting in Eq. (6).

7.3. Unsupervised Classification

In this experiment, the detectability of the hidden clustersby the proposed algorithm was compared to those by thereference algorithms. The proposed algorithm consists ofthe convex clustering, refitting of the cluster parameters,and the empirical-Bayes hyperparameter selection. For eachparameter setting of the artificial datasets, 20 independentsets were randomly generated and we evaluated the averageperformances among the 20 sets. The rough number ofclusters was set as ξ = c and the inverse temperatureβ was chosen from {d−1, d−0.9, . . . , 1} inside the fittingalgorithm.

One reference method was the adaptive-bandwidth soft c-means algorithm using DA in which an equally sized Gaus-sian mixture model c−1 ∑c

i=1 N(x;μi , σ2i I d) is fitted. The

parameters {μ1, σ21 , . . . ,μc, σ

2c } were optimized with regu-

larizing the variances as σ 2i ≥ 0.01. In DA, we maximized

a relaxed log-likelihood∑n

j=1 log c−1 ∑ci=1 Nρ(xj ;μi ,

σ 2i I d) and the annealing was performed first with ρ = 0.1,

second with ρ = 0.5, and finally with ρ = 1. We alsoimplemented another version of the soft c-means algorithmthat repeats random initializations 20 times and picks thebest clustering.

Another reference method was Dirichlet Process Mix-tures (DPM) with the variational inference [22], in whichboth the bandwidths and the number of clusters were auto-matically chosen. With DPM, fully parametrized Gaus-sian mixture models

∑i λiN(x;μi , σ

2i I d) were fitted, and

the posterior mean values of the parameters {λi,μi , σ2i }i

were used for the post-processing hard-clustering. In DPMsampling scheme G|α,G0 ∼ DP(α,G0), we placed ahyperprior α ∼ Gamma(1, 1). For the mean μ and thecovariance matrix σ 2I d of each cluster, a product mea-sure N(μ;μ0, σ

2I d)Gamma(1/σ 2; 0.1, 0.1) was used asthe base distribution G0. While we limited the maximum

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 18: Sequential minimal optimization in convex clustering repetitions

Takahashi: Sequential Minimal Optimization 87

2040

6080

100

roug

h #

of c

lust

ers

0.38

0.4

0

.42

0.4

4 0

.48

0.5

0.5

0.5

0.52

0.52

0.52

0.54

0.54

0.5

6 (c=30)

=3)

s0 = 80

2040

6080

100

roug

h #

of c

lust

ers

0.1

0.1

0.2

0.2

0.2

0.3 0.4

0.5

0.6

0.7

0.8

(c=140)(c=130)

s0 = 20

-3.0 -2.0 -1.0 0.0log(beta)

-3.0 -2.0 -1.0 0.0log(beta)

-6 -5 -4 -3 -2 -1 0

2040

6080

100

log(beta)

roug

h #

of c

lust

ers

0.5 0.55

0.6

0.6

0.65

0.65

0.65

0.7

0.7 0.7

0.7

0.7

0.7

0.7

0.7

5

0.7

5 0.75

0.8

0.8

0.8

0.8

0.8

5

0.9

0.9

(c=2(c=29)

s0 = 80

2040

6080

100

roug

h #

of c

lust

ers

0.3

0.3

0

.3

0.4

0.5 0.6

0.7

0.7

0.8

0.8

0.9

0.9

(c=24)(c=24)

s0 = 20d = 20, c = 20

-3.0 -2.0 -1.0 0.0log(beta)

-6 -5 -4 -3 -2 -1 0

2040

6080

100

log(beta)

roug

h #

of c

lust

ers

0.4

0

.4

0.4

0

.45

0.5

0.5

0.6

0.6

0.6 0.6

0.65

0.7

0.7

5

0.8

0.8

0.8

5

0.9

0.9

0.9

5

(c=2

(c=24)

s0 = 20d = 500, c = 20

2040

6080

100

roug

h #

of c

lust

ers

-6 -5 -4 -3 -2 -1 0log(beta)

0.2

0.2

0.3 0.3

0.4

0.4 0.5

0.5

0.5

0.6

0.6

0.6

0.7

0.7

0.7

0.8

0.8

0.8

0.9

0.9 (c=131)(c=131)

s0 = 20

2040

6080

100

-3.0 -2.0 -1.0 0.0log(beta)

roug

h #

of c

lust

ers

0.3

5

0.35

0.35

0.4

0.45

0.5 0.55

0.6

0.65

0.7

(c=1

(c=5)

s0 = 80d = 20, c =100

-6 -5 -4 -3 -2 -1 0

2040

6080

100

log(beta)

roug

h #

of c

lust

ers

0.45 0.45

0.5

0.5

0.55

0.6 0.6

0.6

0.65

0.65

0.7

0.7

0.75

(c=715)

(c=34)

s0 = 80d = 500, c = 100

Fig. 10 Comparisons of hyperparameter settings in terms of the Normalized Mutual Information (NMI) between the true and the estimatedcluster labels. The black symbol ‘�’ is located at the best hyperparameter to most predict the true labels, while the blue symbol ‘•’ islocated at the hyperparameter chosen with the unsupervised empirical-Bayes method. As expected, the inverse temperature β was moresensitive to the performance score than the rough number of clusters ξ , whereas the rough number of clusters was also meaningfulin handling high-noise data. In middle noise datasets, the maximum marginal likelihood hyperparameter almost gave the highest NMIscore. In high noise datasets, the empirical-Bayes methods preferred parsimonious models. When the dimensionality is high, the optimalhyperparameter tended to be located near β = 1/

√d that is the middle point of the x-axis. [Color figure can be viewed in the online

issue, which is available at wileyonlinelibrary.com.]

number of mixtures by 150, the number of clusters wasautomatically chosen in DPM.

Figure 11 shows the comparison results, where the per-formance stability of the convex clustering is prominentespecially in the high-dimensional datasets. Basically, per-formances of DA or many random initializations werehighly volatile. While the variance of the DPM’s perfor-mance was smaller than those of DA and random initializa-tions, DPM struggled to find the true cluster structure in theunderlying density, because of its nature to automaticallydetermine the number of clusters. In contrast, the exemplar-based nature of the convex clustering was more beneficialin robustly detecting the underlying structure. While weutilized the true number of clusters as a rough number ofclusters ξ , we already confirmed that ξ does not so affectto the performance as to β. The automatic selection withthe empirical-Bayes methods can recover the true clustersin middle noise conditions. The improvement in the high-noise condition would be the future work.

8. CONCLUSION

This paper introduced an accelerated algorithm for theexemplar-based convex clustering. The proposed algorithm

consists of the accurate pruning of the irrelevant kernelsand the element-wise Newton–Raphson method for opti-mizing the positive mixture weights. For high-dimensionaldata, the accelerated algorithm is first applied with largeinitial bandwidths. Then refitting of the cluster parame-ters and the accelerated pruning of the irrelevant kernelsare repeated several times. For isotropic Gaussian-clusterdata, the empirical-Bayes method proposed was able tosuccessfully determine the optimal clustering. Unlike exist-ing maximum-likelihood or Bayesian clustering algorithms,the proposed algorithm does not depend on random initial-izations and matches some types of requirements in realpractices.

Finally we address how to choose the number of initialkernels m. While this paper only treated the data whosenumber of samples n is not so large, some practitioners arerequired to handle huge data. In such cases, we must admitsome randomness in the results. Yet we should try to acquirestable results by taking more initial kernels than the truenumber of clusters. The essence of the convex clusteringexists in the process to choose the relevant kernels fromthe given set of the kernels. While c-means algorithmdirectly samples the c initial kernels, the convex clusteringwith m > c can choose the initial kernels more accurately.

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 19: Sequential minimal optimization in convex clustering repetitions

88 Statistical Analysis and Data Mining, Vol. 5 (2012)

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

5 25 125

NM

I

σ0

Soft c-means (many trials)Soft c-means (annealing)

Dirichlet Process MixturesConvex clustering repetitions

(c, n) = (50, 5) and varying s0

5 25 125σ0

(c, n) = (50, 5) and varying s0

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

NM

I

Soft c-means (many trials)Soft c-means (annealing)

Dirichlet Process MixturesConvex clustering repetitions

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

NM

I

20 40 60 80 100

# of clusters

Soft c-means (many trials)Soft c-means (annealing)

Dirichlet Process MixturesConvex clustering repetitions

(s0, n) = (20, 5) and varying c

20 40 60 80 100

# of clusters

(s0, n) = (20, 5) and varying c

0

0.2

0.4

0.6

0.8

1

NM

ISoft c-means (many trials)Soft c-means (annealing)

Dirichlet Process MixturesConvex clustering repetitions

0.4

0.5

0.6

0.7

0.8

0.9

1

NM

I

5 25degrees of freedom

Soft c-means (many trials)Soft c-means (annealing)

Dirichlet Process MixturesConvex clustering repetitions

(s0, c) = (20, 5) and varying n

d = 20

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

NM

I

(s0, c) = (20, 5) and varying n

5 25degrees of freedom

Soft c-means (many trials)Soft c-means (annealing)

Dirichlet Process MixturesConvex clustering repetitions

d = 500

Fig. 11 Comparisons of the detectability of the hidden clusters in the higher-dimensional datasets. Compared to other c-means typealgorithms, the convex clustering with refitting exhibited stable performances though the average performances were slightly inferiorto the soft c-means, when d = 20. The main reason of the bad performance in d = 20 and σ0 > 80 was the empirical-Bayes modelselection, where parsimonious models were chosen to regularize the fitted density. Yet in other settings, the automatic selection with theempirical-Bayes methods achieved higher scores in detecting the hidden clusters. In the high-dimensional data (d = 500), the proposedalgorithm was prominently stable and almost outperformed all of the other algorithms in all of the settings. [Color figure can be viewedin the online issue, which is available at wileyonlinelibrary.com.]

Hence we recommend to set c < m n when handlinghuge data. Actually, when n is huge, the probabilitywith which the m subset includes the relevant kernels ishigh.

In future work, we plan to apply the sequential minimaloptimization in supervised learning tasks and extend theconvex clustering for more complex optimizations thatinvolve many local optima. One of the important extensions

is the detection of non-spherical or non-elliptic clustersbased on spectral methods. Spectral clustering and theconvex clustering are different in terms of parameterspecifications, where the number of top-k eigenvectorsor the bandwidths of kernels are specified, respectively.How to keep the consistency between the parameterspecifications on these algorithms is an open problem. Inaddition, we will analyze the behaviors when richer types

Statistical Analysis and Data Mining DOI:10.1002/sam

Page 20: Sequential minimal optimization in convex clustering repetitions

Takahashi: Sequential Minimal Optimization 89

of kernel distributions are adopted instead of the isotropicGaussian distributions.

Another direction considered is an application for hiddenMarkov models [31–33] where temporal dependenciesexist among the hidden clusters. Since some clusters canhave the same emission distributions while have differenttemporal dynamics [32], many local optima exist in fittinga hidden Markov model and identifying the true clusters ismore challenging than the standard clustering problem. Thevalue of the globally optimal clustering algorithm would bemore prominent in these challenging tasks.

REFERENCES

[1] J. B. MacQueen, Some methods for classification andanalysis of multivariate observations, In Proceedings of theFifth Berkeley Symposium on Mathematical Statistics andProbability, vol 1, 1967, 281–297.

[2] M. R. Anderberg, Cluster Analysis for Applications, NewYork, Academic Press, 1973.

[3] E. W. Forgy, Cluster analysis of multivariate data: efficiencyversus interpretability of classifications, Biometrics 21(1965), 768–769.

[4] A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximumlikelihood from incomplete data via the EM algorithm, J RStat Soc B 39(1) (1977), 1–38.

[5] K. Rose, Deterministic annealing for clustering, compres-sion, classification, regression, and related optimizationproblems, In Proceedings of the IEEE, vol 86, 1998,2210–2239.

[6] N. Ueda, and R. Nakano, Deterministic annealing EMalgorithm, Neural Networks 11(2) (1998), 271–282.

[7] D. Lashkari, and P. Golland, Convex clustering withexemplar-based models, In Advances in Neural InformationProcessing Systems 20, J. C. Platt, D. Koller, Y. Singer,and S. Roweis, eds. Cambridge, MA, MIT Press, 2008,825–832.

[8] R. Takahashi, Sequential minimal optimization in adaptive-bandwidth convex clustering, In Proceedings of the 11thSIAM International Conference on Data Mining (SDM2011), 2011, 896–907.

[9] J. C. Platt, Advances in kernel methods: support vectorlearning, Cambridge, MA, MIT Press, 1999, 185–208.

[10] I. Csiszar, and P. C. Shields, Information theory andstatistics: a tutorial, Foundat Trends Commun Inform Theory1(4) (2004), 417–528.

[11] R. A. Redner, and H. F. Walker, Mixture densities,maximum likelihood, and the EM algorithm, SIAM Rev 26(1984), 195–239.

[12] L. Xu, and M. I. Jordan, On convergence properties ofthe EM algorithm for Gaussian mixtures, Neural Comput8 (1995), 129–151.

[13] V. N. Vapnik, The Nature of Statistical Learning Theory,New York, Springer, 1995.

[14] T. Minka, Estimating a Dirichlet distribution, Technicalreport, Microsoft Research, 2003.

[15] C. Zahn, Graph-theoretical methods for detecting anddescribing gestalt clusters, IEEE Trans Comput C-20(1)(1971), 68–86.

[16] O. Grygorash, Y. Zhou, and Z. Jorgensen, Minimumspanning tree based clustering algorithms, In Proceedingsof the 18th IEEE International Conference on Tools withArtificial Intelligence (ICTAI ’06), IEEE Computer Society,Washington, DC, 2006, 73–81.

[17] C. Yang, R. Duraiswami, N. A. Gumerov, and L. Davis,Improved fast gauss transform and efficient kernel densityestimation, In Proceedings of the Ninth IEEE InternationalConference on Computer Vision (ICCV’03), vol 1, 2003,664–671.

[18] M. A. Carreira-Perpinan, Gaussian mean shift is an EMalgorithm, IEEE Trans Pattern Anal Mach Intell 29(5)(2007), 767–776.

[19] B. J. Frey, and D. Dueck, Clustering by passing messagesbetween data points, Science 315 (2007), 972–976.

[20] C. Antoniak, Mixtures of Dirichlet processes with appli-cations to Bayesian nonparametric problems, Ann Stat 2(1974), 1152–1174.

[21] T. S. Ferguson, Bayesian density estimation by mixtures ofnormal distributions, Rec Adv Stat (1983), 287–302.

[22] D. M. Blei, and M. I. Jordan, Variational inference forDirichlet process mixtures, Bayesian Anal 1(1) (2006),121–144.

[23] K. Kurihara, M. Welling, and Y. W. Teh, Collapsed varia-tional Dirichlet process mixture models, In Proceedings ofthe International Joint Conference on Artificial Intelligence,vol 20, 2007.

[24] H. Ishwaran, and L. F. James, Gibbs sampling methods forstick-breaking priors, J Am Stat Assoc 96 (2001), 161–173.

[25] S. Walker, Sampling the Dirichlet mixture model with slices,Commun Stat Simul Comput 36 (2007), 45–54.

[26] O. Papaspiliopoulos, and G. O. Roberts, RetrospectiveMCMC for Dirichlet process hierarchical models,Biometrika 95 (2008), 169–186.

[27] D. Tarlow, R. Zemel, and B. Frey, Flexible priors forexemplar-based clustering, In The 24th Conference onUncertainty in Artificial Intelligence (UAI 2008), 2008.

[28] I. S. Dhillon, Y. Guan, and B. Kulis, Kernel k-means,spectral clustering and normalized cuts, In 10th ACM KDDConference, 2004, 551–556.

[29] L. Zelnik-Manor, and P. Perona, Self-tuning spectral cluster-ing. In Advances in Neural Information Processing Systems,Lawrence K. Saul, Yair Weiss, and Leon Bottou, eds. Cam-bridge, MA, MIT Press, 2005, 1601–1608.

[30] M. Sugiyama, T. Suzuki, S. Nakajima, H. Kashima, P. vonBunau, and M. Kawanabe, Direct importance estimation forcovariate shift adaptation, Ann Inst Stat Math 60(4) (2008),699–746.

[31] M. J. Beal, Z. Ghahramani, and C. E. Rasmussen, Theinfinite hidden Markov model. In Advances in NeuralInformation Processing Systems 14, T. G. Dietterich, S.Becker, and Z. Ghahramani, eds. Cambridge, MA, MITPress, 2002, 577–584.

[32] S. M. Siddiqi, Fast state discovery for HMM model selectionand learning, In Proceedings of the Eleventh InternationalConference on Artificial Intelligence and Statistics (AI-STATS 2007), 2007.

[33] J. Van Gael, Y. Saatci, Y. Teh, and Z. Ghahramani,Beam sampling for the infinite hidden Markov model. InProceedings of the 25th Annual International Conference onMachine Learning (ICML 2008), Andrew McCallum, andSam Roweis, eds. Omnipress, 2008, 1088–1095.

Statistical Analysis and Data Mining DOI:10.1002/sam