Optimal interval clustering: Application to Bregman clustering and statistical mixture learning

26
Optimal interval clustering: Application to Bregman clustering and statistical mixture learning Frank Nielsen 1 Richard Nock 2 1 Sony Computer Science Laboratories/Ecole Polytechnique 2 UAG-CEREGMIA/NICTA May 2014 c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/26

description

We present a generic dynamic programming method to compute the optimal clustering of n scalar elements into k pairwise disjoint intervals. This case includes 1D Euclidean k-means, k-medoids, k-medians, k-centers, etc. We extend the method to incorporate cluster size constraints and show how to choose the appropriate k by model selection. Finally, we illustrate and refine the method on two case studies: Bregman clustering and statistical mixture learning maximizing the complete likelihood. http://arxiv.org/abs/1403.2485

Transcript of Optimal interval clustering: Application to Bregman clustering and statistical mixture learning

Page 1: Optimal interval clustering: Application to Bregman clustering and statistical mixture learning

Optimal interval clustering: Application toBregman clustering and statistical mixture

learning

Frank Nielsen1 Richard Nock2

1Sony Computer Science Laboratories/Ecole Polytechnique2UAG-CEREGMIA/NICTA

May 2014

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/26

Page 2: Optimal interval clustering: Application to Bregman clustering and statistical mixture learning

Hard clustering: Partitioning the data set

� Partition X = {x1, ..., xn} ⊂ X into k clustersC1 ⊂ X , ..., Ck ⊂ X :

X =k⊎

i=1

Ci

� Center-based (prototype) hard clustering: k-means [2],k-medians, k-center, �r -center [10], etc.

� Model-based hard clustering: statistical mixtures maximizingthe complete likelihood (prototype=model paramter).

� k-means: NP-hard when d > 1 and k > 1 [11, 7, 1].

� k-medians and k-centers: NP-hard [12] (1984)

� In 1D, k-means is polynomial [3, 15]: O(n2k).

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 2/26

Page 3: Optimal interval clustering: Application to Bregman clustering and statistical mixture learning

Euclidean 1D k-means

� 1D k-means [8] has contiguous partition.

� Solved by enumerating all(n−1k−1

)partitions in 1D (1958).

Better than Stirling numbers of the second kind S(n, k) thatcount all partitions.

� Polynomial in time O(n2k) using Dynamic Programming(DP) [3] (sketched in 1973 in two pages).

� R package Ckmeans.1d.dp [15] (2011).

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/26

Page 4: Optimal interval clustering: Application to Bregman clustering and statistical mixture learning

Interval clustering: Structure

Sort X ∈ X with respect to total order < on X in O(n log n).

Output represented by:

� k intervals Ii = [xli , xri ] such that Ci = Ii ∩ X .� or better k − 1 delimiters li (i ∈ {2, ..., k}) since ri = li+1 − 1

(i < k and rk = n) and l1 = 1.

[x1...xl2−1]︸ ︷︷ ︸C1

[xl2 ...xl3−1]︸ ︷︷ ︸C2

... [xlk ...xn]︸ ︷︷ ︸Ck

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/26

Page 5: Optimal interval clustering: Application to Bregman clustering and statistical mixture learning

Objective function for interval clustering

Scalars x1 < ... < xn are partitioned contiguouslyinto k clusters: C1 < ... < Ck .

Clustering objective function:

min ek(X ) =k⊕

j=1

e1(Cj)

c1(·): intra-cluster cost/energy⊕: inter-cluster cost/energy (commutative, associative)

n = kp + 1 1D points equally distributed → k different optimalclustering partitions

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 5/26

Page 6: Optimal interval clustering: Application to Bregman clustering and statistical mixture learning

Examples of objective functionsIn arbitrary dimension X = R

d :

� �r -clustering (r ≥ 1):⊕

=∑

e1(Cj) = minp∈X

⎛⎝∑

x∈Cjd(x , p)r

⎞⎠

(argmin=prototype pj is the same whether we take power of 1r

of sum or not)Euclidean �r -clustering: r = 1 median, r = 2 means.

� k-center (limr→∞):⊕

= max

e1(Ci) = minp∈X

maxx∈Cj

d(x , p)

� Discrete clustering: Search space in min is Cj instead of X.

Note that in 1D, �s-norm distance is always d(p, q) = |p − q|,independent of s ≥ 1.

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/26

Page 7: Optimal interval clustering: Application to Bregman clustering and statistical mixture learning

Optimal interval clustering by Dynamic ProgrammingXj ,i = {xj , ..., xi} (j ≤ i)Xi = X1,i = {x1, ..., xi}E = [ei ,j ]: n × k cost matrix, O(n × k) memoryei ,m = em(Xi )

Optimality equation:

ei ,m = minm≤j≤i

{ej−1,m−1 ⊕ e1(Xj ,i)}

Associative/commutative operator ⊕ (+ or max).Initialize with ci ,1 = c1(Xi )E : compute from left to right column, from bottom to top.Best clustering solution cost is at en,k .Time: n × k × O(n)× T1(n)=O(n2kT1(n)), O(nk) memory

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/26

Page 8: Optimal interval clustering: Application to Bregman clustering and statistical mixture learning

Retrieving the solution: Backtracking

Use an auxiliary matrix S = [si .j ] for storing the argmin.

Backtrack in O(k) time.

� Left index lk of Ck stored at sn,k : lk = sn,k .

� Iteratively retrieve the previous left interval indexes at entrieslj−1 = slj−1,i for j = k − 1, ..., j = 1.

Note that lj − 1 = n−∑kl=j nl and lj − 1 =

∑j−1l=1 nl .

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/26

Page 9: Optimal interval clustering: Application to Bregman clustering and statistical mixture learning

Optimizing time with a Look Up Table (LUT)

Save time when computing e1(Xj ,i) since we perform n× k ×O(n)such computations.

Look Up Table (LUT): Add extra n × n matrix E1 withE1[j][i ] = e1(Xj ,i).Build in O(n2T1(n))...Then DP in O(n2k)=O(n2T1(n))).

→ quadratic amount of memory (n > 10000...)

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/26

Page 10: Optimal interval clustering: Application to Bregman clustering and statistical mixture learning

DP solver with cluster size constraints

n−i and n+i : lower/upper bound constraints on ni = |Ci |∑kl=1 = n−i ≤ n and

∑kl=1 = n+i ≥ n.

When no constraints: add dummy constraints n−i = 1 andn+i = n − k + 1.

nm = |Cm| = i − j + 1 such that n−m ≤ nm ≤ n+m.→ j ≤ i + 1− n−m and j ≥ i + 1− n+m.

ei ,m = minmax{1+∑m−1

l=1 n−l ,i+1−n+m}≤j

j≤i+1−n−m

{ej−1,m−1 ⊕ e1(Xj ,i)} ,

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/26

Page 11: Optimal interval clustering: Application to Bregman clustering and statistical mixture learning

Model selection from the DP tablem(k) = ek(X )

e1(X ) decreases with k and reaches minimum when k = n.Model selection: trade-off choose best model among all the modelswith k ∈ [1, n].

Regularized objective function: e′k(X ) = ek(X ) + f (k), f (k)related to model complexity.Compute the DP table for k = n, ..., 1 and avoids redundantcomputations.Then compute the criterion for the last line (indexed by n) andchoose the argmin of e′k .

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/26

Page 12: Optimal interval clustering: Application to Bregman clustering and statistical mixture learning

A Voronoi cell condition for DP optimality

elements → interval clusters → prototypes

interval clusters ← prototypes

Voronoi diagram:Partition X wrt. P = {p1, ..., pk}.Voronoi cell:

V (pj) = {x ∈ X : d r (x , pj ) ≤ d r (x , pl ) ∀l ∈ {1, ..., k}}.

x r is a monotonically increasing function on R+, equivalent to

V ′(pj) = {x ∈ X : d(x : pj) < d(x : pl)}DP guarantees optimal clustering when ∀P, V ′(pj ) is an interval2-clustering exhibits the Voronoi bisector.

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/26

Page 13: Optimal interval clustering: Application to Bregman clustering and statistical mixture learning

1-mean (centroid): O(n) time

minp

n∑i=1

(xi − p)2

D(x , p) = (x − p)2, D ′(x , p) = 2(x − p), D ′′(x , p) = 2

Convex optimization (existence and unique solution)

n∑i=1

D ′(x , p) = 0⇒n∑

i=1

xi − np = 0

Center of mass p = 1n

∑ni=1 xi− (barycenter)

Extends to Bregman divergence:

DF (x , p) = F (x)− F (p)− (x − p)F ′(p)

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/26

Page 14: Optimal interval clustering: Application to Bregman clustering and statistical mixture learning

2-means: O(n log n) timeFind xl2 (n − 1 potential locations for xl : from x2 to xn):

minxl2{e1(C1) + e1(C2)}

Browse from left to right l2 = x2, ...., xn.Update cost in constant time E2(l + 1) from E2(l) (SATs alsoO(1)):

E2(l) = e2(x1...xl−1|xl ...xn)

μ1(l + 1) =(l − 1)μ1(l) + xl

l, μ2(l + 1) =

(n − l + 1)μ2(l)− xln − l

v1(l + 1) =l∑

i=1

(xi − μ1(l + 1))2 =l∑

i=1

x2i − lμ21(l + 1)

ΔE2(l) =l − 1

l‖μ1(l)− xl‖2 + n − l + 1

n− l‖μ2(l)− xl‖2

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/26

Page 15: Optimal interval clustering: Application to Bregman clustering and statistical mixture learning

2-means: Experiments

Intel Win7 i7-4800

n Brute force SAT Incremental300000 155.022 0.010 0.00911000000 1814.44 0.018 0.015

Do we need sorting and Ω(n log n) time? (k = 1 is linear time)

Note that MaxGap does not yield the separator (becausecentroid is sum of squared distance minimizer)

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/26

Page 16: Optimal interval clustering: Application to Bregman clustering and statistical mixture learning

Optimal 1D Bregman k-means

Bregman information [2] e1 (generalizes cluster variance):

e1(Cj) = minxl∈Cj

wlBF (xl : pj). (1)

Expressed as [14]:

e1(Cj) =(∑

xl∈Cj wl

)(pjF

′(pj)− F (pj)) +(∑

xl∈Cj wlF (xl))−

F ′(pj )(∑

x∈Cj wlx)

process using Summed Area Tables [6] (SATs)S1(j) =

∑jl=1 wl , S2(j) =

∑jl=1 wlxl , and S3(j) =

∑jl=1 wlF (xl ) in

O(n) time at preprocessing stage.Evaluate the Bregman information e1(Xj ,i) in constant time O(1).

For example,∑i

l=j wlF (xl) = S3(i)− S3(j − 1) with S3(0) = 0.Bregman Voronoi diagrams have connected cells [4] thus DP yieldsoptimal interval clustering.

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/26

Page 17: Optimal interval clustering: Application to Bregman clustering and statistical mixture learning

Exponential families in statistics

Family of probability distributions:

F = {pF (x ; θ) : θ ∈ Θ}

Exponential families [13]:

pF (x |θ) = exp (t(x)θ − F (θ) + k(x)) ,

For example:univariate Rayleigh R(σ), t(x) = x2, k(x) = log x , θ = − 1

2σ2 ,

η = −1θ , F (θ) = log− 1

2θ and F ∗(η) = −1 + log 2η .

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 17/26

Page 18: Optimal interval clustering: Application to Bregman clustering and statistical mixture learning

Uniorder exponential families: MLE

Maximum Likelihood Estimator (MLE) [13]:

e1(Xj ,i) = l̂(xj , ..., xi ) = F ∗(η̂j ,i ) +1

i − j + 1

i∑l=j

k(xl).

with η̂j ,i =1

i−j+1

∑il=j t(xl).

By making a change of variable yl = t(xl), and not accounting the∑k(xl ) terms that are constant for any clustering, we get

e1(Xj ,i) ≡ F ∗

⎛⎝ 1

i − j + 1

i∑l=j

yl

⎞⎠

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 18/26

Page 19: Optimal interval clustering: Application to Bregman clustering and statistical mixture learning

Hard clustering for learning statistical mixtures

Expectation-Maximization learns monotonically from aninitialization by maximizing the incomplete log-likelihood.Mixture maximizing the complete log-likelihood:

lc(X ; L,Ω) =n∑

i=1

log(αlip(xi ; θli )),

L = {li}i : hidden labels.

max lc ≡ minθ1,...,θk

n∑i=1

kminj=1

(− log p(xi ; θj)− log αj).

Given fixed α and − log pF (x ; θ) amounts to a dual Bregmandivergence[2].Run Bregman k-means and DP yields optimal partition sinceadditively-weighted Bregman Voronoi diagrams are interval [4].

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 19/26

Page 20: Optimal interval clustering: Application to Bregman clustering and statistical mixture learning

Hard clustering for learning statistical mixtures

Location families:

F = {f (x ;μ) = 1

σf0(

x − μ

σ), μ ∈ R}

f0 standard density, σ > 0 fixed. Cauchy or Laplacian families havedensity graphs intersecting in exactly one point.

→ singly-connected Maximum Likelihood Voronoi cells.

Model selection: Akaike Information Criterion [5] (AIC):

AIC(x1, ..., xn) = −2l(x1, ..., xn) + 2k +2k(k + 1)

n − k − 1

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 20/26

Page 21: Optimal interval clustering: Application to Bregman clustering and statistical mixture learning

Experiments with: Gaussian Mixture Models (GMMs)

gmm1 score = −3.0754314021966658 (Euclidean k-means) gmm2

score = −3.038795325884112 (Bregman k-means, better)

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 21/26

Page 22: Optimal interval clustering: Application to Bregman clustering and statistical mixture learning

Conclusion

� Generic DP for solving interval clustering:� O(n2kT1(n))-time using O(nk) memory� O(n2T1(n)) time using O(n2) memory

� Refine DP by adding minimum/maximum cluster sizeconstraints

� Model selection from DP table

� Two applications:� 1D Bregman �r -clustering. 1D Bregman k-means in O(n2k)

time using O(nk) memory using Summed Area Tables (SATs)� Mixture learning maximizing the complete likelihood:

� For uni-order exponential families amount to a dual Bregmank-means on Y = {yi = t(xi )}i

� For location families with density graph intersecting pairwisein one point (Cauchy, Laplacian: �∈ exponential families)

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 22/26

Page 23: Optimal interval clustering: Application to Bregman clustering and statistical mixture learning

Perspectives

Ω(n log n) for sorting.

Hierarchical center-based clustering with single-linkage: clusteringtree.Best k-partition pruning using DP [?]:Optimal for α = 2 +

√3-perturbation resilient instances.

Time O(nk2 + nT1(n))Question: How to maintain dynamically an optimal contiguousclustering? (core-set approximation in the streaming model [9])

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 23/26

Page 24: Optimal interval clustering: Application to Bregman clustering and statistical mixture learning

Bibliography I

Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat.

Np-hardness of Euclidean sum-of-squares clustering.Machine Learning, 75(2):245–248, 2009.

Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh.

Clustering with Bregman divergences.Journal of Machine Learning Research, 6:1705–1749, 2005.

Richard Bellman.

A note on cluster analysis and dynamic programming.Mathematical Biosciences, 18(3-4):311 – 312, 1973.

Jean-Daniel Boissonnat, Frank Nielsen, and Richard Nock.

Bregman Voronoi diagrams.Discrete Computational Geometry, 44(2):281–307, September 2010.

J. Cavanaugh.

Unifying the derivations for the Akaike and corrected Akaike information criteria.Statistics & Probability Letters, 33(2):201–208, April 1997.

Franklin C. Crow.

Summed-area tables for texture mapping.In Proceedings of the 11th Annual Conference on Computer Graphics and Interactive Techniques,SIGGRAPH ’84, pages 207–212, New York, NY, USA, 1984. ACM.

Sanjoy Dasgupta.

The hardness of k-means clustering.Technical Report CS2008-0916.

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 24/26

Page 25: Optimal interval clustering: Application to Bregman clustering and statistical mixture learning

Bibliography II

Walter D Fisher.

On grouping for maximum homogeneity.Journal of the American Statistical Association, 53(284):789–798, 1958.

Sariel Har-Peled and Akash Kushal.

Smaller coresets for k-median and k-means clustering.In Proceedings of the Twenty-first Annual Symposium on Computational Geometry, SCG ’05, pages126–134, New York, NY, USA, 2005. ACM.

Meizhu Liu, Baba C. Vemuri, Shun ichi Amari, and Frank Nielsen.

Shape retrieval using hierarchical total Bregman soft clustering.IEEE Trans. Pattern Anal. Mach. Intell., 34(12):2407–2419, 2012.

Meena Mahajan, Prajakta Nimbhorkar, and Kasturi R. Varadarajan.

The planar k-means problem is NP-hard.Theoretical Computer Science, 442:13–21, 2012.

Nimrod Megiddo and Kenneth J Supowit.

On the complexity of some common geometric location problems.SIAM journal on computing, 13(1):182–196, 1984.

Frank Nielsen.

k-mle: A fast algorithm for learning statistical mixture models.CoRR, abs/1203.5181, 2012.

Frank Nielsen and Richard Nock.

Sided and symmetrized Bregman centroids.IEEE Transactions on Information Theory, 55(6):2882–2904, 2009.

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 25/26

Page 26: Optimal interval clustering: Application to Bregman clustering and statistical mixture learning

Bibliography III

Haizhou Wang and Mingzhou Song.

Ckmeans.1d.dp: Optimal k-means clustering in one dimension by dynamic programming.R Journal, 3(2), 2011.

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 26/26