lecture 7: clustering algorithms - Purdue Universityvarao/STAT545/lect7.pdf · Clustering...

lecture 7: clustering algorithmsSTAT 545: Intro. to Computational Statistics

Vinayak RaoPurdue University

September 12, 2016

Clustering

Given a large dataset, group data points into ‘clusters’.

Data points in the same cluster are similar in some sense.

E.g. cluster students scores (to decide grade)

Applications:

Compression/feature-extraction/exploration/visualization

• simpler representation of complex data

Image segmentation, community detectn, co-expressed genes

1/18

Clustering (contd.)

We are given N data vectors (x1, . . . , xN) in ℜd.Let ci be the cluster assignment of observation xi:

ci ∈ {1 · · · K}

Equivalently, we can use one-hot (or 1-of-K) encoding:

ric =

1, if ci = c0, otherwise

Observe: ric ≥ 0 and∑K

c=1 ric = 1 just like a probability vector.

However, ric is binary: we will relax this in later lectures.

2/18

Cluster parameters

Associate cluster i with parameter θi ∈ ℜd (cluster prototype).

Write θ = {θ1, . . . , θK}, C = {c1, . . . , cN} (or R = {r, . . . , rN}).

Problem: Given data (x1, . . . , xN) find θ and C.

Define a loss-function L(θ,C), and minimize it.

Start by defining a distance (or similarity measure) d(x,θ):

d(x,θ) =d∑i=1

(xi − θi)2 Squared Euclidean or L2 dist.

d(x,θ) =d∑i=1

|xi − θi| L1 distance

3/18

Clustering loss function

We want all members of a cluster to be close to the prototype.∑i s.t. ci=c

d(xi,θc) =N∑i=1

ricd(xi,θc) should be small for each c.

Overall loss function:

L(θ,R) =K∑c=1

N∑i=1

ricd(xi,θc)

Optimize over both:

• cluster assignments (discrete)• cluster parameters (continuous)

4/18

K-means

Minimizing L(θ,C) is hard (O(NDK+1)).

Instead, use heuristic greedy (local-search) algorithms.When d(·, ·) is Euclidean, the most popular is Lloyd’s algorithm.

If we had the cluster parameters θ∗, can we solve for C?

Copt = argmin L(θ∗,C)

If we had the cluster assignments C∗, can we solve for θ?

θopt = argmin L(θ,C∗)

5/18

K-means

Start with an initialialization of the parameters, call it θ0.Assign observations to nearest clusters, giving R0.

Repeat for i in 1 to N:

• Recalculate cluster means, θi• Recalculate cluster assignments, Ri

Coordinate-descent.

Resulting algorithm has complexity O(INKD)

6/18

[demo]

7/18

Questions

Does this algorithm converge to a global minimum?

Does it converge at all?

What is the convergence criteria?

8/18

Limitations

Local optima: Sensitive to initialization.Solution: Run many times and pick the best clustering.

Empty clusters.Solution: discard them, or use heuristics to assignobservations to them

Choosing K.Solution: search over a set of K’s, penalizing larger values.

Requires circular clusters.Solution: use some other method

9/18

Variations to k-means

Modify distance functions.L1 distance: k-medians

Modify the algorithm.L1 distance: k-medoids (exemplar-based)

10/18

Hierarchical clustering

k-means is a partitioning algorithm that assigns eachobservation to a unique cluster.Often there is no clear best clustering.

11/18

Hierarchical clustering

k-means is a partitioning algorithm that assigns eachobservation to a unique cluster.Often is is natural to view data as having a hierarchical structure

11/18

Two approaches:

Top-down (divisive) clustering:

• Initialize all observations into a single cluster, and divideclusters sequentially.

Bottom-up (agglomerative) clustering:

• Initialize each observation in its own cluster, and mergeclusters sequentially.

• More flexible, and more common.

12/18

Agglomerative clustering

12

3

4

5

1 2 3 4 5

13/18

Agglomerative clustering

Pick a distance function (e.g. Euclidean).

Pick a linkage criterion defining distance between two clusters:

• Single linkage: d(A,B) = minx∈A,y∈B d(x, y).• Complete linkage: d(A,B) = maxx∈A,y∈B d(x, y).• Centroid linkage: d(A,B) = d(CA, CB) (CA: centroid of A).• Average linkage: d(A,B) = 1

|A||B|∑

x∈A,y∈B d(x, y).14/18

Some properties of the Gaussian

Marginalization:[XY

]∼ N

([µXµY

],

[ΣXX ΣXYΣYX ΣYY

]), X ∼ ?

0.000.010.020.030.04

0

Cou

nt

0

0label X

labe

l Y

0

0.000.020.040.060.08Count

µ =

[00

], Σ =

[10 44 2

]

X ∼ N (µX,ΣXX)

15/18

Some properties of the Gaussian

Conditioning:[XY

]∼ N

([µXµY

],

[ΣXX ΣXYΣYX ΣYY

]), Y|(X = a) ∼ ?

0.000.010.020.030.040

Cou

nt

0

0label X

labe

l Y

0

0.000.020.040.060.08Count

µ =

[00

], Σ =

[10 44 2

]

Y|X ∼ N(µY +ΣXYΣ

−1XX (a− µX) , ΣYY − ΣXYΣ

−1XXΣYX

)16/18

The Gaussian distribution, conjugacy andBayes’ rule

We have a Gaussian ‘prior’ on X1.We observe a noisy measurement Y1|X1 ∼ N (AX1,ΣE).

[Xϵ

]→

[XY

]X and Y jointly Gaussian: what is its mean and covariance?

Y is marginally Gaussian: what is its mean and covariance?

X|Y is Gaussian: what is its mean and covariance? 17/18

Product of Gaussian densities:

Product of Gaussian densities is Gaussian (upto normalization)

0.0

0.2

0.4

0.6

0.8

−5.0 −2.5 0.0 2.5 5.0x

y

2

2

"red"

black

blue

red

Intuition: sum of two quadratic functions is a quadratic

Aside: need only specify prob. distrib. to a constant

• p(x) and C ·p(x) represents the same, if C is independent of x• Probabilities must integrate to 1

18/18

lecture 7: clustering algorithms - Purdue Universityvarao/STAT545/lect7.pdf · Clustering...

Documents

Transcript of lecture 7: clustering algorithms - Purdue Universityvarao/STAT545/lect7.pdf · Clustering...