lecture 7: clustering algorithms - Purdue Universityvarao/STAT545/lect7.pdf · Clustering...
Transcript of lecture 7: clustering algorithms - Purdue Universityvarao/STAT545/lect7.pdf · Clustering...
lecture 7: clustering algorithmsSTAT 545: Intro. to Computational Statistics
Vinayak RaoPurdue University
September 12, 2016
Clustering
Given a large dataset, group data points into ‘clusters’.
Data points in the same cluster are similar in some sense.
E.g. cluster students scores (to decide grade)
Applications:
Compression/feature-extraction/exploration/visualization
• simpler representation of complex data
Image segmentation, community detectn, co-expressed genes
1/18
Clustering (contd.)
We are given N data vectors (x1, . . . , xN) in ℜd.Let ci be the cluster assignment of observation xi:
ci ∈ {1 · · · K}
Equivalently, we can use one-hot (or 1-of-K) encoding:
ric =
1, if ci = c0, otherwise
Observe: ric ≥ 0 and∑K
c=1 ric = 1 just like a probability vector.
However, ric is binary: we will relax this in later lectures.
2/18
Cluster parameters
Associate cluster i with parameter θi ∈ ℜd (cluster prototype).
Write θ = {θ1, . . . , θK}, C = {c1, . . . , cN} (or R = {r, . . . , rN}).
Problem: Given data (x1, . . . , xN) find θ and C.
Define a loss-function L(θ,C), and minimize it.
Start by defining a distance (or similarity measure) d(x,θ):
d(x,θ) =d∑i=1
(xi − θi)2 Squared Euclidean or L2 dist.
d(x,θ) =d∑i=1
|xi − θi| L1 distance
3/18
Clustering loss function
We want all members of a cluster to be close to the prototype.∑i s.t. ci=c
d(xi,θc) =N∑i=1
ricd(xi,θc) should be small for each c.
Overall loss function:
L(θ,R) =K∑c=1
N∑i=1
ricd(xi,θc)
Optimize over both:
• cluster assignments (discrete)• cluster parameters (continuous)
4/18
K-means
Minimizing L(θ,C) is hard (O(NDK+1)).
Instead, use heuristic greedy (local-search) algorithms.When d(·, ·) is Euclidean, the most popular is Lloyd’s algorithm.
If we had the cluster parameters θ∗, can we solve for C?
Copt = argmin L(θ∗,C)
If we had the cluster assignments C∗, can we solve for θ?
θopt = argmin L(θ,C∗)
5/18
K-means
Start with an initialialization of the parameters, call it θ0.Assign observations to nearest clusters, giving R0.
Repeat for i in 1 to N:
• Recalculate cluster means, θi• Recalculate cluster assignments, Ri
Coordinate-descent.
Resulting algorithm has complexity O(INKD)
6/18
[demo]
7/18
Questions
Does this algorithm converge to a global minimum?
Does it converge at all?
What is the convergence criteria?
8/18
Limitations
Local optima: Sensitive to initialization.Solution: Run many times and pick the best clustering.
Empty clusters.Solution: discard them, or use heuristics to assignobservations to them
Choosing K.Solution: search over a set of K’s, penalizing larger values.
Requires circular clusters.Solution: use some other method
9/18
Variations to k-means
Modify distance functions.L1 distance: k-medians
Modify the algorithm.L1 distance: k-medoids (exemplar-based)
10/18
Hierarchical clustering
k-means is a partitioning algorithm that assigns eachobservation to a unique cluster.Often there is no clear best clustering.
11/18
Hierarchical clustering
k-means is a partitioning algorithm that assigns eachobservation to a unique cluster.Often there is no clear best clustering.
11/18
Hierarchical clustering
k-means is a partitioning algorithm that assigns eachobservation to a unique cluster.Often there is no clear best clustering.
11/18
Hierarchical clustering
k-means is a partitioning algorithm that assigns eachobservation to a unique cluster.Often there is no clear best clustering.
11/18
Hierarchical clustering
k-means is a partitioning algorithm that assigns eachobservation to a unique cluster.Often is is natural to view data as having a hierarchical structure
11/18
Hierarchical clustering
k-means is a partitioning algorithm that assigns eachobservation to a unique cluster.Often is is natural to view data as having a hierarchical structure
11/18
Two approaches:
Top-down (divisive) clustering:
• Initialize all observations into a single cluster, and divideclusters sequentially.
Bottom-up (agglomerative) clustering:
• Initialize each observation in its own cluster, and mergeclusters sequentially.
• More flexible, and more common.
12/18
Agglomerative clustering
12
3
4
5
1 2 3 4 5
13/18
Agglomerative clustering
12
3
4
5
1 2 3 4 5
13/18
Agglomerative clustering
12
3
4
5
1 2 3 4 5
13/18
Agglomerative clustering
12
3
4
5
1 2 3 4 5
13/18
Agglomerative clustering
12
3
4
5
1 2 3 4 5
13/18
Agglomerative clustering
Pick a distance function (e.g. Euclidean).
Pick a linkage criterion defining distance between two clusters:
• Single linkage: d(A,B) = minx∈A,y∈B d(x, y).• Complete linkage: d(A,B) = maxx∈A,y∈B d(x, y).• Centroid linkage: d(A,B) = d(CA, CB) (CA: centroid of A).• Average linkage: d(A,B) = 1
|A||B|∑
x∈A,y∈B d(x, y).14/18
Agglomerative clustering
Pick a distance function (e.g. Euclidean).
Pick a linkage criterion defining distance between two clusters:
• Single linkage: d(A,B) = minx∈A,y∈B d(x, y).• Complete linkage: d(A,B) = maxx∈A,y∈B d(x, y).• Centroid linkage: d(A,B) = d(CA, CB) (CA: centroid of A).• Average linkage: d(A,B) = 1
|A||B|∑
x∈A,y∈B d(x, y).14/18
Agglomerative clustering
Pick a distance function (e.g. Euclidean).
Pick a linkage criterion defining distance between two clusters:
• Single linkage: d(A,B) = minx∈A,y∈B d(x, y).• Complete linkage: d(A,B) = maxx∈A,y∈B d(x, y).• Centroid linkage: d(A,B) = d(CA, CB) (CA: centroid of A).• Average linkage: d(A,B) = 1
|A||B|∑
x∈A,y∈B d(x, y).14/18
Agglomerative clustering
Pick a distance function (e.g. Euclidean).
Pick a linkage criterion defining distance between two clusters:
• Single linkage: d(A,B) = minx∈A,y∈B d(x, y).• Complete linkage: d(A,B) = maxx∈A,y∈B d(x, y).• Centroid linkage: d(A,B) = d(CA, CB) (CA: centroid of A).• Average linkage: d(A,B) = 1
|A||B|∑
x∈A,y∈B d(x, y).14/18
Agglomerative clustering
Pick a distance function (e.g. Euclidean).
Pick a linkage criterion defining distance between two clusters:
• Single linkage: d(A,B) = minx∈A,y∈B d(x, y).• Complete linkage: d(A,B) = maxx∈A,y∈B d(x, y).• Centroid linkage: d(A,B) = d(CA, CB) (CA: centroid of A).• Average linkage: d(A,B) = 1
|A||B|∑
x∈A,y∈B d(x, y).14/18
Agglomerative clustering
Pick a distance function (e.g. Euclidean).
Pick a linkage criterion defining distance between two clusters:
• Single linkage: d(A,B) = minx∈A,y∈B d(x, y).• Complete linkage: d(A,B) = maxx∈A,y∈B d(x, y).• Centroid linkage: d(A,B) = d(CA, CB) (CA: centroid of A).• Average linkage: d(A,B) = 1
|A||B|∑
x∈A,y∈B d(x, y).14/18
Some properties of the Gaussian
Marginalization:[XY
]∼ N
([µXµY
],
[ΣXX ΣXYΣYX ΣYY
]), X ∼ ?
0.000.010.020.030.04
0
Cou
nt
0
0label X
labe
l Y
0
0.000.020.040.060.08Count
µ =
[00
], Σ =
[10 44 2
]
X ∼ N (µX,ΣXX)
15/18
Some properties of the Gaussian
Conditioning:[XY
]∼ N
([µXµY
],
[ΣXX ΣXYΣYX ΣYY
]), Y|(X = a) ∼ ?
0.000.010.020.030.040
Cou
nt
0
0label X
labe
l Y
0
0.000.020.040.060.08Count
µ =
[00
], Σ =
[10 44 2
]
Y|X ∼ N(µY +ΣXYΣ
−1XX (a− µX) , ΣYY − ΣXYΣ
−1XXΣYX
)16/18
The Gaussian distribution, conjugacy andBayes’ rule
We have a Gaussian ‘prior’ on X1.We observe a noisy measurement Y1|X1 ∼ N (AX1,ΣE).
[Xϵ
]→
[XY
]X and Y jointly Gaussian: what is its mean and covariance?
Y is marginally Gaussian: what is its mean and covariance?
X|Y is Gaussian: what is its mean and covariance? 17/18
Product of Gaussian densities:
Product of Gaussian densities is Gaussian (upto normalization)
0.0
0.2
0.4
0.6
0.8
−5.0 −2.5 0.0 2.5 5.0x
y
2
2
"red"
black
blue
red
Intuition: sum of two quadratic functions is a quadratic
Aside: need only specify prob. distrib. to a constant
• p(x) and C ·p(x) represents the same, if C is independent of x• Probabilities must integrate to 1
18/18