Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals...
Transcript of Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals...
![Page 1: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/1.jpg)
Maria-Florina Balcan
04/06/2015
Clustering.
Unsupervised Learning
Additional resources:
• Center Based Clustering: A Foundational Perspective.
Awasthi, Balcan. Handbook of Clustering Analysis. 2015.
Reading:
• Chapter 14.3: Hastie, Tibshirani, Friedman.
![Page 2: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/2.jpg)
Logistics
• Midway Review due today.
• Final Report, May 8.
• Poster Presentation, May 11.
• Exam #2 on April 29th.
• Project:
• Communicate with your mentor TA!
![Page 3: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/3.jpg)
Clustering, Informal Goals
Goal: Automatically partition unlabeled data into groups of
similar datapoints.
Question: When and why would we want to do this?
• Automatically organizing data.
Useful for:
• Representing high-dimensional data in a low-dimensional space (e.g., for visualization purposes).
• Understanding hidden structure in data.
• Preprocessing for further analysis.
![Page 4: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/4.jpg)
• Cluster news articles or web pages or search results by topic.
Applications (Clustering comes up everywhere…)
• Cluster protein sequences by function or genes according to expression profile.
• Cluster users of social networks by interest (community detection).
Facebook network Twitter Network
![Page 5: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/5.jpg)
• Cluster customers according to purchase history.
Applications (Clustering comes up everywhere…)
• Cluster galaxies or nearby stars (e.g. Sloan Digital Sky Survey)
• And many many more applications….
![Page 6: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/6.jpg)
Clustering
[March 4th: EM-style algorithm for clustering for mixture of Gaussians (specific
probabilistic model).]
Today:
• Objective based clustering
• Hierarchical clustering
• Mention overlapping clusters
![Page 7: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/7.jpg)
Objective Based Clustering
Goal: output a partition of the data.
Input: A set S of n points, also a distance/dissimilarity measure specifying the distance d(x,y) between pairs (x,y).
E.g., # keywords in common, edit distance, wavelets coef., etc.
– k-median: find center pts 𝐜𝟏, 𝐜𝟐, … , 𝐜𝐤 to
minimize ∑i=1 n minj∈ 1,…,k d(𝐱𝐢, 𝐜𝐣)
– k-means: find center pts 𝒄𝟏, 𝒄𝟐, … , 𝒄𝒌 to
minimize ∑i=1 n minj∈ 1,…,k d2(𝐱𝐢, 𝐜𝐣)
– K-center: find partition to minimize the maxim radius
z x
y c1 c2
s c3
![Page 8: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/8.jpg)
Input: A set of n datapoints 𝐱𝟏, 𝐱𝟐, … , 𝐱𝒏 in Rd
Euclidean k-means Clustering
target #clusters k
Output: k representatives 𝒄𝟏, 𝐜𝟐, … , 𝒄𝒌 ∈ Rd
Objective: choose 𝒄𝟏, 𝐜𝟐, … , 𝒄𝒌 ∈ Rd to minimize
∑i=1 n minj∈ 1,…,k 𝐱𝐢 − 𝐜𝐣
2
![Page 9: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/9.jpg)
Input: A set of n datapoints 𝐱𝟏, 𝐱𝟐, … , 𝐱𝒏 in Rd
Euclidean k-means Clustering
target #clusters k
Output: k representatives 𝒄𝟏, 𝐜𝟐, … , 𝒄𝒌 ∈ Rd
Objective: choose 𝒄𝟏, 𝐜𝟐, … , 𝒄𝒌 ∈ Rd to minimize
∑i=1 n minj∈ 1,…,k 𝐱𝐢 − 𝐜𝐣
2
Natural assignment: each point assigned to its
closest center, leads to a Voronoi partition.
![Page 10: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/10.jpg)
Input: A set of n datapoints 𝐱𝟏, 𝐱𝟐, … , 𝐱𝒏 in Rd
Euclidean k-means Clustering
target #clusters k
Output: k representatives 𝒄𝟏, 𝐜𝟐, … , 𝒄𝒌 ∈ Rd
Objective: choose 𝒄𝟏, 𝐜𝟐, … , 𝒄𝒌 ∈ Rd to minimize
∑i=1 n minj∈ 1,…,k 𝐱𝐢 − 𝐜𝐣
2
Computational complexity:
NP hard: even for k = 2 [Dagupta’08] or
d = 2 [Mahajan-Nimbhorkar-Varadarajan09]
There are a couple of easy cases…
![Page 11: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/11.jpg)
An Easy Case for k-means: k=1
Output: 𝒄 ∈ Rd to minimize ∑i=1 n 𝐱𝐢 − 𝐜
2
Solution:
1
n∑i=1
n 𝐱𝐢 − 𝐜2
= 𝛍 − 𝐜2
+1
n∑i=1
n 𝐱𝐢 − 𝛍2
So, the optimal choice for 𝐜 is 𝛍.
The optimal choice is 𝛍 =1
n∑i=1
n 𝐱𝐢
Input: A set of n datapoints 𝐱𝟏, 𝐱𝟐, … , 𝐱𝒏 in Rd
Avg k-means cost wrt c Avg k-means cost wrt μ
Idea: bias/variance like decomposition
![Page 12: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/12.jpg)
Another Easy Case for k-means: d=1
Output: 𝒄 ∈ Rd to minimize ∑i=1 n 𝐱𝐢 − 𝐜
2
Extra-credit homework question
Hint: dynamic programming in time O(n2k).
Input: A set of n datapoints 𝐱𝟏, 𝐱𝟐, … , 𝐱𝒏 in Rd
![Page 13: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/13.jpg)
Input: A set of n datapoints 𝐱𝟏, 𝐱𝟐, … , 𝐱𝐧 in Rd
Common Heuristic in Practice:
The Lloyd’s method
Repeat until there is no further change in the cost.
• For each j: Cj ←{𝑥 ∈ 𝑆 whose closest center is 𝐜𝐣}
• For each j: 𝐜𝐣 ←mean of Cj
Initialize centers 𝒄𝟏, 𝐜𝟐, … , 𝒄𝒌 ∈ Rd and
clusters C1, C2, … , Ck in any way.
[Least squares quantization in PCM, Lloyd, IEEE Transactions on Information Theory, 1982]
![Page 14: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/14.jpg)
Input: A set of n datapoints 𝐱𝟏, 𝐱𝟐, … , 𝐱𝐧 in Rd
Common Heuristic in Practice:
The Lloyd’s method
Repeat until there is no further change in the cost.
• For each j: Cj ←{𝑥 ∈ 𝑆 whose closest center is 𝐜𝐣}
• For each j: 𝐜𝐣 ←mean of Cj
Initialize centers 𝒄𝟏, 𝐜𝟐, … , 𝒄𝒌 ∈ Rd and
clusters C1, C2, … , Ck in any way.
[Least squares quantization in PCM, Lloyd, IEEE Transactions on Information Theory, 1982]
Holding 𝒄𝟏, 𝐜𝟐, … , 𝒄𝒌 fixed,
pick optimal C1, C2, … , Ck
Holding C1, C2, … , Ck fixed,
pick optimal 𝒄𝟏, 𝐜𝟐, … , 𝒄𝒌
![Page 15: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/15.jpg)
Input: A set of n datapoints 𝐱𝟏, 𝐱𝟐, … , 𝐱𝐧 in Rd
Common Heuristic: The Lloyd’s method
Initialize centers 𝐜𝟏, 𝐜𝟐, … , 𝐜𝐤 ∈ Rd and
clusters C1, C2, … , Ck in any way.
Repeat until there is no further change in the cost.
• For each j: Cj ←{𝑥 ∈ 𝑆 whose closest center is 𝐜𝐣}
• For each j: 𝐜𝐣 ←mean of Cj
Note: it always converges.
• the cost always drops and
• there is only a finite #s of Voronoi partitions (so a finite # of values the cost could take)
![Page 16: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/16.jpg)
Input: A set of n datapoints 𝐱𝟏, 𝐱𝟐, … , 𝐱𝐧 in Rd
Initialization for the Lloyd’s method
Initialize centers 𝐜𝟏, 𝐜𝟐, … , 𝐜𝐤 ∈ Rd and
clusters C1, C2, … , Ck in any way.
Repeat until there is no further change in the cost.
• For each j: Cj ←{𝑥 ∈ 𝑆 whose closest center is 𝐜𝐣}
• For each j: 𝐜𝐣 ←mean of Cj
• Initialization is crucial (how fast it converges, quality of solution output)
• Discuss techniques commonly used in practice
• Random centers from the datapoints (repeat a few times)
• K-means ++ (works well and has provable guarantees)
• Furthest traversal
![Page 17: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/17.jpg)
Lloyd’s method: Random Initialization
![Page 18: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/18.jpg)
Example: Given a set of datapoints
Lloyd’s method: Random Initialization
![Page 19: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/19.jpg)
Select initial centers at random
Lloyd’s method: Random Initialization
![Page 20: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/20.jpg)
Assign each point to its nearest center
Lloyd’s method: Random Initialization
![Page 21: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/21.jpg)
Recompute optimal centers given a fixed clustering
Lloyd’s method: Random Initialization
![Page 22: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/22.jpg)
Assign each point to its nearest center
Lloyd’s method: Random Initialization
![Page 23: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/23.jpg)
Recompute optimal centers given a fixed clustering
Lloyd’s method: Random Initialization
![Page 24: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/24.jpg)
Assign each point to its nearest center
Lloyd’s method: Random Initialization
![Page 25: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/25.jpg)
Recompute optimal centers given a fixed clustering
Lloyd’s method: Random Initialization
Get a good quality solution in this example.
![Page 26: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/26.jpg)
Lloyd’s method: Performance
It always converges, but it may converge at a local optimum
that is different from the global optimum, and in fact could
be arbitrarily worse in terms of its score.
![Page 27: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/27.jpg)
Lloyd’s method: Performance
Local optimum: every point is assigned to its nearest center
and every center is the mean value of its points.
![Page 28: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/28.jpg)
Lloyd’s method: Performance
.It is arbitrarily worse than optimum solution….
![Page 29: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/29.jpg)
Lloyd’s method: Performance
This bad performance, can happen
even with well separated Gaussian
clusters.
![Page 30: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/30.jpg)
Lloyd’s method: Performance
This bad performance, can
happen even with well
separated Gaussian clusters.
Some Gaussian are
combined…..
![Page 31: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/31.jpg)
Lloyd’s method: Performance
• For k equal-sized Gaussians, Pr[each initial center is in a
different Gaussian] ≈𝑘!
𝑘𝑘 ≈1
𝑒𝑘
• Becomes unlikely as k gets large.
• If we do random initialization, as k increases, it becomes
more likely we won’t have perfectly picked one center per
Gaussian in our initialization (so Lloyd’s method will output
a bad solution).
![Page 32: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/32.jpg)
Another Initialization Idea: Furthest
Point Heuristic
Choose 𝐜𝟏 arbitrarily (or at random).
• Pick 𝐜𝐣 among datapoints 𝐱𝟏, 𝐱𝟐, … , 𝐱𝐝 that is
farthest from previously chosen 𝐜𝟏, 𝐜𝟐, … , 𝐜𝒋−𝟏
• For j = 2, … , k
Fixes the Gaussian problem. But it can be thrown
off by outliers….
![Page 33: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/33.jpg)
Furthest point heuristic does well on
previous example
![Page 34: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/34.jpg)
(0,1)
(0,-1)
(-2,0) (3,0)
Furthest point initialization heuristic
sensitive to outliers
Assume k=3
![Page 35: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/35.jpg)
(0,1)
(0,-1)
(-2,0) (3,0)
Furthest point initialization heuristic
sensitive to outliers
Assume k=3
![Page 36: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/36.jpg)
K-means++ Initialization: D2 sampling [AV07]
• Choose 𝐜𝟏 at random.
• Pick 𝐜𝐣 among 𝐱𝟏, 𝐱𝟐, … , 𝐱𝐝 according to the distribution
• For j = 2, … , k
• Interpolate between random and furthest point initialization
𝐏𝐫(𝐜𝐣 = 𝐱𝐢) ∝ 𝐦𝐢𝐧𝐣′<𝐣 𝐱𝐢 − 𝐜𝐣′
𝟐
• Let D(x) be the distance between a point 𝑥 and its nearest
center. Chose the next center proportional to D2(𝐱).
D2(𝐱𝐢)
Theorem: K-means++ always attains an O(log k) approximation to optimal k-means solution in expectation.
Running Lloyd’s can only further improve the cost.
![Page 37: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/37.jpg)
K-means++ Idea: D2 sampling
• Interpolate between random and furthest point initialization
• Let D(x) be the distance between a point 𝑥 and its nearest
center. Chose the next center proportional to D𝛼(𝐱).
• 𝛼 = 0, random sampling
• 𝛼 = ∞, furthest point (Side note: it actually works well for k-center)
• 𝛼 = 2, k-means++
Side note: 𝛼 = 1, works well for k-median
![Page 38: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/38.jpg)
(0,1)
(0,-1)
(-2,0) (3,0)
K-means ++ Fix
![Page 39: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/39.jpg)
K-means++/ Lloyd’s Running Time
Repeat until there is no change in the cost.
• For each j: Cj ←{𝑥 ∈ 𝑆 whose closest center is 𝐜𝐣}
• For each j: 𝐜𝐣 ←mean of Cj
Each round takes
time O(nkd).
• K-means ++ initialization: O(nd) and one pass over data to
select next center. So O(nkd) time in total.
• Lloyd’s method
• Exponential # of rounds in the worst case [AV07].
• Expected polynomial time in the smoothed analysis model!
![Page 40: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/40.jpg)
K-means++/ Lloyd’s Summary
• Exponential # of rounds in the worst case [AV07].
• Expected polynomial time in the smoothed analysis model!
• K-means++ always attains an O(log k) approximation to optimal k-means solution in expectation.
• Running Lloyd’s can only further improve the cost.
• Does well in practice.
![Page 41: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/41.jpg)
What value of k???
• Hold-out validation/cross-validation on auxiliary task (e.g., supervised learning task).
• Heuristic: Find large gap between k -1-means cost and k-means cost.
• Try hierarchical clustering.
![Page 42: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/42.jpg)
soccer
sports fashion
Gucci tennis Lacoste
All topics
Hierarchical Clustering
• A hierarchy might be more natural.
• Different users might care about different levels of granularity or even prunings.
![Page 43: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/43.jpg)
• Partition data into 2-groups (e.g., 2-means)
Top-down (divisive)
Hierarchical Clustering
• Recursively cluster each group.
Bottom-Up (agglomerative)
soccer
sports fashion
Gucci tennis Lacoste
All topics • Start with every point in its own cluster.
• Repeatedly merge the “closest” two clusters.
• Different defs of “closest” give different algorithms.
![Page 44: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/44.jpg)
Bottom-Up (agglomerative)
• Single linkage: dist A, 𝐵 = minx∈A,x′∈B′
dist(x, x′)
dist A, B = avgx∈A,x′∈B′
dist(x, x′)
soccer
sports fashion
Gucci tennis Lacoste
All topics Have a distance measure on pairs of objects.
d(x,y) – distance between x and y
• Average linkage:
• Complete linkage:
• Wards’ method
E.g., # keywords in common, edit distance, etc
dist A, B = maxx∈A,x′∈B′
dist(x, x′)
![Page 45: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/45.jpg)
Single Linkage
Bottom-up (agglomerative) • Start with every point in its own cluster.
• Repeatedly merge the “closest” two clusters.
Single linkage: dist A, 𝐵 = minx∈A,x′∈𝐵
dist(x, x′)
0 6 2.1 3.2 -2 -3 A B C D E F
3
4
5
A B D E 1 2
A B C
A B C D E
A B C D E F
Dendogram
![Page 46: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/46.jpg)
Single Linkage
Bottom-up (agglomerative) • Start with every point in its own cluster.
• Repeatedly merge the “closest” two clusters.
1 2 3
4 5
One way to think of it: at any moment, we see connected components of the graph where connect any two pts of distance < r.
0 6 2.1 3.2 -2 -3 A B C D E F
Watch as r grows (only n-1 relevant values because we only we merge at value of r corresponding to values of r in different clusters).
Single linkage: dist A, 𝐵 = minx∈A,x′∈𝐵
dist(x, x′)
![Page 47: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/47.jpg)
Complete Linkage
Bottom-up (agglomerative) • Start with every point in its own cluster.
• Repeatedly merge the “closest” two clusters.
Complete linkage: dist A, B = maxx∈A,x′∈B
dist(x, x′)
One way to think of it: keep max diameter as small as possible at any level.
0 6 2.1 3.2 -2 -3 A B C D E F
3 4
5
A B D E 1 2
A B C DEF
A B C D E F
![Page 48: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/48.jpg)
Complete Linkage
Bottom-up (agglomerative) • Start with every point in its own cluster.
• Repeatedly merge the “closest” two clusters.
One way to think of it: keep max diameter as small as possible.
0 6 2.1 3.2 -2 -3 A B C D E F
1 2 3 4
5
Complete linkage: dist A, B = maxx∈A,x′∈B
dist(x, x′)
![Page 49: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/49.jpg)
Ward’s Method
Bottom-up (agglomerative) • Start with every point in its own cluster.
• Repeatedly merge the “closest” two clusters.
Ward’s method: dist C, C′ =C ⋅ C′
C + C′ mean C − mean C′ 2
Merge the two clusters such that the increase in k-means cost is as small as possible.
Works well in practice.
0 6 2.1 3.2 -2 -3 A B C D E F
1 2 4
5
3
![Page 50: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/50.jpg)
Running time
In fact, can run all these algorithms in time 𝑂(𝑁2 log 𝑁).
• Each algorithm starts with N clusters, and performs N-1 merges.
• For each algorithm, computing 𝑑𝑖𝑠𝑡(𝐶, 𝐶′) can be done in time 𝑂( 𝐶 ⋅ 𝐶′ ). (e.g., examining 𝑑𝑖𝑠𝑡(𝑥, 𝑥′) for all 𝑥 ∈ 𝐶, 𝑥′ ∈ 𝐶′)
• Time to compute all pairwise distances and take smallest is 𝑂(𝑁2).
See: Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008. http://www-nlp.stanford.edu/IR-book/
• Overall time is 𝑂(𝑁3).
![Page 51: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/51.jpg)
Hierarchical Clustering Experiments [BLG, JMLR’15]
Ward’s method does the best among classic techniques.
![Page 52: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/52.jpg)
Hierarchical Clustering Experiments [BLG, JMLR’15]
Ward’s method does the best among classic techniques.
![Page 53: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/53.jpg)
What You Should Know
• Partitional Clustering. k-means and k-means ++
• Hierarchical Clustering.
• Lloyd’s method
• Single linkage, Complete Linkge, Ward’s method
• Initialization techniques (random, furthest
traversal, k-means++)
![Page 54: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/54.jpg)
Additional Slides
![Page 55: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/55.jpg)
Smoothed analysis model
• Imagine a worst-case input.
• But then add small Gaussian perturbation to each data point.
![Page 56: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/56.jpg)
Smoothed analysis model
• Imagine a worst-case input.
• But then add small Gaussian perturbation to each data point.
• Theorem [Arthur-Manthey-Roglin 2009]:
• Might still find local opt that is far from global opt.
- E[number of rounds until Lloyd’s converges] if add Gaussian perturbation with variance 𝜎2 is polynomial in 𝑛, 1/𝜎.
- The actual bound is : 𝑂𝑛34𝑘34𝑑8
𝜎6
![Page 57: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/57.jpg)
TCS
Christos Papadimitriou
Colleagues at Berkeley
Databases Systems
Algorithmic Game Theory
Overlapping Clusters: Communities
![Page 58: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/58.jpg)
Overlapping Clusters: Communities
• Social networks • Professional networks
• Product Purchasing Networks, Citation Networks, Biological Networks, etc
![Page 59: Clustering. Unsupervised Learningninamf/courses/601sp15/slides/21... · Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints.](https://reader036.fdocuments.in/reader036/viewer/2022070810/5f08a8a47e708231d42318b6/html5/thumbnails/59.jpg)
Kids
CDs
lullabies
Electronics
Overlapping Clusters: Communities
Baby's Favorite Songs