Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means:...
Transcript of Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means:...
![Page 1: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely](https://reader033.fdocuments.in/reader033/viewer/2022041705/5e44832745c20d03b17ee105/html5/thumbnails/1.jpg)
Clustering: K-means
Industrial AI Lab.
Prof. Seungchul Lee
![Page 2: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely](https://reader033.fdocuments.in/reader033/viewer/2022041705/5e44832745c20d03b17ee105/html5/thumbnails/2.jpg)
Supervised vs. Unsupervised Learning
2
Supervised Learning Unsupervised Learning
Building a model from labeled data Clustering from unlabeled data
![Page 3: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely](https://reader033.fdocuments.in/reader033/viewer/2022041705/5e44832745c20d03b17ee105/html5/thumbnails/3.jpg)
Data Clustering
• Data clustering is an unsupervised learning problem
• Given:
– 𝑚 unlabeled examples {𝑥 1 , 𝑥 2 , ⋯ , 𝑥(𝑚)}
– the number of partitions 𝑘
• Goal: group the examples into 𝑘 partitions
3
![Page 4: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely](https://reader033.fdocuments.in/reader033/viewer/2022041705/5e44832745c20d03b17ee105/html5/thumbnails/4.jpg)
Data Clustering: Similarity
• The only information clustering uses is the mutual similarity between samples
• A good clustering is one that achieves:
– high within-cluster similarity
– low inter-cluster similarity
4
![Page 5: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely](https://reader033.fdocuments.in/reader033/viewer/2022041705/5e44832745c20d03b17ee105/html5/thumbnails/5.jpg)
K-means: (Iterative) Algorithm
1) Initialization
• Input
– 𝑘: the number of clusters
– Training set 𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑚
• Randomly initialize cluster centers anywhere in ℝ𝑛
5
![Page 6: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely](https://reader033.fdocuments.in/reader033/viewer/2022041705/5e44832745c20d03b17ee105/html5/thumbnails/6.jpg)
K-means: (Iterative) Algorithm
2) Iteration
• Repeat until convergence – A possible convergence criteria: cluster centers do not change anymore
6
![Page 7: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely](https://reader033.fdocuments.in/reader033/viewer/2022041705/5e44832745c20d03b17ee105/html5/thumbnails/7.jpg)
K-means: (Iterative) Algorithm
3) Output• 𝑐 (label) : index (1 to 𝑘) of cluster centroid (centers)
• 𝜇: averages (mean) of points assigned to cluster 𝜇1, 𝜇2, ⋯ , 𝜇𝑘
7
![Page 8: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely](https://reader033.fdocuments.in/reader033/viewer/2022041705/5e44832745c20d03b17ee105/html5/thumbnails/8.jpg)
Initialization (𝒌 = 𝟐)
8
![Page 9: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely](https://reader033.fdocuments.in/reader033/viewer/2022041705/5e44832745c20d03b17ee105/html5/thumbnails/9.jpg)
Assigning Points
9
![Page 10: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely](https://reader033.fdocuments.in/reader033/viewer/2022041705/5e44832745c20d03b17ee105/html5/thumbnails/10.jpg)
Recomputing the Cluster Centers
10
![Page 11: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely](https://reader033.fdocuments.in/reader033/viewer/2022041705/5e44832745c20d03b17ee105/html5/thumbnails/11.jpg)
Assigning Points
11
![Page 12: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely](https://reader033.fdocuments.in/reader033/viewer/2022041705/5e44832745c20d03b17ee105/html5/thumbnails/12.jpg)
Recomputing the Cluster Centers
12
![Page 13: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely](https://reader033.fdocuments.in/reader033/viewer/2022041705/5e44832745c20d03b17ee105/html5/thumbnails/13.jpg)
Assigning Points
13
![Page 14: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely](https://reader033.fdocuments.in/reader033/viewer/2022041705/5e44832745c20d03b17ee105/html5/thumbnails/14.jpg)
Recomputing the Cluster Centers
14
![Page 15: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely](https://reader033.fdocuments.in/reader033/viewer/2022041705/5e44832745c20d03b17ee105/html5/thumbnails/15.jpg)
Assigning Points
15
![Page 16: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely](https://reader033.fdocuments.in/reader033/viewer/2022041705/5e44832745c20d03b17ee105/html5/thumbnails/16.jpg)
Recomputing the Cluster Centers
16
![Page 17: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely](https://reader033.fdocuments.in/reader033/viewer/2022041705/5e44832745c20d03b17ee105/html5/thumbnails/17.jpg)
Summary: K-means Clustering
• (Iterative) Algorithm
17
![Page 18: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely](https://reader033.fdocuments.in/reader033/viewer/2022041705/5e44832745c20d03b17ee105/html5/thumbnails/18.jpg)
K-means: Optimization Point of View (Optional)
• 𝑐𝑖 = index of cluster (1, 2,⋯ , 𝑘) to which example 𝑥 𝑖 is currently assigned
• 𝜇𝑘 = cluster centroid
• 𝜇𝑐𝑖 = cluster centroid of cluster to which example 𝑥 𝑖 has been assigned
• Optimization objective:
18
![Page 19: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely](https://reader033.fdocuments.in/reader033/viewer/2022041705/5e44832745c20d03b17ee105/html5/thumbnails/19.jpg)
Expectation Maximization (EM) Algorithm
• It is a "chicken and egg" problem (dilemma)– Q: if we knew 𝑐𝑖s, how would we determine which points to associate with each cluster center?
– A: for each point 𝑥 𝑖 , choose closest 𝑐𝑖
– Q: if we knew the cluster memberships, how do we get the centers?
– A: choose 𝑐𝑖 to be the mean of all points in the cluster
• Extension of K-means algorithm– A special case of Expectation Maximization (EM) algorithm
– A special case of Gaussian Mixture Model (GMM)
– Won’t be discussed in this course
19
![Page 20: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely](https://reader033.fdocuments.in/reader033/viewer/2022041705/5e44832745c20d03b17ee105/html5/thumbnails/20.jpg)
Python: Data Generation
20
![Page 21: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely](https://reader033.fdocuments.in/reader033/viewer/2022041705/5e44832745c20d03b17ee105/html5/thumbnails/21.jpg)
Python: Data Generation and Random Initialization
21
![Page 22: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely](https://reader033.fdocuments.in/reader033/viewer/2022041705/5e44832745c20d03b17ee105/html5/thumbnails/22.jpg)
Python: K-Means
22
![Page 23: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely](https://reader033.fdocuments.in/reader033/viewer/2022041705/5e44832745c20d03b17ee105/html5/thumbnails/23.jpg)
Python: K-Means in Scikit-learn
23
![Page 24: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely](https://reader033.fdocuments.in/reader033/viewer/2022041705/5e44832745c20d03b17ee105/html5/thumbnails/24.jpg)
Initialization Issues
• k-means is extremely sensitive to cluster center initialization
• Bad initialization can lead to– Poor convergence speed
– Bad overall clustering
• Safeguarding measures:– Choose first center as one of the examples, second which is the farthest from the first, third which is
the farthest from both, and so on.
– Try multiple initialization and choose the best result
24
![Page 25: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely](https://reader033.fdocuments.in/reader033/viewer/2022041705/5e44832745c20d03b17ee105/html5/thumbnails/25.jpg)
Choosing the Number of Clusters
• Idea: when adding another cluster does not give much better modeling of the data
• One way to select 𝑘 for the K-means algorithm is to try different values of 𝑘, plot the K-means objective versus 𝑘, and look at the 'elbow-point' in the plot
25
![Page 26: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely](https://reader033.fdocuments.in/reader033/viewer/2022041705/5e44832745c20d03b17ee105/html5/thumbnails/26.jpg)
Choosing the Number of Clusters
26
![Page 27: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely](https://reader033.fdocuments.in/reader033/viewer/2022041705/5e44832745c20d03b17ee105/html5/thumbnails/27.jpg)
K-means: Limitations
• Make hard assignments of points to clusters
– A point either completely belongs to a cluster or not belongs at all
– No notion of a soft assignment (i.e., probability of being assigned to each cluster)
– Gaussian mixture model (we will study later) and Fuzzy K-means allow soft assignments
• Sensitive to outlier examples
– K-medians algorithm is a more robust alternative for data with outliers
27
![Page 28: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely](https://reader033.fdocuments.in/reader033/viewer/2022041705/5e44832745c20d03b17ee105/html5/thumbnails/28.jpg)
K-means: Limitations
• Works well only for round shaped, and of roughly equal sizes/density cluster
• Does badly if the cluster have non-convex shapes
– Spectral clustering (we will study later) and Kernelized K-means can be an alternative
28
![Page 29: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely](https://reader033.fdocuments.in/reader033/viewer/2022041705/5e44832745c20d03b17ee105/html5/thumbnails/29.jpg)
K-means: Limitations
• Non-convex/non-round-shaped cluster: standard K-means fails !
• (optional) Connectivity → networks → spectral partitioning
29
![Page 30: Clustering: K-means - GitHub Pagesi-systems.github.io/.../12_Clustering_K-means.pdf · K-means: Limitations •Make hard assignments of points to clusters –A point either completely](https://reader033.fdocuments.in/reader033/viewer/2022041705/5e44832745c20d03b17ee105/html5/thumbnails/30.jpg)
K-means: Limitations
• Clusters with different densities
30