Introduc)on*to*Machine*Learning:*Clustering*...
Transcript of Introduc)on*to*Machine*Learning:*Clustering*...
![Page 1: Introduc)on*to*Machine*Learning:*Clustering* Algorithms*sssykim/teaching/f13/slides/clustering.pdf · SoClusteringofIndividualsintoThree* Clusters*with*Gaussian*Mixture*Model* Cluster1*](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e0920eb1d20644f5250a689/html5/thumbnails/1.jpg)
Introduc)on to Machine Learning: Clustering Algorithms
02-‐223 How to Analyze Your Own Genome
Fall 2013
![Page 2: Introduc)on*to*Machine*Learning:*Clustering* Algorithms*sssykim/teaching/f13/slides/clustering.pdf · SoClusteringofIndividualsintoThree* Clusters*with*Gaussian*Mixture*Model* Cluster1*](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e0920eb1d20644f5250a689/html5/thumbnails/2.jpg)
• Organizing data into clusters such that there is
• high intra-‐cluster similarity
• low inter-‐cluster similarity
• Informally, finding natural groupings among objects.
![Page 3: Introduc)on*to*Machine*Learning:*Clustering* Algorithms*sssykim/teaching/f13/slides/clustering.pdf · SoClusteringofIndividualsintoThree* Clusters*with*Gaussian*Mixture*Model* Cluster1*](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e0920eb1d20644f5250a689/html5/thumbnails/3.jpg)
The quality or state of being similar; likeness; resemblance; as, a similarity of features.
Similarity is hard to define, but… “We know it when we see it”
The real meaning of similarity is a philosophical quesPon. We will take a more pragmaPc approach.
Webster's Dic)onary
![Page 4: Introduc)on*to*Machine*Learning:*Clustering* Algorithms*sssykim/teaching/f13/slides/clustering.pdf · SoClusteringofIndividualsintoThree* Clusters*with*Gaussian*Mixture*Model* Cluster1*](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e0920eb1d20644f5250a689/html5/thumbnails/4.jpg)
Defini)on: Let x and y be two objects from the universe of possible objects. The distance (dissimilarity) between x and y is a real number denoted by D(x,y)
€
s(x,y) =
(xi − µx )(yi − µy )i
J
∑(J −1)σ xσ y
€
x = (x1,x2,...,xJ ),y = (y1,y2,...,yJ )
J: the number of features
![Page 5: Introduc)on*to*Machine*Learning:*Clustering* Algorithms*sssykim/teaching/f13/slides/clustering.pdf · SoClusteringofIndividualsintoThree* Clusters*with*Gaussian*Mixture*Model* Cluster1*](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e0920eb1d20644f5250a689/html5/thumbnails/5.jpg)
Clustering Algorithms
• Hierarchical agglomeraPve clustering
• K-‐means clustering algorithm
• Gaussian mixture model
![Page 6: Introduc)on*to*Machine*Learning:*Clustering* Algorithms*sssykim/teaching/f13/slides/clustering.pdf · SoClusteringofIndividualsintoThree* Clusters*with*Gaussian*Mixture*Model* Cluster1*](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e0920eb1d20644f5250a689/html5/thumbnails/6.jpg)
Hierarchical Clustering
• Probably the most popular clustering algorithm in computaPonal biology
• AgglomeraPve (boZom-‐up)
• Algorithm:
1. IniPalize: each item a cluster 2. Iterate:
• select two most similar clusters
• merge them
3. Halt: when there is only one cluster le[ dendrogram
![Page 7: Introduc)on*to*Machine*Learning:*Clustering* Algorithms*sssykim/teaching/f13/slides/clustering.pdf · SoClusteringofIndividualsintoThree* Clusters*with*Gaussian*Mixture*Model* Cluster1*](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e0920eb1d20644f5250a689/html5/thumbnails/7.jpg)
Similarity Criterion: Single Linkage
• cluster similarity = similarity of two most similar members
- Potentially long and skinny clusters
![Page 8: Introduc)on*to*Machine*Learning:*Clustering* Algorithms*sssykim/teaching/f13/slides/clustering.pdf · SoClusteringofIndividualsintoThree* Clusters*with*Gaussian*Mixture*Model* Cluster1*](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e0920eb1d20644f5250a689/html5/thumbnails/8.jpg)
Example: Single Linkage
12 3 4
5
In most cases (1-‐r2), where r2 is the correlaPon coefficient, is used as similarity measure between samples
![Page 9: Introduc)on*to*Machine*Learning:*Clustering* Algorithms*sssykim/teaching/f13/slides/clustering.pdf · SoClusteringofIndividualsintoThree* Clusters*with*Gaussian*Mixture*Model* Cluster1*](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e0920eb1d20644f5250a689/html5/thumbnails/9.jpg)
Example: Single Linkage
12 3 4
5
![Page 10: Introduc)on*to*Machine*Learning:*Clustering* Algorithms*sssykim/teaching/f13/slides/clustering.pdf · SoClusteringofIndividualsintoThree* Clusters*with*Gaussian*Mixture*Model* Cluster1*](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e0920eb1d20644f5250a689/html5/thumbnails/10.jpg)
Example: Single Linkage
12 3 4
5
![Page 11: Introduc)on*to*Machine*Learning:*Clustering* Algorithms*sssykim/teaching/f13/slides/clustering.pdf · SoClusteringofIndividualsintoThree* Clusters*with*Gaussian*Mixture*Model* Cluster1*](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e0920eb1d20644f5250a689/html5/thumbnails/11.jpg)
Similarity Criterion: Complete Linkage
• cluster similarity = similarity of two least similar members
+ tight clusters
![Page 12: Introduc)on*to*Machine*Learning:*Clustering* Algorithms*sssykim/teaching/f13/slides/clustering.pdf · SoClusteringofIndividualsintoThree* Clusters*with*Gaussian*Mixture*Model* Cluster1*](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e0920eb1d20644f5250a689/html5/thumbnails/12.jpg)
Similarity Criterion: Average Linkage
• cluster similarity = average similarity of all pairs
the most widely used similarity measure
Robust against noise
![Page 13: Introduc)on*to*Machine*Learning:*Clustering* Algorithms*sssykim/teaching/f13/slides/clustering.pdf · SoClusteringofIndividualsintoThree* Clusters*with*Gaussian*Mixture*Model* Cluster1*](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e0920eb1d20644f5250a689/html5/thumbnails/13.jpg)
In some cases we can determine the “correct” number of clusters. However, things are rarely this clear cut, unfortunately.
But What Are the Clusters?
![Page 14: Introduc)on*to*Machine*Learning:*Clustering* Algorithms*sssykim/teaching/f13/slides/clustering.pdf · SoClusteringofIndividualsintoThree* Clusters*with*Gaussian*Mixture*Model* Cluster1*](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e0920eb1d20644f5250a689/html5/thumbnails/14.jpg)
• Nonhierarchical, each object is placed in exactly one of K non-‐overlapping clusters.
• the user has to specify the desired number of clusters K.
• In hierarchical clustering, we use similarity measures between two observed samples, whereas in K-‐means clustering, we use the similarity measures between an observed sample and the cluster center (mean).
![Page 15: Introduc)on*to*Machine*Learning:*Clustering* Algorithms*sssykim/teaching/f13/slides/clustering.pdf · SoClusteringofIndividualsintoThree* Clusters*with*Gaussian*Mixture*Model* Cluster1*](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e0920eb1d20644f5250a689/html5/thumbnails/15.jpg)
0
1
2
3
4
5
0 1 2 3 4 5
k1
k2
k3
• For a pre-‐defined number of clusters K, iniPalize K centers randomly
![Page 16: Introduc)on*to*Machine*Learning:*Clustering* Algorithms*sssykim/teaching/f13/slides/clustering.pdf · SoClusteringofIndividualsintoThree* Clusters*with*Gaussian*Mixture*Model* Cluster1*](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e0920eb1d20644f5250a689/html5/thumbnails/16.jpg)
0
1
2
3
4
5
0 1 2 3 4 5
k1
k2
k3
• Iterate between the following two steps – Assign all objects to the nearest center.
– Move a center to the mean of its members.
![Page 17: Introduc)on*to*Machine*Learning:*Clustering* Algorithms*sssykim/teaching/f13/slides/clustering.pdf · SoClusteringofIndividualsintoThree* Clusters*with*Gaussian*Mixture*Model* Cluster1*](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e0920eb1d20644f5250a689/html5/thumbnails/17.jpg)
0
1
2
3
4
5
0 1 2 3 4 5
k1
k2
k3
• A[er moving centers, re-‐assign the objects…
![Page 18: Introduc)on*to*Machine*Learning:*Clustering* Algorithms*sssykim/teaching/f13/slides/clustering.pdf · SoClusteringofIndividualsintoThree* Clusters*with*Gaussian*Mixture*Model* Cluster1*](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e0920eb1d20644f5250a689/html5/thumbnails/18.jpg)
0
1
2
3
4
5
0 1 2 3 4 5
k1
k2
k3
• A[er moving centers, re-‐assign the objects to nearest centers.
• Move a center to the mean of its new members.
![Page 19: Introduc)on*to*Machine*Learning:*Clustering* Algorithms*sssykim/teaching/f13/slides/clustering.pdf · SoClusteringofIndividualsintoThree* Clusters*with*Gaussian*Mixture*Model* Cluster1*](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e0920eb1d20644f5250a689/html5/thumbnails/19.jpg)
k1
k2 k3
• Re-‐assign and move centers, unPl no objects changed membership.
![Page 20: Introduc)on*to*Machine*Learning:*Clustering* Algorithms*sssykim/teaching/f13/slides/clustering.pdf · SoClusteringofIndividualsintoThree* Clusters*with*Gaussian*Mixture*Model* Cluster1*](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e0920eb1d20644f5250a689/html5/thumbnails/20.jpg)
Algorithm k-‐means
1. Decide on a value for K, the number of clusters.
2. IniPalize the K cluster centers randomly.
3. Decide the cluster memberships of the N objects by assigning them to the nearest cluster center.
4. Re-‐esPmate the K cluster centers, by assuming the memberships found above are correct.
5. Repeat 3 and 4 unPl none of the N objects changed membership in the last iteraPon.
![Page 21: Introduc)on*to*Machine*Learning:*Clustering* Algorithms*sssykim/teaching/f13/slides/clustering.pdf · SoClusteringofIndividualsintoThree* Clusters*with*Gaussian*Mixture*Model* Cluster1*](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e0920eb1d20644f5250a689/html5/thumbnails/21.jpg)
Algorithm k-‐means
1. Decide on a value for K, the number of clusters.
2. IniPalize the K cluster centers (randomly, if necessary).
3. Decide the cluster memberships of the N objects by assigning them to the nearest cluster center.
4. Re-‐esPmate the K cluster centers, by assuming the memberships found above are correct.
5. Repeat 3 and 4 unPl none of the N objects changed membership in the last iteraPon. Use one of the distance / similarity
funcPons we discussed earlier
Average / median of cluster members
![Page 22: Introduc)on*to*Machine*Learning:*Clustering* Algorithms*sssykim/teaching/f13/slides/clustering.pdf · SoClusteringofIndividualsintoThree* Clusters*with*Gaussian*Mixture*Model* Cluster1*](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e0920eb1d20644f5250a689/html5/thumbnails/22.jpg)
Gaussian Mixture Model as SoK-‐Clustering
• In K-‐means algorithm, each sample can be assigned to only a single cluster.
• Gaussian mixture models relax this assumpPon – Each sample can be assigned to mulPple clusters with certain
probabiliPes.
– The data for each cluster is modeled with a Gaussian distribuPon • Mean parameter η: cluster center
• Variance parameter σ2: cluster width
![Page 23: Introduc)on*to*Machine*Learning:*Clustering* Algorithms*sssykim/teaching/f13/slides/clustering.pdf · SoClusteringofIndividualsintoThree* Clusters*with*Gaussian*Mixture*Model* Cluster1*](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e0920eb1d20644f5250a689/html5/thumbnails/23.jpg)
SoK-‐Clustering of Individuals into Three Clusters with Gaussian Mixture Model
Cluster 1 Cluster 2 Cluster 3
0.1 0.4 0.5
0.8 0.1 0.1
0.7 0.2 0.1
0.10 0.05 0.85
… … …
… … …
… … …
… … …
… … …
… … …
Probability of
Individual 1
Individual 2
Individual 3
Individual 4
Individual 5
Individual 6
Individual 7
Individual 8
Individual 9
Individual 10
Sum
1
1
1
1
1
1
1
1
1
1 • Each individual can assigned to more than one clusters with a certain probability. • For each individual, the probabiliPes for all clusters should sum to 1. (i.e., each row should sum to 1.) • Each cluster is explained by a cluster center variable (i.e., cluster mean)
![Page 24: Introduc)on*to*Machine*Learning:*Clustering* Algorithms*sssykim/teaching/f13/slides/clustering.pdf · SoClusteringofIndividualsintoThree* Clusters*with*Gaussian*Mixture*Model* Cluster1*](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e0920eb1d20644f5250a689/html5/thumbnails/24.jpg)
Summary
• Clustering algorithms – Hierarchical agglomeraPve clustering
• Build a dendrogram over samples using similarity measures – K-‐means clustering algorithm
• ParPPon samples into K clusters
• Hard assignment: each sample can belong to only one cluster
– Gaussian mixture models for clustering
• StaPsPcal/ProbabilisPc variaPon of K-‐means algorithm • So[ assignment: each sample can belong to mulPple clusters and this uncertainty in cluster assignment is represented as probabiliPes that sum to 1.