lec-3
-
Upload
noorahmed1991 -
Category
Documents
-
view
217 -
download
1
description
Transcript of lec-3
![Page 1: lec-3](https://reader038.fdocuments.in/reader038/viewer/2022103005/55cf903b550346703ba418ad/html5/thumbnails/1.jpg)
COL870: Clustering Algorithms
Ragesh Jaiswal, CSE, IIT Delhi
Ragesh Jaiswal, CSE, IIT Delhi COL870: Clustering Algorithms
![Page 2: lec-3](https://reader038.fdocuments.in/reader038/viewer/2022103005/55cf903b550346703ba418ad/html5/thumbnails/2.jpg)
The k-means Problem
Ragesh Jaiswal, CSE, IIT Delhi COL870: Clustering Algorithms
![Page 3: lec-3](https://reader038.fdocuments.in/reader038/viewer/2022103005/55cf903b550346703ba418ad/html5/thumbnails/3.jpg)
The k-means Problem
k-means
Given a set of points S ⊂ Rd in a d dimensional Euclidean space,and an integer k , output a set T ⊂ Rd of points (called centers)such that |T | = k and the following cost function is minimised:
cost(S ,T ) =∑x∈S
minz∈T||x − z ||2.
Figure : What is the solution for the 2-means problem for the above 2-Ddataset?
Ragesh Jaiswal, CSE, IIT Delhi COL870: Clustering Algorithms
![Page 4: lec-3](https://reader038.fdocuments.in/reader038/viewer/2022103005/55cf903b550346703ba418ad/html5/thumbnails/4.jpg)
The k-means Problem
k-means
Given a set of points S ⊂ Rd in a d dimensional Euclidean space,and an integer k , output a set T ⊂ Rd of points (called centers)such that |T | = k and the following cost function is minimised:
cost(S ,T ) =∑x∈S
minz∈T||x − z ||2.
How hard is the 1-means problem?
Given x1, ..., xn ∈ Rd find a point z ∈ Rd such thatf (z) =
∑i ||xi − z ||2 is minimized.
Ragesh Jaiswal, CSE, IIT Delhi COL870: Clustering Algorithms
![Page 5: lec-3](https://reader038.fdocuments.in/reader038/viewer/2022103005/55cf903b550346703ba418ad/html5/thumbnails/5.jpg)
The k-means Problem
k-means
Given a set of points S ⊂ Rd in a d dimensional Euclidean space,and an integer k , output a set T ⊂ Rd of points (called centers)such that |T | = k and the following cost function is minimised:
cost(S ,T ) =∑x∈S
minz∈T||x − z ||2.
How hard is the 1-means problem?
Given x1, ..., xn ∈ Rd find a point z ∈ Rd such thatf (z) =
∑i ||xi − z ||2 is minimized.
What is ∂f (z)∂zi
?
So, for what z ,∑
i ||xi − z ||2 gets minimized?
Ragesh Jaiswal, CSE, IIT Delhi COL870: Clustering Algorithms
![Page 6: lec-3](https://reader038.fdocuments.in/reader038/viewer/2022103005/55cf903b550346703ba418ad/html5/thumbnails/6.jpg)
The k-means Problem
k-means
Given a set of points S ⊂ Rd in a d dimensional Euclidean space,and an integer k , output a set T ⊂ Rd of points (called centers)such that |T | = k and the following cost function is minimised:
cost(S ,T ) =∑x∈S
minz∈T||x − z ||2.
How hard is the 1-means problem?
Given x1, ..., xn ∈ Rd find a point z ∈ Rd such thatf (z) =
∑i ||xi − z ||2 is minimized.
What is ∂f (z)∂zi
? ∂f (z)∂zi
= nzi −∑
j xji
So, for what z ,∑
i ||xi − z ||2 gets minimized? z =∑
i xin∑
i xin is called the centroid of the points x1, ..., xn.
Ragesh Jaiswal, CSE, IIT Delhi COL870: Clustering Algorithms
![Page 7: lec-3](https://reader038.fdocuments.in/reader038/viewer/2022103005/55cf903b550346703ba418ad/html5/thumbnails/7.jpg)
The k-means Problem
k-means
Given a set of points S ⊂ Rd in a d dimensional Euclidean space,and an integer k , output a set T ⊂ Rd of points (called centers)such that |T | = k and the following cost function is minimised:
cost(S ,T ) =∑x∈S
minz∈T||x − z ||2.
How hard is the k-means problem for k > 1 when d = 1? Inother words, how hard is the 1-dimensional k-means problem?
Ragesh Jaiswal, CSE, IIT Delhi COL870: Clustering Algorithms
![Page 8: lec-3](https://reader038.fdocuments.in/reader038/viewer/2022103005/55cf903b550346703ba418ad/html5/thumbnails/8.jpg)
The k-means Problem
k-means
Given a set of points S ⊂ Rd in a d dimensional Euclidean space,and an integer k , output a set T ⊂ Rd of points (called centers)such that |T | = k and the following cost function is minimised:
cost(S ,T ) =∑x∈S
minz∈T||x − z ||2.
How hard is the k-means problem for k > 1 when d = 1? Inother words, how hard is the 1-dimensional k-means problem?
There is a simpler Dynamic Programming algorithms for thisproblem!
Ragesh Jaiswal, CSE, IIT Delhi COL870: Clustering Algorithms
![Page 9: lec-3](https://reader038.fdocuments.in/reader038/viewer/2022103005/55cf903b550346703ba418ad/html5/thumbnails/9.jpg)
The k-means Problem
k-means
Given a set of points S ⊂ Rd in a d dimensional Euclidean space,and an integer k , output a set T ⊂ Rd of points (called centers)such that |T | = k and the following cost function is minimised:
cost(S ,T ) =∑x∈S
minz∈T||x − z ||2.
How hard is the k-means problem for k > 1 and d > 1?
NP-hard.
Ragesh Jaiswal, CSE, IIT Delhi COL870: Clustering Algorithms
![Page 10: lec-3](https://reader038.fdocuments.in/reader038/viewer/2022103005/55cf903b550346703ba418ad/html5/thumbnails/10.jpg)
The k-means Problem
k-means
Given a set of points S ⊂ Rd in a d dimensional Euclidean space, andan integer k, output a set T ⊂ Rd of points (called centers) such that|T | = k and the following cost function is minimised:
cost(S ,T ) =∑x∈S
minz∈T||x − z ||2.
How is this problem related to clustering?
The k centers induce a voronoi partition of Rd (and hence thegiven data points).The voronoi partition corresponding to a center z is the region ofspace whose nearest center (among the k centers) is z .
Ragesh Jaiswal, CSE, IIT Delhi COL870: Clustering Algorithms
![Page 11: lec-3](https://reader038.fdocuments.in/reader038/viewer/2022103005/55cf903b550346703ba418ad/html5/thumbnails/11.jpg)
The k-means ProblemThe k-means Algorithm
The most popular heuristic that used to solve the k-meansproblem in practice is the k-means algorithm (also known asLloyd’s Algorithm).
k-means Algorithm
- Initialize centers z1, ..., zk ∈ Rd .- Repeat until there is no further change in cost:
- For each j : Cj ← {x ∈ S |zj is the closest center of x}.- For each j : zj ← Centroid(Cj).
Ragesh Jaiswal, CSE, IIT Delhi COL870: Clustering Algorithms
![Page 12: lec-3](https://reader038.fdocuments.in/reader038/viewer/2022103005/55cf903b550346703ba418ad/html5/thumbnails/12.jpg)
The k-means ProblemThe k-means Algorithm
k-means Algorithm
- Initialize centers z1, ..., zk ∈ Rd .- Repeat until there is no further change in cost:
- For each j : Cj ← {x ∈ S |zj is the closest center of x}.- For each j : zj ← Centroid(Cj).
Claim 1: For any dataset S , let C̄i denote the set of centersafter the i th iteration of the loop. Then for all i ,cost(S , C̄i+1) ≤ cost(S , C̄i ).
Lemma
For any set S ⊂ Rd and any z ∈ Rd ,
cost(S , z) = cost(S ,Centroid(S)) + |S | · ||z − Centroid(S)||2.
Ragesh Jaiswal, CSE, IIT Delhi COL870: Clustering Algorithms
![Page 13: lec-3](https://reader038.fdocuments.in/reader038/viewer/2022103005/55cf903b550346703ba418ad/html5/thumbnails/13.jpg)
The k-means ProblemThe k-means Algorithm
k-means Algorithm
- Initialize centers z1, ..., zk ∈ Rd .- Repeat until there is no further change in cost:
- For each j : Cj ← {x ∈ S |zj is the closest center of x}.- For each j : zj ← Centroid(Cj).
Claim 1: For any dataset S , let C̄i denote the set of centersafter the i th iteration of the loop. Then for all i ,cost(S , C̄i+1) ≤ cost(S , C̄i ).
Claim 2: There exists datasets on which the k-meansalgorithm gives arbitrarily bad solutions.
Ragesh Jaiswal, CSE, IIT Delhi COL870: Clustering Algorithms
![Page 14: lec-3](https://reader038.fdocuments.in/reader038/viewer/2022103005/55cf903b550346703ba418ad/html5/thumbnails/14.jpg)
End
Ragesh Jaiswal, CSE, IIT Delhi COL870: Clustering Algorithms