Determining the k in k-means with MapReduce
-
Upload
thibault-debatty -
Category
Science
-
view
459 -
download
2
description
Transcript of Determining the k in k-means with MapReduce
![Page 1: Determining the k in k-means with MapReduce](https://reader033.fdocuments.in/reader033/viewer/2022060108/555096dfb4c90595208b4666/html5/thumbnails/1.jpg)
Determining the k in k-means with MapReduce
Thibault Debatty, Pietro Michiardi,Wim Mees & Olivier Thonnard
Algorithms for MapReduce and Beyond 2014
![Page 2: Determining the k in k-means with MapReduce](https://reader033.fdocuments.in/reader033/viewer/2022060108/555096dfb4c90595208b4666/html5/thumbnails/2.jpg)
Determining the k in k-means with MapReduce 2
Clustering & k-means
● Clustering● K-means
[Stuart P. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28:129–137, 1982.]
– 1982 (a great year!)– But still largely used– Drawbacks (amongst others):
● Local minimum● K is a parameter!
![Page 3: Determining the k in k-means with MapReduce](https://reader033.fdocuments.in/reader033/viewer/2022060108/555096dfb4c90595208b4666/html5/thumbnails/3.jpg)
Determining the k in k-means with MapReduce 3
Clustering & k-means
● Determine k:– VERY difficult
[Anil K Jain. Data Clustering : 50 Years Beyond K-Means. Pattern Recognition Letters, 2009]
– Using cluster evaluation metrics:Dunn's index, Elbow, Silhouette, “jump method” (based on information theory), “Gap statistic”,...
O(k²)
![Page 4: Determining the k in k-means with MapReduce](https://reader033.fdocuments.in/reader033/viewer/2022060108/555096dfb4c90595208b4666/html5/thumbnails/4.jpg)
Determining the k in k-means with MapReduce 4
G-means
● G-means[Greg Hamerly and Charles Elkan. Learning the k in k-means. In Neural Information Processing Systems. MIT Press, 2003]
● K-means : points in each cluster are spherically distributed around the center
Source: scikit-learn
![Page 5: Determining the k in k-means with MapReduce](https://reader033.fdocuments.in/reader033/viewer/2022060108/555096dfb4c90595208b4666/html5/thumbnails/5.jpg)
Determining the k in k-means with MapReduce 5
G-means
● G-means[Greg Hamerly and Charles Elkan. Learning the k in k-means. In Neural Information Processing Systems. MIT Press, 2003]
● K-means : points in each cluster are spherically distributed around the center
normality test & recursion
![Page 6: Determining the k in k-means with MapReduce](https://reader033.fdocuments.in/reader033/viewer/2022060108/555096dfb4c90595208b4666/html5/thumbnails/6.jpg)
Determining the k in k-means with MapReduce 6
G-means
Dataset
![Page 7: Determining the k in k-means with MapReduce](https://reader033.fdocuments.in/reader033/viewer/2022060108/555096dfb4c90595208b4666/html5/thumbnails/7.jpg)
Determining the k in k-means with MapReduce 7
G-means
1. Pick 2 centers
![Page 8: Determining the k in k-means with MapReduce](https://reader033.fdocuments.in/reader033/viewer/2022060108/555096dfb4c90595208b4666/html5/thumbnails/8.jpg)
Determining the k in k-means with MapReduce 8
G-means
2. k-means
![Page 9: Determining the k in k-means with MapReduce](https://reader033.fdocuments.in/reader033/viewer/2022060108/555096dfb4c90595208b4666/html5/thumbnails/9.jpg)
Determining the k in k-means with MapReduce 9
G-means
3. Project
![Page 10: Determining the k in k-means with MapReduce](https://reader033.fdocuments.in/reader033/viewer/2022060108/555096dfb4c90595208b4666/html5/thumbnails/10.jpg)
Determining the k in k-means with MapReduce 10
G-means
3. Project
![Page 11: Determining the k in k-means with MapReduce](https://reader033.fdocuments.in/reader033/viewer/2022060108/555096dfb4c90595208b4666/html5/thumbnails/11.jpg)
Determining the k in k-means with MapReduce 11
G-means
Normal?No=> recursion
4. Normality test
![Page 12: Determining the k in k-means with MapReduce](https://reader033.fdocuments.in/reader033/viewer/2022060108/555096dfb4c90595208b4666/html5/thumbnails/12.jpg)
Determining the k in k-means with MapReduce 12
G-means
5. Recursion
![Page 13: Determining the k in k-means with MapReduce](https://reader033.fdocuments.in/reader033/viewer/2022060108/555096dfb4c90595208b4666/html5/thumbnails/13.jpg)
Determining the k in k-means with MapReduce 13
MapReduce G-means
● Challenges:
1. Reduce I/O operations
2. Reduce number of jobs
3. Maximize parallelism
4. Limit memory usage
![Page 14: Determining the k in k-means with MapReduce](https://reader033.fdocuments.in/reader033/viewer/2022060108/555096dfb4c90595208b4666/html5/thumbnails/14.jpg)
Determining the k in k-means with MapReduce 14
MapReduce G-means
● Challenges:
1. Reduce I/O operations
2. Reduce number of jobs
3. Maximize parallelism
4. Limit memory usage
![Page 15: Determining the k in k-means with MapReduce](https://reader033.fdocuments.in/reader033/viewer/2022060108/555096dfb4c90595208b4666/html5/thumbnails/15.jpg)
Determining the k in k-means with MapReduce 15
MapReduce G-means
2. Reduce number of jobs
PickInitialCenters
while Not ClusteringCompleted do
KMeans
KMeansAndFindNewCenters
TestClusters
end while
![Page 16: Determining the k in k-means with MapReduce](https://reader033.fdocuments.in/reader033/viewer/2022060108/555096dfb4c90595208b4666/html5/thumbnails/16.jpg)
Determining the k in k-means with MapReduce 16
MapReduce G-means
TestClusters
Map(key, point)Find clusterFind vectorProject point on vectorEmit(cluster, projection)
end procedure
Reduce(cluster, projections)Build a vectorADtest(vector)if normal then
Mark clusterend if
end procedure
3. Maximize parallelism
4. Limit memory usage
![Page 17: Determining the k in k-means with MapReduce](https://reader033.fdocuments.in/reader033/viewer/2022060108/555096dfb4c90595208b4666/html5/thumbnails/17.jpg)
Determining the k in k-means with MapReduce 17
MapReduce G-means
TestClusters
Map(key, point)Find clusterFind vectorProject point on vectorEmit(cluster, projection)
end procedure
Reduce(cluster, projections)Build a vectorADtest(vector)if normal then
Mark clusterend if
end procedure Bottle
neck
3. Maximize parallelism
4. Limit memory usage (risk of crash)
![Page 18: Determining the k in k-means with MapReduce](https://reader033.fdocuments.in/reader033/viewer/2022060108/555096dfb4c90595208b4666/html5/thumbnails/18.jpg)
Determining the k in k-means with MapReduce 18
MapReduce G-means
TestClusters
Map(key, point)Find clusterFind vectorProject point on vectorEmit(cluster, projection)
end procedure
Reduce(cluster, projections)Build a vectorADtest(vector)if normal then
Mark clusterend if
end procedure
TestFewClusters
Map(key, point)Find clusterFind vectorProject point on vectorAdd projection to list
end procedure
Close()For each list do
Build a vectorA2 = ADtest(vector)Emit(cluster, A2)
End for eachend procedure
In memory combiner
![Page 19: Determining the k in k-means with MapReduce](https://reader033.fdocuments.in/reader033/viewer/2022060108/555096dfb4c90595208b4666/html5/thumbnails/19.jpg)
Determining the k in k-means with MapReduce 19
MapReduce G-means
TestClusters
Map(key, point)Find clusterFind vectorProject point on vectorEmit(cluster, projection)
end procedure
Reduce(cluster, projections)Build a vectorADtest(vector)if normal then
Mark clusterend if
end procedure
TestFewClusters
Map(key, point)Find clusterFind vectorProject point on vectorAdd projection to list
end procedure
Close()For each list do
Build a vectorA2 = ADtest(vector)Emit(cluster, A2)
End for eachend procedure
#clusters > #reducers
&
Estimated required memory < Java heap
![Page 20: Determining the k in k-means with MapReduce](https://reader033.fdocuments.in/reader033/viewer/2022060108/555096dfb4c90595208b4666/html5/thumbnails/20.jpg)
Determining the k in k-means with MapReduce 20
MapReduce G-means
TestClusters
Map(key, point)Find clusterFind vectorProject point on vectorEmit(cluster, projection)
end procedure
Reduce(cluster, projections)Build a vectorADtest(vector)if normal then
Mark clusterend if
end procedure
TestFewClusters
Map(key, point)Find clusterFind vectorProject point on vectorAdd projection to list
end procedure
Close()For each list do
Build a vectorA2 = ADtest(vector)Emit(cluster, A2)
End for eachend procedure
#clusters > #reducers
&
Estimated required memory < Java heap
Experimentally:64 Bytes / point
![Page 21: Determining the k in k-means with MapReduce](https://reader033.fdocuments.in/reader033/viewer/2022060108/555096dfb4c90595208b4666/html5/thumbnails/21.jpg)
Determining the k in k-means with MapReduce 21
Comparison
MR multi-k-means MR G-means
Speed
Quality
all possible values of kin a single job
![Page 22: Determining the k in k-means with MapReduce](https://reader033.fdocuments.in/reader033/viewer/2022060108/555096dfb4c90595208b4666/html5/thumbnails/22.jpg)
Determining the k in k-means with MapReduce 22
Comparison
MR multi-k-means MR G-means
Speed O(nk²) computations O(nk) computations
But:● more iterations● more dataset reads● log
2(k)
Quality New centers added if and where needed
But:tends to overestimate k!
![Page 23: Determining the k in k-means with MapReduce](https://reader033.fdocuments.in/reader033/viewer/2022060108/555096dfb4c90595208b4666/html5/thumbnails/23.jpg)
Determining the k in k-means with MapReduce 23
Experimental results : Speed
● Hadoop● Synthetic dataset● 10M points in R10
● Euclidean distance● 8 machines
![Page 24: Determining the k in k-means with MapReduce](https://reader033.fdocuments.in/reader033/viewer/2022060108/555096dfb4c90595208b4666/html5/thumbnails/24.jpg)
Determining the k in k-means with MapReduce 24
Experimental results : Quality
● Hadoop● Synthetic dataset● 10M points in R10
● Euclidean distance● 8 machines
k 100 200 400
kfound
150 279 639
Within Cluster Sum of Square(less is better)
MR G-means 3.34 3.33 3.23
multi-k-means 3.71 3.6 3.39
(with same k)
x ~1.5
![Page 25: Determining the k in k-means with MapReduce](https://reader033.fdocuments.in/reader033/viewer/2022060108/555096dfb4c90595208b4666/html5/thumbnails/25.jpg)
Determining the k in k-means with MapReduce 25
Conclusions & future work...
● MapReduce algorithm to determine k● Running time proportional to k● Future:
– Overestimation of k– Test on real data– Test scalability– Reduce I/O (using Spark)– Consider skewed data– Consider impact of machine failure
![Page 26: Determining the k in k-means with MapReduce](https://reader033.fdocuments.in/reader033/viewer/2022060108/555096dfb4c90595208b4666/html5/thumbnails/26.jpg)
Determining the k in k-means with MapReduce 26
Thank you!