Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P...

43
Clustering (Part 2) 1

Transcript of Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P...

Page 1: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Clustering

(Part 2)

1

Page 2: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Center-based clustering

Let P be a set of N points in metric space (M, d), and let k be thetarget number of clusters, 1 ≤ k ≤ N. We define a k-clustering ofP as a tuple C = (C1,C2, . . . ,Ck ; c1, c2, . . . , ck) where

• (C1,C2, . . . ,Ck) defines a partition of P, i.e.,P = C1 ∪ C2 ∪ · · · ∪ Ck

• c1, c2, . . . , ck are suitably selected centers for the clusters,where ci ∈ Ci for every 1 ≤ i ≤ k.

Observe that the above definition requires the centers to belong tothe clusters, hence to the pointset. Whenever appropriate we willdiscuss the case when the centers can be chosen more freely fromthe metric space.

2

Page 3: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Example of 5-clustering

3

Page 4: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Center-based clustering

In this course, we study the 3 most popular center-based clusteringproblems.

Namely (a formal definition will be given shortly):

• k-center clustering: requires to minimize the maximum distance ofany point from the closest center

• k-means clustering: requires to minimize the sum of the squareddistances of the points from their closest centers

• k-median clustering: requires to minimize the sum of thedistances of the points from their closest centers

Observation: for k-means and k-median, minimizing the sum of(squared) distances is equivalent to minimizing the average (squared)distance.

4

Page 5: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Center-based clustering: k-center objective

5

Page 6: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Center-based clustering: k-means/median objectives

6

Page 7: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Center-based clustering: formal definitions

Definition

Consider e metric space (M, d) and an integer k > 0. A k-clusteringproblem for (M, d) is an optimization problem whose instances are thefinite pointsets P ⊆ M. For each instance P, the problem requires tocompute a k-clustering C of P (feasible solution) which minimizes asuitable objective function Φ(C).

The following are the three popular objective functions.

• Φkcenter(C) = maxki=1 maxa∈Ci d(a, ci ) (k-center clustering)

• Φkmeans(C) =∑k

i=1

∑a∈Ci

(d(a, ci ))2 (k-means clustering).

• Φkmedian(C) =∑k

i=1

∑a∈Ci

d(a, ci ) (k-median clustering).

k-means objective is also referred to as Sum of Squared Errors (SSE)

7

Page 8: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Observations

• All aforementioned problems (k-center, k-means, k-median) areNP-hard ⇒ it is impractical to search for optimal solutions.

• There are several efficient approximation algorithms that in practicereturn good-quality solutions. However, dealing efficiently with largeinputs is still a challenge!

• How do we decide which problem suites better our needs?

• k-center is useful when we need to guarantee that every pointis close to a center (e.g., if centers represent locations ofcrucial facilities or if they are needed as ”representatives” forthe data). As we will see later, it is sensitive to noise.

• k-means/median provide guarantees on the average (square)distances. k-means is more sensitive to noise but there existfaster algorithms to solve it.

8

Page 9: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Observations

• k-center and k-median belong to the family of facility-locationproblems, which, given a set F of candidate locations of facilitiesand a set C of customers, require to find a subset of k locationswhere to open facilities, and an assignment of customers to them,so to minimize the max/avg distance between a customers and itsassigned facility.

9

Page 10: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Notations

Let P be a pointset from a metric space (M, d). For every intergerk ∈ [1, |P|] we define

• Φoptkcenter(P, k) = minimum value of Φkcenter(C) over all possible

k-clusterings C of P.

• Φoptkmeans(P, k) = minimum value of Φkmeans(C) over all possible

k-clusterings C of P.

• Φoptkmedian(P, k) = minimum value of Φkmedian(C) over all possible

k-clusterings C of P.

Also, for any point p ∈ P and any subset S ⊆ P we define

d(p,S) = min{q ∈ S : d(p, q)}

which generalizes the distance function to distances between points andsets.

10

Page 11: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Partitioning primitive

Let P be a pointset and S ⊆ P a set of k selected centers. For allpreviously defined clustering problems, the best k-clustering around thesecenters is the one where each point is assigned to the cluster of theclosest center (ties broken arbitrarily).

We will use primitive Partition(P,S) to denote this task:

Partition(P,S)

Let S = {c1, c2, . . . , ck} ⊆ Pfor i ← 1 to k do Ci ← {ci}for each p ∈ P − S do`← argmini=1,k{d(p, ci )} // ties broken arbitrarilyC` ← C` ∪ {p}

C ← (C1,C2, . . . ,Ck ; c1, c2, . . . , ck)return C

11

Page 12: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Exercise

Let |P| = N and suppose that each point x ∈ P is represented as akey-value pair (IDx , (x , f )), where IDx is a distinct integer in[0,N), and f is a binary flag which is 1 if x ∈ S and 0 otherwise.

Assume that k = |S | = O(√

N)

. Design a 1-round MapReduce

algorithm that implements Partition(P,S) with ML = O(√

N)

and MA = O (N).

12

Page 13: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

k-center clustering

13

Page 14: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Farthest-First Traversal: algorithm

• Popular 2-approximation sequential algorithm developed byT.F. Gonzalez [Theoretical Computer Science, 38:293-306,1985]

• Simple and, somewhat fast, implementation when the inputfits in main memory.

• Powerful primitive for extracting samples for further analyses

Input Set P of N points from a metric space (M, d), integer k > 1

Output k-clustering C = (C1,C2, . . . ,Ck ; c1, c2, . . . , ck) of Pwhich is a good approximation to the k-center problem for P.

14

Page 15: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Farthest-First Traversal: algorithm (cont’d)

S ← {c1} // c1 ∈ P arbitrary pointfor i ← 2 to k do

Find the point ci ∈ P − S that maximizes d(ci ,S)S ← S ∪ {ci}

return Partition(P,S)

Observation: The invokation of Partition(P, S) at the end can beavoided by maintaining the best assignment of the points to thecurrent set of centers in every iteration of the for-loop.

15

Page 16: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Farthest-First Traversal: example (k = 5)

16

Page 17: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Farthest-First Traversal: example (k = 5)

17

Page 18: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Farthest-First Traversal: example (k = 5)

18

Page 19: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Farthest-First Traversal: example (k = 5)

19

Page 20: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Farthest-First Traversal: example (k = 5)

20

Page 21: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Farthest-First Traversal: example (k = 5)

21

Page 22: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Farthest-First Traversal: example (k = 5)

22

Page 23: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Exercise

Show that the Farthest-First Traversal algorithm can beimplemented to run in O (N · k) time.

Hint: make sure that in each iteration i of the for-loop each point

p ∈ P − S knows its closest center among c1, c2, . . . , ci−1 and the

distance from such a center.

23

Page 24: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Farthest-First Traversal: analysis

Theorem

Let Calg be the k-clustering of P returned by the Farthest-First Traversalalgorithm. Then:

Φkcenter(Calg) ≤ 2 · Φoptkcenter(P, k).

That is, Farthest-First Traversal is a 2-approximation algorithm.

Proof of Theorem

Let S = {c1, c2, . . . , ck} be the set of centers returned by Farthest-FirstTraversal with input P, and let Si = {c1, c2, . . . , ci} be the partial set ofcenters determined up to iteration i of the for-loop.

It is easy to see that:

d(ci ,Si−1) ≤ d(cj ,Sj−1) ∀2 ≤ j < i ≤ k (1)

24

Page 25: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Farthest-First Traversal: analysis (cont’d)

Proof of Theorem (cont’d)

Let q be the point with maximum d(q,S) (i.e., distance between q andthe closest center). Therefore, Φkcenter(Calg) = d(q,S) = d(q,Sk). Letci be the closest center to q, that is, d(q,S) = d(q, ci ).

We now show that the k + 1 points c1, c2, . . . , ck , q are such that anytwo of them are at distance at least Φkcenter(Calg) from one another.

Indeed, since we assumed ci to be the closest center to q, we have

Φkcenter(Calg) = d(q, ci ) ≤ d(q, cj) ∀j 6= i .

Note that q would be the point selected by the algorithm as (k + 1)-stcenter (if more than k centers were sought). Hence, by Inequality (??) inthe previous slide, we have that for any pair of indices j1, j2, with1 ≤ j1 < j2 ≤ k,

Φkcenter(Calg) = d(q,Sk) ≤ d(ck ,Sk−1) ≤ d(cj2,Sj2−1) ≤ d(cj2, cj1).

25

Page 26: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Farthest-First Traversal: analysis (cont’d)

26

Page 27: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Farthest-First Traversal: analysis (cont’d)

Proof of Theorem (cont’d).

Now we know that all pairwise distances between c1, c2, . . . , ck , q are atleast Φkcenter(Calg). Clearly, by the pigeonhole principle, two of thesek + 1 points must fall in the same cluster of the optimal clustering.

Suppose that x , y are two points in {c1, c2, . . . , ck , q} which belong tothe same cluster of the optimal clustering with center c . By the triangleinequality, we have that

Φkcenter(Calg) ≤ d(x , y) ≤ d(x , c) + d(c , y) ≤ 2Φoptkcenter(P, k),

and the theorem follows.27

Page 28: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Observations on k-center clustering

• k-center clustering provides a strong guarantee on how closeeach point is to the center of its cluster.

• For any fixed ε > 0 it is NP-hard to compute a k-clusteringCalg of a pointset P with

Φkcenter(Calg) ≤ (2− ε)Φoptkcenter(P, k),

hence the Farthest-First Traversal is likely to provide almostthe best approximation guarantee obtainable in polynomialtime.

• For noisy pointsets (e.g., pointsets with outliers) theclustering which optimizes the k-center objective mayobfuscate some “natural” clustering inherent in the data (seeexample in the next slide).

28

Page 29: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Example: noisy pointset

29

Page 30: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

k-center clustering for big data

How can we compute a “good” k-centerclustering of a pointset P that is too large

for a single machine?

30

Page 31: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

k-center clustering for big data

Observation: Farthest-First Traversal requires k − 1 scans of pointset Pwhich is impractical for massive pointsets and not so small k. In thesecases, we resort to the following general technique.

(Composable) coreset technique

It is a variant of sampling:

1 Extract a ”small” subset of the input (dubbed coreset), making surethat it contains a good solution to the problem.

2 Run best known (sequential) algorithm, possibly expensive, to solvethe problem on the small corest, rather than on the entire input,still obtaining a fairly good solution.

Observation: the technique is effective when the coreset can beextracted efficiently and, in particular, when it can be composed bycombining smaller coresets independently extracted from some (arbitrary)partition of the input.

31

Page 32: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Coreset technique

32

Page 33: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Composable coreset technique

33

Page 34: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Application of coreset technique to k-center

• Consider a large pointset P and a target granularity k.

• Select a coreset T ⊂ P making sure that for each x ∈ P − T isclose to some point of T (approx. within Φopt

kcenter(P, k)).

• Search the set S of k final centers in T , so that each point of T isat distance approx. within Φopt

kcenter(P, k) from the closest center.

• By combining the above two points, we gather that each point of Pis at distance O

(Φopt

kcenter(P, k))

from the closest center.

• The MR algorithm described next implements this idea using acomposable coreset constraction.

34

Page 35: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

MapReduce-Farthest-First Traversal

Let P be a set of N points (N large!) from a metric space (M, d), andlet k > 1 be an integer. The following MapReduce algorithm(MR-Farthest-First Traversal) computes a good k-center clustering. (Thedetails of the map and reduce phases of each round are left as anexercise.)

• Round 1: Partition P arbitrarily in ` subsets of equal sizeP1,P2, . . . ,P` and execute the Farthest-First Traversal algorithm oneach Pi separately to determine a set Ti of k centers, for 1 ≤ i ≤ `.

• Round 2: Gather the coreset T = ∪`i=1Ti (of size ` · k) and run,using a single reducer, the Farthest-First Traversal algorithm on Tto determine a set S = {c1, c2, . . . , ck} of k centers.

• Round 3: Execute Partition(P,S) (see previous exercise on how toimplement Partition(P,S) in 1 round).

Observation: note that in Rounds 1 and 2 the Farthest-First traversalalgorithm is used to determine only centers and not complete clusterings.

35

Page 36: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Analysis of MR-Farthest-First TraversalAssume k ≤

√N. By setting ` =

√N/k , the 3-round MR-Farthest-First

traversal algorithm uses

• Local space ML = O(√

N · k)

= o(N). Indeed,

R1: O (N/`) = O(√

N · k)

R2: O (` · k) = O(√

N · k)

R3: O(√

N)

(from exercise).

• Aggregate space MA = O (N)

Theorem

Let Calg be the k-clustering of P returned by the MR-Farthest-FirstTraversal algorithm. Then:

Φkcenter(Calg) ≤ 4 · Φoptkcenter(P, k).

That is, MR-Farthest-First Traversal is a 4-approximation algorithm.

36

Page 37: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Analysis of MR-Farthest-First Traversal (cont’d)

Proof of Theorem

For 1 ≤ i ≤ `, let di be the maximum distance of a point of Pi from thecenters Ti determined in Round 1.

By reasoning as in the analysis of Farthest-First Traversal, we can showthat there exist k + 1 points in Pi whose pairwise distances are all ≥ di .At least two such points, say x , y , must belong to the same cluster of theoptimal clustering for P (not Pi !), with center c . Therefore,

di ≤ d(x , y) ≤ d(x , c) + d(c , y) ≤ 2Φoptkcenter(P, k).

Consider now the set of centers S determined in Round 2 from thecoreset T , and let dT be the maximum distance of a point of T from S .The same argument as above, shows that

dT ≤ 2Φoptkcenter(P, k).

37

Page 38: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Analysis of MR-Farthest-First Traversal (cont’d)

Proof of Theorem (cont’d).

Finally, consider an arbitrary point p ∈ P and suppose that p ∈ Pi , forsome 1 ≤ i ≤ `. By virtue of the two inequalities derived in the previousslide, we know that there must exist a point t ∈ Ti (hence t ∈ T ) and apoint c ∈ S , such that

d(p, t) ≤ 2Φoptkcenter(P, k)

d(t, c) ≤ 2Φoptkcenter(P, k).

Therefore, by triangle inequality, we have that

d(p, c) ≤ d(p, t) + d(t, c) ≤ 4Φoptkcenter(P, k),

and this immediately implies that Φkcenter(Calg) ≤ 4 ·Φoptkcenter(P, k).

38

Page 39: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Observations on MR-Farthest-First Traversal

• The sequential Farthest-First Traversal algorithm is used both toextract the coreset and to compute the final set of centers. Itprovides a good coreset since it ensures that any point not in thecoreset be well represented by some coreset point.

• MR-Farthest-First Traversal is able to handle very large pointsetsand the the final approximation is not too far from the bestachievable one.

• By selecting k ′ > k centers from each subset Pi in Round 1, thequality of the final clustering improves. In fact, it can be shown thatwhen P satisfy certain properties and k ′ is sufficiently large,MR-Farthest-First Traversal can almost match the sameapproximation quality as the sequential Farthest-First Traversal,while, still using sublinear local space and linear aggregate space.

39

Page 40: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Exercise

Let P be a set of N points in a metric space (M, d), and let T ⊆ P be acoreset of |T | > k points such that for each x ∈ P we haved(x ,T ) ≤ εΦopt

kcenter(P, k), for some ε ∈ (0, 1). Let S be the set of kcenters obtained by running the Farthest-First Traversal algorithm on T ,and let CS be the clustering returned by Partition(P,S). Prove an upperbound to Φkcenter(CS) as a function of ε and Φopt

kcenter(P, k).

Exercise

Let P be a set of points in a metric space (M, d), and let T ⊆ P. Forany k < |T |, |P|, show that Φopt

kcenter(T , k) ≤ 2Φoptkcenter(P, k). Is the

bound tight?

40

Page 41: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Exercise

Let P be a set of N points in a metric space (M, d), and letC = (C1,C2, . . . ,Ck ; c1, c2, . . . , ck) be a k-clustering of P. Initially, eachpoint q ∈ P is represented by a pair (ID(q), (q, c(q))), where ID(q) is adistinct key in [0,N − 1] and c(q) ∈ {c1, . . . , ck} is the center of thecluster of q.

1 Design a 2-round MapReduce algorithm that for each cluster centerci determines the most distant point among those belonging to thecluster Ci (ties can be broken arbitrarily).

2 Analyze the local and aggregate space required by your algorithm.Your algorithm must require o(N) local space and O (N) aggregatespace.

41

Page 42: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

Exercise

Let P be a set of N bicolored points from a metric space, partitioned intok clusters C1,C2, . . . ,Ck . Each point x ∈ P is initially represented by thekey-value pair (IDx , (x , ix , γx)), where IDx is a distinct key in [0,N − 1],ix is the index of the cluster which x belongs to, and γx ∈ {0, 1} is thecolor of x .

1 Design a 2-round MapReduce algorithm that for each cluster Ci

checks whether all points of Ci have the same color. The output ofthe algorithm must be the k pairs (i , bi ), with 1 ≤ i ≤ k , wherebi = −1 if Ci contains points of different colors, otherwise bi is thecolor common to all points of Ci .

2 Analyze the local and aggregate space required by your algorithm.Your algorithm must require o(N) local space and O (N) aggregatespace.

42

Page 43: Clustering (Part 2) - DEIcapri/BDC/MATERIAL/Clustering-2-1920.pdf · Center-based clustering Let P be a set of N points in metric space(M;d), and let k be the target number of clusters,1

References

LRU14 J. Leskovec, A. Rajaraman and J. Ullman. Mining MassiveDatasets. Cambridge University Press, 2014. Sections 3.1.1 and 3.5,and Chapter 7

BHK18 A. Blum, J. Hopcroft, and R. Kannan. Foundations of DataScience. Manuscript, June 2018. Chapter 7

43