Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale...

127
Algorithms at Scale (Week 6)

Transcript of Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale...

Page 1: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

AlgorithmsatScale(Week6)

Page 2: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Summary

Today:ClusteringandStreamingk-medianclustering• Findk centerstominimizetheaverage

distancetoacenter.LPapproximationalgorithm• Find2k centersthatgivea4-

approximationoftheoptimalclustering.Streaming• Findk centersinastreamofpoints.• Useahierarchicalschemetoreduce

space.Otherclusteringproblems

LastWeek:GraphStreaming

ConnectivityBipartitetestMSTSpannersMatching

Page 3: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Goingforward…

ProblemsetdueThursday,October3:• Experimentalproblemset.• Implementastreamingalgorithms/sketch.• Seewhatperformanceyoucanget.• Goal:testitoutandseewhatyoucanlearnaboutit.

Comingup:• End-of-semesterMiniProject.• Teamsoftwo.• Goal:lookmoredeeplyintoatopiccoveredinthisclass.• I’llprovideoptionsfromeachofthe4partsoftheclass

(sublineartime/sampling,streaming,caching,parallel)• Willsendmoreinformation.

Task:Findapartnerthisweek.

Page 4: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

k-Clustering

Givenpoints:P=p1,p2,…,pn

Assumptions:⇒ Pointsareinametricspace:

distancessatisfytriangleinequality.

⇒ (Think:Euclideanspace)⇒ Thenumberofclustersk isgiven.

Goal:⇒ Chooseasetkpoints(“centers”)

thatminimizesomemetric.

Example:3clusters

Page 5: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

k-Clustering

Givenpoints:P=p1,p2,…,pn

Assumptions:⇒ Pointsareinametricspace:

distancessatisfytriangleinequality.

⇒ Thenumberofclustersk isgiven.

Goal:⇒ Chooseasetkpoints(“centers”)

thatminimizesomemetric.

Example:3clusters

Metricspace:1. d(x,y)=0iff x=y2. d(x,y)=d(y,x)3. d(x,y)≤d(x,z)+d(z,y)

Page 6: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

k-Clustering

Givenpoints:P=p1,p2,…,pn

Manyclusteringvariants:⇒ k-Center⇒ k-Median⇒ k-Means⇒ k-Medoids⇒ Min-CutClustering⇒ SpectralClustering⇒ Etc.⇒ Etc.⇒ Etc.⇒ Etc.

Example:3clusters

Page 7: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

k-CenterClustering

Givenpoints:P=p1,p2,…,pn

Assumptions:⇒ Pointsareinametricspace:

distancessatisfytriangleinequality.

⇒ (Think:Euclideanspace)⇒ Thenumberofclustersk isgiven.

Goal:⇒ Chooseasetk points(“centers”)

thatminimizethemaximumdistancetoacenter.

Example:3clusters

Page 8: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

k-MedianClustering

Givenpoints:P=p1,p2,…,pn

Assumptions:⇒ Pointsareinametricspace:

distancessatisfytriangleinequality.

⇒ (Think:Euclideanspace)⇒ Thenumberofclustersk isgiven.

Goal:⇒ Chooseasetk points(“centers”)

thatminimizetheaveragedistancetoacenter.

⇒ Equivalent:minimizethesumofthedistancestothecenters.

Example:3clusters• Avg.dist.:2• Totaldist.:22

22

1

13

4

3

6

Page 9: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

k-MedianClustering

Givenpoints:P=p1,p2,…,pn

Facts:• k-MedianisNP-hard.• InEuclideanmetric,thereisanearly

lineartime(1+𝜀)approximationalgorithm.

• Ingeneral:o Li-Svensson 2013

(1+√3)-approximationo Byrka etal.2015

2.675-approximationo Improvementssincethen?

Example:3clusters• Avg.dist.:2• Totaldist.:22

22

1

13

4

3

6

Page 10: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

k-MedianClustering

Givenpoints:P=p1,p2,…,pn

FindpointsC=c1,c2,…,ck inP

thatminimize:

Example:3clusters• Avg.dist.:2• Totaldist.:22

22

1

13

4

3

6

D(P,C) =nX

i=1

mincj2C

|pi � cj |

Page 11: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

k-MedianClustering

Givenpoints:P=p1,p2,…,pn

FindpointsC=c1,c2,…,ck inP

andassignmentfunctionc() that

mapsP—>C minimizing:

Example:3clusters• Avg.dist.:2• Totaldist.:22

22

1

13

4

3

6D(P,C) =nX

i=1

|pi � c(i)|

Page 12: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Summary

Today:ClusteringandStreamingk-medianclustering• Findk centerstominimizetheaverage

distancetoacenter.LPapproximationalgorithm• Find2k centersthatgivea4-

approximationoftheoptimalclustering.Streaming• Findk centersinastreamofpoints.• Useahierarchicalschemetoreduce

space.Otherclusteringproblems

Page 13: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Approximatek-MedianClustering

Givenpoints:P=p1,p2,…,pn

LetC* betheoptimalclustering.

ClusteringC isa𝛄-approximation

if:

Example:3clusters• Avg.dist.:2• Totaldist.:22

22

1

13

4

3

6

D(P,C) �D(P,C⇤)

Page 14: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Approximatek-MedianClustering

Givenpoints:P=p1,p2,…,pn

LetC* betheoptimalclusteringwithk centers.ClusteringC isan(α,𝛄)-approximationifithasatmostαk centersand:

Example:3clusters• Avg.dist.:2• Totaldist.:22

22

1

13

4

3

6

D(P,C) �D(P,C⇤)

Page 15: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

(2,2)-approximation

Example:6clusters• Avg.dist.:4• Totaldist.:44

Example:3clusters• Avg.dist.:2• Totaldist.:22

22

1

13

4

3

6

10

4

12

8

10

Page 16: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Approximatek-MedianClustering

Givenpoints:P=p1,p2,…,pn

LetC* betheoptimalclusteringwithk centers.ClusteringC isan(α,𝛄)-approximationifithasatmostαk centersand:

Today:(2,4)-approximation

Example:3clusters• Avg.dist.:2• Totaldist.:22

22

1

13

4

3

6

D(P,C) �D(P,C⇤)

Page 17: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Approximatek-MedianClustering

IntegerLinearProgram

Variables(intuition):yj : Is point pj a cluster head?

xi,j : Is point pi assigned to center pj?

p1p2

p3Example:y1 = 0 x1,2 = 1

y2 = 1 x2,3 = 0

y3 = 1 x1,3 = 0

Page 18: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Approximatek-MedianClustering

IntegerLinearProgram

Variables(intuition):

ILP:

yj : Is point pj a cluster head?

xi,j : Is point pi assigned to center pj?

p1p2

p3

minX

i,j

xi,jd(pi, pj)

8i :P

j xi,j = 1P

j yj k

8i, j : xi,j yj

8i, j : xi,j , yj 2 {0, 1}

Page 19: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Approximatek-MedianClustering

IntegerLinearProgram

Variables(intuition):

ILP:

yj : Is point pj a cluster head?

xi,j : Is point pi assigned to center pj?

p1p2

p3

minX

i,j

xi,jd(pi, pj)

8i :P

j xi,j = 1P

j yj k

8i, j : xi,j yj

8i, j : xi,j , yj 2 {0, 1}

Page 20: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Approximatek-MedianClustering

IntegerLinearProgram

Variables(intuition):

ILP:

yj : Is point pj a cluster head?

xi,j : Is point pi assigned to center pj?

p1p2

p3

minX

i,j

xi,jd(pi, pj)

8i :P

j xi,j = 1P

j yj k

8i, j : xi,j yj

8i, j : xi,j , yj 2 {0, 1}

Page 21: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Approximatek-MedianClustering

IntegerLinearProgram

Variables(intuition):

ILP:

yj : Is point pj a cluster head?

xi,j : Is point pi assigned to center pj?

p1p2

p3

minX

i,j

xi,jd(pi, pj)

8i :P

j xi,j = 1P

j yj k

8i, j : xi,j yj

8i, j : xi,j , yj 2 {0, 1}

Page 22: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Approximatek-MedianClustering

IntegerLinearProgram

Variables(intuition):

ILP:

yj : Is point pj a cluster head?

xi,j : Is point pi assigned to center pj?

p1p2

p3

minX

i,j

xi,jd(pi, pj)

8i :P

j xi,j = 1P

j yj k

8i, j : xi,j yj

8i, j : xi,j , yj 2 {0, 1}

Page 23: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Approximatek-MedianClustering

IntegerLinearProgram

Claim1:Ifx andy satisfytheconstraints,thenitisavalidsolutiontotheclusteringproblem.

ILP:

p1p2

p3

minX

i,j

xi,jd(pi, pj)

8i :P

j xi,j = 1P

j yj k

8i, j : xi,j yj

8i, j : xi,j , yj 2 {0, 1}

Page 24: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Approximatek-MedianClustering

IntegerLinearProgram

Claim2:Ifyouhaveavalidclusteringsolution,youcanchoosex andy tosatisfytheconstraints.

ILP:

p1p2

p3

minX

i,j

xi,jd(pi, pj)

8i :P

j xi,j = 1P

j yj k

8i, j : xi,j yj

8i, j : xi,j , yj 2 {0, 1}

Page 25: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Approximatek-MedianClustering

IntegerLinearProgram

Badnews:SolvingIntegerLinearProgramsisNP-Hard.

ILP:

p1p2

p3

minX

i,j

xi,jd(pi, pj)

8i :P

j xi,j = 1P

j yj k

8i, j : xi,j yj

8i, j : xi,j , yj 2 {0, 1}

Page 26: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

8i :P

j xi,j = 1P

j yj k

8i, j : xi,j yj

8i, j : 0 xi,j , yj 1

Approximatek-MedianClustering

Relax:LinearProgram

Goodnews:Relax!Replaceintegralconstraintswith[0,1] constraints.

LP:

p1p2

p3

minX

i,j

xi,jd(pi, pj)

Page 27: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

8i :P

j xi,j = 1P

j yj k

8i, j : xi,j yj

8i, j : 0 xi,j , yj 1

Approximatek-MedianClustering

Relax:LinearProgram

Goodnews:Relax!Cansolveefficiently(inpolynomialtime)usinganLPsolver.

LP:

p1p2

p3

minX

i,j

xi,jd(pi, pj)

Page 28: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

8i :P

j xi,j = 1P

j yj k

8i, j : xi,j yj

8i, j : 0 xi,j , yj 1

Approximatek-MedianClustering

Relax:LinearProgram

Goodnews:Relax!Ifyouhaveavalidclusteringsolution,youcanchoosex andy tosatisfytheconstraints.

LP:

p1p2

p3

minX

i,j

xi,jd(pi, pj)

Page 29: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Approximatek-MedianClustering

Relax:LinearProgram

Goodnews:Relax!Ifyouhaveavalidclusteringsolution,youcanchoosex andy tosatisfytheconstraints.

IfC isa(fractional)solutiontotheLP,andC* istheoptimal(integral)solution,then:

p1p2

p3

D(C,P ) D(C⇤, P )

Solutionisnoworse thantheoptimalsolution!Maybebetterthanoptimal!

Page 30: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

8i :P

j xi,j = 1P

j yj k

8i, j : xi,j yj

8i, j : 0 xi,j , yj 1

Approximatek-MedianClustering

Relax:LinearProgram

Badnews:solutionisfractionalIfx andy satisfytheconstraints,thenitmayNOTbeavalidsolutiontotheclusteringproblem.

LP:

p1p2

p3

minX

i,j

xi,jd(pi, pj)

Page 31: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Approximatek-MedianClustering

Relax:LinearProgram

Badnews:solutionisfractionalIfx andy satisfytheconstraints,thenitmayNOTbeavalidsolutiontotheclusteringproblem.

p1p2

p3

Variables(intuition):yj : Is point pj a cluster head?

xi,j : Is point pi assigned to center pj?

y1 = 0.5 x1,2 = 0.5

y2 = 0.5 x2,3 = 0

y3 = 1 x1,3 = 0.5

Page 32: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Approximatek-MedianClustering

Relax:LinearProgram

Solution:roundtointegersIfx andy satisfytheconstraints,thenmaybewecanroundthevariablesinawaythatdoesnotincreasethecosttoomuch.

p1p2

p3

yj : Is point pj a cluster head?

xi,j : Is point pi assigned to center pj?

y1 = 0.5 x1,2 = 0.5

y2 = 0.5 x2,3 = 0

y3 = 1 x1,3 = 0.5

Page 33: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Roundingthek-MedianLP

Step1: Whatisthecost?

Definethecostofpi:

LPminimizes:

p1p2

p3

minX

i,j

xi,jd(pi, pj)

Ci =X

j

xi,jd(pi, pj)

minX

i

Ci

Goal:roundinawaythatdoesnotincreasecosttoomuch!

Page 34: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Roundingthek-MedianLP

Step1: Whatisthecost?

Definethecostofpi:

Goalafterrounding:constructC’ st.

p1p2

p3

Ci =X

j

xi,jd(pi, pj)

minX

i

Ci

Goal:roundinawaythatdoesnotincreasecosttoomuch!

C 0j 4Cj

Page 35: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Roundingthek-MedianLP

Step2:Sortbycost

Notice:smallestcostishardesttoround.

(Mostriskthatitwillincreasetoomuch.)

p1p2

p3

Page 36: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Roundingthek-MedianLP

Step3:Addpj withsmallestcostCjtooursetofcenters.

S={pj}p1

p2

p3

Page 37: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Roundingthek-MedianLP

Step4:Ifpi iswithindistance4Cj ofpj,thenwecandeleteit.

S={pj}

è pi isalreadycloseenoughtoacenterinoursolution.

p1p2

p3

C 0i d(pi, pj) 4Cj

4Ci

Recall:Cj wasthesmallest.

Page 38: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Roundingthek-MedianLP

Step4:Ifpi iswithindistance4Ci ofpj,thenwecandeleteit.

S={pj}

è pi isalreadycloseenoughtoacenterinoursolution.

p1p2

p3

C 0i d(pi, pj) 4Ci

Page 39: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Roundingthek-MedianLP

Step4:Ifthereissomepointqwhere:

thenwecandeleteit.

è pi isalreadycloseenoughtoacenterinoursolution.

p1p2

d(pi, q) 2Ci

d(pj , q) 2Cj

q

2Ci2C2

Recall:Cj wasthesmallest.

C 0i d(pi, pj)

d(pi, q) + d(q, pj)

2Ci + 2Cj

2Ci + 2Ci

4Ci

Page 40: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Roundingthek-MedianLP

Step4:Ifthereissomepointqwhere:

thenwecandeleteit.

è AllnodesinV(i) arecloseenoughtopi thatwecandeletethem.

p1p2

d(pi, q) 2Ci

d(pj , q) 2Cj

q

2Ci2C2

V (j) = {pi | 9q, d(pi, q) 2Ci, d(pj , q) 2Cj}

Page 41: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Roundingthek-MedianLP

Step5:Repeatuntilallaredeleted.

p1p2

q

2Ci2C2

Page 42: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Roundingthek-MedianLP

RoundingAlgorithm:

1. S={}2. Repeatuntilallpointsaredeleted:• Letpj betheremainingpoint

withminimumCj.• Addpj toS.• DeleteallpointsinV(j).

3. ReturnS.

p1p2

q

2Ci2C2

WheredidweusetheLPsolution??

Page 43: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Roundingthek-MedianLP

RoundingAlgorithm:

1. S={}2. Repeatuntilallpointsaredeleted:• Letpj betheremainingpoint

withminimumCj.• Addpj toS.• DeleteallpointsinV(j).

3. ReturnS.

Claim:Forallj: C 0j 4Cj

ComputeC’ usingcentersinS.

Page 44: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Roundingthek-MedianLP

Step4:Ifthereissomepointqwhere:

thenwecandeleteit.

è pi isalreadycloseenoughtoacenterinoursolution.

p1p2

d(pi, q) 2Ci

d(pj , q) 2Cj

q

2Ci2C2

Recall:Cj wasthesmallest.

C 0i d(pi, pj)

d(pi, q) + d(q, pj)

2Ci + 2Cj

2Ci + 2Ci

4Ci

Page 45: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Roundingthek-MedianLP

RoundingAlgorithm:

1. S={}2. Repeatuntilallpointsaredeleted:• Letpj betheremainingpoint

withminimumCj.• Addpj toS.• DeleteallpointsinV(j).

3. ReturnS.

Claim:Forallj: C 0j 4Cj

ComputeC’ usingcentersinS.

Page 46: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Roundingthek-MedianLP

RoundingAlgorithm:

1. S={}2. Repeatuntilallpointsaredeleted:• Letpj betheremainingpoint

withminimumCj.• Addpj toS.• DeleteallpointsinV(j).

3. ReturnS.

Claim:Forallj:èC 0j 4Cj

d(C 0, P ) 4d(C,P ) 4d(C⇤, P )

Page 47: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Roundingthek-MedianLP

RoundingAlgorithm:

1. S={}2. Repeatuntilallpointsaredeleted:• Letpj betheremainingpoint

withminimumCj.• Addpj toS.• DeleteallpointsinV(j).

3. ReturnS.

Remainingproblem?

Page 48: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Roundingthek-MedianLP

RoundingAlgorithm:

1. S={}2. Repeatuntilallpointsaredeleted:• Letpj betheremainingpoint

withminimumCj.• Addpj toS.• DeleteallpointsinV(j).

3. ReturnS.

Remainingproblem:HowmanycentersaddedtoS?

Page 49: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Roundingthek-MedianLP

RoundingAlgorithm:

1. S={}2. Repeatuntilallpointsaredeleted:• Letpj betheremainingpoint

withminimumCj.• Addpj toS.• DeleteallpointsinV(j).

3. ReturnS.

Claim:Atmost2k centersaddedtoS.

Page 50: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Roundingthek-MedianLP

Keylemma:Ifpi isaddedtoS,then:

è Sincey’ssumtok,ifV(j)aredisjoint,cannotaddmorethan2kpointstoS.

p2X

j2V (i)

yj � 1/2

Page 51: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Roundingthek-MedianLP

Keylemma:Ifpi isaddedtoS,then:

è Sincey’ssumtokandV(j)aredisjoint,cannotaddmorethan2kpointstoS.

p2X

j : d(pi,pj)2Ci

yj � 1/2

Subtlepoint:symmetry!Ifaddingpi deletespj,thenaddpj deletespi.

Page 52: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Roundingthek-MedianLP

Keylemma:Ifpi isaddedtoS,then:

Observation1:

p2X

j : d(pi,pj)2Ci

yj � 1/2

X

j : d(pi,pj)2Ci

yj �X

j : d(pi,pj)2Ci

xi,j

Page 53: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

8i :P

j xi,j = 1P

j yj k

8i, j : xi,j yj

8i, j : 0 xi,j , yj 1

Approximatek-MedianClustering

Relax:LinearProgram

LP:

p1p2

p3

minX

i,j

xi,jd(pi, pj)

Page 54: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Roundingthek-MedianLP

Keylemma:Ifpi isaddedtoS,then:

Observation1:

p2X

j : d(pi,pj)2Ci

yj � 1/2

X

j : d(pi,pj)2Ci

yj �X

j : d(pi,pj)2Ci

xi,j

Page 55: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Roundingthek-MedianLP

Observation2:

Ci =“average”distancefromitoacenter. p2

Ci =X

j

xi,jd(pi, pj)

Page 56: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Roundingthek-MedianLP

Observation2:

Ci =“average”distancefromitoacenter.

LetZ berandomvariablethatequalsd(pi,pj) withprobabilityxij.

p2

E[Z] = Ci

Page 57: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Roundingthek-MedianLP

Observation2:

Ci =“average”distancefromitoacenter.

LetZ berandomvariablethatequalsd(pi,pj) withprobabilityxij.

p2

X

j : d(pi,pj)2Ci

xi,j = Pr(Z 2Ci) = 1� Pr(Z > 2Ci)

Page 58: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Roundingthek-MedianLP

Observation2:

Ci =“average”distancefromitoacenter.

LetZ berandomvariablethatequalsd(pi,pj) withprobabilityxij.

p2

X

j : d(pi,pj)2Ci

xi,j = Pr(Z 2Ci) = 1� Pr(Z > 2Ci)

= 1� Pr(Z > 2E[Z])

� 1� 1/2 = 1/2

Page 59: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Roundingthek-MedianLP

Keylemma:Ifpi isaddedtoS,then:

Conclusion:

p2X

j : d(pi,pj)2Ci

yj � 1/2

X

j : d(pi,pj)2Ci

yj �X

j : d(pi,pj)2Ci

xi,j � 1/2

Page 60: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Roundingthek-MedianLP

Keylemma:Ifpi isaddedtoS,then:

è Fact:yi’s sumto≤kè Fact:V(i) aredisjointè Fact:Foreachpi addedtoS,deletepointswithyj’s

thatsumtoatleast½.è Cannotaddmorethan2k pointstoS.

p2X

j : d(pi,pj)2Ci

yj � 1/2

Page 61: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Approximatek-MedianClustering

Givenpoints:P=p1,p2,…,pn

Today:(2,4)-approximation• GiveIntegerLinearProgram(ILP).• RelaxtoLinearProgram(LP).• SolveLP.• Round(carefully).

Example:3clusters• Avg.dist.:2• Totaldist.:22

22

1

13

4

3

6

Page 62: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Weightedk-MedianClustering

Givenpoints:P=p1,p2,…,pn

Givenweights:w1,w2,…,wn

FindpointsC=c1,c2,…,ck inP

andassignmentfunctionc() that

mapsP—>C minimizing:

Example:3clusters• Avg.dist.:2• Totaldist.:22

22

1

13

4

3

6

D(P,C) =nX

i=1

wi|pi � c(i)|

Page 63: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Weightedk-MedianClustering

Givenpoints:P=p1,p2,…,pn

Givenweights:w1,w2,…,wn

FindpointsC=c1,c2,…,ck inP

andassignmentfunctionc() that

mapsP—>C minimizing:

Example:3clusters• Avg.dist.:2• Totaldist.:22

22

1

13

4

3

6

D(P,C) =nX

i=1

wi|pi � c(i)|

Exercise:

Showhowtoadapttheapproximatek-medianalgorithmtogivea(2,4)-approximatesolutionfortheweightedk-Medianclusteringproblem.

Page 64: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

StreamingData

Dataarrivesinastream:S=s1,s2,…,sT

Eachsj isapoint.⇒ Eachpointshowsupexactlyonce.⇒ Pointsshowupinanarbitrary(worst-case)order.

ExampleinEuclideanplane:S=(17,3),(1,7),(15,1),(4,1),(3,19),(1,1),(2,1)

Atendofstream:output k clustercenters.

Page 65: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

StreamingData

Dataarrivesinastream:S=s1,s2,…,sT

Memory:

Goal:(2,O(1))-approximation

O(pnk)

Warning:Today,theapproximationratioisgoingtobelarge.

Page 66: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

S=∅repeattimes:

1.LetP=nextpoints.2.Find(2,4)-approximateclusteringonP.3.Add2knewclustercenterstoS.Weighteach

clustercenterwith#ofpointsattachedtoit.4.EmptyP.

Return(2,4)-approximate(weighted)clusteringonS.

Streamingk-Median

Core-SetAlgorithm

rn

k pnk

Page 67: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

St (√nk elements)

Datastreamcontainingnelements

S1 (√nk elements) S1 (√nk elements)

2kcenters

2kcenters

2kcenters

(2,4)-approximatek-median

(2,4)-approximatek-median

(2,4)-approximatek-median

2kcenters

(2,4)-approximateweightedk-median

2pnk centersatintermediatelevel

Page 68: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

St (√nk elements)

Datastreamcontainingnelements

S1 (√nk elements) S1 (√nk elements)

(2,4)-approximatek-median

(2,4)-approximatek-median

(2,4)-approximatek-median

2kcenters

(2,4)-approximateweightedk-median

2pnk centersatintermediatelevel

2kcenters

2kcenters

2kcenters CoreSet

Page 69: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

St (√nk elements)

Datastreamcontainingnelements

S1 (√nk elements) S1 (√nk elements)

2kcenters

2kcenters

2kcenters

(2,4)-approximatek-median

(2,4)-approximatek-median

(2,4)-approximatek-median

2kcenters

(2,4)-approximateweightedk-median

2pnk centersatintermediatelevel

Space:O(pnk)

Page 70: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Claims:

Claim1:Spaceisatmost.

Claim2:Theoutputisatmost2k centers.

O(pnk)

Byconstruction.

Page 71: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Claims:

Claim1:Spaceisatmost.

Claim2:Theoutputisatmost2k centers.

Claim3:Theoutputis(2,80)-approximationfork-Median.

O(pnk)

Byconstruction.

Page 72: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Notation:

1:Substream Si istheith segmentofthestream.

2:PointsTi arethe2k centersoutputbyith (2,4)-approximation.

3:Sw aretheweightedpoints,andwaretheweights,usedforthefinal(2,4)-approximation.

4:PointsT arethefinaloutputofthealgorithm.

Page 73: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Lemma:

Interpretation:Wecanboundthefinaldistancesbytwoparts:(1) thedistanceofapointtotheintermediateclustering,and(2) thedistanceoftheintermediateclusteringtothefinal

clustering.

d(S, T ) tX

i=1

d(Si, Ti) + d(Sw, T )

Page 74: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Proof: d(S, T ) =tX

i=1

X

x2Si

d(x, T )

tX

i=1

X

x2Si

d(x, ti

(x)) + d(ti

(x), T )

tX

i=1

d(Si

, T

i

) +tX

i=1

X

x2Si

d(ti

(x), T )

tX

i=1

d(Si

, T

i

) +tX

i=1

2kX

j=1

|Si

|d(tij

, T )

tX

i=1

d(Si

, T

i

) + d(Sw

, T )

Definitionofd(S,T).

Variablesxrangeoverallpointsintheset

Page 75: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Proof: d(S, T ) =tX

i=1

X

x2Si

d(x, T )

tX

i=1

X

x2Si

d(x, ti

(x)) + d(ti

(x), T )

tX

i=1

d(Si

, T

i

) +tX

i=1

X

x2Si

d(ti

(x), T )

tX

i=1

d(Si

, T

i

) +tX

i=1

2kX

j=1

|Si

|d(tij

, T )

tX

i=1

d(Si

, T

i

) + d(Sw

, T )

TriangleInequality

Pointti(x) isthecenterassignedtox intheintermediatecoreset,wherexisapointinsegmentSi ofthestream.

Page 76: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Proof: d(S, T ) =tX

i=1

X

x2Si

d(x, T )

tX

i=1

X

x2Si

d(x, ti

(x)) + d(ti

(x), T )

tX

i=1

d(Si

, T

i

) +tX

i=1

X

x2Si

d(ti

(x), T )

tX

i=1

d(Si

, T

i

) +tX

i=1

2kX

j=1

|Si

|d(tij

, T )

tX

i=1

d(Si

, T

i

) + d(Sw

, T )

Definitionofd(Si,Ti).

Page 77: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Proof: d(S, T ) =tX

i=1

X

x2Si

d(x, T )

tX

i=1

X

x2Si

d(x, ti

(x)) + d(ti

(x), T )

tX

i=1

d(Si

, T

i

) +tX

i=1

X

x2Si

d(ti

(x), T )

tX

i=1

d(Si

, T

i

) +tX

i=1

2kX

j=1

|Si

|d(tij

, T )

tX

i=1

d(Si

, T

i

) + d(Sw

, T )

Iterateoverallcentersincoreset.

Counthowmanytimeseachisincludedinthesum.

Pointtij isoneofthe2kpointsinthecoreset fortheith segment.

Page 78: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Proof: d(S, T ) =tX

i=1

X

x2Si

d(x, T )

tX

i=1

X

x2Si

d(x, ti

(x)) + d(ti

(x), T )

tX

i=1

d(Si

, T

i

) +tX

i=1

X

x2Si

d(ti

(x), T )

tX

i=1

d(Si

, T

i

) +tX

i=1

2kX

j=1

|Si

|d(tij

, T )

tX

i=1

d(Si

, T

i

) + d(Sw

, T )Definitionofd(Sw,T).

Weightw(i)=|Si|.

Page 79: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Lemma:

Interpretation:Wecanboundthefinaldistancesbytwoparts:(1) thedistanceofapointtotheintermediateclustering,and(2) thedistanceoftheintermediateclusteringtothefinal

clustering.

d(S, T ) tX

i=1

d(Si, Ti) + d(Sw, T )

Page 80: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Lemma:

Goal:

d(S, T ) tX

i=1

d(Si, Ti) + d(Sw, T )

d(S, T ) 80d(S,C⇤)

Page 81: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Usefulfact:

WhereA issomelargersetofallpossiblepointsinthemetricspace,andS’ isanarbitrarysubsetofA.

Interpretation:ToclusterS’,wecanfocusonpointsinS’ (andonlyloseafactorof2.)Wedon’tneedcentersnotinS’.

minT 0✓S0

d(S0, T 0) 2 minT 0✓A

d(S0, T 0)

Page 82: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Usefulfact:

Proof: TriangleInequalityLetT’ betheoptimalsolutioninA.Lett besomepointinT’thatisnotinS’,let t’ betheclosestpointinS’ tot,andlets besomeotherpointinS’.Wecanreplacet witht’ because:

minT 0✓S0

d(S0, T 0) 2 minT 0✓A

d(S0, T 0)

d(s, t0) d(s, t) + d(t, t0) d(s, t) + d(s, t) 2d(set)

Page 83: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Lemma:

Goal:

d(S, T ) tX

i=1

d(Si, Ti) + d(Sw, T )

d(S, T ) 80d(S,C⇤)

Page 84: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Lemma:

Interpretation:Wecanboundthedistancestothecoreset bytheoptimalclustering.

tX

i=1

d(Si, Ti) 8d(S,C⇤)

Page 85: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

tX

i=1

d(Si, Ti) tX

i=1

4 minT 0✓Si

d(Si, T0)

tX

i=1

8 minT 0✓P

d(Si, T0)

tX

i=1

8 minT 0✓P

d(Si, C⇤)

8d(S,C⇤)

Streamingk-Median

Core-SetAlgorithm

Proof: Becauseweusea(2,4)-approximationalgorithmtocomputethecoreset.

Page 86: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

tX

i=1

d(Si, Ti) tX

i=1

4 minT 0✓Si

d(Si, T0)

tX

i=1

8 minT 0✓P

d(Si, T0)

tX

i=1

8 minT 0✓P

d(Si, C⇤)

8d(S,C⇤)

Streamingk-Median

Core-SetAlgorithm

Proof:

Becauseweonlyloseafactoroftwogoingtoalargesetofpoints.

Page 87: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

tX

i=1

d(Si, Ti) tX

i=1

4 minT 0✓Si

d(Si, T0)

tX

i=1

8 minT 0✓P

d(Si, T0)

tX

i=1

8d(Si, C⇤)

8d(S,C⇤)

Streamingk-Median

Core-SetAlgorithm

Proof:

Bydefinitionoftheoptimalclustering.

Page 88: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

tX

i=1

d(Si, Ti) tX

i=1

4 minT 0✓Si

d(Si, T0)

tX

i=1

8 minT 0✓P

d(Si, T0)

tX

i=1

8d(Si, C⇤)

8d(S,C⇤)

Streamingk-Median

Core-SetAlgorithm

Proof:

Bysummingoverallpoints.

Page 89: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Lemma:

Interpretation:Wecanboundthedistancestothecoreset bytheoptimalclustering.

tX

i=1

d(Si, Ti) 8d(S,C⇤)

Page 90: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Lemma:

Goal:

d(S, T ) tX

i=1

d(Si, Ti) + d(Sw, T )

d(S, T ) 80d(S,C⇤)

Page 91: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Lemma:

Interpretation:Wecanboundthecostofthesecondpart…

d(Sw, T ) 8tX

i=1

d(Si, Ti) + 8d(S,C⇤)

Page 92: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Part1: d(Sw, C⇤)

tX

i=1

d(Si, Ti) + d(S,C⇤)

d(Sw

, C

⇤) =X

i,j

|Si,j

|d(ti,j

, T

⇤)

X

i,j

X

x2Si,j

[d(ti,j

, x) + d(x, t⇤(x))]

X

i

X

x2Si

[d(ti

(x), x) + d(x, t⇤(x))]

tX

i=1

d(Si

, T

i

) + d(S,C⇤)

Definitionofweighted...

Page 93: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Part1: d(Sw, C⇤)

tX

i=1

d(Si, Ti) + d(S,C⇤)

d(Sw

, C

⇤) =X

i,j

|Si,j

|d(ti,j

, T

⇤)

X

i,j

X

x2Si,j

[d(ti,j

, x) + d(x, t⇤(x))]

X

i

X

x2Si

[d(ti

(x), x) + d(x, t⇤(x))]

tX

i=1

d(Si

, T

i

) + d(S,C⇤)

SumoverSij andusetriangleinequality.

Page 94: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Part1: d(Sw, C⇤)

tX

i=1

d(Si, Ti) + d(S,C⇤)

d(Sw

, C

⇤) =X

i,j

|Si,j

|d(ti,j

, T

⇤)

X

i,j

X

x2Si,j

[d(ti,j

, x) + d(x, t⇤(x))]

X

i

X

x2Si

[d(ti

(x), x) + d(x, t⇤(x))]

tX

i=1

d(Si

, T

i

) + d(S,C⇤)

Simplifyenumerationoverallpointsincoreset.

Page 95: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Part1: d(Sw, C⇤)

tX

i=1

d(Si, Ti) + d(S,C⇤)

d(Sw

, C

⇤) =X

i,j

|Si,j

|d(ti,j

, T

⇤)

X

i,j

X

x2Si,j

[d(ti,j

, x) + d(x, t⇤(x))]

X

i

X

x2Si

[d(ti

(x), x) + d(x, t⇤(x))]

tX

i=1

d(Si

, T

i

) + d(S,C⇤) Definition…

Page 96: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Part1:

Part2:

Conclusion:

d(Sw, C⇤)

tX

i=1

d(Si, Ti) + d(S,C⇤)

d(Sw, T ) 8d(Sw, C⇤)

d(Sw, T ) 8tX

i=1

d(Si, Ti) + 8d(S,C⇤)

Page 97: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Part1:

Part2:

d(Sw, C⇤)

tX

i=1

d(Si, Ti) + d(S,C⇤)

d(Sw, T ) 4 minT 0✓Sw

d(Sw, T0)

8 minT 0✓P

d(Sw, T0)

8d(Sw, C⇤)

d(Sw, T ) 8d(Sw, C⇤)

Becauseused4-approximationalgorithm.

Page 98: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Part1:

Part2:

d(Sw, C⇤)

tX

i=1

d(Si, Ti) + d(S,C⇤)

d(Sw, T ) 4 minT 0✓Sw

d(Sw, T0)

8 minT 0✓P

d(Sw, T0)

8d(Sw, C⇤)

d(Sw, T ) 8d(Sw, C⇤)

BecauseusingpointsinSwonlylosesafactorof2.

Page 99: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Part1:

Part2:

d(Sw, C⇤)

tX

i=1

d(Si, Ti) + d(S,C⇤)

d(Sw, T ) 4 minT 0✓Sw

d(Sw, T0)

8 minT 0✓P

d(Sw, T0)

8d(Sw, C⇤)

d(Sw, T ) 8d(Sw, C⇤)

Bydefinition…

Page 100: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Part1:

Part2:

d(Sw, C⇤)

tX

i=1

d(Si, Ti) + d(S,C⇤)

d(Sw, T ) 8d(Sw, C⇤)

Page 101: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Additallup:

d(S, T ) tX

i=1

d(Si, Ti) + d(Sw, T )

8d(S,C⇤) + d(Sw, T )

8d(S,C⇤) + 8tX

i=1

d(Si, Ti) + 8d(S,C⇤)

8d(S,C⇤) + 8(8d(S,C⇤)) + 8d(S,C⇤)

80d(S,C⇤)

Page 102: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Additallup:

d(S, T ) tX

i=1

d(Si, Ti) + d(Sw, T )

8d(S,C⇤) + d(Sw, T )

8d(S,C⇤) + 8tX

i=1

d(Si, Ti) + 8d(S,C⇤)

8d(S,C⇤) + 8(8d(S,C⇤)) + 8d(S,C⇤)

80d(S,C⇤)

Page 103: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Additallup:

d(S, T ) tX

i=1

d(Si, Ti) + d(Sw, T )

8d(S,C⇤) + d(Sw, T )

8d(S,C⇤) + 8tX

i=1

d(Si, Ti) + 8d(S,C⇤)

8d(S,C⇤) + 8(8d(S,C⇤)) + 8d(S,C⇤)

80d(S,C⇤)

Page 104: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Additallup:

d(S, T ) tX

i=1

d(Si, Ti) + d(Sw, T )

8d(S,C⇤) + d(Sw, T )

8d(S,C⇤) + 8tX

i=1

d(Si, Ti) + 8d(S,C⇤)

8d(S,C⇤) + 8(8d(S,C⇤)) + 8d(S,C⇤)

80d(S,C⇤)

Page 105: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Additallup:

d(S, T ) tX

i=1

d(Si, Ti) + d(Sw, T )

8d(S,C⇤) + d(Sw, T )

8d(S,C⇤) + 8tX

i=1

d(Si, Ti) + 8d(S,C⇤)

8d(S,C⇤) + 8(8d(S,C⇤)) + 8d(S,C⇤)

80d(S,C⇤)

Page 106: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

St (√nk elements)

Datastreamcontainingnelements

S1 (√nk elements) S1 (√nk elements)

2kcenters

2kcenters

2kcenters

(2,4)-approximatek-median

(2,4)-approximatek-median

(2,4)-approximatek-median

2kcenters

(2,4)-approximateweightedk-median

2pnk centersatintermediatelevel

Space:O(pnk)

Approximation:(2,80)

Page 107: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

CoreSetAlgorithm

Questions:

Whatifyouwantlessspace?• Increasesegmentsize?• Decreasenumberofcoresets?

Page 108: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

CoreSetAlgorithm

Question:

Whatifyouwantlessspace?• Increasesegmentsize?• Decreasenumberofcoresets?

Idea: hierarchicalconstruction!

Page 109: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

HierarchicalConstruction

St

Datastreamcontainingnelements

S1 S6

2k

2kcenters

S2 S5S4S3

2k 2k 2k 2k 2k2k

2kcenters

2kcenters

2kcenters

Page 110: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

CoreSetAlgorithm

Algorithmidea:

Define.

Wheneveryouseem elementsinthestream:• Runthe(2,4)-approximationè 2kcenters.• Storethe2knewcentersinlevel1.

m = n✏

Page 111: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

CoreSetAlgorithm

Algorithmidea:

Define.

Wheneveryouseem elementsinthestream:• Runthe(2,4)-approximationè 2kcenters.• Storethe2knewcentersinlevel1.

Wheneveryouhavem setsofcentersinlevelj:• Runthe(2,4)-approximationè 2kcenters.• Storethe2knewcentersinlevelj+1.

m = n✏

Page 112: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

CoreSetAlgorithm

Algorithmidea:

Define.

Wheneveryouseem elementsinthestream:• Runthe(2,4)-approximationè 2kcenters.• Storethe2knewcentersinlevel1.

Wheneveryouhavem setsofcentersinlevelj:• Runthe(2,4)-approximationè 2kcenters.• Storethe2knewcentersinlevelj+1.

m = n✏

Page 113: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

HierarchicalConstruction

St

Datastreamcontainingnelements

S1 S6

2k

2kcenters

S2 S5S4S3

2k 2k 2k 2k 2k2k

2kcenters

2kcenters

2kcenters

Page 114: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

CoreSetAlgorithm

Algorithmidea:

Define.

Wheneveryouhavem setsofcentersinlevelj:• Runthe(2,4)-approximationè 2kcenters.• Storethe2knewcentersinlevelj+1.

Treewithfan-outm hashowmanylevels?

logm n =

log n

logm=

log n

log n✏=

1

m = n✏

Page 115: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

CoreSetAlgorithm

Algorithmidea:

Define.

Treewithfan-outm hashowmanylevels?

Spaceusage?

logm n =

log n

logm=

log n

log n✏=

1

m = n✏

Page 116: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

HierarchicalConstruction

St

Datastreamcontainingnelements

S1 S6

2k

2kcenters

S2 S5S4S3

2k 2k 2k 2k 2k2k

2kcenters

2kcenters

2kcenters

Page 117: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

CoreSetAlgorithm

Algorithmidea:

Define.

Treewithfan-outm hashowmanylevels?

Spaceusage:

logm n =

log n

logm=

log n

log n✏=

1

m = n✏

✓1

◆(m)(2k) =

2kn✏

Storeatmostmsetsofcentersfor eachlevelofthetree.

Page 118: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

CoreSetAlgorithm

Algorithmidea:

Define.

Approximationfactor?

m = n✏

Page 119: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-Median

Core-SetAlgorithm

Lemma:

Interpretation:Wecanboundthecostoflevel1by8timeslevel0…

Similarly:Wecanboundthecostoflevel2by8timeslevel1…Wecanboundthecostoflevel(1/𝜀)by8timeslevel(1/𝜀)-1.

d(Sw, T ) 8tX

i=1

d(Si, Ti) + 8d(S,C⇤)

Page 120: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

CoreSetAlgorithm

Algorithmidea:

Define.

Approximationfactor:

m = n✏

O(81/✏)

Page 121: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

HierarchicalConstruction

St

Datastreamcontainingnelements

S1 S6

2k

2kcenters

S2 S5S4S3

2k 2k 2k 2k 2k2k

2kcenters

2kcenters

2kcenters

Space:Approximation:(2,)O(81/✏) O(kn1/✏/✏)

Page 122: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

k-CenterClustering

Givenpoints:P=p1,p2,…,pn

Assumptions:⇒ Pointsareinametricspace:

distancessatisfytriangleinequality.

⇒ (Think:Euclideanspace)⇒ Thenumberofclustersk isgiven.

Goal:⇒ Chooseasetk points(“centers”)

thatminimizethemaximumdistancetoacenter.

Example:3clusters

Page 123: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

k-CenterApproximationAlgorithm

Showthatthisisa2-approximation:

1. T={x} foranyx inP.2. Repeatuntil|T|=k:• Letz bethepointinP that

maximizesd(z,T).• Addz toT.

3. ReturnT.

Claim:cost(P,T)≤2cost(P,C*)cost(P,T) isthemaximumdistanceofanypointinP tothesetT.

Page 124: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

k-CenterClustering

Someusefulthingstoprove:

Ifx isthefarthestpointfromT attheend(atdistancer):⇒ EverypointT∪{x} isatleastr fromeachother.⇒ Everyotherpointisdistance<r fromT.

IfC* isanoptimalclustering:⇒ AtleasttwopointsinT∪{x} areassignedtothesamecenter.⇒ Sothecentermustbeatleastdistancer/2 fromoneofthem.

Page 125: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Showthatthisisan8-approximation:

T=firstk pointsinstream.R=1Repeatuntilendofstream:

1. While|T|≤k:• Getnewpointx.• ifd(x,T)>2R,thenaddx toT.

2. T’=∅.3. Whilesomez inT hasd(z,T’)>2R:addz toT’4. T=T’5. R=2R

Streamingk-CenterClustering

Assumeminimumdistancebetweenpointsis1.

RebuildT’here.

DoubleR.

Page 126: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Streamingk-CenterClustering

Someusefulthingstoprove:

Beforestep(2):⇒ Everypointiswithin2RofT.

Beforestep(5):⇒ Everypointiswithin4RofT.

Beforestep(2):⇒ Therearek+1centersatdistanceatleastRfromeachother.

Beforestep(5):⇒ Allcentersaredistanceatleast2Rfromeachother.

Page 127: Algorithms at Scale - NUS Computinggilbert/CS5234/2019/... · 2019. 9. 20. · Algorithms at Scale (Week 6) Summary Today: Clustering and Streaming k-median clustering • Find kcenters

Summary

Today:ClusteringandStreamingk-medianclustering• Findk centerstominimizetheaverage

distancetoacenter.LPapproximationalgorithm• Find2k centersthatgivea4-

approximationoftheoptimalclustering.Streaming• Findk centersinastreamofpoints.• Useahierarchicalschemetoreduce

space.Otherclusteringproblems

LastWeek:GraphStreaming

ConnectivityBipartitetestMSTSpannersMatching