Christian Sohler 1 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität A Fast...

Post on 26-Mar-2015

232 views 0 download

Tags:

Transcript of Christian Sohler 1 HEINZ NIXDORF INSTITUT Universität Paderborn Algorithmen und Komplexität A Fast...

Christian Sohler 1

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

A Fast PTAS for k-Means Clustering

Dan Feldman, Tel Aviv University, Morteza Monemizadeh,Christian Sohler ,Universität Paderborn

Christian Sohler 2

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

Simple coreset for clustering problemsOverview

Introduction

Weak Coresets• Definition• Intuition• The construction• A sketch of analysis

The k-means PTAS

Conclusions

Christian Sohler 3

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

IntroductionClustering

Clustering• Partition input in sets (cluster), such that

- Objects in same cluster are similar - Objects in different clusters are dissimilar

Goal• Simplification

• Discovery of patterns

Procedure• Map objects to Euclidean space => point set P

• Points in same cluster are close

• Points in different clusters are far away from eachother

Christian Sohler 4

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

Introductionk-means clustering

Clustering with Prototypes• One prototyp (center) for each cluster

k-Means Clustering• k clusters C ,…,C

• One center c for each cluster C

• Minimize d(p,c )

1 k

i i

pCiii

2

Christian Sohler 5

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

Introductionk-means clustering

Clustering with Prototypes• One prototyp (center) for each cluster

k-Means Clustering• k clusters C ,…,C

• One center c for each cluster C

• Minimize d(p,c )

1 k

i i

pCiii

2

Christian Sohler 6

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

Introductionk-means clustering

Clustering with Prototypes• One prototyp (center) for each cluster

k-Means Clustering• k clusters C ,…,C

• One center c for each cluster C

• Minimize d(p,c )

1 k

i i

pCiii

2

Christian Sohler 7

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

(128,59,88)(218,181,163)

IntroductionSimplification / Lossy Compression

Christian Sohler 8

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

IntroductionSimplification / Lossy Compression

Christian Sohler 9

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

IntroductionSimplification / Lossy Compression

Christian Sohler 10

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

IntroductionProperties of k-means

Properties of k-meansOptimal solution, if

• Centers are given assign each point to the nearest center

• Cluster are given centroid (mean) of clusters

Christian Sohler 11

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

IntroductionProperties of k-means

Properties of k-meansOptimal solution, if

• Centers are given assign each point to the nearest center

• Cluster are given centroid (mean) of clusters

Christian Sohler 12

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

IntroductionProperties of k-means

Properties of k-meansOptimal solution, if

• Centers are given assign each point to the nearest center

• Cluster are given centroid (mean) of clusters

Christian Sohler 13

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

IntroductionProperties of k-means

Properties of k-meansOptimal solution, if

• Centers are given assign each point to the nearest center

• Cluster are given centroid (mean) of clusters

Christian Sohler 14

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

IntroductionProperties of k-means

Properties of k-meansOptimal solution, if

• Centers are given assign each point to the nearest center

• Cluster are given centroid (mean) of clusters

Notation:cost(P,C) denotes the cost of the solution defined this way

Christian Sohler 15

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

Weak CoresetsCentroid Sets

Definition (-approx. centroid set)A set S is called -approximate centroid set, if

it contains a subset C S s.t. cost(P,C) (1+) cost(P,Opt)

Lemma [KSS04]The centroid of a random set of 2/ points is with constant

probability a (1+)-approx. of the optimal center of P.

CorollaryThe set of all centroids of subsets of 2/ points is an -approx.

Centroid set.

Christian Sohler 16

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

Weak CoresetsDefinition

Definition (weak -Coreset for k-means)A pair (K,S) is called a weak -coreset for P, if for every set C of k

centers from the -approx. centroid set S we have (1-) cost(P,C) cost(K,C) (1+) cost(P,C)

Point set P (light blue)

Christian Sohler 17

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

Weak CoresetsDefinition

Definition (weak -Coreset for k-means)A pair (K,S) is called a weak -coreset for P, if for every set C of k

centers from the -approx. centroid set S we have (1-) cost(P,C) cost(K,C) (1+) cost(P,C)

Set of solution S (yellow)

Christian Sohler 18

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

Weak CoresetsDefinition

Definition (weak -Coreset for k-means)A pair (K,S) is called a weak -coreset for P, if for every set C of k

centers from the -approx. centroid set S we have (1-) cost(P,C) cost(K,C) (1+) cost(P,C)

Possible coreset with weights (red)

4

34

5

5

Christian Sohler 19

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

Weak CoresetsDefinition

Definition (weak -Coreset for k-means)A pair (K,S) is called a weak -coreset for P, if for every set C of k

centers from the -approx. centroid set S we have

(1-) cost(P,C) cost(K,C) (1+) cost(P,C)

Approximates cost of k centers (voilett) from S

4

34

5

5

Christian Sohler 20

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

Weak CoresetsIdeal Sampling

Problem• Given n numbers a1,…,an >0

• Task: approximate A:=ai by random sampling

Ideal Sampling• Assign weights w1,…, wn to numbers• wj = avg / aj

• Pr[x=j] = aj / avg• Estimator: wxax

Christian Sohler 21

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

Weak CoresetsIdeal Sampling

Problem• Given n numbers a1,…,an >0

• Task: approximate A:=ai by random sampling

Ideal Sampling• Assign weights w1,…, wn to numbers• wj = avg / aj

• Pr[x=j] = aj / avg• Estimator: wxax

Properties of estimator:(1) wxax = A (0 variance)(2) Expected weight of number j is 1

Christian Sohler 22

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

Weak CoresetsIdeal Sampling

Problem• Given n numbers a1,…,an >0

• Task: approximate A:=ai by random sampling

Ideal Sampling• Assign weights w1,…, wn to numbers• wj = A / aj

• Pr[x=j] = aj / A• Estimator: wxax

Properties of estimator:(1) wxax = A (0 variance)(2) Expected weight of number j is 1

Only problem:Weights can be very large

Christian Sohler 23

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

Weak CoresetsConstruction

Step 1• Compute constant factor approximation

Christian Sohler 24

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

Weak CoresetsConstruction

Step 2• Consider each cluster separately

Christian Sohler 25

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

Weak CoresetsConstruction

Step 2• Consider each cluster separately

Christian Sohler 26

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

Weak CoresetsConstruction

Step 2• Consider each cluster separately

Main idea: Apply ideal sampling to each Cluster CPr[pi is taken] = dist(pi, c) / cost(C,c)w(pi) = cost(C,c) / dist(pi,c)

Christian Sohler 27

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

Weak CoresetsConstruction

Step 2• Consider each cluster separately

Main idea: Apply ideal sampling to each Cluster CPr[pi is taken] = dist(pi, c) / cost(C,c)w(pi) = cost(C,c) / dist(pi,c)

But what about high weights?

Christian Sohler 28

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

Weak CoresetsConstruction

Step 2• A little twist

Main idea: Apply ideal sampling to each Cluster CPr[pi is taken] = dist(pi, c) / cost(C,c)w(pi) = cost(C,c) / dist(pi,c)

Christian Sohler 29

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

Weak CoresetsConstruction

Step 3• A little twist

Uniform sampling from small ballRadius = average distance /

Ideal sampling from ‚outliers‘

Christian Sohler 30

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

Weak CoresetsAnalysis

Fix arbitrary set of centers K• Case (a): nearest center is ‚far away‘

Christian Sohler 31

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

Weak CoresetsAnalysis

Fix arbitrary set of centers K• Case (a): nearest center is ‚far away‘

At least (1-)-fraction of points is here by choice

of radius

Christian Sohler 32

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

Weak CoresetsAnalysis

Fix arbitrary set of centers K• Case (a): nearest center is ‚far away‘

At least (1-)-fraction of points is here by choice

of radius

Weight of samples from outliers at most |C|

Christian Sohler 33

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

Weak CoresetsAnalysis

Fix arbitrary set of centers K• Case (a): nearest center is ‚far away‘

At least (1-)-fraction of points is here by choice

of radius

Forget about outliers!

Christian Sohler 34

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

Weak CoresetsAnalysis

Fix arbitrary set of centers K• Case (a): nearest center is ‚far away‘

Christian Sohler 35

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

Weak CoresetsAnalysis

Fix arbitrary set of centers K• Case (a): nearest center is ‚far away‘

Doesn‘t matter where points lie inside the ball

DD

Christian Sohler 36

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

Weak CoresetsAnalysis

Fix arbitrary set of centers K• Case (b): nearest center is ‚near‘

Christian Sohler 37

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

Weak CoresetsAnalysis

Fix arbitrary set of centers K• Case (b): nearest center is ‚near‘

Almost ideal sampling- Expectation is cost(C,K)- low variance

Christian Sohler 38

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

Weak CoresetsResult

The centroid set• S is set of all centroids of 2/ points (with repetition) from our

sample set K

• Can show that K approximates all solutions from S

• Can show that S is an -approx. centroid set w.h.p.

TheoremOne can compute in O(nkd) time a weak -coreset (K,S). The size

of K is poly(k, 1/). S is the set of all centroids of subsets of K of size 2/.

Christian Sohler 39

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

Weak CoresetsApplications

Fast-k-Means-PTAS(P,k)1. Compute weak coreset K

2. Project K on poly(1/,k) dimensional space

3. Exhaustively search for best solution of (projection of) centroid set

4. Return centroids of the points that create C

Running time:O(nkd + (k/) )O(k/)

~

Christian Sohler 40

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und KomplexitätSummary

Weak Coresets• independent of n and d

• fast PTAS for k-means

• First PTAS for kernel k-means (if the kernel maps into finite dimensional space)

Christian Sohler 41

HEINZ NIXDORF INSTITUTUniversität Paderborn

Algorithmen und Komplexität

Christian SohlerHeinz Nixdorf Institut& Institut für InformatikUniversität PaderbornFürstenallee 1133102 Paderborn, Germany

Tel.: +49 (0) 52 51/60 64 27Fax: +49 (0) 52 51/62 64 82E-Mail: csohler@upb.dehttp://www.upb.de/cs/ag-madh

Thank you!Thank you!