k-means

download k-means

of 22

description

Good details about k-means

Transcript of k-means

  • Chapter 3

    k-means Clustering

    The k-means or Lloyds algorithm is one of the most popular clustering al-gorithms. It can be used to solve the k-median or the k-means problem (seeProblem 1.14). To recall, in the k-median problem we are given a set ofpoints P M and k N and have to find a partition of P into k subsetor clusters C1, . . . , Ck with corresponding set of centroids C = {c1, . . . , ck}such that costD(P,C) = costDk (P ). Here cost

    Dk (P ) is the minimal cost of

    P with respect to a set of k centroids. In this section we mainly look atthe k-median problem for the squared euclidean distance, in which case it iscalled the k-means problem. However, Lloyds algorithm can be formulatedand used for a variety of distance or dissimilarity measures. The basic ideaof the algorithm is as follows.

    1. choose k initial centers

    2. repeat the following steps until there is no improvement in cost function

    a) Ci := set of points closest to ci

    b) ci := centroid of Ci

    This idea raises several questions.

    1. What are the centroids (with respect to D)?

    2. Does k-means converge? If so, how fast?

    3. How good are the solutions found by k-means?

    4. For which dissimilarity measures can it be applied?

    32

  • 3.1 Lloyds algorithm for the squared euclidean

    distance

    For the euclidean distance the first question has a maybe surprising answer.In this case the centroids are calledWeber points. Even for k = 1 these pointsare hard to compute. In fact, they cannot be computed exactly in the sensethat the Weber point of a set cannot be expressed with an expression usingonly simple functions (+,, d, d 4) in the coordinates of the originalpoints. For the squared euclidean distance, centroids are easier to describe.

    Lemma 3.1. For any finite set X Rd the centroid of X with respect to thesquared euclidean distance Dl22 is given by the center of gravity of the pointsin X, i.e.

    c(X) =1

    |X|xX

    x.

    More precisely, for any y Rd:

    Dl22(X, y) = Dl22(X, c(X)) + |X| Dl22(c(X), y).

    Proof. We only need to prove the final statement, since xy2 = 0 x = y.Let X Rd and y Rd be arbitrary. By u, v we denote the inner productof two vectors in Rd. To simplify notation, we write c for the centroid c(X)of X . Then

    Dl22(X, y) =xX

    x y2

    =xX

    x c + c y2

    =xX

    x c+ c y, x c+ c y

    =xX

    x c, x c+xX

    c y, c y+ 2xX

    x c, c y

    = Dl22(X, c) + |X|c y2 + 2xX

    (x c), c y

    .

    33

  • By definition of the centroid

    xX(x c) = 0. HencexX

    (x c), c y

    = 0

    and the lemma follows.

    Using the lemma we obtain the following formulation of Lloyds algorithmfor the squared euclidean distance.

    k-Means(P )

    choose k initial centroids c1, . . . , ck;

    repeat/* assignment step */

    for i = 1, . . . , k doCi := set of points in P closest to ci;

    end

    /* reavaluation step */

    for i = 1, . . . , k doci := c(Ci) =

    1|Ci|

    pCi p;

    end

    until convergence;

    return c1, . . . , ck and C1, . . . , Ck

    The next lemma shows that the algorithm always halts after a finitenumber of steps.

    Lemma 3.2. Algorithm k-Means always halts after a finite number of steps.The number of assignment and estimation steps can be bounded by nO(k

    2d).

    Proof. We only show the first statement. The proof of the second statementis an exercise. Since the algorithms stops as soon as there is no furtherimprovement in the quality of the solution, it will never compute the sameclustering twice. Hence it suffices to show that the number of possible k-clusterings of a point set P is finite. Set n := |P |. The number of subsets ofP is 2n. Since every k-clustering of P consists of k subsets of P , the numberof possible k-clusterings is bounded by 2kn.

    34

  • Unfortunately, the upper bound of the previous lemma on the runningtime of Algorithm k-Means can not be improved significantly.

    Lemma 3.3. There are sets X Rn, n = |X| for which there are initialcenters such that k-Means uses 2(

    n) assignment and reevaluation steps.

    However, note that the dimension of the points sets in this lemma dependon the number of points.

    It is well known that k-Means can get stuck in poor local optima andthe the quality of the solution found by k-Means depends heavily on theinitial centroids. An example for this behavior is given in Figure 3.1.

    a

    b

    c

    d

    b

    c1

    b

    c2

    b

    c1b

    c2

    optimal cluster

    C2

    optimal cluster

    C1

    k-means cluster C1

    k-means cluster C2

    Figure 3.1: Example for complete kinkage distance between sets

    Since the k-means problem is a hard problem, One should not surprisedthat Algorithm k-Means does not perform consistently well. In fact, wehave

    Theorem 3.4. The k-means problem is NP-complete. This remains true if

    1. d = 2 and k is arbitrary,

    2. k = 2 and d is arbitrary.

    Despite its shortcomings. k-Means is very popular and widely used. Inpractice, it usually performs reasonable fast and returns useful and mean-ingful clusterings. Later, we will see an elegant approach to compute good

    35

  • initial centroids for k-Means. Before doing so, we show that k-Means canbe applied not only to the squared euclidean distance, but to a large class ofdissimilarity measures, called Bregman divergences.

    3.2 k-Means for Bregman divergences

    To define Bregman divergences we need to recall some basic mathematicaldefinitions.

    A set S Rd os called convex if

    x, y S, [0, 1] : x+ (1 ) y S.

    A function is called strictly convex if

    x, y S, (0, 1) : (x) + (1 ) (y) > ( x+ (1 ) y)

    If the function g : Rd R, (x1, . . . , xd) 7 g(x1, . . . , xd) is differen-tiable, then

    g(y) :=(g

    x1(y), . . . ,

    g

    xd(y

    )is the gradient of g.

    Definition 3.5. S Rd, S 6= , convex, : S R differentiable, strictlyconvex function. The Bregman divergence d associated to is defined by

    S S R0(x, y) 7 (x) (y) x y,(y).

    Bregman divergences have a nice geometric interpretation that is shown inFigure 3.2. For y fixed, let fy : R

    d R be defined by fy(x) : (y) + x y,(y). Then fy is a linear function and hence a linear approximationto at point y. Then d(x, y) is the difference between the true functionvalue (x) and the value fy(x) of the linear approximation to at y. Fromthis description it should be clear, that Bregman divergences need not besymmetric. In fact, many Bregman divergences are asymmetric. Examplesof Bregman divergences include

    36

  • xy

    b

    b

    (y) + x y,(y)

    d(x, y)

    Figure 3.2: Geometric interpretation of Bregman divergences

    The squared euclidean distance Dl22 is the Bregman divergence associ-ated to

    (x) = x, x =di=1

    x2i .

    The Mahalanobis divergence DA, A Rdd positive definite (see Ex-ample 1.6), is the Bregman divergence associated to (x) = xT A x.

    The Kullback-Leibler divergenceDKL (see Example 1.8) is the Bregmandivergence associated to

    (x) =

    di=1

    xi log xi.

    The following result follows from the fact that is strictly convex, we omita detailed proof.

    Lemma 3.6. Bregman divergences are positive and reflexive.

    37

  • The following simple but surprising and powerful lemma shows that wecan generalize Algorithm k-Means (Algorithm 5) without modifications toany Bregman divergence. For example, we apply the version of the k-Meansfor the Kullback-Leibler divergence to try to solve the clustering problemderived and described in Section 1.3.

    Lemma 3.7. Let d : S S R0 be a Bregman divergence and X S, |X|

  • often choses the initial centers uniformly at random from the input set P .However, this strategy does not guarantee that k-Means finds clusteringsclose to the global optimum. In this section, we describe and analyze anon-uniform seeding strategy to compute initial centers, such that k-Meansstarted with these centroids in expectation finds a clustering that is fairlyclose to the optimal clustering.

    We restrict ourselves to the squared euclidean distance, although thetechnique can be generalized to certain Bregman divergences. To simplifynotation we write D(, ) instead of Dl22(, ). As usual P Rd, |P | < , isthe input. The following conventions further simplify the exposition.

    if C, |C| = k, is a set of centroids with corresponding set of clustersC = {C1, . . . , Ck}, the we call both C and C simply a clustering.

    for each x Rd and each C Rd, |C|

  • k-Means++(P, k)

    choose c P uniformly at random, C := {c};repeat

    chosse c P according to distribution pc();C := C {c};

    until |C| = k;run k-Means on P with initial centers C;

    return C;

    Intuitively, k-Means++ chooses the initial centroids iteratively by pre-ferring in the i-th step those points that are farther away from the first i 1centroids. It is clear that k-Means++ can be implemented to run in poly-nomial time. To analyze the quality of the solutions found by k-Means++we will completely ignore the k-Means-step of the algorithms. Although,in practice running just a small number of k-Means-steps improves the so-lution and leads to very good results. Our goal is to prove the followingtheorem

    Theorem 3.8. For any finite set of points P Rd and any k N, algorithmk-Means++ computes a k-clustering C of P such that

    E[D(P,C)] 8 (2 + ln k) optk(P ).

    To prove the theorem we need three preliminary lemmas. The next lemmais used to analyze the first random choice of the algorithm.

    Lemma 3.9. A Rd, |A| < . If a A is chosen uniformly at randomfrom A, then

    E[D(A, {a})] = 2 opt1(A).

    Proof. By c(A) we denote the center of gravity of A. By Lemma 3.1 we have

    40

  • opt1(A) = D(A, c(A)). Next

    E[D(A, {a}] =a0A

    1

    |A|bA

    D(b, a0)

    =a0A

    1

    |A|bA

    b a02

    Lemma 3.1=

    1

    |A|a0A

    bA

    b c(A)2 + |A|a0 c(A)2

    = 2bA

    b c(A)2

    = 2 opt1(A).

    The following lemma is a key ingredient to analyze the subsequent randomchoices of k-Means++.

    Lemma 3.10. Let A P be a cluster of Copt and let C, |C| < k, be arbitrary.If a is chosen according to pC(), then

    E[D(A,C {a})|a A] 8 Dopt(A).

    Proof. First, observe that

    pC(a|a A) = pC(a)yA pC(y)

    =D(a, C)xP D(x, C)

    xP D(x, C)yAD(y, C)

    =D(a, C)yAD(y, C)

    =D(a, C)

    D(A,C).

    Hence

    E[D(A,C {a})|a A] =

    a0A

    D(a0, C)

    D(A,C)aA

    min{D(a, C), D(a, a0)}.

    41

  • Using

    a0, a : D(a0, C) 2D(a, C) + 2D(a, a0) (Exercise 1.3)

    we obtain

    D(a0, C) 2|A| aA

    D(a, C) +2

    |A| aA

    D(a, a0)

    =2

    |A| D(A,C) +2

    |A| aA

    D(a, a0).

    Hence

    E[D(A,C {a})|a A]

    2|A| a0A

    D(A,C)

    D(A,C)aA

    min{D(a, C), D(a, a0)}

    +2

    |A| a0A

    aAD(a, a0)D(A,C)

    aA

    min{D(a, C), D(a, a0)}.

    In the first summand we replace the minimum by D(a, a0) and in the secondsummand by D(a, C) to obtain

    E[D(A,C {a})|a A]

    2|A| a0A

    aA

    D(a, a0) +2

    |A| a0A

    aA

    D(a, a0)

    =4

    |A| a0A

    aA

    D(a, a0).

    Using Lemma 3.9 and its proof we see that

    1

    |A| a0A

    aA

    D(a, a0) = E[D(A), {a0}] 2 opt1(A) = 2 Dopt(A).

    This concludes the proof.

    The next lemma will lead easily to the proof of Theorem 3.8.

    42

  • Lemma 3.11. Let 0 < u < k, 0 t u. Let P u be the union of u differentclusters of Copt and set P

    c := P \ P u. Finally, let B P c and set C0 := Band Cj := Cj1 {aj}, j = 1, . . . , t, where aj is chosen according to pCj1 .Then

    E[D(P,Ct)

    ] (1 +Ht)(D(P c, B) + 8 Dopt(P u))+u tu

    D(P u, B),

    where Ht =t

    i=11i.

    Before we prove this lemma, we show that it implies Theorem 3.8.

    Proof of Thereom 3.8. Let c1 be the first centroid chosen by algorithm k-Means++ and assume c1 is in the set A of an optimal clustering Copt. We applyLemma 3.11 with the following parameter settings

    t = u = k 1, B = {c1}, P c = A, P u = P \ A

    and obtain

    E[D(P,C)] (1 +Hk1) (D(A, {c1}) + 8 Dopt(P \ A)

    )Lemma 3.9 (2 + ln(k))(2 opt1(A) + 8 optk(P ) 8 opt1(A))

    8 (2 + ln(k))optk(P ).

    We are left with

    Proof of Lemma 3.11. The proof is by induction on pairs (t, u). More pre-cisely, we show that

    the lemma holds for all pairs (t, u) with t = 0 and u > 0 the lemma holds for the pair (1, 1) if the lemma holds for the pairs (t1, u) and (t1, u1) then it holdsfor the pair (t, u).

    43

  • We call the optimal clusters in P u the uncovered clusters and the clusters inP c the covered clusters. In the cases t = 0, u > 0, we have 1 +H0 =

    utu

    = 1and

    E[D(P,B)] = E[D(P,C0)] = D(Pc, B) +D(P u, B),

    which proves the lemma in these cases.For the pair (t, u) = (1, 1) we have 1 + H1 = 1 and

    utu

    = 0. Moreover,P u consists of a single cluster A and we have just a = a1 to choose. Bydefinition of the k-Means++ distribution pB

    Pr[a1 A] = D(A,B)D(P,B)

    and Pr[a1 6 A] = D(Pc, B)

    D(P,B).

    Using Lemma 3.10 this implies

    E[D(P,C1)] D(A,B)D(P,B)

    (D(P c, B) + 8 Dopt(A)

    )+D(P c, B)

    D(P,B)D(P,B)

    2 D(P c, B) + 8 Dopt(A),which proves the lemma in this case as well.

    Now we prove the induction step from cases (t1, u), (t1, u1) to thecase (t, u). Consider the first choice a1. With probability

    case(i) D(Pc,B)

    D(P,B)we have a1 P c

    case(ii) D(Pu,B)

    D(P,B)we have a1 P u.

    In case (i) we can use the hypothesis for (t 1, u) to conclude

    E[D(P,Ct)|a1 P c] (1 +Ht1)(D(P c, B) + 8 Dopt(P u)

    )+u (t 1)

    uD(P u, B)

    and

    D(P c, B)

    D(P,B)E[D(P,Ct)|a1 P c] D(P

    c, B)

    D(P,B)

    ((1+Ht1)

    (D(P c, B)+8Dopt(P u)

    )+u (t 1)

    uD(P u, B)). (3.1)

    44

  • In case (ii) we partition the event a1 P u into the disjoint subevents a1 A,A P u, A Copt. Hence

    Pr[a1 P u] =

    APu,ACoptPr[a1 A].

    Using the induction hypothesis for the pair (t 1, u 1) and Lemma 3.10we obtain

    E[D(P,Ct|a1 A)] (1 +Ht1)(D(P c, B) + 8 Dopt(A)+

    8 Dopt(P u) 8 Dopt(A))+u tu 1

    (D(P u, B)D(A,B))

    = (1 +Ht1)(D(P c, B) + 8 Dopt(P u)

    )+u tu 1

    (D(P u, B)D(A,B)).

    Summing over all A and using the following consequence of the power-meaninequality

    APu,ACopt

    D(A,B)2 1uD(P u, B)u

    we obtain

    APu,ACopt

    Pr[a1 A] E[D(P,Ct|a1 A)]

    =

    APu,ACopt

    D(A,B)

    D(P,B) ((1 +Ht1)(D(P c, B) + 8 Dopt(P u))

    +u tu 1

    (D(P u, B)D(A,B))

    D(Pu, B)

    D(P,B) ((1 +Ht1)(D(P c, B) + 8 Dopt(P u)))

    +1

    D(P,B)

    u tu 1

    (D(P u, B)2 1

    uD(P u, B)2

    )=D(P u, B)

    D(P,B) ((1 +Ht1)(D(P c, B) + 8 Dopt(P u))+ u t

    uD(P u, B)

    ).

    45

  • Combining this with (3.1) leads to

    E[D(P,Ct)] =D(P c, B)

    D(P,B) E[D(P,Ct)|a1 P c]

    +

    APu,ACoptPr[a1 A] E[D(P,Ct|a1 A)]

    (1 +Ht1)(D(P c, B) + 8 Dopt(P u)

    )+u tu

    D(P u, B)

    +D(P c, B)

    D(P,B)

    1

    uD(P u, B)

    (1 +Ht1 + 1u)(D(P c, B) + 8 Dopt(P u)

    )+u tu

    D(P u, B)

    (1 +Ht)(D(P c, B) + 8 Dopt(P u)

    )+u tu

    D(P u, B),

    where we used t u 1t 1

    ufor the last inequality.

    3.4 A constant factor approximation algorithm

    for the k-means problem

    In this section we describe a constant factor approximation algorithm for thek-means problem. Since we only consider the squared euclidean distance,we write D() instead of Dl22(). The algorithm will be a local improvementalgorithm. More precisely, given some intermediate solution, i.e. a set of kcentroids, the algorithm will try to improve the quality of the solution byreplacing a single centroid from the current solution by some other centroid,until no more improvement is possible. To make this idea work properly, werestrict the set of centroids to some finite set of candidates.

    Let P Rd be the set of input points. As usual, by Copt we denote a setof optimal centroids for the k-means problem on input P . Let T Rd besome finite set. Then for t T

    NT (t) := {p P : D(p, t) D(p, y) for all y T},

    i.e. NT (t) is the set of point closer to t than to any other point in T . Forp P and t T we denote by tp the closest point in T to p (ties are broken

    46

  • arbitrarily). Then

    D(P, T ) :=pP

    D(p, tp).

    Let K Rd be some finite set of points and let c 1 be some constant. Kis a (k, c)-approximate candidate set for P if

    S K, |S| = k : D(P, S) c optk(P ) = c D(P,Copt),

    i.e. the best set of k centroids in K is at most c times worse than the (overall)optimal set of k centroids. The next lemma is not hard to prove, its proof isleft as an exercise.

    Lemma 3.12. For all finite sets P Rd, and all k N , the set P is a(2, k)-approximate candidate set for itself.

    The following observation is also not hard to prove.

    Observation. IfK is a (2, k)-approximate candidate set for P and if D(P, S) c minTK,|T |=kD(P, T ), then D(P, S) 2c optk(P ).

    From now on we restrict ourselves to the candidate set K = P and denoteby

    O := argminSP,|S|=kD(P, S)

    i.e. O is an optimal set of centroids in P . However, most of the results pre-sented in the sequel can be generalized (with appropriate but minor modifi-cations) to arbitrary candidate sets. The following definition is fundamentalfor the constant factor approximation algorithm for k-means.

    Definition 3.13. Let S P and 0 < < 1.1. S is called stable, if for all s S, s P \ S

    D(P, S {s} {s}) D(P, S).

    2. S is called -stable, if for all s S, s P \ S

    D(P, S {s} {s}) (1 )D(P, S).

    47

  • Our goal is to compute stable sets. This leads to the following algorithm.

    k-means-Li(P )

    choose a set S P of k initial centroids;repeat

    find s S, s P \ S with D(P, S {s} {s})D(P, S);set S := S {s} {s};

    until S is stable;

    Unfortunately, the number of loop-iterations of this algorithm cannot bebounded by a polynomial in |P |. Hence, we modify it to obtain

    k-means-Li(P )

    choose a set S P of k initial centroids;repeat

    find s S, s P \ S with D(P, S {s} {s}(1 )D(P, S);set S := S {s} {s};

    until S is -stable;

    Using standard techniques, it can be shown that for fixed this algorithmhas polynomial running time. The quality of the solution found by the twoalgorithms follows immediately from our main theorem of this section.

    Theorem 3.14. If S is a stable set, thenD(P, S) 25 D(P,O).

    If S is a -stable set, then

    D(P, S) ( 51

    )2 D(P,O).We will only prove the first part of this theorem, the proof of the second

    part is an easy extension, that we leave as an exercise. Combining thistheorem with Lemma 3.12, we obtain

    Corollary 3.15. For any > 0, the k-means problem can be approximatedwith factor 50 + in time polynomial in the input size and in 1/.

    48

  • In the rest of this section we prove the first part of Theorem 3.14. Firstwe need some notation and definitions. Recall that O P, |O| = k, denotesan optimal set of centroids in P . Let S be a stable set, |S| = k. We call S aset of heuristic centroids.

    If s S is a closest point in S to o O, then we say that s captures o,or that o is captured by s, and we write s = so.

    If s S captures no element of O, then s is called lonely.Next we partition S into S1, . . . , Sm and O into O1, . . . , Om such that

    |Si| = |Oi|, i = 1, . . . , m if s Si, then either s is lonely or s captures all o Oi.

    One easily sees that such partitioning always exists. An example is givenin Figure 3.3. The edges in this figure denote the capture-relation. Given a

    + + + + +S1

    +S2

    + + +S3

    heuristic centroids

    bc bc bc bc bc

    O1

    bc

    O2

    bc bc bc

    O3

    optimal centroids

    Figure 3.3: Partitioning optimal and heuristic centers consistent with cap-tured points

    partitioning as above, (s1, o1), . . . , (sk, ok) are called swap pairs, if

    j : (sj , oj) Si Oi

    each o O is contained in exactly one pair, each s is contained in at most two pairs, for each pair (sj, oj) the element sj captures no o 6= oj.

    49

  • + + + + +S1

    +S2

    + + +S3

    heuristic centroids

    bc bc bc bc bc

    O1

    bc

    O2

    bc bc bc

    O3optimal centroids

    Figure 3.4: Swap pairs

    One easily sees that

    Observation. For each partitioning of centroids S1, . . . , Sm and O1, . . . , Omthere is a set of swap pairs.

    For the example shown in Figure 3.3 possible swap pairs are shown inFigure 3.4. Let (s, o) be a swap pair in set {(s1, o1), . . . , (sk, ok)} and letC1, . . . , Ck be the clusters for set S = {s1, . . . , sk}. We define a new clusteringor assignment of points in P for the set of centroids S = S {s} {o} asfollows

    if q 6 NS(s) NO(o), then o stays in its old cluster, if q NO(o), then q is assigned to os cluster, if q NS(s) \NO(o) then q is assigned to the cluster belonging to soq .

    Using our notation, soq is the heuristic center that is closest to the optimalcenter closest to q. To show that this reassignment leads to some clusteringwith respect to the centroids in S , we need to argue, that for q NS(s) \NO(o) we have soq S . However, q NS(s) \NO(o) implies oq 6= o. Since(s, o) is a swap pair, s captures at most o. In particular, s does not captureoq. Therefore soq S .

    Note that the clustering defined be reassigning points as above, need notbe the best possible assignment of points to the centroids in S . The differencein value between the clustering with respect to S and the clustering as definedabove, is given by

    qNO(o)

    (D(q, o)D(q, sq)

    )+

    qNS(s)\NO(o)

    (D(q, soq)D(q, s)

    ). (3.2)

    50

  • Lemma 3.16. Let S be a stable set. Then

    0 D(P,O) 3D(P, S) + 2R,

    where R :=

    qP D(q, soq).

    Proof. By stability for any swap pair (s, o)qNO(o)

    (D(q, o)D(q, sq)

    )+

    qNS(s)\NO(o)

    (D(q, soq)D(q, s)

    ) 0.For any q NS(s) the term in the second sum is always non-negative. Hencewe also obtain

    qNO(o)

    (D(q, o)D(q, sq)

    )+

    qNS(s)

    (D(q, soq)D(q, s)

    ) 0.Since each o O is contained in exactly one swap pair and each s S iscontained in at most two swap pairs, we get from the previous inequality bysumming overall swap pairs

    qP

    (D(q, oq)D(q, sq)

    )+ 2

    qP

    (D(q, soq)D(q, sq)

    ) 0.Rearranging terms this leads to

    qPD(q, oq) 3

    qP

    D(q, sq) + 2qP

    D(q, soq) 0.

    By definition of R, this proves the lemma.

    We need to estimate R in terms of D(P,O) and D(P, S). By rearrangingterms again we get

    R =oO

    qNO(o)

    D(q, so) =oO

    D(NO(o), {so}).

    By c(o) we denote the center of gravity of NO(o), i.e.

    c(o) :=1

    |NO(o)|

    qNO(o)q.

    51

  • Then we get

    R =oO

    D(NO(o), {so})

    Lemma 3.1=

    oO

    D(NO(c(o)), {o}) + |NO(o)|D(c(o), so)

    =oO

    qNO(o)

    (D(q, c(o)) +D(c(o), so)

    )def. of so

    oO

    qNO(o)

    (D(q, c(o)) +D(c(o), sq)

    )=

    qP

    (D(q, c(oq)) +D(c(oq), sq)

    )-inequality

    qP

    D(q, c(oq)) +qP

    (c(oq) q+ q sq)2Lemma 3.1 D(P,O) +

    qP

    c(oq) q2

    +qP

    q sq2 + 2 qP

    c(oq) q q sq

    Lemma 3.1 2 D(P,O) +D(P, S) + 2 qP

    c(oq) q q sq.

    To analyze the sum in the last expression we use the following lemma, thatis a reformulation of the Cauchy-Schwartz inequality.

    Lemma 3.17. Let n, . . . , n and 1, . . . , n be two sequences of real numbersand set

    2 :=

    ni=1

    2in

    i=1 2i

    .

    Thenni=1

    ii 1

    ni=1

    2i .

    We apply this lemma to the sequences q = q sq, q P , and q =

    52

  • c(oq) q, q P , to obtainR 2 D(P,O) +D(P, S) + 2

    qP

    q sq2

    = 2 D(P,O) +D(P, S) + 2D(P, S).

    Again using Lemma 3.1 we see that

    D(P,O) =oO

    D(NO(o), {o}) oO

    D(NO(o), {c(o)}).

    Setting

    2 :=D(P, S)

    D(P,O),

    this implies or, equivalently, 1 1

    . We conclude

    R 2 D(P,O) +D(P, S) + 2D(P, S)

    2 D(P,O) +D(P, S) + 2D(P, S)

    = 2 D(P,O) + (1 + 2

    ) D(P, S).Note that is the approximation factor we are interested in.

    Combining Lemma 3.16 with our last inequality for R we get

    0 D(P,O) 3 D(P, S) + 2 (2 D(P,O) + (1 + 2

    ) D(P, S))

    = 5 D(P,O) (1 4

    ) D(P, S).Equivalently,

    5

    1 4/ D(P, S)

    D(P,O)= 2.

    This implies

    5 2 (1 4/),or

    0 2 4 5 = ( + 1)( 5).Since the first factor is positive for > 0, the second factor must be smallerthan 0. Hence 5, which by definition of proves Theorem 3.14.

    53