Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal...

31
Closest Pairs COMP 3711H - HKUST Version of 27/11/2014 M. J. Golin

Transcript of Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal...

Page 1: Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal and vertical lines. Then (a) No box holds more than 1 point (b) If d(p;p0) = then

Closest Pairs

COMP 3711H - HKUSTVersion of 27/11/2014M. J. Golin

Page 2: Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal and vertical lines. Then (a) No box holds more than 1 point (b) If d(p;p0) = then

Outline

• Introduction

• The Gridding Lemma

• An O(n log n) sweep line algorithm

• An O(n) randomized algorithm

– Hashing a point set

– The algorithm

Page 3: Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal and vertical lines. Then (a) No box holds more than 1 point (b) If d(p;p0) = then

Introduction

• Most algorithms seen so far have assumed that items have1-dimensional weights or keys, i.e., they can be ordered.

• There is often a need, e.g., graphics, data bases, tomanipulate multi-dimensional data. The area of algorithmsthat treats geometric (multi-dimensional) data is calledcomputational geometry.

• In this set of slides, we will see two algorithms for solvingthe 2-dimensional closest-pair problem.

• The first algorithm will take O(n log n) worst case time tofind the closest pair among n points. The second algorithmwill be randomized and only require O(n) average time.

Page 4: Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal and vertical lines. Then (a) No box holds more than 1 point (b) If d(p;p0) = then

Problem Definitions

• Problem input is P = {p1, p2, . . . , pn},a set of n 2-dimensional points, wherep = (p.x, p.y)

• distance between p and p′ isd(p, p′) =

√(p.x− p′.x)2 + (p.y − p′.y)2,

• Set d(p, P ) = minp′∈P−{p} d(p, p′)

to be distance of nearest point to p.

• Set δ(P ) = minp∈P d(p, P ) to be theclosest pair distance in P.

• The Closest-Pair Problem is to findp, p′ ∈ P such that d(p, p′) = δ(P )

p

Page 5: Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal and vertical lines. Then (a) No box holds more than 1 point (b) If d(p;p0) = then

Outline

• Introduction

• The Gridding Lemma

• An O(n log n) sweep line algorithm

• An O(n) randomized algorithm

– Hashing a point set

– The algorithm

Page 6: Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal and vertical lines. Then (a) No box holds more than 1 point (b) If d(p;p0) = then

The Gridding Lemma

The Gridding Lemma:

Let P = {p1, p2, . . . , pn} be a set of points andδ = δ(P ) their closest-pair distance.Grid the plane into δ

2 ×δ2 boxes using horizontal and vertical lines. Then

(a) No box holds more than 1 point

(b) If d(p, p′) = δ then the boxes containing p and p′

are within two horizontal and two vertical boxes of each other.

Proof:

(a) if two points are in same box they can be at mostδ/√2 < δ away from each other, contradicting def of δ.

(b) if two points p, p′ are more than two boxeshorizontally (or vertically) away from each other then

|p.x− p′.x| > δ (|p.y − p′.y| > δ), so d(p, p′) > δ

contradicting definition of δ.

Page 7: Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal and vertical lines. Then (a) No box holds more than 1 point (b) If d(p;p0) = then

Outline

• Introduction

• The Gridding Lemma

• An O(n log n) sweep line algorithm

• An O(n) randomized algorithm

– Hashing a point set

– The algorithm

Page 8: Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal and vertical lines. Then (a) No box holds more than 1 point (b) If d(p;p0) = then

A Sweep Algorithm

Sort the points by x coordinate.

p1

p2

p3

p4

p5

p6

p7

p8

p9

Set Pi = {p1, . . . , pi}.

Note δ(Pi) = min(δ(Pi−1), d(pi, Pi−1))

Algorithm sweeps a vertical line throughpoints, i.e., p2, p3, . . . , pn, at each stepusing equation to calculate δ(Pi).

At end, δ = δ(Pn)

δ(P2) = d(p2, p1)

Page 9: Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal and vertical lines. Then (a) No box holds more than 1 point (b) If d(p;p0) = then

A Sweep Algorithm

Sort the points by x coordinate.

p1

p2

p3

p4

p5

p6

p7

p8

p9

Set Pi = {p1, . . . , pi}.

Note δ(Pi) = min(δ(Pi−1), d(pi, Pi−1))

Algorithm sweeps a vertical line throughpoints, i.e., p2, p3, . . . , pn, at each stepusing equation to calculate δ(Pi).

At end, δ = δ(Pn)

δ(P2) = d(p2, p1)

δ(P3) = d(p2, p1)

Page 10: Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal and vertical lines. Then (a) No box holds more than 1 point (b) If d(p;p0) = then

A Sweep Algorithm

Sort the points by x coordinate.

p1

p2

p3

p4

p5

p6

p7

p8

p9

Set Pi = {p1, . . . , pi}.

Note δ(Pi) = min(δ(Pi−1), d(pi, Pi−1))

Algorithm sweeps a vertical line throughpoints, i.e., p2, p3, . . . , pn, at each stepusing equation to calculate δ(Pi).

At end, δ = δ(Pn)

δ(P2) = d(p2, p1)

δ(P3) = d(p2, p1)

δ(P4) = d(p2, p1)

Page 11: Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal and vertical lines. Then (a) No box holds more than 1 point (b) If d(p;p0) = then

A Sweep Algorithm

Sort the points by x coordinate.

p1

p2

p3

p4

p5

p6

p7

p8

p9

Set Pi = {p1, . . . , pi}.

Note δ(Pi) = min(δ(Pi−1), d(pi, Pi−1))

Algorithm sweeps a vertical line throughpoints, i.e., p2, p3, . . . , pn, at each stepusing equation to calculate δ(Pi).

At end, δ = δ(Pn)

δ(P2) = d(p2, p1)

δ(P3) = d(p2, p1)

δ(P4) = d(p2, p1)

δ(P5) = d(p5, p4)

Page 12: Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal and vertical lines. Then (a) No box holds more than 1 point (b) If d(p;p0) = then

A Sweep Algorithm

Sort the points by x coordinate.

p1

p2

p3

p4

p5

p6

p7

p8

p9

Set Pi = {p1, . . . , pi}.

Note δ(Pi) = min(δ(Pi−1), d(pi, Pi−1))

Algorithm sweeps a vertical line throughpoints, i.e., p2, p3, . . . , pn, at each stepusing equation to calculate δ(Pi).

At end, δ = δ(Pn)

δ(P2) = d(p2, p1)

δ(P3) = d(p2, p1)

δ(P4) = d(p2, p1)

δ(P5) = d(p5, p4)

δ(P6) = d(p5, p4)

Page 13: Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal and vertical lines. Then (a) No box holds more than 1 point (b) If d(p;p0) = then

A Sweep Algorithm

Sort the points by x coordinate.

p1

p2

p3

p4

p5

p6

p7

p8

p9

Set Pi = {p1, . . . , pi}.

Note δ(Pi) = min(δ(Pi−1), d(pi, Pi−1))

Algorithm sweeps a vertical line throughpoints, i.e., p2, p3, . . . , pn, at each stepusing equation to calculate δ(Pi).

At end, δ = δ(Pn)

δ(P2) = d(p2, p1)

δ(P3) = d(p2, p1)

δ(P4) = d(p2, p1)

δ(P5) = d(p5, p4)

δ(P6) = d(p5, p4)

δ(P7) = d(p5, p4)

Page 14: Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal and vertical lines. Then (a) No box holds more than 1 point (b) If d(p;p0) = then

A Sweep Algorithm

Sort the points by x coordinate.

p1

p2

p3

p4

p5

p6

p7

p8

p9

Set Pi = {p1, . . . , pi}.

Note δ(Pi) = min(δ(Pi−1), d(pi, Pi−1))

Algorithm sweeps a vertical line throughpoints, i.e., p2, p3, . . . , pn, at each stepusing equation to calculate δ(Pi).

At end, δ = δ(Pn)

δ(P2) = d(p2, p1)

δ(P3) = d(p2, p1)

δ(P4) = d(p2, p1)

δ(P5) = d(p5, p4)

δ(P6) = d(p5, p4)

δ(P7) = d(p5, p4)

δ(P8) = d(p5, p4)

Page 15: Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal and vertical lines. Then (a) No box holds more than 1 point (b) If d(p;p0) = then

A Sweep Algorithm

Sort the points by x coordinate.

p1

p2

p3

p4

p5

p6

p7

p8

p9

Set Pi = {p1, . . . , pi}.

Note δ(Pi) = min(δ(Pi−1), d(pi, Pi−1))

Algorithm sweeps a vertical line throughpoints, i.e., p2, p3, . . . , pn, at each stepusing equation to calculate δ(Pi).

At end, δ = δ(Pn)

δ(P2) = d(p2, p1)

δ(P3) = d(p2, p1)

δ(P4) = d(p2, p1)

δ(P5) = d(p5, p4)

δ(P6) = d(p5, p4)

δ(P7) = d(p5, p4)

δ(P8) = d(p5, p4)

δ(P9) = d(p9, p8)

δ = δ(P9) = d(p9, p8)

Page 16: Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal and vertical lines. Then (a) No box holds more than 1 point (b) If d(p;p0) = then

Updating One Sweep Step

δ(Pi) = min(δ(Pi−1), d(pi, Pi−1))Set δi = δ(Pi) and di = d(pi, Pi−1).

If di < δi−1 thendi = d(pi, pj) for some j < i with

If we grid the plane with a δi−1

2 ×δi−1

2grid with pi as the origin, then pj mustbe within 2 horizontal grid lines and 2vertical grid lines of pj .

Gridding Lemma tells us there is atmost one point per box.

δi−1

δi−1

Let Si = {p ∈ Pi−1 : pi.x− p.x < δi−1}

This impliesif di < δi−1 thendi = (pi, p) where(a) p ∈ Si and(b) p is one of the 4 points

above pi in Sior

p is one of the 4 pointsbelow pi in Si.

pi

|pi.x− pj .x| ≤ di = δi < δi−1

Page 17: Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal and vertical lines. Then (a) No box holds more than 1 point (b) If d(p;p0) = then

Updating One Sweep Step (ii)

δi = min(δi−1, d(pi, Pi−1))

δi−1

Let Si = {p ∈ Pi−1 : pi.x− p.x < δi−1}if di < δi−1 then di = (pi, p) where(a) p ∈ Si and(b) p is one of the 4 points above pi in Si or

p is one of the 4 points below pi in Si.

Suppose we knew δi−1 and had the points in Si kept in a balanced binarysearch tree ordered by increasing y coordinate.

We could then calculate δi in O(log n) time as follows• In Si, find the 4 points above pi and four points below it

This takes O(log n) time• Find the distances between these (at most) 8 points and pi.

This takes O(1) time. Call this minimum distance d′

(note that there might be no points in Si. In this case d′ =∞)• Set δi = min(δi−1, d

′)

pi

Page 18: Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal and vertical lines. Then (a) No box holds more than 1 point (b) If d(p;p0) = then

The Full Sweep Algorithm

We just developed the following algorithm to calculate δ2, δ3, . . . , δn

Let Si = {p ∈ Pi−1 : pi.x− p.x < δi−1}

A Set δ2 = d(p1, p2);Create a y-sorted balanced BST on S2 = {p1}

B For i = 3 to n1. Update the BST from storing Si−1 to storing Si2. In O(log n) time use the BST storing Si to calculate δi

Step (A) uses O(1) time. Since step (B)(2) uses O(log n) time per stepand it’s called O(n) times, all the calls to it use only O(n log n) time.

If we can implement step (B)(1) in O(n log n) total time the entirealgorithm then is O(n log n).

Page 19: Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal and vertical lines. Then (a) No box holds more than 1 point (b) If d(p;p0) = then

The Final Piece of Analysis

The Problem: Given p1, p2, . . . , pn sorted by nondecreasing x-coordinateand a sequence δ2 ≥ δ3 ≥ · · · ≥ δn−1, iteratively construct balancedBSTs storing the sets S2, S3, . . . , Sn sorted by y-coordinate. Use onlyO(n log n) time in total.

For k < i, if pk 6∈ Si then pk−1 6∈ Si because

pi.x− pk−1.x ≥ pi.x− pk.x ≥ δi−1so Si is a continuous sequence of points pk, pk+1, . . . , pi−1.

Furthermore, if pk 6∈ Si−1 then pk 6∈ Si, because

pi.x− pk.x ≥ pi−1.x− pk.x > δi−2 ≥ δi−1

⇒ Si−1 can be updated to Si by (a) adding pi−1 to the BST and then(b) walking rightwards from the leftmost point in Si−1, removing pointsfrom the BST, until finding the new leftmost point in Si.

Every point is added once and removed at most once, so we perform O(n)inserts and O(n) deletes from the BST, requiring only O(n log n) time

Page 20: Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal and vertical lines. Then (a) No box holds more than 1 point (b) If d(p;p0) = then

Odds & Ends

• This approach of sorting the points by x then updatingsolution by moving from left to right is a common one incomputational geometry. It is called the sweep-linetechnique.

• It is possible to prove that O(n log n) is a lower bound forsolving this problem in the algebraic decision tree model ofcomputation (a generalization of the comparison treemodel we used for sorting). So, our algorithm is optimal inthis model.

• With randomization and the use of the floor function (bxc)this can be impoved to an O(n) average case timerandomized algorithm.

Page 21: Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal and vertical lines. Then (a) No box holds more than 1 point (b) If d(p;p0) = then

Outline

• Introduction

• The Gridding Lemma

• An O(n log n) sweep line algorithm

• An O(n) randomized algorithm

– Hashing a point set

– The algorithm

Page 22: Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal and vertical lines. Then (a) No box holds more than 1 point (b) If d(p;p0) = then

A Randomized Algorithm• In this section we will see a randomized algorithm for finding the

closest pair in O(n) average case time.

• Input is n two-dimensional points.

• The randomization will come from two places

– The first step of the algorithm will be to randomly order the input.That is, it will choose one of the n! possible orderings of the npoints uniformly at random and label the points as p1, p2, . . . , pnin that order

– The only other place randomization is used will be in the use of(universal) hashing

• Before starting the main part of the algorithm we perform an O(n)scan finding xmin, xmax, ymin ymax, the smallest and largest x andy coordinates among all of the points.

Page 23: Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal and vertical lines. Then (a) No box holds more than 1 point (b) If d(p;p0) = then

Outline

• Introduction

• The Gridding Lemma

• An O(n log n) sweep line algorithm

• An O(n) randomized algorithm

– Hashing a point set

– The algorithm

Page 24: Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal and vertical lines. Then (a) No box holds more than 1 point (b) If d(p;p0) = then

Hashing a Sparse Array• Given an s× s array in which at most n items

exist (are non-zero).

• We would like to be able to perform the followingtwo operations in O(1) time each.– Insert(x): Insert new item x into the array– Search(x): If x exists, return it (else return nil)

• In practice, array item is a pointer to a record.Sucessful search returns pointer to record’s data.

• This can be solved using hashing

• Array index is (i, j) with 0 ≤ i, j ≤ s.Can map array indices uniquely to integer i+ js < s2 + 1

• Applying universal hashing with U = s2 + 1 and m = ngives O(1) average case insert and search

00

s

s

j

i

(i,j)

Page 25: Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal and vertical lines. Then (a) No box holds more than 1 point (b) If d(p;p0) = then

Bucketing: Hashing a Grid

• Let p0 = (xmin, ymin);D = max(xmax − xmin, ymax − ymin).

p0D

• A grid of size d will denote a horizontal/verticalgrid with grid separation d between adjacentparallel lines and p0 as the origin.

• The grid box labelled (X,Y ) will contain all points (x, y) withX ≤ x ≤ X + d and Y ≤ y ≤ Y + d(and appropriate tie breaking rule for points on the boundaries).

• (x, y) will be in the box(bx−xmin

d c, by−ymin

d c)

• Maximum box coordinate is dD/de

• The entire point set is contained in the squareof side D with lower-left corner p0.

Page 26: Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal and vertical lines. Then (a) No box holds more than 1 point (b) If d(p;p0) = then

Hashing the Nearest Neighbor Grid

• (x, y) will be in the box(bx−xmin

d c, by−ymin

d c)

• Maximum box coordinate is dD/de

• Using the Array Hashing technique discussedbefore, we can hash the points by storingthem in their corresponding array element(bucket). Non-empty buckets will return apointer to list of points in that grid box.

• Let Pi = {p1, p2, . . . , pn}, δi = δ(Pi) and grid size is d = δi/2

• Inserting point into grid takes average O(1) time.Searching for points in grid box takes averageO(1) time plus # of points in box.

• Gridding Lemma says that each grid box contains at most 1 point

• ⇒ Finding points in a grid box takes average O(1) time

Page 27: Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal and vertical lines. Then (a) No box holds more than 1 point (b) If d(p;p0) = then

Outline

• Introduction

• The Gridding Lemma

• An O(n log n) sweep line algorithm

• An O(n) randomized algorithm

– Hashing a point set

– The algorithm

Page 28: Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal and vertical lines. Then (a) No box holds more than 1 point (b) If d(p;p0) = then

One Algorithmic Step

• Suppose we know δi and have hashed and storedthe pointset Pi with grid size d = δi/2.

• How can we calculate δi+1 and, after completion,have Pi+1 hashed and stored with grid size δi+1/2?

• Note that δi+1 = min(δi, d(pi+1, Pi))

• δi+1 < δi iff d(pi+1, Pi) = d(pi+1, p) < δifor p ∈ Pi with p in one of the 25 boxeswithin distance 2 of the box containing pi+1

• Using the hash search, these 25 boxes and their contents (at mostone point each) can be found in O(1) time. In O(1) further time, findd′, the min distance between pi+1 and these ≤ 25 points.

• If d′ ≥ δi then δi+1 = δi⇒ insert pi+1 into the grid in O(1) average time

If d′ < δi then δi+1 = d′

⇒ Throw away the entire old grid andin O(i) average time, regrid Pi+1 with the new d = d′/2

Page 29: Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal and vertical lines. Then (a) No box holds more than 1 point (b) If d(p;p0) = then

The Complete Algorithm

• Randomly label the points as p1, p2, . . . , pnPerform an O(n) scan finding xmin, xmax, yminymax, the smallest and largest x and y coordinatesamong all of the points

• Set δ2 = d(p1, p2).Hash P2 into the grid with d = δ2/2.

• For i = 3 to n do

(A) Find the min distance d′ between pi+1 and the≤ 25 points in the 25 grid boxes surrounding pi+1

(B) If d′ > δi thenSet δi+1 = δi and insert pi+1 into old grid

else (d′ < δi)

Set δi+1 = d′ andRegrid all of Pi+1 with grid size d = δi+1/2.

Average Run Time

1st Step: O(n)

2nd Step: O(1)

All 3(A): O(n)

All 3(B)(i): O(n)(i)

(ii)= O(n)+ time

for all 3(B)(ii)

Page 30: Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal and vertical lines. Then (a) No box holds more than 1 point (b) If d(p;p0) = then

Final Piece• We just saw the algorithm is O(n) average running time with

exception of 3(B)(ii)• 3(B)(ii)can be rephased as, in step i

if δi+1 > δi do nothingelse do O(i+ 1) average work

• Note that trivially δi+1 = d(pj , pk) for some j, k ≤ i+ 1.δi+1 < δi only if j = i+ 1 or k = i+ 1.

• Points are in random order, so prob of this happening is ≤ 2/(i+ 1)i.e., Pr(δi+1 < δi) ≤ 2/(i+ 1)

• Let Xi be amount of work done by 3(B)(ii) at step i+ 1

Xi = 0 with Prob 1− 2i+1 ; Xi = O(i+ 1) with Prob 2

i+1 .

⇒ expected value E(Xi) =2i+1O(i+ 1) = O(1)

• Total work by 3(B)(ii) is Y =∑n−1i=2 Xi. Then total average work is

E(Y ) = E(∑n−1

i=2 Xi

)=∑n−1i=2 E(Xi) = O(n)

Page 31: Closest Pairs · = (P) their closest-pair distance. Grid the plane into 2 2 boxesusing horizontal and vertical lines. Then (a) No box holds more than 1 point (b) If d(p;p0) = then

The Complete Algorithm

• Randomly label the points as p1, p2, . . . , pnPerform an O(n) scan finding xmin, xmax, yminymax, the smallest and largest x and y coordinatesamong all of the points

• Set δ2 = p1, p2.Hash P2 into the grid with d = δ2/2.

• For i = 3 to n do

(A) Find the min distance d′ between pi+1 and the≤ 25 points in the 25 grid boxes surrounding pi+1

(B) If d′ > δi thenSet δi+1 = δi and insert pi+1 into old grid

else (d′ < δi)

Set δi+1 = d′ andRegrid all of Pi+1 with grid size d = δi+1/2.

Average Run Time

1st Step: O(n)

2nd Step: O(1)

All 3(A): O(n)

All 3(B)(i): O(n)(i)

(ii)

All 3(B)(ii): O(n)

= O(n)