Mehlhorn-Tsakalidis Revisited · Mehlhorn-Tsakalidis Revisited Christos Zaroliagis [Joint work with...

Mehlhorn-Tsakalidis Revisited

Christos Zaroliagis

[Joint work with A. Kaporis, C. Makris, S. Sioutas, A. Tsakalidis, & K. Tsichlas]

1 / 24

Mehlhorn-Tsakalidis - once upon a time ...

2 / 24

Mehlhorn-Tsakalidis - once upon a time ...

2 / 24

Dynamic Dictionary Search - Problem

Given set of keys S = {X1, . . . , Xn}, Xi ∈ [a, b] ⊂ IR

Maintain (under key insertions and deletions) a non-decreasing ordering

P = {X(1), . . . , X(n)} of S such that:

given a query element y

find largest X(j) ∈ P : X(j) ≤ y

3 / 24

Dynamic Dictionary Search - Classical methods

Use arbitrary rule to select splitting element X(k) ∈ P that splits P into two

subsets, and recurse

e.g., in binary search, splitting element = middle element

Balanced search trees (AVL-trees, red-black trees, (a, b)-trees, etc):

search & update time: O(log n)

Bounds

• optimal for Pointer Machine

• RAM: Θ(√

log log n

)[Andersson & Thorup, 2001]

4 / 24

Bounds

• RAM: Θ(√

log log n

4 / 24

Bounds

• RAM: Θ(√

log log n

4 / 24

Surpassing the Bounds - Expected Complexities

Interpolation Search [Peterson,57]

Select splitting elements by taking advantage of the statistical properties of

the keys

Splitting elements are spread closer to query key y

Expected search time (static problem):

Θ(log log n), uniform distribution

[Yao & Yao,76], [Gonnet,77], [Perl & Reingold,77], [Perl, Itai & Avni,78],

[Gonnet, Rogers & George, 80]

Θ(log log n), regular (non-uniform) distribution

[Willard,85]

5 / 24

the keys

[Willard,85]

5 / 24

the keys

[Willard,85]

5 / 24

the keys

[Willard,85]

5 / 24

Dynamic Interpolation Search

Scenario: µ-random insertions, random deletions

µ uniform [Frederickson,83], [Itai,Konheim & Rodeh, 81]

search: O(log log n) expected time

update: O(nε) time, 0 < ε < 1

6 / 24

µ (f1, f2)-smooth (peak-less or peak-adjusting) [Mehlhorn & Tsakalidis,85]

[c2 −

c3 − c1

f1(n)≤ X ≤ c2 | c1 ≤ X ≤ c3

]≤ βf2(n)

smooth ⊃ {uniform, regular, bounded, other non-uniform}→ any probability distribution is (f1,Θ(n))-smooth

update:

{O(log log n) expected [Mehlhorn & Tsakalidis,85]

O(1) time (position given) [Andersson & Mattson,93]

7 / 24

[c2 −

c3 − c1

f1(n)≤ X ≤ c2 | c1 ≤ X ≤ c3

]≤ βf2(n)

update:

7 / 24

[c2 −

c3 − c1

f1(n)≤ X ≤ c2 | c1 ≤ X ≤ c3

]≤ βf2(n)

update:

7 / 24

Static/Dynamic Interpolation Search

Key AssumptionConditional distribution on the subinterval dictated by an arbitrary interpolation

step remains unaffected (i.e., µ-random)

8 / 24

This Work (Revisiting Mehl-Tsak DIS) - Result 1

Key assumption is valid

• only when elements are distinct (indeed assumed in all previous work)

V produced under some continuous (or discrete) distribution with zero probability

of collision

• otherwise, it fails

Probabilistic analyses of previous Interpolation Search structures are

inapplicable to sequences of non-distinct elements

V produced by discrete probability distributions with measurable (non-zero)

probability of key collision

9 / 24

Key assumption is valid

• only when elements are distinct (indeed assumed in all previous work)

V produced under some continuous (or discrete) distribution with zero probability

of collision

• otherwise, it fails

Probabilistic analyses of previous Interpolation Search structures are

inapplicable to sequences of non-distinct elements

V produced by discrete probability distributions with measurable (non-zero)

probability of key collision

9 / 24

Interpolation Search - Lack of Generalization

∃ applications where we must store duplicates

Secondary indices in databases

Searching tables with alphabetic keys (names, dictionary entries, etc)

B keys follow a non-uniform, discrete distribution and collisions do occur

B Empirical results showed that Interpolation Search has a very poor

performance on such data

[Perl & Reingold,77], [Burton & Lewis,80], [Santorno & Sidney,85], [Perl & Gabriel,92]

B Heuristics; no rigorous performance analysis

Storing duplicates once: statistical properties are destroyed

10 / 24

New dynamic interpolation search data structure

update: O(1) time (position given)

Bounds hold with high probability

Analysis

− always valid irrespectively of the distinctness or not of the elements

− applies to

(log log n)1+ε , nδ)

-smooth input distributions, δ ∈ (0, 1), ε > 0

Robust (no apriori knowledge of the input distribution is required)

11 / 24

This Work (Revisiting Mehl-Tsak DIS) - Other Results

Modifications to our data structure yield

O(1) expected search time w.h.p. for a (largest so far) subclass of smooth

distributions

O(log log n) expected search time w.h.p. for Power-law and Binomialdistributions

B Power-law and Binomial: @ proper choices of f1, f2 to achieve O(log log n)expected search time

12 / 24

Interpolation Search [Mehlhorn & Tsakalidis,85], [Andersson & Mattson,93]

eFirst Interpolation Step

Children of the root node:

Root��

XXXXXXXXXe1

. . . e

. . . v . . .

. . . eR(n)

Root’s ID array

ID[1] ID[2] ID[j] ID[I(n)]

Root’s REP array

pppppppppppp

ppppppppp

pppppppppppp

pppppppppppp pppppppp

ppppa+ 1 b−aI(n)

rREP[1]

rREP[2]

rREP[3]

r. . .

. . .REP[t] . . .

rREP[R(n)]

rREP[t+ l]

a+ 2 b−aI(n) a+ (j − 1) b−a

I(n) a+ j b−aI(n) a+ (f1(n)− 1) b−a

µ is (f1(n), f2(n)) = (I(n), n/R(n))-smooth

root has R(n) children; a root-child has R(n/R(n)) children, etc.

ID array: partitions [a, b] into I(n) equal-length parts

REP array: partitions P into R(n) equal-sized subsets, each of size n/R(n)REP[i]↔ Pi = {X ∈ P | X((i−1) n

R(n)) < X ≤ X(i n

R(n))}, i = 1, . . . , R(n)

13 / 24

Interpolation Search [Mehlhorn & Tsakalidis,85], [Andersson & Mattson,93]

eFirst Interpolation Step

Children of the root node:

Root��

XXXXXXXXXe1

. . . e

. . . v . . .

. . . eR(n)

Root’s ID array

ID[1] ID[2] ID[j] ID[I(n)]

Root’s REP array

pppppppppppp

ppppppppp

pppppppppppp

pppppppppppp pppppppp

ppppa+ 1 b−aI(n)

rREP[1]

rREP[2]

rREP[3]

r. . .

. . .REP[t] . . .

rREP[R(n)]

rREP[t+ l]

a+ 2 b−aI(n) a+ (j − 1) b−a

I(n) a+ j b−aI(n) a+ (f1(n)− 1) b−a

Query element y V

j =⌊

b−aI(n)⌋

+ 1 V Ij =[a + (j − 1) b−a

I(n) , a + jb−a

Search for appropriate REP within Ij

− O(1) expected number of REPs

− distribution of elements between consecutive REPs: µ-random (smooth)

14 / 24

Analysis and the Subtle Case

[a, b] µ-random and smooth, (a′, b′] ⊆ [a, b]

Pr[X = λ | a′ < X ≤ b′] =

Pr[X = λ]

Pr[a′ < X < b′], a

′ < λ < b′ (1)

Consider X1, X2, X3 ∈ [a, b], and their ordering X(1) ≤ X(2) ≤ X(3)

Pr[X = λ | X(1) = a′∩X(3) = b

′] =Pr[X = λ ∩ X(1) = a′ ∩ X(3) = b′]

Pr[X(1) = a′ ∩ X(3) = b′](2)

Is (2) µ-random and smooth ??

15 / 24

Pr[X = λ | a′ < X ≤ b′] =

Pr[X = λ]

Pr[a′ < X < b′], a

′ < λ < b′ (1)

Pr[X = λ | X(1) = a′∩X(3) = b

′] =Pr[X = λ ∩ X(1) = a′ ∩ X(3) = b′]

Pr[X(1) = a′ ∩ X(3) = b′](2)

15 / 24

Pr[X = λ | a′ < X ≤ b′] =

Pr[X = λ]

Pr[a′ < X < b′], a

′ < λ < b′ (1)

Pr[X = λ | X(1) = a′∩X(3) = b

′] =Pr[X = λ ∩ X(1) = a′ ∩ X(3) = b′]

Pr[X(1) = a′ ∩ X(3) = b′](2)

15 / 24

Distinct keys (Pr[key collision] = 0)

Event E1 = {X = λ ∩ X(1) = a′ ∩ X(3) = b′} occurs if ≥ 1 occurs

{X2 = λ, X3 = a′, X1 = b

′}, {X1 = λ, X2 = a′, X3 = b

′},{X3 = λ, X1 = a

′, X2 = b′}, {X1 = λ, X3 = a

′, X2 = b′},

{X3 = λ, X2 = a′, X1 = b

′}, {X2 = λ, X1 = a′, X3 = b

Event E2 = {X(1) = a′ ∩ X(3) = b′} occurs if ≥ 1 occurs

{X1 = a′, X2 = b

′, a′ < X3 < b

′}, {X2 = a′, X1 = b

′, a′ < X3 < b

′},{X1 = a

′, X3 = b′, a

′ < X2 < b′}, {X3 = a

′, X1 = b′, a

′ < X2 < b′},

{X2 = a′, X3 = b

′, a′ < X1 < b′}, {X3 = a

′, X2 = b′, a

′ < X1 < b′}

(2) = Pr[E1]Pr[E2]

= 6Pr[X=λ] Pr[X=a′] Pr[X=b′]6Pr[X=a′] Pr[X=b′] Pr[a′<X<b′] = Pr[X=λ]

Pr[a′<X<b′] = (1)

16 / 24

Non-distinct keys (Pr[key collision] 6= 0)

Event E1 = {X = λ ∩ X(1) = a′ ∩ X(3) = b′} occurs if ≥ 1 occurs

{X2 = λ, X3 = a′, X1 = b

′}, {X1 = λ, X2 = a′, X3 = b

′},{X3 = λ, X1 = a

′, X2 = b′}, {X1 = λ, X3 = a

′, X2 = b′},

{X3 = λ, X2 = a′, X1 = b

′}, {X2 = λ, X1 = a′, X3 = b

Event E2 = {X(1) = a′ ∩ X(3) = b′} occurs if ≥ 1 occurs

{X1 = a′, X2 = b

′, a′ < X3 < b

′}, {X2 = a′, X1 = b

′, a′ < X3 < b

′},{X1 = a

′, X3 = b′, a

′ < X2 < b′}, {X3 = a

′, X1 = b′, a

′ < X2 < b′},

{X2 = a′, X3 = b

′, a′ < X1 < b′}, {X3 = a

′, X2 = b′, a

′ < X1 < b′}

{X1,2 = a′, X3 = b

′}, {X1,2 = b′, X3 = a

′}, {X1,3 = a′, X2 = b

′},{X1,3 = b

′, X2 = a′}, {X2,3 = a

′, X1 = b′}, {X2,3 = b

′, X1 = a′}

(2) = Pr[E1]Pr[E2]

= Pr[X=λ]Pr[X=a′]+Pr[X=b′]

2+Pr[a′<X<b′]

6= (1) !!

17 / 24

The New Data Structure

n elements drawn µ-randomly from [a, b]µ is (f1, f2) = (nα, nδ)-smooth, α, δ ∈ (0, 1)

Key idea

Forget about REPs

Partition [a, b] into f1(n) intervals, each of size (b − a)/f1(n)

Recurse on each interval (BIN) until |BIN| = O(poly log n)

18 / 24

1st Layer of f1(n) bins.

Each bin receives a ball with probability O(f2(n)n

BIN(1) BIN(2) BIN(j1) BIN(f1(n))

r r r . . .

n1 balls r r . . .

n2 balls r r r . . .

nj1 balls r r r r . . .

nf1(n) balls r. . .

a+ 1 b−af1(n)

a+ 2 b−af1(n)

a+ (j1 − 1) b−af1(n)

a+ j1b−af1(n)

a+ (f1(n)− 1) b−af1(n)

✥✥✥✥✥✥✥✥✥✥✥✥✥✥✥✥✥✥✥✥✥✥✥✥✥✥✥

✦✦✦✦✦✦✦✦✦✦✦✦✦✦✦✦✦

✄✄✄✄✄

PPPPPPPPPPPPPPPP

BIN(j1)’s 2nd Layer of f1(nj1) bins.

Each bin receives a ball with probability O(f2(nj1

BIN(j1, 1) BIN(j1, 2) BIN(j1, j2) BIN(j1, f1(nj1))

aj1 bj1

r . . .

nj1,1 balls r r r r . . .

nj1,2 balls r r r . . .

nj1,j2 balls r r . . .

nj1,f1(nj1) ballsr

. . .aj1 + 1bj1−aj1f1(nj1

) aj1 + 2bj1−aj1f1(nj1

) aj1 + (j2 − 1)bj1−aj1f1(nj1

) aj1 + j2bj1−aj1f1(nj1

Leaf BIN: q∗-heap [Willard,93]⇒ O(1) search & update time

19 / 24

Lem. 1 The elements of each BIN (subinterval) are µ-randomly distributed

Lem. 2 |BIN| = f2(|parent-BIN|) w.h.p.

⇓− 1st layer: |BIN| = O(nδ)

− k-th layer: |BIN| = O(nδk

I Number of layers: O(log log n)

20 / 24

O(1) Search Time

First layer of f1(n) BINS.

Each BIN receives a ball with probability O(f2(n)n

(lnO(1) n

BIN(11) BIN(21) BIN(j1) BIN(f1(n)1)

r r r . . .

n11 balls r r r r . . .

n21 balls r r r r . . .

nj1 balls r r r r . . .

nf1(n)1 balls r. . .

a+ 1 b−af1(n)

a+ 2 b−af1(n)

a+ (j1 − 1) b−af1(n)

a+ j1b−af1(n)

a+ (f1(n)− 1) b−af1(n)

µ: (f1(n), f2(n)) =(

g, lnO(1)

)-smooth, g: constant

|BIN| = O(lnO(1)n) w.h.p.

Implement each bin as q∗-heap V O(1) search time

21 / 24

Power-law Distribution

I1 I21 bnα

I1 is dense V van Emde Boas structure, O(log log |I1|) search time

I2 is sparse V distribution is (f1(n), poly log n)-smooth

Similar idea for Binomial

22 / 24

Revisiting Mehl-Tsak DIS - Conclusions

New dynamic interpolation search data structure

Supports non-distinct (duplicate) elements

Expected search time O(log log n) w.h.p. for smooth and other

distributions (e.g., power law)

O(1) expected search time for a subclass (largest so far) of smooth

distributions

23 / 24

Thanks

Thank you Thanassi for

introducing me (and others) to the intricacies of elementary data structures

your contributions to CS&E and for establishing it within Greece

your huge contribution in elevating the Dept of Computer Engineering &

Informatics in all respects (infrastructure, procedures, etc) within and out of

our University

the artistic touch you gave to CEID

... your support and friendship

Wish you all the best for the future

24 / 24

Thanks

our University

24 / 24

Thanks

our University

24 / 24

Thanks

our University

24 / 24

Thanks

our University

24 / 24

Thanks

our University

24 / 24

Thanks

our University

24 / 24

Thanks

our University

24 / 24

Thanks

our University

24 / 24

Mehlhorn-Tsakalidis Revisited · Mehlhorn-Tsakalidis Revisited Christos Zaroliagis [Joint work with...

Documents

Transcript of Mehlhorn-Tsakalidis Revisited · Mehlhorn-Tsakalidis Revisited Christos Zaroliagis [Joint work with...

academics.ut.ac.irTaxonomy of Helminths .oas 5- Mehlhorn, H. (2001) Encyclopedic Reference Of volume edition, Springer. 6- Mehlhorn, H. (2001) Encyclopedic Reference of Parasitology

Indicators for monitoring the Strategic Transport Research and Innovation Agenda · 2020-05-07 · Research and Innovation Agenda (STRIA). Authors Anastasios Tsakalidis, European

presents Toads Revisited - Philip Larkinphiliplarkin.com/.../04/toads-revisited-booklet.pdf · Toads Revisited Toads Revisited celebrates 5 years since the Larkin25 commemoration

McLuhan Revisited

The Maximum Flow Problem - Max Planck Societypeople.mpi-inf.mpg.de/~mehlhorn/DatAlg/Maxflow.pdf · 2000-11-20 · P I Informatik 1 Kurt Mehlhorn The Maximum Flow Problem n put: †

Biomimicry revisited

Jaanix Revisited

RDA Revisited

Canon Revisited

XL Revisited

INDONESIAEXPLORATION REVISITED

Modernism Revisited

Autoprogettazione revisited

Yesterday Revisited

Slump revisited

Globalization Revisited

RestKit Revisited

Reductionism revisited

IV Revisited

Classics Revisited