Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

37
Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables By: Dahlia Malkhi, Moni Naor & David Ratajzcak Nov. 11, 2003 Presented by Zhenlei Jia Nov. 11, 2004

description

Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables. By: Dahlia Malkhi, Moni Naor & David Ratajzcak Nov. 11, 2003 Presented by Zhenlei Jia Nov. 11, 2004. Acknowledgments. Some of the following slides are adapted from the slides created by the authors of the paper. - PowerPoint PPT Presentation

Transcript of Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

By: Dahlia Malkhi, Moni Naor & David RatajzcakNov. 11, 2003

Presented by Zhenlei JiaNov. 11, 2004

Acknowledgments

Some of the following slides are adapted from the slides created by the authors of the paper

Outline Outline DHT Properties Viceroy

Structure Routing Algorithm Join/Leave Bounding In-degree: Bucket Solution Fault Tolerance

Summary

DHT What’s DHT

Store (key, value) pairs Lookup Join/Leave

Examples CAN, Pastry, Tapestry, Chord etc.

DHT Properties Dilation

Efficient lookup, usually O(log(n)) Maintenance cost

Support dynamic environment Control messages, affected servers

Degree Number of opened connections Servers impacted by node join/leave Heartbeat, graceful leave

DHT Properties (cont.) Congestion:

Peers should share the routing load evenly Load (of a node): the probability that it is on a ro

ute with random source and destination. If path length = O(log(n)) then on average, each

node is on n2 x O(log(n))/n = O(nlog(n)) routes. Average load = O(nlogn)/n2 = O(log(n))/n

Previous Works Node-

degree Dilation Congestion Topology

Chord (MIT)

log(n) log(n) log(n)/n Hypercube

Tapestry (Berkeley)

log(n) log(n) log(n)/n Hypercube

CAN (ICSI)

d dn(1/d) dn(1/d)/n d-dim. tourus

Small worlds

3 log2(n) log2(n)/n Cube connected cycle

Ours 3-7 log(n) log(n)/n Butterfly

Intuition

• Route is a combination of links of appropriate size

• Chord: Each node has ALL log(n) links

• Viceroy

• Each node has ONE of the long-range links

• A link of length 1/2k points to a node has link of length 1/2k+1

Chrod

000 001 010 011 100 101 110 111Level 1

Level 2

Level 3

Level 4

A Butterfly Network Each node has

ONE of the long-range links

A link of length 1/2k points to a node has link of length 1/2k+1

Nodes “share” each other’s long link

Routing

1.Route to root

2.Route to right group

3.Route to right level

Path: O(log(n))

Degree: O(1)

A Viceroy network

Level 1

Level 2

Level 3

•Ideally, there should be log(n) levels

•There is not a global counter

•Later, we will see how a node can estimate log(n) locally

0 1

00010010

0011

0100

0101

0110

10001001

1011

110111101111

Structure: Nodes Node

Id: 128 bits binary string, u Level: positive integer, u.level

Order of ids b1b2…bk ∑i=1…k bi/2i

Each node has a SUCCESSOR and a PREDECESSORSUCC(u), PRED(u)

Node u stores the keys k such that u≤k<SUCC(u)

Structure: Nodes

0 1

x SUCC(x)

PRED(x)

Keys stored on x

Lemma 2.1

Let n0 = 1/d(x, SUCC(x)), then w.h.p. (i.e. p>1-1/n1+e) that

log(n)-log(log(n))-O(1) <log(n0) ≤3log(n)

Node x selects level from 1…log(n0) uniformly randomly

Structure: Links A node u in level k has six out links

2 x Short: SUCCESSOR ,PREDECESSOR 2 x Medium: (left) closest level-(k+1) node whos

e id matches u.id[k] and is smaller than u.id. 1 x Long: the closest level-(k+1) node with prefix

u1…uk-1(1-uk)(?) u1…uk-1(1-uk)uk+1…uw* where w=log(n0)-log(log(n0))

1 x Parent: closest level-(k-1) node Also keeps track of in-bound links

Structure: Links

Level 1

Level 2

Level 3

0 1

00010010

0011

0100

0101

0110

10001001

1011

110111101111

Short link Short link

Parent link, to level k-1

Medium linkMatches x[k]0*

Long link, cross over about 1/2k

Matches u[w] except kth bit. (11*)

Matches 1*Wrong!

Routing: AlgorithmLOOKUP(x, y):Initialization: set cur to xProceed to root: while cur.level > 1:

cur = cur.parentGreedy search:

if cur.id ≤ y < SUCC(cur).id, return cur. Otherwise, choose m from links of cur that minimize d(m, y), move to m and repeat.

Demo: http://www.cs.huji.ac.il/labs/danss/anatt/viceroy.html

Routing: Example

Level 1

Level 2

Level 3

0 1

00010010

0011

0100

0101

0110

10001001

1011

110111101111y

x

One Observation

Level 1

Level 2

Level 3

0 1

00010010

0011

0100

0101

0110

10001001

1011

110111101111

Routing: Analysis (1)

Level 1

Level 2

Level 3

0 1

00010010

0011

0100

0101

0110

10001001

1011

110111101111y

x

Routing: Analysis (2) Expected path length = O(log(n))

log(n ) to `level-1’ node log(n ) for traveling among clusters log(n ) for final local search

Routing: Theorems Theorem 4.4

The path length from x to y is O(log(n)) w.h.p. Proof is based on several lemmas Lemma 4.1

For every node u with a level u.level < log(n)-log(log(n)), the number of nodes between u and u.Medium-left (Medium-right), if it exists, is at most 6log2(n) w.h.p.

Routing: Theorems (2) Lemma 4.2

In the greedy search phase of a lookup of value Y from node x, let the j’th greedy step vj, for 1 ≤ j ≤ m, be such that vj is more than O(log2(n)) nodes away from y. Then w.h.p. node vj is reached over a Medium or Long link, and hence satisfies vj.level = j and vj[j] = Y[j].

m = log(n)-2loglog(n)-log(3+e) W.h.p. within m steps, we are n/2m = 6log2(n) nodes

away from the destination

Routing: Theorems (3) Lemma 4.3

Let v be a node that is O(log2(n)) nodes away from the target y. Then w.h.p., within O(log(n)) greedy steps that target y is reached from v.

Theorem 4.4 The total length of a route from x to y is O(log(n)) w.h.p.

Theorem 4.6 Expected load on every node is O(log(n)/n). The load on every node is log2(n)/n w.h.p.

Theorem 4.7Every node u has in-degree O(log(n)) w.h.p.

Join: Algorithm1. Choose identifier: select a random 128 bits x1x2…x128

2. Setup short links: invoke LOOKUP(x), let x’ be the result node. Insert x between x’ and x’.SUUCESSOR.

3. Choose level: let k be the maximal number of matching prefix bits between x and either SUCC(x) or PRED(x), choose level from 1…k.

4. Set parent link: If SUCC(x) has level x.level-1, set x.parent to it. Otherwise, move to SUCC(x) and repeat.

5. Set long link: p = x1…xk-1(1-xk)xk+1…xw

Invoke LOOKUP(p), stop after a node at level x.level+1 and matches p

is reached.

Join: Algorithm (cont.)6. Set medium links: Denote p = x1x2…xx.level. If SUCC(x) has prefix

p and level x.level+1, set x.Medium-right link to it. Otherwise, move the SUCC(x) and repeat.

7. Set inbound links: Denote p = x1x2…xx.level.

Set inbound Medium links: Following SUCC links, so long as successor y has a prefix p and a level different from x.level, if y.level = x.level-1, set y.Medium-left to x.Set inbound long links: Following SUCC links, find y that has a prefix matches p and has level x.level. Take any inbound links that is closer to x than y.Set inbound parent links: Following PRED link, find y such that y.level = x.level+1. Repeat until meet a node in same level as x.

Join: Example

Level 1

Level 2

Level 3

0 1

00010010

0011

0100

0101

0110

10001001

1011

110111101111

0111

Lookup(x)

0111

Set Medium link: O(lg2n) w.h.pp = x1x2…xk (01)If y[k] != p: stopIf y[k]=p and y.level=k+1: set Medium linkOtherwise, move to succ(y)

STOP

Set Parent link:Following SUCC link, find a node has level k-1.

Set long linkP = x1…xk-1(1-xk)…xw stop at level k+1?In this case, find 00*

Set inbound long links:Following short links, find y such that y[k]=x[k] and y.level = x.level, check y’s inbound links.

X

Join: Analysis LOOKUP takes O(log(n)) messages w.h.p. Travels on short links during link setting phase is O(lg2n) w.h.

p. A Medium link is within 6log2(n) nodes from x w.h.p. Similar for others

Theorem 5.1: A JOIN operation by a new node x incurs expected O(log(n)) number of messages, and O(log2(n)) messages w.h.p.The expected number of nodes that change their state as a result of x’s join is constant, and w.h.p is O(log(n)).

Because node x has O(log(n)) in-degrees w.h.p.Similar results holds for LEAVE.

Bounding In-degrees Theorem 4.7

Every node has expected constant in-degree, and has O(log(n)) in-degree w.h.p.

In-degree=# of servers affected by join/leave

How to guarantee constant in-degree? Bucket solution

A background process to balance the assignment of levels

Bucket Solution: Intuition

Level k-1

Level k

• Node x has log(n) in-degree, assuming Medium Right

~log(n)

x

• Too many nodes at level k-1;• Improve the level selection procedure

Too few nodes at level k

Bucket Solution

The name space is divided into non-overlapped buckets.

A bucket contains m nodes, where log(n) ≤m ≤ clog(n), for c>2.

In a buckets, levels are NOT assigned randomly

For each 1≤j≤log(n), there are 1…c nodes at level j in each bucket

In(x) < 7c (?? 2c)

0 1

000100100011

01000101

0110

100010011011

110111101111

Maintaining Bucket Size n can be accurately estimated When bucket size exceeds clog(n), the bucket

is split into two equal size buckets. When bucket size drops below log(n), it is

merged with a neighbor bucket. Further more, if the merged bucket is greater than log(n)x(2c+2)/3, the new bucket is split into two buckets.

(c+1)/3 > 1 since c>2 Buckets are organized into a ring, which can

be merged or split with O(1) message.

Maintain Level Property Node join/leave without merging or splitting O(1)

Join: size < clog(n), choose a level that has less that c nodes

Leave: If it is the only node in its level, find another level that has two nodes, reassign level j to one of them.

Bucket merge or split may result in a reassignment of the levels to all nodes in the bucket(s) O(log(n))

Merging/splitting are expensive, but they do not happen very often

After a merging or splitting of buckets, at least log(n) (c-2)/3 JOIN/LEAVE must happen in this bucket until another merging or splitting of this bucket is performed

Amortized Overhead = c/((c-2)/3) = O(1) for c>2

Amortized analysis

Log(n)

clog(n)

d1 d2

d1, d2 > (c-2)/3

New bucket size

Max bucket size

min(c/2lgn, (c+1)/3lgn) max(c/2lgn, (2c+2)/3lgn)

Viceroy has no built in support for fault tolerance

Viceroy requires graceful leave Leaves are NOT the same as failures

Performance is sensitive to failure External techniques:

Thickening Edges State Machine Replication

Fault Tolerance

State Machine Replication

Old

NewSMR SMR

Super node

Viceroy nodes

Related Works De Bruijn Graph Based Network

Distance halving D2B Koorde

Others Symphony (Small world model) Ulysses (ButterFly, log(n), log(n)/loglogn)

Summary Constant out-degree Expected constant in-degree

O(log(n )) w.h.p. O(1) with bucket solution

O(log(n )) path length w.h.p Expected log(n )/n load:

O(log2(n)/n) w.h.p. Weakness/improvements:

Not Locality Aware No Fault Tolerance Support Due to the lack of flexibility of ButterFly network

Question

Photo by Peter J. Bryant