Distributed Hash Table Algorithms

28
L3S Research Center, University of Hannover Distributed Hash Table Algorithms Distributed Hash Table Algorithms Distributed Hash Table Algorithms Distributed Hash Table Algorithms Wolf Wolf- -Tilo Tilo Balke and Wolf Siberski Balke and Wolf Siberski 14.11.2007 14.11.2007 1 Peer-to-Peer Systems and Applications, Springer LNCS 3485 *With slides provided by Stefan Götz and Klaus Wehrle (University of Tübingen) and Review: DHT Basics Objects need unique key Key is hashed to integer value Huge key space, e.g. 2 128 „Purple Rain“ Hash-funktion (e.g. SHA-1) Key space partitioned Each peer gets its key range DHT Goals Efficient routing to the responsible peer Efficient routing table maintenance 2313 2 Distributed Hash Table Algorithms L3S Research Center Efficient routing table maintenance 3485 - 610 1622 - 2010 611 - 709 2011 - 2206 2207- 2905 (3485 - 610) 2906 - 3484 1008 - 1621

Transcript of Distributed Hash Table Algorithms

Page 1: Distributed Hash Table Algorithms

L3S Research Center, University of Hannover

Distributed Hash Table AlgorithmsDistributed Hash Table AlgorithmsDistributed Hash Table AlgorithmsDistributed Hash Table Algorithms

WolfWolf--TiloTilo Balke and Wolf SiberskiBalke and Wolf Siberski14.11.200714.11.2007

1Peer-to-Peer Systems and Applications, Springer LNCS 3485

*With slides provided by Stefan Götz and Klaus Wehrle (University of Tübingen) and

Review: DHT Basics

Objects need unique key

Key is hashed to integer valueHuge key space, e.g. 2128

„Purple Rain“

Hash-funktion(e.g. SHA-1)

Key space partitionedEach peer gets its key range

DHT GoalsEfficient routing to the responsible peer

Efficient routing table maintenance

2313

2Distributed Hash Table AlgorithmsL3S Research Center

Efficient routing table maintenance

3485 -610

1622 -2010

611 -709

2011 -2206

2207-2905

(3485 -610)

2906 -3484

1008 -1621

Page 2: Distributed Hash Table Algorithms

DHT Design Space

Minimal routing table

3485 -610

1622 -2010

611 -709

2011 -2206

2207-2905

2906 -3484

1008 -1621

gPeer state O(1), Avg. path length O(n)

Brittle network

3485 -610

1622 -2010

611 -709

2011 -2206

2207-2905

2906 -3484

1008 -1621

3Distributed Hash Table AlgorithmsL3S Research Center

Maximal routing tablePeer state O(n), Path length O(1)

Very inefficient routing table maintenance

DHT Routing Tables

3485 -610

1622 -2010

611 -709

2011 -2206

2207-2905

2906 -3484

1008 -1621

Usual routing tablePeer state O(log n), Path length O(log n)

Compromise between routing efficiency and maintenance efficiency

4Distributed Hash Table AlgorithmsL3S Research Center

Page 3: Distributed Hash Table Algorithms

DHT Algorithms

CHORD

Pastry

Symphony

Viceroy

CAN

5Distributed Hash Table AlgorithmsL3S Research Center

Pastry: Basics

128 bit circular id spaceRouting table elements

Leaf set: Key space proximity

Routing table: long distance links

Neighborhood set: network proximity

Basic routingIf (target key in key space proximity)

Use direct leaf set link

nodeIds

6Distributed Hash Table AlgorithmsL3S Research Center

Use direct leaf set link

else

Use link from routing table to resolve next digit of target key

Page 4: Distributed Hash Table Algorithms

Pastry: Leaf sets

Each node maintains IP addresses of the nodes with the L numerically closest larger and smaller nodeIds, respectively.

ti ffi i / b trouting efficiency/robustness

fault detection (keep-alive)

application-specific local coordination

7Distributed Hash Table AlgorithmsL3S Research Center

Pastry: Routing table

L nodes in leaf set

log2b N Rows(actually log2b 2128= 128/b)

2b columns

8Distributed Hash Table AlgorithmsL3S Research Center

L network neighbors

Page 5: Distributed Hash Table Algorithms

Pastry: Routing

log2b N steps

O(log N) stated462bad467c4

d471f1

d46a1c

Route(d46a1c)

d4213f

d13da3

9Distributed Hash Table AlgorithmsL3S Research Center

65a1fc

Pastry: Routing procedure

If (destination is within range of our leaf set) forward to numerically closest member

elselet l = length of shared prefix let d = value of l-th digit in D’s addressif (Rl

d exists) forward to Rl

d

else forward to a known node* that (a) shares at least as long a prefix

10Distributed Hash Table AlgorithmsL3S Research Center

(a) shares at least as long a prefix(b) is numerically closer than this node

*from LeafSet, RoutingTable, or NetworkNeigbors

Page 6: Distributed Hash Table Algorithms

Pastry: Routing Properties

O(log N) routing table size2b * log2b N + 2l

O(log N) message forwarding steps

Network stability:guaranteed unless L/2 simultaneous failures of nodes with adjacent nodeIds

11Distributed Hash Table AlgorithmsL3S Research Center

Number of routing hops:No failures: < log2b N average, 128/b + 1 max

During failure recovery O(N) worst case, average case much better

Pastry: Node addition

d462baZ=d467c4

d471f1

X=d46a1c

Route(d46a1c)

d4213f

d13da3

New node: X=d46a1c

12Distributed Hash Table AlgorithmsL3S Research Center

A = 65a1fc

Page 7: Distributed Hash Table Algorithms

Routing table maintenance

Leaf setCopy from neighbor

Extend by sending request to right/left boundary leaf link

Routing tableCollect routing tables from peers encountered during network entry

Works because peers encountered share same prefix

Can be incomplete

13Distributed Hash Table AlgorithmsL3S Research Center

Network neighbor setProbe nodes from collected routing tables

Request neighborhood sets for known nearby nodes

Pastry: Locality properties

Assumption: scalar proximity metric e.g. ping/RTT delay, # IP hops, geographical distance

a node can probe distance to any other node

Proximity invariant: Each routing table entry refers to a node close to the local node (in the proximity space), among all nodes with the appropriate nodeId prefix.

14Distributed Hash Table AlgorithmsL3S Research Center

Page 8: Distributed Hash Table Algorithms

Pastry: Geometric Routing in proximity space

d46a1cd462ba

d467c4d471f1

d467c4

Proximity space

Route(d46a1c)

d4213f

d13da3

65a1fc65a1fc

d13da3

d4213f

d462ba

NodeId space

15Distributed Hash Table AlgorithmsL3S Research Center

d13da3d462ba

Network distance for each routing step is exponentially increasing (entry in row l is chosen from a set of nodes of size N/2bl)Distance increases monotonically (message takes larger and larger strides)

Pastry: Locality properties

Each routing step is local, but there is no guarantee of globally shortest path

Nevertheless, simulations show:Expected distance traveled by a message in the proximity space is within a small constant of the minimum

Among k nodes with nodeIds closest to the key, message likely to reach the node closest to the source node first

16Distributed Hash Table AlgorithmsL3S Research Center16

Page 9: Distributed Hash Table Algorithms

Pastry: Node addition details

New node X contacts nearby node A

A routes “join” message to X, which arrives to Z, closest to X

X obtains leaf set from Z, i’th row for routing table from i’th node from A to Z

X informs any nodes that need to be aware of its arrival

17Distributed Hash Table AlgorithmsL3S Research Center17

Node departure/failure

Leaf set repair (eager – all the time): Leaf set members exchange keep-alive messages

request set from furthest live node in set

Routing table repair (lazy – upon failure): get table from peers in the same row, if not found – from higher rows

Neighborhood set repair (eager)

18Distributed Hash Table AlgorithmsL3S Research Center

Page 10: Distributed Hash Table Algorithms

Pastry: Average # of hops

3.5

4

4.5

hops

1

1.5

2

2.5

3

Ave

rage

num

ber

of h

PastryLog(N)

19Distributed Hash Table AlgorithmsL3S Research Center

0

0.5

1000 10000 100000

Number of nodes

A Log(N)

L=16, 100k random queries

Pastry distance vs IP distance

2000

2500

mes

sage Mean = 1.59

500

1000

1500

tanc

e tr

avel

ed b

y Pa

stry

m

20Distributed Hash Table AlgorithmsL3S Research Center

00 200 400 600 800 1000 1200 1400

Distance between source and destination

Dis

t

GATech top., .5M hosts, 60K nodes, 20K random messages

Page 11: Distributed Hash Table Algorithms

Pastry Summary

Usual DHT scalabilityPeer state log(N)

Avg. path length log(N)

Very robustDifferent routes possible

Lazy routing table update sufficient

N t k i it

21Distributed Hash Table AlgorithmsL3S Research Center

Network proximity awareNo IP network detours

Symphony

Symphony DHT

Map the nodes and keys to the ring

Link every node with its successor and predecessor

Add k random links with probability proportional to 1/(d·log N), where d is the distance on the ring

Lookup time O(log2 N)

If k = log N lookup time O(log N)

Easy to insert and remove nodes

22Distributed Hash Table AlgorithmsL3S Research Center

(perform periodical refreshes for the links)

Page 12: Distributed Hash Table Algorithms

Symphony in a Nutshell

Nodes arranged in a unit circle (perimeter = 1)

Arrival --> Node chooses position along circleuniformly at random

Each node has 1 short link (next node on circle)and k long linksg

Adaptation of Small World Idea: [Kleinberg00]Long links chosen from a probability distributionfunction: p(x) = 1/(x log n) where n = #nodes.

Simple greedy routing:“Forward along that link that minimizes th b l t di t t th d ti ti ”

n ?

23Distributed Hash Table AlgorithmsL3S Research Center

node long link short linkFault Tolerance:

No backups for long links! Only short linksare fortified for fault tolerance.

the absolute distance to the destination.”

Average lookup latency = O((log2 n) / k) hops

Network Size Estimation Protocol

x = Length of arc

Problem: What is the current value of n, the total number of nodes?

1/x = Estimate of n

3 arcs are enough.

24Distributed Hash Table AlgorithmsL3S Research Center

Page 13: Distributed Hash Table Algorithms

Step 0: Symphony

ribution

p(x) = 1 / (x log n)Pr

obab

ility

Dis

tr

Symphony:“Draw from the PDF k times”

25Distributed Hash Table AlgorithmsL3S Research Center

0 ¼ ½ 1Distance to long distance neighbor

Step 1: Step-Symphony

ribution

p(x) = 1 / x log n

Probab

ility

Dis

tr Step-Symphony:“Draw from the discretized PDF k times”

26Distributed Hash Table AlgorithmsL3S Research Center

0 ¼ ½ 1Distance to long distance neighbor

Page 14: Distributed Hash Table Algorithms

Step 2: Divide PDF into log n Equal Bins

ribution

Step-Partitioned-Symphony:“Draw exactly once from each of k bins”

Probab

ility

Dis

tr

27Distributed Hash Table AlgorithmsL3S Research Center

0 ¼ ½ 1Distance to long distance neighbor

Step 3: Discrete PDF

ribution

Chord:“Draw exactly once from each of log n bins”Each bin is essentially a point.

Probab

ility

Dis

tr

28Distributed Hash Table AlgorithmsL3S Research Center

0 ¼ ½ 1Distance to long distance neighbor

Page 15: Distributed Hash Table Algorithms

Two Optimizations

Bi-directional RoutingExploit both outgoing and incoming links!

Route to the neighbor that minimizes absolute distance to destination

Reduces avg latency by 25-30%

1-LookaheadList of neighbor’s neighbors

Reduces avg. latency by 40%

29Distributed Hash Table AlgorithmsL3S Research Center

Also applicable to other DHTs

Symphony: Summary (1)

Distributed Hashing in a Small World

Like Chord:Overlay structure: ring

Key ID space partitioning

Unlike Chord:Routing Table

Two short links for immediate neighbors

k long distance links for jumping

Long distance links are built in a probabilistic way

30Distributed Hash Table AlgorithmsL3S Research Center

Long distance links are built in a probabilistic way

Peers are selected using a Probability Distribution Function (pdf)

Exploit the characteristics of a small-world network

Dynamically estimate the current system size

Page 16: Distributed Hash Table Algorithms

Symphony: Summary (2)

Each node has k = O(1) long distance linksLookup:

Expected path length: O((log2N)/k) hops

&Join & leaveExpected: O(log2N) messages

Comparing with Chord:Discard the strong requirements on the routing table (finger table)

rely on the small world to reach the destination.

31Distributed Hash Table AlgorithmsL3S Research Center

Viceroy network

Arrange nodes and keys on a ringAs usual

32Distributed Hash Table AlgorithmsL3S Research Center

Page 17: Distributed Hash Table Algorithms

Viceroy network

Assign to each node a level valuechosen uniformly from the set {1,…,log n}

estimate n by taking the inverse f th di t f th dof the distance of the node

with its successor

easy to update

33Distributed Hash Table AlgorithmsL3S Research Center

Viceroy network

Create a ring of nodes within the same level

34Distributed Hash Table AlgorithmsL3S Research Center

Page 18: Distributed Hash Table Algorithms

Downward links

For peer with key x at level iDirect successor peer on level i+1

Long link to peer x+2i on level i+1

35Distributed Hash Table AlgorithmsL3S Research Center

Upward links

For each peer with key x at level i

Predecessor link on level i-1

Long link to peer at x-2i on level i-1

36Distributed Hash Table AlgorithmsL3S Research Center

Page 19: Distributed Hash Table Algorithms

Butterfly links

Each node x at level i has two downward links to level i+1a left link to the first node of level i+1 after position x on the ring

a right link to the first node of level i+1 after position x + (½)i

37Distributed Hash Table AlgorithmsL3S Research Center

Viceroy

Emulating the butterfly network

level 1000 001 011010 100 101 110 111

level 1

level 2

level 4

level 3

38Distributed Hash Table AlgorithmsL3S Research Center

Logarithmic path lengths between any two nodes in the network

Constant degree per node

Page 20: Distributed Hash Table Algorithms

Viceroy Summary

Scalability: Optimal peer statePeer state log(1)

Avg. path length log(N)

Complex algorithm

Network proximity not taken into account

39Distributed Hash Table AlgorithmsL3S Research Center

Early and successful algorithm

Simple & elegantIntuitively to understand and implement

CAN: Overview

many improvements and optimizations exist

Sylvia Ratnasamy et al. in 2001

Main responsibilities:CAN is a distributed system that maps keys onto values

Keys hashed into d dimensional space

I t f

40Distributed Hash Table AlgorithmsL3S Research Center

Interface: insert(key, value)

retrieve(key)

Page 21: Distributed Hash Table Algorithms

Virtual d-dimensionalCartesian coordinatesystem on a d-torus

Example: 2-d [0,1]x[1,0]

CAN

Dynamically partitionedamong all nodes

Pair (K,V) is stored bymapping key K to a point P in the space using a uniform hash function and storing (K,V) at the node in the zone containing P

Retrieve entry (K,V) by applying the same hash function to map K to

41Distributed Hash Table AlgorithmsL3S Research Center

P and retrieve entry from node in zone containing PIf P is not contained in the zone of the requesting node or its neighboring zones, route request to neighbor node in zone nearest P

CAN

State of the system at time t

Peer

ResourceResource

42Distributed Hash Table AlgorithmsL3S Research Center

x

Zone

In this 2-dimensional space a key is mapped to a point (x,y)

Page 22: Distributed Hash Table Algorithms

CAN: Routing

d-dimensional space with n zones

2 zones are neighbours if

Peer

Q(x,y)

(x,y)

Q(x,y)

gd-1 dimensions overlap

Algorithm:Choose the neighbor nearest to the destination

43Distributed Hash Table AlgorithmsL3S Research Center

key

CAN: Construction - Basic Idea

44Distributed Hash Table AlgorithmsL3S Research Center

Page 23: Distributed Hash Table Algorithms

CAN: Construction

Bootstrapnode

45Distributed Hash Table AlgorithmsL3S Research Center

new node

CAN: Construction

Bootstrap

I

pnode

46Distributed Hash Table AlgorithmsL3S Research Center

new node 1) Discover some node “I” already in CAN

Page 24: Distributed Hash Table Algorithms

CAN: Construction

I

(x,y)

47Distributed Hash Table AlgorithmsL3S Research Center

2) Pick random point in spacenew node

CAN: Construction

(x,y)

I

J

48Distributed Hash Table AlgorithmsL3S Research Center

3) I routes to (x,y), discovers node J new node

Page 25: Distributed Hash Table Algorithms

CAN: Construction

newJ

49Distributed Hash Table AlgorithmsL3S Research Center

4) split J’s zone in half… new owns one half

CAN-Improvement: Multiple Realities

Build several CAN-networks

Each network is called a reality

Routing

Jump between realities

Chose reality in which distance is shortest

50Distributed Hash Table AlgorithmsL3S Research Center

Page 26: Distributed Hash Table Algorithms

CAN-Improvement: Multiple Dimensions

51Distributed Hash Table AlgorithmsL3S Research Center

CAN: Multiple Dimensions vs. Multiple Realities

More dimensionsshorter paths

More realitiesMore realitiesmore robustness

Trade-off?

52Distributed Hash Table AlgorithmsL3S Research Center

Page 27: Distributed Hash Table Algorithms

CAN: Summary

Inferior scalabilityPeer state O(d)

Avg. path length O(d N1/d)

Useful for spatial data!

53Distributed Hash Table AlgorithmsL3S Research Center

Spectrum of DHT Protocols

Protocol #links latency

CAN O(d) O(d N1/d)Deterministic

CAN O(d) O(d N1/d)Chord O(log N) O(log N)

Viceroy O(1) O(log N)Pastry O((2b-1)(log2 N)/b) O((log N) / b)

Topology

PartlyRandomizedTopology

54Distributed Hash Table AlgorithmsL3S Research Center

CompletelyRandomizedTopology

Symphony O(k) O((log2 N)/k)

Page 28: Distributed Hash Table Algorithms

Latency vs State Maintenance

ency

15 Viceroy

x x CAN

Network size: n=215 nodesAve

rage

Lat

5 10 x CAN

Pastryx x Chord

X PastrySymphony x

x x x

xx x x

x

x

55Distributed Hash Table AlgorithmsL3S Research Center

# TCP Connections0 10 20 30 40 50 60