Sylvia Ratnasamy (UC Berkley Dissertation 2002) Paul Francis Mark Handley Richard Karp Scott Shenker...

Post on 30-Dec-2015

216 views 2 download

Tags:

Transcript of Sylvia Ratnasamy (UC Berkley Dissertation 2002) Paul Francis Mark Handley Richard Karp Scott Shenker...

Sylvia Ratnasamy (UC Berkley Dissertation 2002)

Paul Francis

Mark Handley

Richard Karp

Scott Shenker

A Scalable, Content Addressable Network

Slides byAuthors above

with additions/modificatons byMichael Isaacs (UCSB)Tony Hannan (Georgia Tech)

Outline

• Introduction• Design• Design Improvements• Evalution• Related Work• Conclusion

• Hash tables– essential building block in software systems

• Internet-scale distributed hash tables– equally valuable to large-scale distributed systems

• peer-to-peer systems– Napster, Gnutella, Groove, FreeNet, MojoNation…

• large-scale storage management systems– Publius, OceanStore, PAST, Farsite, CFS ...

• mirroring on the Web

Internet-scale hash tables

Content-Addressable Network(CAN)

• CAN: Internet-scale hash table

• Interface– insert (key, value)– value = retrieve (key)

• Properties– scalable– operationally simple– good performance

• Related systems: Chord, Plaxton/Tapestry ...

Problem Scope

• provide the hashtable interface• scalability• robustness• performance

Not in scope

• security• anonymity• higher level primitives

– keyword searching– mutable content

Outline

• Introduction• Design• Design Improvements• Evalution• Related Work• Conclusion

K V

CAN: basic idea

K V

K V

K V

K V

K V

K V

K V

K V

K V

K V

CAN: basic idea

insert(K1,V1)

K V

K V

K V

K V

K V

K V

K V

K V

K V

K V

K V

CAN: basic idea

insert(K1,V1)

K V

K V

K V

K V

K V

K V

K V

K V

K V

K V

K V

CAN: basic idea

(K1,V1)

K V

K V

K V

K V

K V

K V

K V

K V

K V

K V

K V

CAN: basic idea

retrieve (K1)

K V

K V

K V

K V

K V

K V

K V

K V

K V

K V

K V

CAN: solution

• Virtual Cartesian coordinate space

• Entire space is partitioned amongst all the nodes – every node “owns” a zone in the overall space

• Abstraction– can store data at “points” in the space – can route from one “point” to another

• Point = node that owns the enclosing zone

CAN: simple example

1

CAN: simple example

1 2

CAN: simple example

1

2

3

CAN: simple example

1

2

3

4

CAN: simple example

CAN: simple example

I

CAN: simple example

node I::insert(K,V)

I

(1) a = hx(K)

CAN: simple example

x = a

node I::insert(K,V)

I

(1) a = hx(K)

b = hy(K)

CAN: simple example

x = a

y = b

node I::insert(K,V)

I

(1) a = hx(K)

b = hy(K)

CAN: simple example

(2) route(K,V) -> (a,b)

node I::insert(K,V)

I

CAN: simple example

(2) route(K,V) -> (a,b)

(3) (a,b) stores (K,V)

(K,V)

node I::insert(K,V)

I(1) a = hx(K)

b = hy(K)

CAN: simple example

(2) route “retrieve(K)” to (a,b)

(K,V)

(1) a = hx(K)

b = hy(K)

node J ::retrieve(K)

J

Data stored in the CAN is addressed by name (i.e. key), not location (i.e. IP address)

CAN

CAN: routing table

CAN: routing

(a,b)

(x,y)

A node only maintains state for its immediate neighboring nodes

CAN: routing

CAN: node insertion

I

new node1) Discover some node “I” already in CAN

CAN: node insertion

2) pick random point in space

I

(p,q)

new node

CAN: node insertion

(p,q)

3) I routes to (p,q), discovers node J

I

J

new node

CAN: node insertion

newJ

4) split J’s zone in half… new owns one half

Inserting a new node affects only a single other node and its immediate neighbors

CAN: node insertion

CAN: node failures

• Need to repair the space

– recover database• state updates• use replication, rebuild database from replicas

– repair routing • takeover algorithm

CAN: takeover algorithm

• Nodes periodically send update message to neigbors

• An absent update message indicates a failure

– Each node send a recover message to the takeover node

– The takeover node takes over the failed node's zone

CAN: takeover node

• The takeover node is the node whose VID is numerically closest to the failed nodes VID.

• The VID is a binary string indicating the path from the root to the zone in the binary partition tree.

0

11

100 1010 1

10

100 101

11

CAN: VID predecessor / successor

• To handle multiple neighbors failing simultaneously, each node maintains its predecessor and successor in the VID ordering (idea borrowed from Chord).

• This guarantees that the takeover node can always be found.

Only the failed node’s immediate neighbors are required for recovery

CAN: node failures

Basic Design recap

• Basic CAN– completely distributed– self-organizing– nodes only maintain state for their

immediate neighbors

Outline

• Introduction• Design• Design Improvements• Evalution• Related Work• Conclusion

Design Improvement Objectives

• Reduce latency of CAN routing– Reduce CAN path length– Reduce per-hop latency (IP

length/latency per CAN hop)• Improve CAN robustness

– data availability– routing

Design Improvement Tradeoffs

Routing performance and system robustnes

» vs.

Increased per-node state and system complexity

Design Improvement Techniques

1. Increase number of dimensions in coordinate space

2. Multiple coordinate spaces (realities)

3. Better CAN routing metrics (measure & consider time)

4. Overloading coordinate zones (peer nodes)

5. Multiple hash functions

6. Geographic-sensitive construction of CAN overlay network (distributed binning)

7. More uniform partitioning (when new nodes enter)

8. Caching & Replication techniques for “hot-spots”

9. Background load balancing algorithm

1. Increase dimensions

2. Multiple realities

• Multiple (r) coordinate spaces

• Each node maintains a different zone in each reality

• Contents of hash table replicated on each reality

• Routing chooses neighbor closest in any reality (can make large jumps towards target)

• Advantages:– Greater data availability– Fewer hops

Multiple realities

Multiple dimensions vs realities

3. Time-sensitive routing

• Each node measures round-trip time (RTT) to each of its neighbors

• Routing message is forwarded to the neighbor with the maximum ratio of progress to RTT

• Advantage:– reduces per-hop latency (25-40%

depending on # dimensions)

4. Zone overloading

• Node maintains list of “peers” in same zone

• Zone not split until MAXPEERS are there

• Only least RTT neighbor maintained

• Replication or splitting of data across peers• Advantages

– Reduced per hop latency

– Reduced path length (because sharing zones has same effect as reducing total number of nodes)

– Improved fault tolerance (a zone is vacant only when all nodes fail)

Zone overload

Avg of 28 to 218 nodes

5. Multiple hash

• (key, value) mapped to “k” multiple points (and nodes)

• Parallel query execution on “k” paths• Or choose closest one in space• Advantages:

– improve data availability– reduce path length

Multiple hash functions

6. Geographically distributed binning

• Distributed binning of CAN nodes based on their relative distances from m IP landmarks (such as root DNS servers)

• m! partitions of coordinate space

• New node joins partition associated with its landmark ordering

• Topological close nodes will be assigned to same partition

• Nodes in same partition divide the space up randomly

• Disadvantage: coordinate space is no longer uniformily populated

– background load balancing could alleviate this problem

Distributed binning

7. More uniform partitioning

• When adding a node to a zone, choose neighbor with largest volume to split

• Advantage– 85% of nodes have the average load as

opposed to 40% without, and largest zone volume dropped from 8x to 2x the average

8. Caching & Replication

• A node can cache values of keys it has recently requested.

• An node can push out “hot” keys to neighbors.

9. Load balancing

• Merge and split zones to balance loads

Outline

• Introduction• Design• Design Improvements• Evalution• Related Work• Conclusion

Evaluation Metrics

1.Path length2.Neighbor state3.Latency4.Volume5.Routing fault tolerance6.Data availability

CAN: scalability

• For a uniformly partitioned space with n nodes and d dimensions – per node, number of neighbors is 2d– average routing path is (dn1/d) hops

• Simulations show that the above results hold in practice (Transit-Stub topology: GT-ITM gen)

• Can scale the network without increasing per-node state

CAN: latency

• latency stretch = (CAN routing delay) (IP routing delay)

• To reduce latency, apply design improvements:– increase dimensions (d)

– multiple nodes per zone (peer nodes) (p)

– multiple “realities” ie coordinate spaces (r)

– multiple hash functions (k)

– use RTT weighted routing

– distributed Binning

– uniform Partitioning

– caching & replication (for “hot spots”)

Overall improvement

CAN: Robustness

• Completely distributed – no single point of failure

• Not exploring database recovery

• Resilience of routing– can route around trouble

Routing resilience

destination

source

Routing resilience

Routing resilience

destination

Routing resilience

• Node X::route(D)

If (X cannot make progress to D) – check if any neighbor of X can make

progress– if yes, forward message to one such nbr

Routing resilience

Routing resilience

Outline

• Introduction• Design• Design Improvements• Evalution• Related Work• Conclusion

Related Algorithms

• Distance Vector (DV) and Link State (LS) in IP routing– require widespread dissemination of

topology– not well suited to dynamic topology like file-

sharing

• Hierarchical routing algorithms– too centralized, single points of stress

• Other DHTs– Plaxton, Tapestry, Pastry, Chord

Related Systems

• DNS

• OceanStore (uses Tapestry)

• Publius: Web-publishing system with high anonymity

• P2P File sharing systems– Napster, Gnutella, KaZaA, Freenet

Outline

• Introduction• Design• Design Improvements• Evalution• Related Work• Conclusion

Consclusion: Purpose

• CAN– an Internet-scale hash table

– potential building block in Internet applications

Conclusion: Scalability

• O(d) per-node state

• O(d * n^(1/d)) path length

For d = (log n)/2, path lengh and per-node state is O(log n) like other DHTs

Conclusion: Latency

• With the main design improvements latency is less than twice the IP path latency

Conclusion: Robustness

• Decentralized, can route around trouble

• With certain design improvements (replication), high data availability

Weaknesses / Future Work

• Security– Denial of service attacks

• hard to combat because a malicious node can act as a malicious server as well as a malicious client

• Mutable content

• Search