Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea,...

36
Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS, UC Berkeley HP Labs, November 2002

Transcript of Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea,...

Page 1: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

Implementation and Deployment of a Large-scale Network

Infrastructure

Ben Y. ZhaoL. Huang, S. Rhea, J. Stribling,A. D. Joseph, J. D. Kubiatowicz

EECS, UC BerkeleyHP Labs, November 2002

Page 2: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 2

Next-generation Network Applications

Scalability in applications Process/threads on single node server Cluster (LAN): fast, reliable, unlimited comm. Next step: scaling to the wide-area

Complexities of global deployment Network unreliability

BGP slow convergence, redundancy unexploited

Lack of administrative control over components Constrains protocol deployment: multicast, congestion ctrl.

Management of large scale resources / components Locate, utilize resources despite failures

Page 3: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 3

Enabling Technology: DOLR(Decentralized Object Location and Routing)

GUID1

DOLR

GUID1GUID2

Page 4: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 4

A SolutionDecentralized Object Location and Routing (DOLR) wide-area overlay application infrastructure

Self-organizing, scalable Fault-tolerant routing and object location Efficient (b/w, latency) data delivery

Extensible, supports application-specific protocols

Recent work Tapestry, Chord, CAN, Pastry Kademlia, Viceroy, …

Page 5: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 5

What is Tapestry?

DOLR driving OceanStore global storage(Zhao, Kubiatowicz, Joseph et al. 2000)

Network structure Nodes assigned bit sequence nodeIds

namespace: 0-2160, based on some radix (e.g. 16) keys from same namespace

Keys dynamically map to 1 unique live node: root

Base API Publish / Unpublish (Object ID) RouteToNode (NodeId) RouteToObject (Object ID)

Page 6: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 6

4

2

3

3

3

2

2

1

2

4

1

2

3

3

1

34

1

1

4 3

2

4

Tapestry MeshIncremental prefix-based routing

NodeID0x43FE

NodeID0xEF31NodeID

0xEFBA

NodeID0x0921

NodeID0xE932

NodeID0xEF37

NodeID0xE324

NodeID0xEF97

NodeID0xEF32

NodeID0xFF37

NodeID0xE555

NodeID0xE530

NodeID0xEF44

NodeID0x0999

NodeID0x099F

NodeID0xE399

NodeID0xEF40

NodeID0xEF34

Page 7: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 7

Routing Mesh

Routing via local routing tables Based on incremental prefix matching Example: 5324 routes to 0629 via

5324 0231 0667 0625 0629 At nth hop, local node matches destination at least n

digits (if any such node exists) ith entry in nth level routing table points to nearest node

matching: prefix(local_ID, n)+iProperties At most log(N) # of overlay hops between nodes Routing table size: b log(N) Actual entries have c-1 backups,

total size: c b log(N)

Page 8: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 8

Object LocationRandomization and Locality

Page 9: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 9

Object Location

Distribute replicates of object references Only references, not the data itself (level of

indirection) Place more of them closer to object itself

Publication Place object location pointers into network Store hops between object and “root” node

Location Route message towards root from client Redirect to object when location pointer found

Page 10: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 10

Node Insertion

Inserting new node N Notify need-to-know nodes of N,

N fills null entries in their routing tables Move locally rooted object references to N Construct locally optimal routing table for N Notify nearby nodes to N for optimization

Two phase node insertion Acknowledged multicast Nearest neighbor approximation

Page 11: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 11

Acknowledged MulticastReach need-to-know nodes of N (e.g. 3111) Add to routing table Move root object references

GN

3211

3229 3013

3222 3205

3023 3022

3021

Page 12: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 12

Nearest NeighborN iterates: list = need-to-know nodes, L = prefix (N, S) Measure distances of List, use to fill routing table, level L Trim to k closest nodes, list = backpointers from k set, L-- Repeat until L == 0

N

Need-to-knownodes

Page 13: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 13

Talk Outline

Algorithms

Architecture

Architectural components

Extensibility API

Evaluation

Ongoing Projects

Conclusion

Page 14: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 14

Single Tapestry Node

Transport Protocols

Network Link Management

Application Interface / Upcall API

DecentralizedFile Systems

Application-LevelMulticast

ApproximateText Matching

RouterRouting Table

&Object Pointer DB

Dynamic Node

Management

Page 15: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 15

Single Node Implementation

Application Programming Interface

Applications

Dynamic Tapestry Core Router Patchwork

Network StageDistance Map

SEDA Event-driven FrameworkJava Virtual Machine

Enter/leaveTapestry

State Maint.Node Ins/del

Routing LinkMaintenance

Node Ins/del

Messages

UDP Pingsro

ute

tono

de /

obj

AP

I ca

lls

Upc

alls

fault

detec

t

heart

beat

msgs

Page 16: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 16

Message Routing

Router: fast routing to nodes / objects

Receive newlocation msg

Forward tonextHop(h+1,G)

Forward tonextHop(0,obj)

Signal AppUpcall Handler

Upcall?yes

nono

yes

HaveObj Ptrs

Receive newroute msg

Signal AppUpcall Handler

Upcall? no

yes

Forward tonextHop(h+1,G)

Page 17: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 17

Extensibility API

deliver (G, Aid, Msg) Invoked at message destination Asynchronous, returns immediately

forward (G, Aid, Msg) Invoked at intermediate hop in route No action taken by default, application calls route()

route (G, Aid, Msg, NextHopNodeSet) Called by application to request message be

routed to set of NextHopNodes

Page 18: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 18

Local Operations

Accessibility to Tapestry maintained state

nextHopSet = Llookup(G, Num) Accesses routing table Returns up to num candidates for next hop

towards G

objReferenceSet = Lsearch(G, num) Searches object references for G Returns up to num references for object, sorted by

increasing network distance

Page 19: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 19

Deployment Status

C simulatorPacket level simulationScales up to 10,000 nodes

Java implementation50000 semicolons of Java, 270 class filesDeployed on local area cluster (40 nodes)Deployed on Planet Lab global network

(~100 distributed nodes)

Page 20: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 20

Talk Outline

Algorithms

Architecture

Evaluation

Micro-benchmarks

Stable network performance

Single and parallel node insertion

Ongoing Projects

Conclusion

Page 21: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 21

Micro-benchmark Methodology

Experiment run in LAN, GBit EthernetSender sends 60001 messages at full speedMeasure inter-arrival time for last 50000 msgs 10000 msgs: remove cold-start effects 50000 msgs: remove network jitter effects

SenderControl

ReceiverControl

Tapestry TapestryLANLink

Page 22: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 22

Micro-benchmark Results

Message Processing Latency

0.01

0.1

1

10

100

0.01 0.1 1 10 100 1000 10000

Message Size (KB)

Tim

e / m

sg (

ms)

Sustainable Throughput

0

5

10

15

20

25

30

0.01 0.1 1 10 100 1000 10000

Message Size (KB)

TP

ut (

MB

/s)

Constant processing overhead ~ 50s

Latency dominated by byte copying

For 5K messages, throughput = ~10,000 msgs/sec

Page 23: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 23

Large Scale MethodologyPlanet Lab global network 98 machines at 42 institutions, in North America,

Europe, Australia (~ 60 machines utilized) 1.26Ghz PIII (1GB RAM), 1.8Ghz PIV (2GB RAM) North American machines (2/3) on Internet2

Tapestry Java deployment 6-7 nodes on each physical machine IBM Java JDK 1.30 Node virtualization inside JVM and SEDA Scheduling between virtual nodes increases

latency

Page 24: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 24

Node to Node Routing

Ratio of end-to-end routing latency to shortest ping distance between nodes

All node pairs measured, placed into buckets

0

5

10

15

20

25

30

35

0 50 100 150 200 250 300

Internode RTT Ping time (5ms buckets)

RD

P (m

in, m

ed, 9

0%) Median=31.5, 90th percentile=135

Page 25: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 25

Object Location

Ratio of end-to-end latency for object location, to shortest ping distance between client and object location

Each node publishes 10,000 objects, lookup on all objects

0

5

10

15

20

25

0 20 40 60 80 100 120 140 160 180 200

Client to Obj RTT Ping time (1ms buckets)

RD

P (

min

, m

ed

ian

, 9

0%

)

90th percentile=158

Page 26: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 26

Latency to Insert Node

Latency to dynamically insert a node into an existing Tapestry, as function of size of existing Tapestry

Humps due to expected filling of each routing level

0

200

400

600

800

1000

1200

1400

1600

1800

2000

0 100 200 300 400 500

Size of Existing Network (nodes)

Inte

gra

tio

n L

ate

nc

y (

ms

)

Page 27: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 27

Bandwidth to Insert Node

Cost in bandwidth of dynamically inserting a node into the Tapestry, amortized for each node in network

Per node bandwidth decreases with size of network

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 50 100 150 200 250 300 350 400

Size of Existing Network (nodes)

Co

ntr

ol

Tra

ffic

BW

(K

B)

Page 28: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 28

Parallel Insertion Latency

Latency to dynamically insert nodes in unison into an existing Tapestry of 200

Shown as function of insertion group size / network size

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

0 0.05 0.1 0.15 0.2 0.25 0.3

Ratio of Insertion Group Size to Network Size

La

ten

cy

to

Co

nv

erg

en

ce

(m

s) 90th percentile=55042

Page 29: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 29

Results SummaryLessons Learned Node virtualization: resource contention Accurate network distances hard to measure

Efficiency verified Msg processing = 50s, Tput ~ 10,000msg/s Route to node/object small factor over optimal

Algorithmic scalability Single node latency/bw scale sublinear to network

size Parallel insertion scales linearly with group size

Page 30: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 30

Talk Outline

Algorithms

Architecture

Evaluation

Ongoing Projects

P2P landmark routing: Brocade

Applications: Shuttle, Interweave, ATA

Conclusion

Page 31: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 31

State of the Art RoutingHigh dimensionality and coordinate-based P2P routing Tapestry, Pastry, Chord,

CAN, etc… Sub-linear storage and # of

overlay hops per route Properties dependent on

random name distribution Optimized for uniform mesh

style networks

Page 32: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 32

Reality

AS-2

P2P Overlay Network

AS-1

AS-3

S R

Transit-stub topology, disparate resources per node

Result: Inefficient inter-domain routing (b/w, latency)

Page 33: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 33

Landmark Routing on P2PBrocade Exploit non-uniformity Minimize wide-area routing hops / bandwidth

Secondary overlay on top of Tapestry Select super-nodes by admin. domain

Divide network into cover sets

Super-nodes form secondary Tapestry Advertise cover set as local objects

brocade routes directly into destination’s local network, then resumes p2p routing

Page 34: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 34

AS-2

P2P Network

AS-1

AS-3

Brocade Layer

S D

Original Route

Brocade Route

Brocade Routing

Page 35: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 35

Applications under DevelopmentOceanStore: global resilient file storeShuttle Decentralized P2P chat service Leverages Tapestry for fault-tolerant routing

Interweave Keyword searchable file sharing utility Fully decentralized, exploits network locality

Approximate Text Addressing Uses text fingerprinting to map similar documents

to single IDs Killer app: decentralized spam mail filter

Page 36: Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz EECS,

HP Labs, 11/26/02 © [email protected] 36

For More Information

Tapestry and related projects (and these slides):http://www.cs.berkeley.edu/~ravenben/tapestry

OceanStore:http://oceanstore.cs.berkeley.edu

Related papers:http://oceanstore.cs.berkeley.edu/publications

http://www.cs.berkeley.edu/~ravenben/publications

[email protected]