Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea,...

Post on 15-Jan-2016

216 views 0 download

Tags:

Transcript of Implementation and Deployment of a Large-scale Network Infrastructure Ben Y. Zhao L. Huang, S. Rhea,...

Implementation and Deployment of a Large-scale Network

Infrastructure

Ben Y. ZhaoL. Huang, S. Rhea, J. Stribling,A. D. Joseph, J. D. Kubiatowicz

EECS, UC BerkeleyHP Labs, November 2002

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 2

Next-generation Network Applications

Scalability in applications Process/threads on single node server Cluster (LAN): fast, reliable, unlimited comm. Next step: scaling to the wide-area

Complexities of global deployment Network unreliability

BGP slow convergence, redundancy unexploited

Lack of administrative control over components Constrains protocol deployment: multicast, congestion ctrl.

Management of large scale resources / components Locate, utilize resources despite failures

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 3

Enabling Technology: DOLR(Decentralized Object Location and Routing)

GUID1

DOLR

GUID1GUID2

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 4

A SolutionDecentralized Object Location and Routing (DOLR) wide-area overlay application infrastructure

Self-organizing, scalable Fault-tolerant routing and object location Efficient (b/w, latency) data delivery

Extensible, supports application-specific protocols

Recent work Tapestry, Chord, CAN, Pastry Kademlia, Viceroy, …

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 5

What is Tapestry?

DOLR driving OceanStore global storage(Zhao, Kubiatowicz, Joseph et al. 2000)

Network structure Nodes assigned bit sequence nodeIds

namespace: 0-2160, based on some radix (e.g. 16) keys from same namespace

Keys dynamically map to 1 unique live node: root

Base API Publish / Unpublish (Object ID) RouteToNode (NodeId) RouteToObject (Object ID)

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 6

4

2

3

3

3

2

2

1

2

4

1

2

3

3

1

34

1

1

4 3

2

4

Tapestry MeshIncremental prefix-based routing

NodeID0x43FE

NodeID0xEF31NodeID

0xEFBA

NodeID0x0921

NodeID0xE932

NodeID0xEF37

NodeID0xE324

NodeID0xEF97

NodeID0xEF32

NodeID0xFF37

NodeID0xE555

NodeID0xE530

NodeID0xEF44

NodeID0x0999

NodeID0x099F

NodeID0xE399

NodeID0xEF40

NodeID0xEF34

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 7

Routing Mesh

Routing via local routing tables Based on incremental prefix matching Example: 5324 routes to 0629 via

5324 0231 0667 0625 0629 At nth hop, local node matches destination at least n

digits (if any such node exists) ith entry in nth level routing table points to nearest node

matching: prefix(local_ID, n)+iProperties At most log(N) # of overlay hops between nodes Routing table size: b log(N) Actual entries have c-1 backups,

total size: c b log(N)

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 8

Object LocationRandomization and Locality

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 9

Object Location

Distribute replicates of object references Only references, not the data itself (level of

indirection) Place more of them closer to object itself

Publication Place object location pointers into network Store hops between object and “root” node

Location Route message towards root from client Redirect to object when location pointer found

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 10

Node Insertion

Inserting new node N Notify need-to-know nodes of N,

N fills null entries in their routing tables Move locally rooted object references to N Construct locally optimal routing table for N Notify nearby nodes to N for optimization

Two phase node insertion Acknowledged multicast Nearest neighbor approximation

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 11

Acknowledged MulticastReach need-to-know nodes of N (e.g. 3111) Add to routing table Move root object references

GN

3211

3229 3013

3222 3205

3023 3022

3021

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 12

Nearest NeighborN iterates: list = need-to-know nodes, L = prefix (N, S) Measure distances of List, use to fill routing table, level L Trim to k closest nodes, list = backpointers from k set, L-- Repeat until L == 0

N

Need-to-knownodes

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 13

Talk Outline

Algorithms

Architecture

Architectural components

Extensibility API

Evaluation

Ongoing Projects

Conclusion

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 14

Single Tapestry Node

Transport Protocols

Network Link Management

Application Interface / Upcall API

DecentralizedFile Systems

Application-LevelMulticast

ApproximateText Matching

RouterRouting Table

&Object Pointer DB

Dynamic Node

Management

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 15

Single Node Implementation

Application Programming Interface

Applications

Dynamic Tapestry Core Router Patchwork

Network StageDistance Map

SEDA Event-driven FrameworkJava Virtual Machine

Enter/leaveTapestry

State Maint.Node Ins/del

Routing LinkMaintenance

Node Ins/del

Messages

UDP Pingsro

ute

tono

de /

obj

AP

I ca

lls

Upc

alls

fault

detec

t

heart

beat

msgs

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 16

Message Routing

Router: fast routing to nodes / objects

Receive newlocation msg

Forward tonextHop(h+1,G)

Forward tonextHop(0,obj)

Signal AppUpcall Handler

Upcall?yes

nono

yes

HaveObj Ptrs

Receive newroute msg

Signal AppUpcall Handler

Upcall? no

yes

Forward tonextHop(h+1,G)

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 17

Extensibility API

deliver (G, Aid, Msg) Invoked at message destination Asynchronous, returns immediately

forward (G, Aid, Msg) Invoked at intermediate hop in route No action taken by default, application calls route()

route (G, Aid, Msg, NextHopNodeSet) Called by application to request message be

routed to set of NextHopNodes

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 18

Local Operations

Accessibility to Tapestry maintained state

nextHopSet = Llookup(G, Num) Accesses routing table Returns up to num candidates for next hop

towards G

objReferenceSet = Lsearch(G, num) Searches object references for G Returns up to num references for object, sorted by

increasing network distance

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 19

Deployment Status

C simulatorPacket level simulationScales up to 10,000 nodes

Java implementation50000 semicolons of Java, 270 class filesDeployed on local area cluster (40 nodes)Deployed on Planet Lab global network

(~100 distributed nodes)

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 20

Talk Outline

Algorithms

Architecture

Evaluation

Micro-benchmarks

Stable network performance

Single and parallel node insertion

Ongoing Projects

Conclusion

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 21

Micro-benchmark Methodology

Experiment run in LAN, GBit EthernetSender sends 60001 messages at full speedMeasure inter-arrival time for last 50000 msgs 10000 msgs: remove cold-start effects 50000 msgs: remove network jitter effects

SenderControl

ReceiverControl

Tapestry TapestryLANLink

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 22

Micro-benchmark Results

Message Processing Latency

0.01

0.1

1

10

100

0.01 0.1 1 10 100 1000 10000

Message Size (KB)

Tim

e / m

sg (

ms)

Sustainable Throughput

0

5

10

15

20

25

30

0.01 0.1 1 10 100 1000 10000

Message Size (KB)

TP

ut (

MB

/s)

Constant processing overhead ~ 50s

Latency dominated by byte copying

For 5K messages, throughput = ~10,000 msgs/sec

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 23

Large Scale MethodologyPlanet Lab global network 98 machines at 42 institutions, in North America,

Europe, Australia (~ 60 machines utilized) 1.26Ghz PIII (1GB RAM), 1.8Ghz PIV (2GB RAM) North American machines (2/3) on Internet2

Tapestry Java deployment 6-7 nodes on each physical machine IBM Java JDK 1.30 Node virtualization inside JVM and SEDA Scheduling between virtual nodes increases

latency

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 24

Node to Node Routing

Ratio of end-to-end routing latency to shortest ping distance between nodes

All node pairs measured, placed into buckets

0

5

10

15

20

25

30

35

0 50 100 150 200 250 300

Internode RTT Ping time (5ms buckets)

RD

P (m

in, m

ed, 9

0%) Median=31.5, 90th percentile=135

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 25

Object Location

Ratio of end-to-end latency for object location, to shortest ping distance between client and object location

Each node publishes 10,000 objects, lookup on all objects

0

5

10

15

20

25

0 20 40 60 80 100 120 140 160 180 200

Client to Obj RTT Ping time (1ms buckets)

RD

P (

min

, m

ed

ian

, 9

0%

)

90th percentile=158

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 26

Latency to Insert Node

Latency to dynamically insert a node into an existing Tapestry, as function of size of existing Tapestry

Humps due to expected filling of each routing level

0

200

400

600

800

1000

1200

1400

1600

1800

2000

0 100 200 300 400 500

Size of Existing Network (nodes)

Inte

gra

tio

n L

ate

nc

y (

ms

)

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 27

Bandwidth to Insert Node

Cost in bandwidth of dynamically inserting a node into the Tapestry, amortized for each node in network

Per node bandwidth decreases with size of network

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 50 100 150 200 250 300 350 400

Size of Existing Network (nodes)

Co

ntr

ol

Tra

ffic

BW

(K

B)

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 28

Parallel Insertion Latency

Latency to dynamically insert nodes in unison into an existing Tapestry of 200

Shown as function of insertion group size / network size

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

0 0.05 0.1 0.15 0.2 0.25 0.3

Ratio of Insertion Group Size to Network Size

La

ten

cy

to

Co

nv

erg

en

ce

(m

s) 90th percentile=55042

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 29

Results SummaryLessons Learned Node virtualization: resource contention Accurate network distances hard to measure

Efficiency verified Msg processing = 50s, Tput ~ 10,000msg/s Route to node/object small factor over optimal

Algorithmic scalability Single node latency/bw scale sublinear to network

size Parallel insertion scales linearly with group size

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 30

Talk Outline

Algorithms

Architecture

Evaluation

Ongoing Projects

P2P landmark routing: Brocade

Applications: Shuttle, Interweave, ATA

Conclusion

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 31

State of the Art RoutingHigh dimensionality and coordinate-based P2P routing Tapestry, Pastry, Chord,

CAN, etc… Sub-linear storage and # of

overlay hops per route Properties dependent on

random name distribution Optimized for uniform mesh

style networks

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 32

Reality

AS-2

P2P Overlay Network

AS-1

AS-3

S R

Transit-stub topology, disparate resources per node

Result: Inefficient inter-domain routing (b/w, latency)

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 33

Landmark Routing on P2PBrocade Exploit non-uniformity Minimize wide-area routing hops / bandwidth

Secondary overlay on top of Tapestry Select super-nodes by admin. domain

Divide network into cover sets

Super-nodes form secondary Tapestry Advertise cover set as local objects

brocade routes directly into destination’s local network, then resumes p2p routing

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 34

AS-2

P2P Network

AS-1

AS-3

Brocade Layer

S D

Original Route

Brocade Route

Brocade Routing

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 35

Applications under DevelopmentOceanStore: global resilient file storeShuttle Decentralized P2P chat service Leverages Tapestry for fault-tolerant routing

Interweave Keyword searchable file sharing utility Fully decentralized, exploits network locality

Approximate Text Addressing Uses text fingerprinting to map similar documents

to single IDs Killer app: decentralized spam mail filter

HP Labs, 11/26/02 © ravenben@eecs.berkeley.edu 36

For More Information

Tapestry and related projects (and these slides):http://www.cs.berkeley.edu/~ravenben/tapestry

OceanStore:http://oceanstore.cs.berkeley.edu

Related papers:http://oceanstore.cs.berkeley.edu/publications

http://www.cs.berkeley.edu/~ravenben/publications

ravenben@eecs.berkeley.edu