Implementation and Deployment of a Large-scale Network
Infrastructure
Ben Y. ZhaoL. Huang, S. Rhea, J. Stribling,A. D. Joseph, J. D. Kubiatowicz
EECS, UC BerkeleyHP Labs, November 2002
HP Labs, 11/26/02 © [email protected] 2
Next-generation Network Applications
Scalability in applications Process/threads on single node server Cluster (LAN): fast, reliable, unlimited comm. Next step: scaling to the wide-area
Complexities of global deployment Network unreliability
BGP slow convergence, redundancy unexploited
Lack of administrative control over components Constrains protocol deployment: multicast, congestion ctrl.
Management of large scale resources / components Locate, utilize resources despite failures
HP Labs, 11/26/02 © [email protected] 3
Enabling Technology: DOLR(Decentralized Object Location and Routing)
GUID1
DOLR
GUID1GUID2
HP Labs, 11/26/02 © [email protected] 4
A SolutionDecentralized Object Location and Routing (DOLR) wide-area overlay application infrastructure
Self-organizing, scalable Fault-tolerant routing and object location Efficient (b/w, latency) data delivery
Extensible, supports application-specific protocols
Recent work Tapestry, Chord, CAN, Pastry Kademlia, Viceroy, …
HP Labs, 11/26/02 © [email protected] 5
What is Tapestry?
DOLR driving OceanStore global storage(Zhao, Kubiatowicz, Joseph et al. 2000)
Network structure Nodes assigned bit sequence nodeIds
namespace: 0-2160, based on some radix (e.g. 16) keys from same namespace
Keys dynamically map to 1 unique live node: root
Base API Publish / Unpublish (Object ID) RouteToNode (NodeId) RouteToObject (Object ID)
HP Labs, 11/26/02 © [email protected] 6
4
2
3
3
3
2
2
1
2
4
1
2
3
3
1
34
1
1
4 3
2
4
Tapestry MeshIncremental prefix-based routing
NodeID0x43FE
NodeID0xEF31NodeID
0xEFBA
NodeID0x0921
NodeID0xE932
NodeID0xEF37
NodeID0xE324
NodeID0xEF97
NodeID0xEF32
NodeID0xFF37
NodeID0xE555
NodeID0xE530
NodeID0xEF44
NodeID0x0999
NodeID0x099F
NodeID0xE399
NodeID0xEF40
NodeID0xEF34
HP Labs, 11/26/02 © [email protected] 7
Routing Mesh
Routing via local routing tables Based on incremental prefix matching Example: 5324 routes to 0629 via
5324 0231 0667 0625 0629 At nth hop, local node matches destination at least n
digits (if any such node exists) ith entry in nth level routing table points to nearest node
matching: prefix(local_ID, n)+iProperties At most log(N) # of overlay hops between nodes Routing table size: b log(N) Actual entries have c-1 backups,
total size: c b log(N)
HP Labs, 11/26/02 © [email protected] 8
Object LocationRandomization and Locality
HP Labs, 11/26/02 © [email protected] 9
Object Location
Distribute replicates of object references Only references, not the data itself (level of
indirection) Place more of them closer to object itself
Publication Place object location pointers into network Store hops between object and “root” node
Location Route message towards root from client Redirect to object when location pointer found
HP Labs, 11/26/02 © [email protected] 10
Node Insertion
Inserting new node N Notify need-to-know nodes of N,
N fills null entries in their routing tables Move locally rooted object references to N Construct locally optimal routing table for N Notify nearby nodes to N for optimization
Two phase node insertion Acknowledged multicast Nearest neighbor approximation
HP Labs, 11/26/02 © [email protected] 11
Acknowledged MulticastReach need-to-know nodes of N (e.g. 3111) Add to routing table Move root object references
GN
3211
3229 3013
3222 3205
3023 3022
3021
HP Labs, 11/26/02 © [email protected] 12
Nearest NeighborN iterates: list = need-to-know nodes, L = prefix (N, S) Measure distances of List, use to fill routing table, level L Trim to k closest nodes, list = backpointers from k set, L-- Repeat until L == 0
N
Need-to-knownodes
HP Labs, 11/26/02 © [email protected] 13
Talk Outline
Algorithms
Architecture
Architectural components
Extensibility API
Evaluation
Ongoing Projects
Conclusion
HP Labs, 11/26/02 © [email protected] 14
Single Tapestry Node
Transport Protocols
Network Link Management
Application Interface / Upcall API
DecentralizedFile Systems
Application-LevelMulticast
ApproximateText Matching
RouterRouting Table
&Object Pointer DB
Dynamic Node
Management
HP Labs, 11/26/02 © [email protected] 15
Single Node Implementation
Application Programming Interface
Applications
Dynamic Tapestry Core Router Patchwork
Network StageDistance Map
SEDA Event-driven FrameworkJava Virtual Machine
Enter/leaveTapestry
State Maint.Node Ins/del
Routing LinkMaintenance
Node Ins/del
Messages
UDP Pingsro
ute
tono
de /
obj
AP
I ca
lls
Upc
alls
fault
detec
t
heart
beat
msgs
HP Labs, 11/26/02 © [email protected] 16
Message Routing
Router: fast routing to nodes / objects
Receive newlocation msg
Forward tonextHop(h+1,G)
Forward tonextHop(0,obj)
Signal AppUpcall Handler
Upcall?yes
nono
yes
HaveObj Ptrs
Receive newroute msg
Signal AppUpcall Handler
Upcall? no
yes
Forward tonextHop(h+1,G)
HP Labs, 11/26/02 © [email protected] 17
Extensibility API
deliver (G, Aid, Msg) Invoked at message destination Asynchronous, returns immediately
forward (G, Aid, Msg) Invoked at intermediate hop in route No action taken by default, application calls route()
route (G, Aid, Msg, NextHopNodeSet) Called by application to request message be
routed to set of NextHopNodes
HP Labs, 11/26/02 © [email protected] 18
Local Operations
Accessibility to Tapestry maintained state
nextHopSet = Llookup(G, Num) Accesses routing table Returns up to num candidates for next hop
towards G
objReferenceSet = Lsearch(G, num) Searches object references for G Returns up to num references for object, sorted by
increasing network distance
HP Labs, 11/26/02 © [email protected] 19
Deployment Status
C simulatorPacket level simulationScales up to 10,000 nodes
Java implementation50000 semicolons of Java, 270 class filesDeployed on local area cluster (40 nodes)Deployed on Planet Lab global network
(~100 distributed nodes)
HP Labs, 11/26/02 © [email protected] 20
Talk Outline
Algorithms
Architecture
Evaluation
Micro-benchmarks
Stable network performance
Single and parallel node insertion
Ongoing Projects
Conclusion
HP Labs, 11/26/02 © [email protected] 21
Micro-benchmark Methodology
Experiment run in LAN, GBit EthernetSender sends 60001 messages at full speedMeasure inter-arrival time for last 50000 msgs 10000 msgs: remove cold-start effects 50000 msgs: remove network jitter effects
SenderControl
ReceiverControl
Tapestry TapestryLANLink
HP Labs, 11/26/02 © [email protected] 22
Micro-benchmark Results
Message Processing Latency
0.01
0.1
1
10
100
0.01 0.1 1 10 100 1000 10000
Message Size (KB)
Tim
e / m
sg (
ms)
Sustainable Throughput
0
5
10
15
20
25
30
0.01 0.1 1 10 100 1000 10000
Message Size (KB)
TP
ut (
MB
/s)
Constant processing overhead ~ 50s
Latency dominated by byte copying
For 5K messages, throughput = ~10,000 msgs/sec
HP Labs, 11/26/02 © [email protected] 23
Large Scale MethodologyPlanet Lab global network 98 machines at 42 institutions, in North America,
Europe, Australia (~ 60 machines utilized) 1.26Ghz PIII (1GB RAM), 1.8Ghz PIV (2GB RAM) North American machines (2/3) on Internet2
Tapestry Java deployment 6-7 nodes on each physical machine IBM Java JDK 1.30 Node virtualization inside JVM and SEDA Scheduling between virtual nodes increases
latency
HP Labs, 11/26/02 © [email protected] 24
Node to Node Routing
Ratio of end-to-end routing latency to shortest ping distance between nodes
All node pairs measured, placed into buckets
0
5
10
15
20
25
30
35
0 50 100 150 200 250 300
Internode RTT Ping time (5ms buckets)
RD
P (m
in, m
ed, 9
0%) Median=31.5, 90th percentile=135
HP Labs, 11/26/02 © [email protected] 25
Object Location
Ratio of end-to-end latency for object location, to shortest ping distance between client and object location
Each node publishes 10,000 objects, lookup on all objects
0
5
10
15
20
25
0 20 40 60 80 100 120 140 160 180 200
Client to Obj RTT Ping time (1ms buckets)
RD
P (
min
, m
ed
ian
, 9
0%
)
90th percentile=158
HP Labs, 11/26/02 © [email protected] 26
Latency to Insert Node
Latency to dynamically insert a node into an existing Tapestry, as function of size of existing Tapestry
Humps due to expected filling of each routing level
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 100 200 300 400 500
Size of Existing Network (nodes)
Inte
gra
tio
n L
ate
nc
y (
ms
)
HP Labs, 11/26/02 © [email protected] 27
Bandwidth to Insert Node
Cost in bandwidth of dynamically inserting a node into the Tapestry, amortized for each node in network
Per node bandwidth decreases with size of network
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 50 100 150 200 250 300 350 400
Size of Existing Network (nodes)
Co
ntr
ol
Tra
ffic
BW
(K
B)
HP Labs, 11/26/02 © [email protected] 28
Parallel Insertion Latency
Latency to dynamically insert nodes in unison into an existing Tapestry of 200
Shown as function of insertion group size / network size
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
0 0.05 0.1 0.15 0.2 0.25 0.3
Ratio of Insertion Group Size to Network Size
La
ten
cy
to
Co
nv
erg
en
ce
(m
s) 90th percentile=55042
HP Labs, 11/26/02 © [email protected] 29
Results SummaryLessons Learned Node virtualization: resource contention Accurate network distances hard to measure
Efficiency verified Msg processing = 50s, Tput ~ 10,000msg/s Route to node/object small factor over optimal
Algorithmic scalability Single node latency/bw scale sublinear to network
size Parallel insertion scales linearly with group size
HP Labs, 11/26/02 © [email protected] 30
Talk Outline
Algorithms
Architecture
Evaluation
Ongoing Projects
P2P landmark routing: Brocade
Applications: Shuttle, Interweave, ATA
Conclusion
HP Labs, 11/26/02 © [email protected] 31
State of the Art RoutingHigh dimensionality and coordinate-based P2P routing Tapestry, Pastry, Chord,
CAN, etc… Sub-linear storage and # of
overlay hops per route Properties dependent on
random name distribution Optimized for uniform mesh
style networks
HP Labs, 11/26/02 © [email protected] 32
Reality
AS-2
P2P Overlay Network
AS-1
AS-3
S R
Transit-stub topology, disparate resources per node
Result: Inefficient inter-domain routing (b/w, latency)
HP Labs, 11/26/02 © [email protected] 33
Landmark Routing on P2PBrocade Exploit non-uniformity Minimize wide-area routing hops / bandwidth
Secondary overlay on top of Tapestry Select super-nodes by admin. domain
Divide network into cover sets
Super-nodes form secondary Tapestry Advertise cover set as local objects
brocade routes directly into destination’s local network, then resumes p2p routing
HP Labs, 11/26/02 © [email protected] 34
AS-2
P2P Network
AS-1
AS-3
Brocade Layer
S D
Original Route
Brocade Route
Brocade Routing
HP Labs, 11/26/02 © [email protected] 35
Applications under DevelopmentOceanStore: global resilient file storeShuttle Decentralized P2P chat service Leverages Tapestry for fault-tolerant routing
Interweave Keyword searchable file sharing utility Fully decentralized, exploits network locality
Approximate Text Addressing Uses text fingerprinting to map similar documents
to single IDs Killer app: decentralized spam mail filter
HP Labs, 11/26/02 © [email protected] 36
For More Information
Tapestry and related projects (and these slides):http://www.cs.berkeley.edu/~ravenben/tapestry
OceanStore:http://oceanstore.cs.berkeley.edu
Related papers:http://oceanstore.cs.berkeley.edu/publications
http://www.cs.berkeley.edu/~ravenben/publications
Top Related