Tomography-based Overlay Network Monitoring UC Berkeley Yan Chen, David Bindel, and Randy H. Katz.
-
date post
19-Dec-2015 -
Category
Documents
-
view
221 -
download
1
Transcript of Tomography-based Overlay Network Monitoring UC Berkeley Yan Chen, David Bindel, and Randy H. Katz.
Motivation• Infrastructure ossification led to thrust of
overlay and P2P applications• Such applications flexible on paths and
targets, thus can benefit from E2E distance monitoring– Overlay routing/location – VPN management/provisioning– Service redirection/placement …
• Requirements for E2E monitoring system– Scalable & efficient: small amount of probing traffic– Accurate: capture congestion/failures– Incrementally deployable– Easy to use
Existing Work• General Metrics: RON (n2 measurement)• Latency Estimation
– Clustering-based: IDMaps, Internet Isobar, etc.– Coordinate-based: GNP, ICS, Virtual Landmarks
• Network tomography– Focusing on inferring the characteristics of physical
links rather than E2E paths– Limited measurements -> under-constrained
system, unidentifiable links
Problem Formulation
Given an overlay of n end hosts and O(n2) paths, how to select a minimal subset of paths to monitor so that the loss rates/latency of all other paths can be inferred.
Assumptions:• Topology measurable• Can only measure the E2E path, not the link
Our Approach
Select a basis set of k paths that fully describe O(n2) paths (k «O(n2))
• Monitor the loss rates of k paths, and infer the loss rates of all other paths
• Applicable for any additive metrics, like latency
End hosts
Overlay Network Operation Center
topology
measurements
Modeling of Path Space
Path loss rate p, link loss rate l )1)(1(1 211 llp
)1log(
)1log(
)1log(
011)1log()1log()1log(
3
2
1
211
l
l
l
llp
A
D
C
B
1
2
3p1
1
3
2
1
011 b
x
x
x
Putting All Paths Together
11 vectorrate losspath vectorrate losslink
matrix path where
,
}1|0{,
rs
sr
bx
GbGx
Totally r = O(n2) paths, s links, s «r
A
D
C
B
1
2
3p1
…=
Sample Path Matrix
• x1 - x2 unknown => cannot compute x1, x2
• Set of vectorsform null space
• To separate identifiable vs. unidentifiable components: x = xG + xN
0
1
1
2
)(
2/
2/
1
0
0
0
1
1
2
)(
21
2
1
1
321
xxx
b
b
b
xxx
x
N
G
111
100
011
G
3
2
1
3
2
1
b
b
b
x
x
x
G
A
D
C
B
1
2
3b1
b2
b3
(1,-1,0)
x2
x1x3
(1,1,0)
path/row space(measured)
null space(unmeasured)
T]011[
Intuition through Topology VirtualizationVirtual links:
• Minimal path segments whose loss rates uniquely identified
• Can fully describe all paths
• xG is composed of virtual links
A
D
C
B
1
2
3b1
b2
b3
(1,-1,0)
x2
x1x3
(1,1,0)
path/row space(measured)
null space(unmeasured)
2
1
1
321 2/
2/
1
0
0
0
1
1
2
)(
b
b
b
xxx
xG
1 2Virtualization
Virtual links
GNG GxGxGxGxb All E2E paths are in path space, i.e., GxN = 0
More Examples
Real links (solid) and all of the overlay paths (dotted) traversing them
Virtualization
Virtual links
1
2 31’ 2’
Rank(G)=2
1 2
1
0
0
1
1
1G
1
2
3
1’2’
4
Rank(G)=3
3’
4’
12
3
1
1
0
0
1
0
1
0
0
1
0
1
0
0
1
1
G
Algorithms
• Select k = rank(G) linearly independent paths to monitor– Use QR decomposition– Leverage sparse matrix: time
O(rk2) and memory O(k2)• E.g., 10 minutes for n =
350 (r = 61075) and k = 2958
• Compute the loss rates of other paths– Time O(k2) and memory
O(k2)
…=
… =
bG Gx
How many measurements saved ?
k « O(n2) ?For a power-law Internet topology • When the majority of end hosts are on the overlay
• When a small portion of end hosts are on overlay– If Internet a pure hierarchical structure (tree): k = O(n)– If Internet no hierarchy at all (worst case, clique):
k = O(n2)– Internet has moderate hierarchical structure [TGJ+02]
k = O(n) (with proof)
For reasonably large n, (e.g., 100), k = O(nlogn)(extensive linear regression tests on both synthetic and real topologies)
Practical Issues
• Topology measurement errors tolerance
• Measurement load balancing on end hosts– Randomized algorithm
• Adaptive to topology changes– Add/remove end hosts and routing changes– Efficient algorithms for incrementally update of
selected paths
Areas and Domains# of
hosts
US (40)
.edu 33
.org 3
.net 2
.gov 1
.us 1
Interna-tional (11)
Europe (6)
France 1
Sweden 1
Denmark 1
Germany 1
UK 2
Asia (2)Taiwan 1
Hong Kong 1
Canada 2
Australia 1
Evaluation• Extensive Simulations• Experiments on PlanetLab
– 51 hosts, each from different organizations
– 51 × 50 = 2,550 paths– On average k = 872
• Results Highlight– Avg real loss rate: 0.023– Absolute error mean:
0.0027 90% < 0.014– Relative error mean: 1.1
90% < 2.0– On average 248 out of 2550
paths have no or incomplete routing information
– No router aliases resolved
Conclusions
• A tomography-based overlay network monitoring system– Given n end hosts, characterize O(n2) paths with a
basis set of O(n logn) paths– Selectively monitor the basis set for their loss rates,
then infer the loss rates of all other paths
• Both simulation and PlanetLab experiments show promising results
Problem FormulationGiven an overlay of n end hosts and O(n2) paths,
how to select a minimal subset of paths to monitor so that the loss rates/latency of all other paths can be inferred.
• Key idea: based on topology, select a basis set of k paths that fully describe O(n2) paths (k «O(n2)) – Monitor the loss rates of k paths, and infer the loss
rates of all other paths– Applicable for any additive metrics, like latency
End hosts
Overlay Network Operation Center
topology
measurements
Modeling of Path Space
Path loss rate p, link loss rate l )1)(1(1 211 llp
)1log(
)1log(
)1log(
011)1log()1log()1log(
3
2
1
211
l
l
l
llp
11 vectorrate losspath vectorrate losslink
matrix path where
,
}1|0{,
rs
sr
bx
GbGx
Put all r = O(n2) paths togetherTotally s links
A
D
C
B
1
2
3p1
1
3
2
1
011 b
x
x
x
Sample Path Matrix
• x1 - x2 unknown => cannot compute x1, x2
• Set of vectorsform null space
• To separate identifiable vs. unidentifiable components: x = xG + xN
• All E2E paths are in path space, i.e., GxN = 0
0
1
1
2
)(
2/
2/
1
0
0
0
1
1
2
)(
21
2
1
1
321
xxx
b
b
b
xxx
x
N
G
GNG GxGxGxGxb
111
100
011
G
3
2
1
3
2
1
b
b
b
x
x
x
G
A
D
C
B
1
2
3b1
b2
b3
(1,-1,0)
x2
x1x3
(1,1,0)
path/row space(measured)
null space(unmeasured)
T]011[
More Examples
Real links (solid) and all of the overlay paths (dotted) traversing them
Virtualization
Virtual links
1
2 31’ 2’
Rank(G)=2
1 2
1
0
0
1
1
1G
1
2
3
1’2’
4
Rank(G)=3
3’
4’
12
3
1
1
0
0
1
0
1
0
0
1
0
1
0
0
1
1
G
Linear Regression Tests of the Hypothesis
• BRITE Router-level Topologies– Barbarasi-Albert, Waxman, Hierarchical models
• Mercator Real Topology• Most have the best fit with O(n) except the
hierarchical ones fit best with O(n logn)
BRITE 20K-node hierarchical topology Mercator 284K-node real router topology
Algorithms
• Select k = rank(G) linearly independent paths to monitor– Use rank revealing decomposition– Leverage sparse matrix: time O(rk2) and memory
O(k2)• E.g., 10 minutes for n = 350 (r = 61075) and k = 2958
• Compute the loss rates of other paths
– Time O(k2) and memory O(k2)
GGG GxbbxGx then,with Solve
1 where ,}1|0{, ksk bGbG Gx 1 where ,}1|0{, rsr bGbG Gx
…=
Practical Issues
• Topology measurement errors tolerance– Care about path loss rates than any interior links– Poor router alias resolution
=> assign similar loss rates to the same links– Unidentifiable routers
=> add virtual links to bypass
• Measurement load balancing on end hosts– Randomly order the paths for scan and selection
of
• Topology Changes– Efficient algorithms for incrementally update of
for adding/removing end hosts & routing changes
G
G
Work in Progress
• Provide it as a continuous service on PlanetLab
• Network diagnostics:Which links or path segments are down
• Iterative methods for better speed and scalability
Topology Changes• Basic building block: add/remove one path
– Incremental changes: O(k2) time (O(n2k2) for re-scan)– Add path: check linear dependency with old basis
set,– Delete path p : hard when
The essential info described by p :
G
• Add/remove end hosts , Routing changes• Topology relatively stable in order of a day
=> incremental detection
0 and}{ where
, topath any add,0}{ if
}{but , such that , vector a
yxpGx
GxypG
pGyGyy
Gp
Evaluation• Simulation
– Topology• BRITE: Barabasi-Albert, Waxman, hierarchical: 1K – 20K
nodes• Real topology from Mercator: 284K nodes
– Fraction of end hosts on the overlay: 1 - 10%– Loss rate distribution (90% links are good)
• Good link: 0-1% loss rate; bad link: 5-10% loss rates• Good link: 0-1% loss rate; bad link: 1-100% loss rates
– Loss model: • Bernouli: independent drop of packet• Gilbert: busty drop of packet
– Path loss rate simulated via transmission of 10K pkts
• Experiments on PlanetLab
Areas and Domains# of
hosts
US (40)
.edu 33
.org 3
.net 2
.gov 1
.us 1
Interna-tional (11)
Europe (6)
France 1
Sweden 1
Denmark 1
Germany 1
UK 2
Asia (2)Taiwan 1
Hong Kong 1
Canada 2
Australia 1
Experiments on Planet Lab
• 51 hosts, each from different organizations– 51 × 50 = 2,550 paths
• Simultaneous loss rate measurement– 300 trials, 300 msec each– In each trial, send a 40-
byte UDP pkt to every other host
• Simultaneous topology measurement– Traceroute
• Experiments: 6/24 – 6/27– 100 experiments in peak
hours
Sensitivity Test of Sending Frequency
• Big jump for # of lossy paths when the sending rate is over 12.8 Mbps
• Loss rate distribution
• Metrics– Absolute error |p – p’ |:
• Average 0.0027 for all paths, 0.0058 for lossy paths
– Relative error [BDPT02]
– Lossy path inference: coverage and false positive ratio
• On average k = 872 out of 2550
lossrate
[0, 0.05)
lossy path [0.05, 1.0] (4.1%)
[0.05, 0.1) [0.1, 0.3) [0.3, 0.5) [0.5, 1.0) 1.0
% 95.9% 15.2% 31.0% 23.9% 4.3% 25.6%
PlanetLab Experiment Results
)',max()('),,max()(where
)(
)(',
)('
)(max)',(
pppp
p
p
p
pppF
Accuracy Results for All Experiments
• For each experiment, get its 95% absolute & relative errors• Most have absolute error < 0.0135 and relative error < 2.0
Lossy Path Inference Accuracy
• 90 out of 100 runs have coverage over 85% and false positive less than 10%
• Many caused by the 5% threshold boundary effects
Topology/Dynamics Issues
• Out of 13 sets of pair-wise traceroute …• On average 248 out of 2550 paths have no
or incomplete routing information• No router aliases resolvedConclusion: robust against topology
measurement errors
• Simulation on adding/removing end hosts and routing changes also give good results
Performance Improvement with Overlay
• With single-node relay• Loss rate improvement
– Among 10,980 lossy paths:– 5,705 paths (52.0%) have loss rate reduced by 0.05 or more– 3,084 paths (28.1%) change from lossy to non-lossy
• Throughput improvement– Estimated with
– 60,320 paths (24%) with non-zero loss rate, throughput computable
– Among them, 32,939 (54.6%) paths have throughput improved, 13,734 (22.8%) paths have throughput doubled or more
• Implications: use overlay path to bypass congestion or failures
lossraterttthroughput
5.1
SERVER
OVERLAY RELAYNODE
OVERLAY NETWORKOPERATION CENTER
CLIENT
3. Network congestion /failure
4. Detect congestion /failure
2. Register trigger
7. Skip-free streamingmedia recovery
6. Setup New Path
1. Setupconnection
5. Alert +New Overlay Path
X
UC Berkeley
UC San Diego
Stanford
HP Labs
Adaptive Overlay Streaming Media
• Implemented with Winamp client and SHOUTcast server• Congestion introduced with a Packet Shaper• Skip-free playback: server buffering and rewinding• Total adaptation time < 4 seconds
Adaptive Streaming Media Architecture
Client 1
MEDIASOURCE
SERVER
SHOUTcastServer
Buffering Layer
Clie
nt 1
Clie
nt 2
Clie
nt 3
Clie
nt 4
FromSHOUTcast
server
Calculated
concatenationpoint
BU
FF
ER
ByteCounter
Client 2
Client 3
Client 4
INTERNET
Triggering /alert + new path
OVERLAY RELAY NODE
RELAY
Overlay Layer
Path Management
TCP/IP Layer
RELAY
CLIENT
Winamp client
TCP/IP Layer
Overlay Layer
Internet
Path Management
Winamp Video/Audio Filter
Byte Counter
TCP/IP Layer
OVERLAY NETWORKOPERATION CENTER
Conclusions• A tomography-based overlay network
monitoring system– Given n end hosts, characterize O(n2) paths with
a basis set of O(nlogn) paths– Selectively monitor O(nlogn) paths to compute
the loss rates of the basis set, then infer the loss rates of all other paths
• Both simulation and real Internet experiments promising
• Built adaptive overlay streaming media system on top of monitoring services– Bypass congestion/failures for smooth playback
within seconds