Measuring
-
Upload
sandra4211 -
Category
Documents
-
view
256 -
download
4
Transcript of Measuring
Measuring & Managing Distributed Networked Systems CSE Science of Complex Systems Seminar
December 6, 2006
Chen-Nee ChuahRobust & Ubiquitous Networking (RUBINET) Lab
http://www.ece.ucdavis.edu/rubinetElectrical & Computer Engineering
2CSE Seminar, December 6, 2006
Networked Computers: Trends
Laptops
WirelessSensors
Hand-held devices(PDAs, etc)
Smart home appliances
Super computer
Intelligent transportation system
Rover Mars Vehicle
• Different capabilities/constraints• Different requirements• Explosive growth in numbers
We know how individual component/layer behaves, but when they are inter-connected and start interacting with each other, we need to cope with the unexpected!
3CSE Seminar, December 6, 2006
What do networking researchers/engineers care about when they design networks/protocols? End-to-end behavior
– Reachability– Performance in terms of delay, losses, throughput– Security– Stability/fault-resilience of the end-to-end path
System-wide behavior– Scalability– Balanced load distribution within a domain– Stability/Robustness/Survivability– Manageability – Evolvability and other X-ities
• J. Kurose, INFOCOM’04 Keynote Speech
4CSE Seminar, December 6, 2006
How do we know when we get there?
We know how to do the following fairly well:– Prove correctness/completeness of stand-alone system or protocol
• E.g., algorithm complexity, convergence behavior
– Look at steady state, worst-case, and average scenario• E.g., Queuing models
– Run simulations/experiments to show improvement of protocol/architecture Z over A, B, C, D ….
What is lacking:– End-to-end Validation of the design solution or system behavior
• Is the system behavior what we really intended?
• How do we verify what type of behaviors/properties are ‘correct’ and what are ‘abnormal’?
– Verification of the system ‘dynamics’, e.g., how different components or network layers interact
5CSE Seminar, December 6, 2006
Challenges
End-to-end system behavior depends on:
Physical topology
Routing protocols
BGP Policies
NAT boxes, firewalls, packet
filters, packet transformers
Logical topology
Traffic Demand
Messy dependency graphs => A lot to model if we truly want to understand and able to validate system behavior
6CSE Seminar, December 6, 2006
Research Philosophy
What’s lacking? A systematic approach to measure and validate
end-to-end behavior & verify system dynamics
Our Approach Combination of measurement, modeling,
inferencing, and experimental techniques
7CSE Seminar, December 6, 2006
Talk Outline
Two sample problem areas: Characterizing ‘Service Availability’ of IP networks
– Measuring & understanding Internet routing dynamics
– Availability as a metric for differentiating topologies
– Example application: graceful network upgrade
Modeling and Validating Firewall Configurations– Validating end-to-end reachability
– Static analysis on firewall rules
8CSE Seminar, December 6, 2006
Overview
What does the current Internet look like?– Topologies
• Inter-domain graphs look like power-law
• Intra-domain– Star/hub, mesh
– North America: backbone links along train tracks
– Routing hierarchy: inter-domain vs. intra-domain routing
– Distributed management• Different tiers of ASes (Customers vs. Tier-1/2/3 ISPs)
– Multiple players• ISPs, content providers, application providers, customers
9CSE Seminar, December 6, 2006
Service Availability
99.999 equivalent of PSTN Service Level Agreements offered by ISP (a single AS)
– Performance in terms of average delay and packet loss• 0.3% loss, speed-of-light end-to-end latencies
– Port availability– Time to fix a reported problem
What do Tier-1 ISPs do to meet SLAs?– Over-provisioning– Load balancing on per-prefix or per-packet basis
End-to-end availability (spanning multiple ASes)? – Does not exist!– ISPs do not share proprietary topology and operational info with one
another
10CSE Seminar, December 6, 2006
Service Availability: Challenges
Typical Service Level Agreements (SLA) or static graph properties (like out-degree, network diameter) do not account for instantaneous network operational conditions
Important factor for designing & evaluating – Failure protection/restoration mechanisms,
– Traffic engineering, e.g., IGP weight selection, capacity provisioning
– Network design e.g., topology, link/node upgrade• Given different choices of network topologies, how do we know if which one
offer good service availability?
– Router configuration, e.g., tuning protocol timers
So, what do we need?– Understanding Internet routing failures would be a good first step!
11CSE Seminar, December 6, 2006
Integrated Monitoring
Where?• Tier-1 ISP backbone (600+ nodes), enterprise/campus networks
What? • BGP/IGP passive route listeners, SONET alarm logs• IPMON/CMON passive traffic monitoring & active probes • Controlled failure experiments• Router configurations and BGP policies
Point-of-Presence (PoP)
APOP-B
POP-CCust
CustPOP-
12CSE Seminar, December 6, 2006
Questions we want to answer
IGP/BGP routing behavior during sunny and rainy days– How frequent do failures occur?
– How do they behave during convergence or routing instabilities
– What are the causes?
– Are there anomalies?
– How do failures/instability/anomalies propagate across networks?
How does the control plane impact the data forwarding?– What actually happens to the packets? What is the effect on end-
to-end service availability?
How do network components/protocols interact? – Unexpected coupling, race-conditions, conflicts?
13CSE Seminar, December 6, 2006
Lessons Learned from Phase 1 (1)
1. Transient, individual link failures are the norm No, failures are not independent & uniformly distributed [INFOCOM’04]
Apr May Jun Jul Aug Sep Oct NovDay
14CSE Seminar, December 6, 2006
Backbone Failure Characteristics
How often? (time between failures)– Weibull distribution:
How are failures spread across links?– Power law: (link with rank l has nl failures)
))/(exp(1)( shapescalexxF
35.1 lnl
15CSE Seminar, December 6, 2006
Revisit Service Availability
So, we now have the failure model….but that’s only one piece of the puzzle
– Failure recovery is *NOT* instantaneous => forwarding can be disrupted during route re-convergence
– Overload/Congestion on backup paths
16CSE Seminar, December 6, 2006
1. Delete IS-IS adj
1b. ISIS Notification
Service Convergence After Failures
Client
1a. Failure Detection
4.Update FIB on linecards
2.Generate LSP2. LSP Flooding
• Protocol convergence = 1+2+3, but data forwarding only resumes after step 4.
- FIB updates depend on number of routing prefixes• Invalid routes during route convergence (2-8 seconds)
=> complete black-hole of traffic => packet drops!
3. SPF & update RIB
17CSE Seminar, December 6, 2006
Service Convergence DelayDetection of link down (SONET) <100msTimer to filter out transient flaps 2s Timer before sending any message out 50msFlooding of the message in the network ~10ms/hop
O(network distance)Timer before computing the new shortest paths 5.5sComputation of the new shortest paths 100-400 ms
O(network size)
Protocol-level convergence: Worst case: 5.9s
Update of the Forwarding Information Base 1.5-2.1 sO(network size + BGP peering points) Service-level convergence: Worst case: 8.0s
Service availability depends on topology, failure model, router configurations, router architecture, location of peering points/customers [IWQoS’04]
18CSE Seminar, December 6, 2006
Predicting Service Availability
Network convergence time is only a rough upper-bound on actual service disruption time (SD Time)
Service availability
– Three metrics: service disruption time (SD), traffic disruption (TD), and delay violation (DV)
– Three perspectives: ingress node, link, and network-wide Develop a java-based tool to estimate SA
– Given topology, link weights, BGP peering points, and traffic demand
Goodness factors derived from SA can be used to characterize different networks [IWQoS’04]
19CSE Seminar, December 6, 2006
Simulation Studies
Two sets of topologies– Set A: Ring, Mesh, 2 Tier 1 ISPs– Set B: 10 similar topologies
Realistic Traffic Matrix– Based on data from tier 1 ISPs – 20% large nodes,
30% medium nodes and 50% small nodes Prefix Distribution
– Proportional to traffic Equal link metric assignment
20CSE Seminar, December 6, 2006
Example Application 1:
Network Design: Choice of TopologiesCDF of Traffic Disruption due to single link failures
CDF of Number of Delay-Parameter Violations due to single link failures
Ring
Ring
ISPs
ISPsMesh
CD
F
CD
F
Traffic Disruption Delay Parameter ViolationsCan differentiate topologies
based on SA Metrics
21CSE Seminar, December 6, 2006
Example Application 2:
Identifying links for upgrade
ONLY SHOW BOTTOMGRAPH
Topology-6
Link Number
Badness
Fact
or
22CSE Seminar, December 6, 2006
Example Application 3:
Graceful Network Upgrade
Tier-1 ISP backbone networks– Nodes Points of Presence (PoPs); Links Inter-POP links
“Graceful” Network Upgrade– Adding new nodes and links into an existing operational network
result in => new topology => new operational conditions => impact of link failures could be different
– Impact on service availability of existing customers should be minimized during upgrade process
Assumptions:– ISPs can reliably determine the future traffic demands when new
nodes are added
– ISPs can reliably estimate the revenue associated with the addition of every new node
• Revenues resulting from new node are independent of each other
23CSE Seminar, December 6, 2006
Graceful Network Upgrade: Problem Statement
Given a budget constraint:– Find where to add new nodes and how they should be
connected to the rest of the network
– Determine an optimal sequence for adding these new nodes/links to minimize impact on existing customers
Two-Phase Framework*– Phase-1: Finding Optimal End State
• Formulated as an optimization problem
– Phase-2: Multistage Node Addition Strategy• Formulated as a multistage dynamic programming (DP) problem
*Joint work with R. Keralapura and Prof. Y. Fan
24CSE Seminar, December 6, 2006
Phase-1: Finding Optimal End State
Objective Function:
Subject to:– Budget constraint:
– Node degree constraint
}max{ '' rk
,...2,1 21
',
nnikM i
gn
jji
bnckuMDuuMDulc T
nTngn
Tgn .)()..()..(
2''''
Link installation costNode installation cost
Budget
Revenue
25CSE Seminar, December 6, 2006
Phase-1: Finding Optimal End State (Cont’d)
– SD Time constraint
– TD constraint
– Link utilization constraint
Can be solved by non-linear programming techniques – tabu search, genetic algorithms
NjSSNqLi
qji
NqLi
qji
,
,,
,
',,
''
iLi
LU'
max
NjTTNqLi
qji
NqLi
qji
,
,,
,
',,
''
26CSE Seminar, December 6, 2006
Phase-2: Multistage Node Addition
Assumptions:– Already have the solution from Phase-1
– ISP wants to add one node and its associated links into the network at every stage
• Hard to debug if all nodes are added simultaneously
• Node/link construction delays
• Sequential budget allocation
• …
– Time period between every stage could range from several months to few years
27CSE Seminar, December 6, 2006
Phase-2: Multistage Node Addition
Multistage dynamic programming problem with costs for performance degradation– Maximize (revenue – cost)
Cost considered:– Node degree
– Link utilization
– SD Time
– TD Time
Cost instead of constraints to ensure wefind feasible solutions
28CSE Seminar, December 6, 2006
Numerical Example
Original NodesPotential Node Locations
A
C
B
D
1
2 3
4Miami
Atlanta
Seattle
DallasPhoenix
New York
Chicago
San Jose
14000 13800, 13000, 14000,cost node
50000 38000, 25000, 45000,revenue
00005b
1costlink
29CSE Seminar, December 6, 2006
Numerical Example – Phase-1 Solution
A
C
B
D
1
2 3
4Miami
Atlanta
Seattle
DallasPhoenix
New YorkChicago
San Jose
Original NodesNew NodesNew Links
Original Links
Solution from Phase-1
30CSE Seminar, December 6, 2006
Penalty for *Not* Considering SA
31CSE Seminar, December 6, 2006
Numerical Example – Phase-2 Solution
A
C
B
D
1
2 3
4Miami
Atlanta
Seattle
DallasPhoenix
New YorkChicago
San Jose
Original NodesNew NodesNew Links
Original Links
Solution from Phase-1
32CSE Seminar, December 6, 2006
Numerical Example – Phase-2 Solution
Ignoring SD time and TD in Phase-2 results in significant performance deterioration for customers
33CSE Seminar, December 6, 2006
Remarks
The two-phase framework is flexible and can be easily adapted to include other constraints or scenario-specific requirements– Revenue and traffic demands in the network after node additions could
dynamically change because of endogenous and exogenous effects• Adaptive and stochastic dynamic programming
– Revenue generated by adding one node is usually not independent of other node additions
– Uncertainty in predicting traffic demands
– Applications of this framework in different environments• Wireless networks (WiFi, WiMax, celllular)
34CSE Seminar, December 6, 2006
Talk Outline
Two sample problem areas: Characterizing ‘Service Availability’ of IP networks
– Measuring & understanding Internet routing dynamics
– Availability as a metric for differentiating topologies
– Example application: graceful network upgrade
Modeling and Validating Firewall Configurations– Validating end-to-end reachability
– Static analysis on firewall rules
35CSE Seminar, December 6, 2006
End-to-End Reachability/Security
When user A sends a packet from a source node S to a destination node D in the Internet– How do we verify there is indeed a route that exist between S and D?
– How do we verify that the packet follow a certain path that adheres to inter-domain peering relationships?
– How do we verify that only this end-to-end connection satisfy some higher-level security policy?
• E.g. Only user A can reach D and other users are blocked?
Answer depends on:– Router configurations & BGP policies
– Packet filters along the way: Firewalls, NAT boxes, etc.
36CSE Seminar, December 6, 2006
Firewalls
Popular network perimeter defense – Packet inspection and filtering
Why are protected networks still observing malicious packets ?– Firewalls can improve security only if they are configured
CORRECTLY
– But: a study of 37 firewall rule sets [Wool’04] shows• One had a single misconfiguration
• All the others had multiple errors
37CSE Seminar, December 6, 2006
Firewall Configurations
Platform-specific, low-level languages– Cisco PIX, Linux IPTables, BSD PF …
Large and complex– Hundreds of rules
Distributed deployment– Consistent w/ global security policy?
38CSE Seminar, December 6, 2006
INTERNET
`
InternetDeMilitarized Zone (DMZ)
Secured Intranet
Distributed Firewalls
ISP B
ISP A
Example: Network of Firewalls
39CSE Seminar, December 6, 2006
Access Control List (ACL)
Core of firewall configuration Consists of a list of rules Individual rules -> <P, action> P: A predicate matching packets
– 5-tuple <proto, sip, spt, dip, dpt>
Format Action
<P, accept> Forward the packet
<P, deny> Drop the packet
<P, chain Y> Goto user chain “Y”
<P, return> Resume calling chain
40CSE Seminar, December 6, 2006
IPX Model- Multiple access list in sequential order
IPtable / Netfilter Model- Modeled as function calls
41CSE Seminar, December 6, 2006
Types of Misconfigurations
Type 1: Policy Violations– Configurations that contradict a high/network level
policy, e.g., • Block NetBIOS and Remote Procedure Call• Bogons : unallocated/reserved IP
Type 2: Inconsistencies– Intra-firewall– Inter-firewall – Cross-path
Type 3: Inefficiencies– Redundant rules
42CSE Seminar, December 6, 2006
3
Intra-Firewall Inconsistencies
Shadowing: all packets matched by the current rule have been assigned a different action by preceding rules
1 deny 10.128.0.0/9
2 deny 10.0.0.0/9
3 accept 10.0.0.0/8
1 2
IP Space
43CSE Seminar, December 6, 2006
Cross-Path Inconsistencies
deny udp
deny udp
Loophole depends on routing
Or “intermittent connectivity”
Troubleshooting nightmare
ZY
Internal
W XDMZ
Internet
44CSE Seminar, December 6, 2006
Validating End-to-End Reachability/Security
How do we verify configuration of firewall rules?– Borrow model checking techniques from software
programming
Static Analysis Algorithm* – Control Flow Analysis
• Complex chain -> simple list
• Network topology -> ACL Graph
– Checks at three levels:• Individual firewalls
• Inter-firewall (see paper for details)
• Root of ACL graph
*Joint work with Prof. Z. Su and Prof. H. Chen
45CSE Seminar, December 6, 2006
RYn
X2
Y1
State Transformation
X1
A/D – packets accepted/dropped so far At the end of ACL, we have Aacl, Dacl
D
Aaccep
t
denyX2
X1
accept
Xj
denyXk
start
46CSE Seminar, December 6, 2006
Checks for Individual Firewalls
D
A
R
accept
P
Current rule to check
Initial State
P
Good RP
47CSE Seminar, December 6, 2006
Checks for Individual Firewalls
D
A
R
accept
P
P
nCorrelatio & Redundancy :Else
Redundancy0
Shadowing
Rule Masked
DP
DP
RP
48CSE Seminar, December 6, 2006
Checks for Individual Firewalls
D
A
R
accept
P
P
nCorrelatio 0
maskedpartially
0 and
DP
RPRP
49CSE Seminar, December 6, 2006
Symbolic Technique – BDD
Our algorithm need:– Keep track of where does all the packets go
– Efficient representation of random set of IP packets
– Fast set operation
Binary Decision Diagram– Compact representation of ordered decision tree
– Efficient apply algorithm, scalable (2^100)
– Encode bit-vectors (packet header)
– Set operations -> BDD operations
50CSE Seminar, December 6, 2006
Evaluation
FW #
ACLs
#
Rules
# Violations Found
Policy Inconsistency Inefficiency
PIX1 7 249 3 16 2
BSD1 2 94 3 0 0
PIX2 3 36 2 0 5
Takes about 1 second Initial allocation of 400KB (10000 BDD nodes) is
enough
51CSE Seminar, December 6, 2006
Speed – Intra-Firewall
0 100 200 300 400 500 600 700 800
ACL Rule length0.0
0.5
1.0
1.5
2.0
2.5
3.0
Tim
e (s
econ
ds)
800 rules
2 seconds
52CSE Seminar, December 6, 2006
Network of Firewalls: Remaining Issues
How do we validate/verify dynamic behavioral changes?– With multi-homing and dynamic load-balancing, the end-to-end
path and sequence of firewalls traversed could change over time
– Adaptation of firewall rules on demand depending on applications
How do we optimize firewall configurations?– Inter-firewall & inter-path optimization
• Must interface with routing plane
– Heavy traffic ‘accepted’ first?• Need to interact with traffic measurement/monitoring modules
53CSE Seminar, December 6, 2006
Other Problem Areas
Measuring, validating, and managing
1. End-to-end network properties– Example: end-to-end reachability and/or security
2. Interactions between multiple control loops (across protocol layers or between multiple entities)
– Example: overlay/IP-layer routing
3. Measurement/monitoring methodologies – How do we know we’re measuring the traffic features
that are really important instead of distorting them?
54CSE Seminar, December 6, 2006
Acknowledgement
Collaborators– Supratik Bhattacharrya (previously at Sprint ATL)
– Hao Chen, Computer Science, UC Davis
– Christophe Diot, Thomson Technology Paris Lab. – Yueyue Fan, Civil & Environmental Eng., UC Davis
– Gianluca Iannaccone, Intel Berkeley
– Zhendong Su, Computer Science, UC Davis
Students– Ram Keralapura
– Jianning Mai
– Saqib Raza
– Lihua Yuan
55CSE Seminar, December 6, 2006
Questions
http://www.ece.ucdavis.edu/rubinet