A Network-State Management Service · 2014-12-10 · A Network-State Management Service Peng Sun...
Transcript of A Network-State Management Service · 2014-12-10 · A Network-State Management Service Peng Sun...
A Network-State Management Service
Peng SunRatul Mahajan, Jennifer Rexford,
Lihua Yuan, Ming Zhang, Ahsan ArefinPrinceton & Microsoft
Complex Infrastructure
1
Complex Infrastructure
1
Number of 2010
Data Center A few
NetworkDevice 1,000s
NetworkCapacity 10s of Tbps
Microsoft Azure
Complex Infrastructure
1
Number of 2010 2014
Data Center A few 10s
NetworkDevice 1,000s 10s of 1,000s
NetworkCapacity 10s of Tbps Pbps
Microsoft Azure
Complex Infrastructure
Variety of vendors/models/time1
Number of 2010 2014
Data Center A few 10s
NetworkDevice 1,000s 10s of 1,000s
NetworkCapacity 10s of Tbps Pbps
Microsoft Azure
Management Applications
2
Management Applications
2
Traffic Engineering
Management Applications
2
Traffic Engineering
Load Balancing
Management Applications
2
Traffic Engineering
Load Balancing Link
Corruption Mitigation
Management Applications
2
Traffic Engineering
Load Balancing Link
Corruption MitigationDevice
Firmware Upgrade
……
Our Question
How to safely run multiple management applications on shared infrastructure
3
Naïve Solution
• Run independently
4
Traffic Engineering
Link Corruption Mitigation
Firmware Upgrade
Network Devices
• It does not work due to 2 problems
Naïve Solution
4
Traffic Engineering
Link Corruption Mitigation
Firmware Upgrade
Network Devices
AggA
ToRs
AggB
Core1 2
Problem #1: Conflict
5
AggA
ToRs
AggB
Core1 2
Problem #1: Conflict
5
Link-corruption-mitigation adjusts traffic away from Core1
AggA
ToRs
AggB
Core1 2
Problem #1: Conflict
5
Link-corruption-mitigation adjusts traffic away from Core1
TE tunes traffic among links to Core1, 2
AggA
ToRs
AggB
Core1 2
Problem #2: Safety Violation
6
AggA
ToRs
AggB
Core1 2
Problem #2: Safety Violation
6
Link-corruption-mitigation shuts down faulty Agg A
AggA
ToRs
AggB
Core1 2
Problem #2: Safety Violation
6
Link-corruption-mitigation shuts down faulty Agg A
Firmware-upgrade schedules Agg B to upgrade
Potential Solution #1
7
Traffic Engineering
Firmware Upgrade
Link Corruption Mitigation
Potential Solution #1
• One monolithic application
7
Traffic Engineering
Firmware Upgrade
Link Corruption Mitigation
Potential Solution #1
• One monolithic application
• Central control of all actions
7
Traffic Engineering
Firmware Upgrade
Link Corruption Mitigation
Too Complex to Build
• Difficult to develop• Combine all applications that are
already individually complicated
8
Too Complex to Build
• Difficult to develop• Combine all applications that are
already individually complicated
• High maintenance cost• for such huge software in practice
8
Potential Solution #2
9
Traffic Engineering
Firmware Upgrade
Link Corruption Mitigation
Potential Solution #2• Explicit coordination among
applications
9
Traffic Engineering
Firmware Upgrade
Link Corruption Mitigation
Potential Solution #2• Explicit coordination among
applications
• Consensus over network changes
9
Traffic Engineering
Firmware Upgrade
Link Corruption Mitigation
Still Too Complex
• Hard to understand each other• Diverse network interactions
10
Still Too Complex
• Hard to understand each other• Diverse network interactions
10
Application Routing Device Config
TrafficEngineering
Firmwareupgrade
Still Too Complex
• Hard to understand each other• Diverse network interactions
10
Application Routing Device Config
TrafficEngineering
Firmwareupgrade
Still Too Complex
• Hard to understand each other• Diverse network interactions
10
Application Routing Device Config
TrafficEngineering
Firmwareupgrade
Main Enemy: Complexity
• Application development
• Application coordination
11
Main Enemy: Complexity
• Application development
• Application coordination
11
MonolithicIndepen-dent
Explicitlycoordinate
Simple Complex
What We Advocate
• Loose coupling of applications
• Design principle:• Simplicity with safety guarantees
12
What We Advocate
• Loose coupling of applications
• Design principle:• Simplicity with safety guarantees
• Forgo joint optimization• Worthwhile tradeoff for simplicity• Applications could do it out-of-band
12
Overview of Statesman
• Network operating system for safe multi-application operation
13
Overview of Statesman
• Network operating system for safe multi-application operation
• Uses network state abstraction• Three views of network state
13
Overview of Statesman
• Network operating system for safe multi-application operation
• Uses network state abstraction• Three views of network state• Dependency model of states
13
The “State” in Statesman
• Complexity of dealing with devices• Heterogeneity• Device-specific commands
14
Network Devices
The “State” in Statesman
• Complexity of dealing with devices• Heterogeneity• Device-specific commands
14
Network Devices
Network State
State Variable Examples
State Variable Value
Device Power Status Up, down
Device Firmware Version number
Device SDN Agent Boot Up, down
Device Routing State Routing rules
Link Admin Status Up, down
Link Control Plane BGP, OpenFlow, …15
Simplify Device InteractionPast Now
16
Network Devices Network Devices
Network State
Application Application
Simplify Device InteractionPast Now
16
SNMP, OF, vendor API, …
Network Devices Network Devices
Network State
Application
Device Statistics
Application
Simplify Device InteractionPast Now
16
SNMP, OF, vendor API, …
Network Devices Network Devices
Network State
Application
Device Statistics
Application
Device-specificcmds
Simplify Device InteractionPast Now
16
SNMP, OF, vendor API, …
Read
Network Devices Network Devices
Network State
Application
Device Statistics
Application
Device-specificcmds
Simplify Device InteractionPast Now
16
SNMP, OF, vendor API, …
Read Write
Network Devices Network Devices
Network State
Application
Device Statistics
Application
Device-specificcmds
Views of Network State
17Network Devices
Network State
ApplicationApplicationApplication
Views of Network State
17Network Devices
Observed State
Observed State Actual state of the whole network
Target State Desired state to be updated on the whole network
Target State
ApplicationApplicationApplication
Network Devices
Two Views Are Not Enough
18
Observed State
Target State
ApplicationApplicationApplication
Network Devices
Two Views Are Not Enough
18
Observed State
Target State
One More View
Proposed State A group of entity-variable-valuesdesired by an application
Proposed State
Network Devices
Two Views Are Not Enough
18
Observed State
Target State
One More View
Proposed State A group of entity-variable-valuesdesired by an application
Proposed State
ApplicationApplicationApplication
Network Devices
Two Views Are Not Enough
18
Observed State
Target State
One More View
Proposed State A group of entity-variable-valuesdesired by an application
Proposed State
ApplicationApplicationApplication
How Merging Works• Combine multiple proposed states
into a safe target state
19
How Merging Works• Combine multiple proposed states
into a safe target state
• Conflict resolution• Last-writer-wins• Priority-based locking• Sufficient for current deployment
19
How Merging Works• Combine multiple proposed states
into a safe target state
• Conflict resolution• Last-writer-wins• Priority-based locking• Sufficient for current deployment
• Safety invariant checking• Partial rejection & Skip update
19
Choose Safety Invariants
20
Choose Safety Invariants
20
TightLoose
Choose Safety Invariants
20
Hinder application too frequently
TightLoose
Choose Safety Invariants
20
Hinder application too frequently
TightLoose
Cannot protect network operation
Choose Safety Invariants
• Our current choice• Connectivity: Every pair of ToRs in
one DC is connected• Capacity: 99% of ToR pairs have at
least 50% capacity
20
Hinder application too frequently
TightLoose
Cannot protect network operation
Recap of Three-View Model• Simplify network management
21
Observed State
Target StateProposed
State
Recap of Three-View Model• Simplify network management
21
Observed State
Target StateProposed
State
ApplicationApplicationApplication
Recap of Three-View Model• Simplify network management
21
Observed State
Target StateProposed
State
What we see from
the network
ApplicationApplicationApplication
Recap of Three-View Model• Simplify network management
21
Observed State
Target StateProposed
State
What we see from
the network
What we want the network
to be
ApplicationApplicationApplication
Recap of Three-View Model• Simplify network management
21
Observed State
Target StateProposed
State
What we see from
the network
What we want the network
to be
What can be actually done on the network
StatesmanApplicationApplicationApplication
Yet Another Problem
• What’s in Proposed State• Small number of state variables that
application cares
22
Yet Another Problem
• What’s in Proposed State• Small number of state variables that
application cares
• Implicit conflicts arises
22
Yet Another Problem
• What’s in Proposed State• Small number of state variables that
application cares
• Implicit conflicts arises• Caused by state dependency
22
Implicit Conflict
23
A
B C
D
Implicit Conflict
23
A
B C
D
Implicit Conflict
23
TE writes new value of routing state of B for tunneling traffic
A
B C
D
Implicit Conflict
23
TE writes new value of routing state of B for tunneling traffic
Firmware-upgrade writes new value of firmware state of B
Dependency Relations
24
Device
Link
Dependency Relations
24
PowerState Device
Link
Dependency Relations
24
PowerState
FirmwareVersion
Device
Link
Dependency Relations
24
PowerState
FirmwareVersion
ConfigurationState
Device
Link
Dependency Relations
24
PowerState
FirmwareVersion
ConfigurationState
Device
Link
bgpd SDN
Dependency Relations
24
PowerState
FirmwareVersion
ConfigurationState AdminState
ConfigurationState
Device
Link
Dependency Relations
24
PowerState
FirmwareVersion
ConfigurationState
RoutingState
AdminState
ConfigurationState
Device
Link
Dependency Relations
24
PowerState
FirmwareVersion
ConfigurationState
RoutingState
AdminState
ConfigurationState
PathState
Device
Link
Build in Dependency Model
• Statesman calculates it internally
• Only exposes the result for each state variable• Whether the variable is controllable
25
Statesman System
26
TargetState
Proposed State
Observed State
Statesman System
26
TargetState
Proposed State
Observed State
Storage Service
Statesman System
26
TargetState
Monitor
Proposed State
Observed State
Storage Service
Statesman System
26
TargetState
Monitor
Checker
Proposed State
Observed State
Storage Service
Statesman System
26
TargetState
Monitor Updater
Checker
Proposed State
Observed State
Storage Service
Deployment Overview
• Operational in Microsoft Azure for 12 months
• Cover 10 DCs of 20K devices
27
Deployment Overview
• Operational in Microsoft Azure for 12 months
• Cover 10 DCs of 20K devices
27
Production Applications
• 3 diverse applications built• Device firmware upgrade• Link corruption mitigation• Traffic engineering
28
Production Applications
• 3 diverse applications built• Device firmware upgrade• Link corruption mitigation• Traffic engineering
• Finish within months
• Only thousands of lines of code
28
Case #1: Resolve ConflictInter-DC TE &
Firmware-upgrade
29
BR 1
BR 2DC 1
BR 8
BR 7DC 4
BR 3BR 4
DC 2
BR 5
DC 3
BR 6
DC = Data CenterBR = Border Router
30
……
……
30
……
……
30
Firmware-upgrade acquires lock of BR1
……
……
30
TE fails to acquire lock, and moves traffic away
……
……
30
TE fails to acquire lock, and moves traffic away
……
……
30
BR1 firmware upgrade starts
……
……
30
BR1 firmware upgrade starts
BR1 firmware upgrade ends. Lock released.
……
……
30
BR1 firmware upgrade starts
TE re-acquires lock, and moves traffic back
……
……
30
BR1 firmware upgrade starts
TE re-acquires lock, and moves traffic back
……
……
Case #1 Summary
• Each application: • Simple logic• Unaware of the other
• Statesman enables: • Conflict resolution• Necessary coordination
31
Case #2: Maintain Capacity Invariant
Firmware-upgrade & Link-corruption-mitigation
32
…
ToR
Agg
…… …
Core
…Pod 4
41
1 n…Pod 1
41
1 n …Pod 10
41
1 n
1 4
Link corrupting packets
33
……
……
…
33
Upgrade proceeds in normal speed in Pod 3 and 5
……
……
…
33
Upgrade proceeds in normal speed in Pod 3 and 5
……
……
…
33
Upgrade proceeds in normal speed in Pod 3 and 5
……
……
…
33
Upgrade proceeds in normal speed in Pod 3 and 5
Upgrade in Pod 4 is slowed down by checker due to lost
capacity
……
……
…
33
Upgrade proceeds in normal speed in Pod 3 and 5
Upgrade in Pod 4 is slowed down by checker due to lost
capacity
……
……
…
Case #2 Summary
• Statesman:• Automatically adjusts application
progresses• Keeps the network within safety
requirements
34
Conclusion
• Need network operating system for multiple management applications
35
Conclusion
• Need network operating system for multiple management applications
• Statesman• Loose coupling of applications• Network state abstraction
35
Conclusion
• Need network operating system for multiple management applications
• Statesman• Loose coupling of applications• Network state abstraction
• Deployed and operational in Azure
35
36
Thanks!
Questions?Check paper for related works