1 Backward Congestion Notification Version 2.0 Davide Bergamasco (davide@cisco.com)davide@cisco.com...

Post on 27-Mar-2015

220 views 0 download

Tags:

Transcript of 1 Backward Congestion Notification Version 2.0 Davide Bergamasco (davide@cisco.com)davide@cisco.com...

1

Backward Congestion Notification Version 2.0

Davide Bergamasco (davide@cisco.com)

Rong Pan (ropan@cisco.com)

Cisco Systems, Inc.

IEEE 802.1 Interim Meeting

Garden Grove, CA (USA)

September 22, 2005

222

Credits

• Valentina Alaria (Cisco)

• Andrea Baldini (Cisco)

• Flavio Bonomi (Cisco)

• Manoj K. Wadekar (Intel)

333

BCN v2.0

• Desire from Mick to see an analytical studyof BCN stability

• BCN v2.0 improvements

• Linear control loop allows analysis of stability

• Simplified detection mechanism

• Reduced signaling rate

• Original BCN framework remains the same

444

BCN Background

Data Center Network

10 Gbps

End Node A 10 Gbps

10 Gbps

End Node B

10 Gbps

10 Gbps End Node C

10 Gbps

Tra

ffic

Traffic

BCN Message

BC

N M

essa

ge

Congestion

Traffic

Traf

fic

Traffic

Edge Switch A

Core Switch

Edge Switch B

Edge Switch C

555

Detection & Signaling

FULL QUEUE

OUTIN

Qeq

BCN (Qoff, Qdelta)

BCN (0,0) No Message

BCN (0,0)

RLTaggedFrame?

SampleFrame with

Probability P

No

Yes

MESSAGE TO GENERATE

MESSAGE TO GENERATE

EMPTY QUEUE

Qsc

BCN (Qoff, Qdelta)

SampledFrame?

Yes

No

SendBCN

NOP

Qoff = Qeq - Qlen [-Qeq. +Qeq]

Qdelta = #pktEnq - #pktDeq [-2Qeq, +2Qeq]

666

Reaction

Data OUT

R1F1

R2F2

RnFn

No

Mat

ch

Control IN

Data IN

Packets Marked withRATE_LIMITED_TAG

EDGENODE

NETWORKCORE

BCN Messagesfrom congestedpoint

* Feedback

Fb = (Qoff - W * Qdelta)

* Additive Increase (Fb > 0)

R = R + Gi * Fb * ru

* Multiplicative Decrease (Fb < 0)

R = R * ( 1 - Gd * |Fb| )

* Parameters

W = derivative weightGi = increase gainGd = decrease gainru = rate unit

777

Suggested BCN Message Format 0 15 31 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + DA = SA of sampled frame +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ SA = MAC Address of CP + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | IEEE 802.1Q Tag or S-Tag | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | EtherType = BCN |Version| Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + CPID + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Qoff | Qdelta | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Timestamp | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | | First N bytes of sampled frame starting from DA | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | FCS | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

888

Suggested RLT Tag Format 0 3 7 15 31 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + DA of rate-limited frame +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ SA of rate-limited frame + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | IEEE 802.1Q Tag or S-Tag of rate-limited frame | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | EtherType = RLT |Version| Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + CPID + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Timestamp |EtherType of rate limited frame|

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| |+ Payload of rate-limited frame +| |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| FCS |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

999ST1 SU1 ST2 SU2 ST3 SU3 ST4 SU4 DT DU

SR2

DR2

SJ

Core Switch

ES2 ES3 ES4 ES5

ES6

SR1

ES1

DR1

Simulation Environment (1)

Congestion

TCP Bulk

UDP On/Off

101010

Simulation Environment (2)

• Short Range, High Speed DC Network

• Link Capacity = 10 Gbps

• Switch latency = 1 s

• Link Length = 100 m (0.5 s propagation delay)

• Control loop

• Delay ~ 3 s

• Parameters

• W = 2

• Gi = 4

• Gd = 1/64

• Ru = 8 Mbps

• Workload

• ST1-ST4: 10 parallel TCP connections transferring 1 MB each continuously

• SU1-SU4: 64 KB bursts of UDP traffic starting at t = 10 ms

111111

BCNv1.0

121212

BCNv2.0

Higher Stability @ Steady State

Faster Transient Response

131313

Simulation Environment (3)

• Long Range, High Speed DC Network

• Link Capacity = 10 Gbps

• Switch latency = 1 s

• Link Length = 20000 m (100 s propagation delay)

• Control loop

• Delay ~ 200 s

• Parameters

• W = 2

• Gi = 4

• Gd = 1/64

• Ru = 8 Mbps

• Workload

• ST1-ST4: 10 parallel TCP connections transferring 1 MB each continuously

• SU1-SU4: 64 KB bursts of UDP traffic starting at t = 10 ms

141414

BCNv1.0

151515

BCNv2.0

Much higher stability @ steady state with larger

loop delays

161616

Summary

• BCN v2 has a number of advantages …

• Can be studied analytically

• Better protection of TCP flows in mixed TCP and UDP traffic scenarios

• Detection algorithm independent of Switch implementation

• Better Performance

• Lower signaling frequency (from 10% to 1%)

• Better stability

• Increased tolerance to loop delays

• … and one disadvantage

• Slower convergence to fairness

171717

A Control-Theoretic Approach to BCNDesign and Analysis

181818

Notation

N: Number of FlowsC: Link Capacity: Round Trip Delay

w: Weight of the DerivitivePm: Sampling ProbabilityGi: Additive Increase GainGd: Multiplicative Decrease Gain

191919

Block Diagram of BCN Congestion Control

+

C

_qR

Time Delay

+

+

_

Gi

∆R

Pm

))1()((*

))(()(

TqTqw

TqqTFb eq

N

Gd

+

+

202020

Non-linear Differential Equations

meq PCdt

tdqwtqqtFb

CtRNdt

tdq

*

1*

)(*))(()(

)(*)(

md PtRtFbGtRdt

tdR*)(*)(**)(

)(

If Fb(t-) > 0

If Fb(t-) < 0mi PtRtFbG

dt

tdR*)(*)(*

)(

Link Control

Source Control

212121

Linearization Around Operating Point

• Using feedback control to analyze local stability

• Operating point:

R = C/N;

q’ = qeq – q = 0;

• Linearization

Difficulty: depending on sgn(Fb(t-d)), the system responses are different

– Luckily, a piecewise-linear function

Details are in the appendix

222222

Block Diagram of BCN Feedback Control

+

R

_

+

+

s

N q

Fb)

*

*1()(

mPC

swsFb se

N

CGws

N

CPG

d

md

2

2**

wGsNC

PG

i

mi

*

**

lose 90o margin

add lead zero to compensate

)*

*1()(

mPC

swsFb

Multiplicative Decrease:

Additive Increase:

232323

The Effect Of Zero From Time Domain’s Eyes

R

q

zero:dq/dt

242424

Choosing Parameters – an example

• Network conditions (10G link)

N = 50

= 200us

• Choose parameters such that the feedback loop is stable with a 35o margin

w = 4

Gi = 2Mbps

Gd = 1/128

Pm = 0.01

252525

Stability Result:lo

st 9

0o m

argi

n

1. With N = 50, delay = 200us, the system is stable

2. Phase margin translates into allowing extreme network conditions of N -> 1000 flows or -> 1ms before oscillation

262626

Simulation Result Shows A Stable System for N = 50; Delay = 200us

272727

Simulation Result Shows System is stable, but on the verge of oscillation: N = 50, Delay = 1ms

282828

Change W = 4 -> 1

1. When w = 1, a system with N = 50, delay = 200us already runs out of margin, on the verge of oscillation

2. w = 1, diminishing zero effect. System can’t cope with wide range of network conditions

292929

Indeed System is stable, but on the verge of oscillation even for N = 50, Delay = 200us when w = 1.0

303030

Requests to 802.1

• Start a Task Force on Congestion Management

• Use BCN as a Baseline Proposal

313131

Appendix

323232

Linearizing…

)()*

*1(*)(

)*

)(*)((*)(

)()(

)()(

.

.

sqPmC

swGsFb

PmC

tqwtqGtFb

s

sNRsq

tRNtq

333333

Linearizing Additive Increase Function

)(

)*

*)(

)((***)(*(

)(

**)(*

)(

)(**)(*)(

:

tR

PCw

dttdq

tqqGPtRG

tR

f

N

PCGPtRG

tFb

f

tFbPtRGdt

tdRf

meqmi

mimi

mi

343434

Linearizing Additive Increase Function

FbwGGsNC

PGR

RwGGFbN

CPGRs

wGG

PC

wNPtRGG

tR

PCw

CtNR

PtRGG

tR

PCw

CtNR

PtRGG

tRPmCw

dttdq

tqqPtRGG

PC

w

dt

tdqtqqGPG

tR

PCw

dttdq

tqqGPtRG

tR

f

i

mi

imi

i

mmi

mmi

mmi

eq

mim

eqmi

meqmi

**

**

*****

**

****)(**

)(

)*

*))(((

**)(**

)(

)*

*))(((

**)(**

)(

)*

*)(

)((**)(**)

**

)()((***

)(

)*

*)(

)((***)(*(

)(

353535

Linearizing Multiplicative Decrease Function

)(

)*

*)(

)((***)(*)(*(

)(

***)(*)(*

)(

)(**)(*)(*)(

:

2

2

tR

PCw

dttdq

tqqGPtRtRG

tR

g

N

CPGPtRtRG

tFb

g

tFbPtRtRGdt

tdRg

meqmd

mdmd

md

363636

Fb

NCG

wGs

NCPG

R

RN

CGwGFb

N

CPGRs

wGN

CGwGtRG

PC

wNGPtRG

tR

PCw

CtNR

GPtRG

tR

PCw

CtNR

GPtRG

tR

PCw

dttdq

tqq

GPtRGPC

w

dt

tdqtqqGPtRG

tR

PCw

dttdq

tqqGPtRtRG

tR

g

d

md

dmd

dd

mmd

mmd

mmd

meq

mdm

eqmd

meqmd

**

**

**

***

*****)(*

*****)(*

)(

)*

*))(((

***)(*

)(

)*

*))(((

***)(*

)(

)*

*)(

)((

***)(*)*

*)(

)((***)(**2

)(

)*

*)(

)((***)(*)(*(

)(

2

2

2

2

22

2

2

Linearizing Multiplicative Decrease Function

373737

- - - -+ + + +

Stop Generation of BCN Messages

t

Q

Qeq

Issue #1: Non-linearity

• ISSUE: Overshoots and undershoots accumulate over time

• SOLUTION: Signal only when

• Q > Qeq && dQ/dt > 0

• Q < Qeq && dQ/dt < 0

• Easy to implement in hardware: just an Up/Down counter

• Increment @ every enqueue

• Decrement @ every dequeue

• Reduces signaling rate by 50%!!

383838

Issue #2: Specific Detection Mechanism

FULL QUEUE

OUTIN

T+4T+3T+2T+1T+0

BCN+4BCN+3BCN+2BCN+1

BCN 0 No Message

NoMessage

BCN 0

RLTaggedFrame?

SampleFrame with

Probability P

No

Yes

MESSAGE TO GENERATE

MESSAGE TO GENERATE

EQUILIBRIUMEMPTY QUEUE

T-1T-2T-3T-4

BCN-1BCN-2BCN-3BCN-4

BCN-1BCN-2BCN-3BCN-4

SampledFrame?

Yes

RL Tag && Solicit

Bit Set?

No

Yes

No

BCNtype

dQ/dt < 0?

dQ/dt > 0

+ Yes

NOP

SendBCN

NOP

Yes

No

No

-

BCN+4BCN+3BCN+2BCN+1No Message

MESSAGE TO GENERATE

NOP

0

393939393939