The Internet’s architecture for managing congestion

The Internet’s architecture for managing congestion

Damon Wischik, UCLwww.wischik.com/damon

Some Internet History

• 1974: First draft of TCP/IP[“A protocol for packet network interconnection”, Vint Cerf and Robert Kahn]

• 1983: ARPANET switches on TCP/IP

• 1986: Congestion collapse• 1988: Congestion control for TCP

[“Congestion avoidance and control”, Van Jacobson]

“A Brief History of the Internet”, the Internet Society

End-to-end control

Internet congestion is controlled by the end-systems.The network operates as a dumb pipe.[“End-to-end arguments in system design” by Saltzer, Reed, Clark, 1981]

UserServer(TCP)

request

End-to-end control


UserServer(TCP)

data

End-to-end control


The TCP algorithm, running on the server, decides how fast to send data.

UserServer(TCP)

data

acknowledgements

TCPif (seqno > _last_acked) {

if (!_in_fast_recovery) {

_last_acked = seqno;

_dupacks = 0;

inflate_window();

send_packets(now);

_last_sent_time = now;

return;

}

if (seqno < _recover) {

uint32_t new_data = seqno - _last_acked;


if (new_data < _cwnd) _cwnd -= new_data; else _cwnd=0;

_cwnd += _mss;

retransmit_packet(now);

send_packets(now);

return;

}

uint32_t flightsize = _highest_sent - seqno;

_cwnd = min(_ssthresh, flightsize + _mss);


_dupacks = 0;

_in_fast_recovery = false;

send_packets(now);

return;

}

if (_in_fast_recovery) {

_cwnd += _mss;

send_packets(now);

return;

}

_dupacks++;

if (_dupacks!=3) {

send_packets(now);

return;

}

_ssthresh = max(_cwnd/2, (uint32_t)(2 * _mss));


_cwnd = _ssthresh + 3 * _mss;

_in_fast_recovery = true;

_recover = _highest_sent;

}

time [0-8 sec]

traf

fic r

ate

[0-1

00 k

B/s

ec]

How TCP shares capacity

sum of flowbandwidths

time

availablebandwidth

individualflow

bandwidths

Motivation: buffer size• Internet routers have buffers,

to accomodate bursts in traffic.

• How big do the buffers need to be?– 3 GByte? Rule of thumb—what Cisco does today

– 300 MByte? [Appenzeller, Keslassy, McKeown, 2004 ]

– 30 kByte?

• Large buffers are unsustainable:– Data volumes double every 10 months– CPU speeds double every 18 months– Memory access speeds double every 10 years

Motivation: TCP’s teleology[Kelly, Maulloo, Tan, 1998]

• Consider several TCP flows sharing a single link

• Let xr be the mean bandwidth of flow r [pkts/sec]

Let y be the total bandwidth of all flows [pkts/sec]

Let C be the total available capacity [pkts/sec]

• TCP and the network act so as to solvemaximise r U(xr) - P(y,C) over xr0 where y=r

xr

x

U(x

)

y

P(y

,C)

C

Bad teleology

little extra valued attached to high-bandwidth flowssevere penalty for

allocating too little bandwidth

x

U(x

)

Bad teleology

x

U(x

)

flows with largeRTT are satisfied with little bandwidth

flows with small RTT want more bandwidth

Bad teleology

y

P(y

,C)

C

no penalty unlesslinks are overloaded

TCP’s teleology

• The network acts as if it’s trying tosolve an optimization problem

• Is this what we want the Internet to optimize?

• Does it even succeed in performing the optimization?

x

U(x

)

yC

P(y

,C)

Synchronization

Desynchronized TCP flows:– aggregate traffic

is smooth

– network solves the optimization

SynchronizedTCP flows:– aggregate traffic

is bursty

– network oscillates about the optimum

++=

++=

aggr

egat

etr

affi

c ra

te

time

indi

vidu

alfl

ow r

ates

TCP traffic model

• When there are many TCP flows, the aggregate traffic rate xt varies smoothly, according to a differential equation[Misra, Gong, Towsley, 2000]

• The equation involves

– pt, the packet loss probability at time t,

– RTT, the average round trip time

time

aggr

egat

etr

affi

c ra

te

desynchronized synchronized

TCPif (seqno > _last_acked) {

if (!_in_fast_recovery) {


_dupacks = 0;

inflate_window();

send_packets(now);

_last_sent_time = now;

return;

}

if (seqno < _recover) {

uint32_t new_data = seqno - _last_acked;


if (new_data < _cwnd) _cwnd -= new_data; else _cwnd=0;

_cwnd += _mss;


send_packets(now);

return;

}

uint32_t flightsize = _highest_sent - seqno;

_cwnd = min(_ssthresh, flightsize + _mss);


_dupacks = 0;

_in_fast_recovery = false;

send_packets(now);

return;

}

if (_in_fast_recovery) {

_cwnd += _mss;

send_packets(now);

return;

}

_dupacks++;

if (_dupacks!=3) {

send_packets(now);

return;

}

_ssthresh = max(_cwnd/2, (uint32_t)(2 * _mss));


_cwnd = _ssthresh + 3 * _mss;

_in_fast_recovery = true;

_recover = _highest_sent;

}

time [0-8 sec]

traf

fic r

ate

[0-1

00 k

B/s

ec]

Queue model

• How does packet loss probability pt depend on buffer size?

• There are two families of answers, depending on queueing delay:– Small buffers (queueing delay « RTT)– Large buffers (queueing delay RTT)

time

val1

40 41 42 43 44 45

05

1015

20

20

05

1015

20

200

05

1015

20

2000

Small buffers

As the optical fibre’s line rate increases

– queue size fluctuates more and more rapidly

– queue size distribution does not change(it depends only on link utilization, not on line rate)

time

val1

40 41 42 43 44 45

05

1015

20

20

05

1015

20

200

05

1015

20

2000

queu

e si

ze[0

-15

pkt]

time [0-5 sec]

queueing delay19 ms

queueing delay1.9 ms

queueing delay0.19 ms

time

val1

40 41 42 43 44 45

05

1015

20

20

05

1015

20

200

05

1015

20

2000

Large buffers (queueing delay 200 ms)

• When xt<C the queue size is small (C=line rate)

• No packet drops, so TCP increases xt

time

val1

40 42 44 46 48 50

050

100

150

queueapprox

queu

e si

ze[0

-160

pkt

]

time [0-10 sec]



• No packet drops, so TCP increases xt

• When xt>C the queue fills up and packets begin to get dropped

time

val1

40 42 44 46 48 50

050

100

150

queueapprox

queu

e si

ze[0

-160

pkt

]

time [0-10 sec]



• No packet drops, so TCPs increases xt

• When xt>C the queue fills up and packets begin to get dropped

• TCPs may ‘overshoot’, leading to synchronization

time

val1

40 42 44 46 48 50

050

100

150

queueapprox

queu

e si

ze[0

-160

pkt

]

time [0-10 sec]


• Drop probability depends onboth traffic rate xt and queue size qt

time

val1

40 42 44 46 48 50

050

100

150

queueapprox

queu

e si

ze[0

-160

pkt

]

time [0-10 sec]

Analysis

• Write down differential equations– for aggregate TCP traffic rate xt

– for queue dynamics and loss prob pt

taking account of buffer size

• Calculate– average link utilization– average queue occupancy/delay– extent of synchronization

and consequent loss of utilization, and jitter[Gaurav Raina, PhD thesis, 2005]

Stability/instability analysis

• For some values of C*RTT, the dynamical system is stable

• For others it is unstable and there are oscillations(i.e. the flows are partially synchronized)

• When it is unstable, we can calculate the amplitude of the oscillations

20 40 60 80 1000.60.8

1.21.4

20 40 60 80 1000.60.8

1.21.4

time

trafficrate xt/C

0.5 1 1.5 2

-4

-3

-2

-1

Instability plot

traffic intensity x/C

log10 ofpkt loss

probability p

TCP throughput equation

queue equation

extent ofoscillationsin x/C

0.5 1 1.5 2

-4

-3

-2

-1

Instability plot

C*RTT=4 pkts

C*RTT=20 pkts

C*RTT=100 pkts

traffic intensity x/C

log10 ofpkt loss

probability p

Alternative buffer-sizing rules

b25 b100 b400

0.5 1 1.5

-4

-3

-2

-1

b10 b20 b50

0.5 1 1.5

-4

-3

-2

-1

0.5 1 1.5 2

-4

-3

-2

-1

b50 b1000

0.5 1 1.5-6

-5

-4

-3

-2

-1

r

p

Intermediate buffers buffer = bandwidth*delay / sqrt(#flows)

or Large buffers buffer = bandwidth*delay

Large buffers with AQMbuffer=bandwidth*delay*{¼,1,4}

Small buffersbuffer={10,20,50} pkts

Small buffers, ScalableTCPbuffer={50,1000} pkts[Vinnicombe 2002] [T.Kelly 2002]

Conclusion

• The network acts to solve an optimization problem.– We can choose which optimization problem,

by choosing the right buffer size & by changing TCP’s code.

• It may or may not attain the solution– In order to make sure the network is stable,

we need to choose the buffer size & TCP code carefully.

Prescription

• ScalableTCP in end-systemsneed to persuade Microsoft, Linus

• Much smaller buffers in routersneed to persuade BT/AT&T

x

U(x

)

y

P(y

,C)

C

ScalableTCP gives more weight to high-bandwidth flows.And it’s been shown to be stable.

With small buffers,the network likes to run with slightly lower utilization,hence lower delay

The Internet’s architecture for managing congestion

Documents

Transcript of The Internet’s architecture for managing congestion