ChapterIII: Transport Layer

105
Chapter III: Transport Layer UG3 Computer Communications & Networks (COMN) Mahesh Marina [email protected] Slides copyright of Kurose and Ross

Transcript of ChapterIII: Transport Layer

Page 1: ChapterIII: Transport Layer

Chapter III Transport Layer

UG3 Computer Communications amp Networks(COMN)

Mahesh Marinamaheshedacuk

Slides copyright of Kurose and Ross

Transport services and protocols

bull provide logical communicationbetween app processes running on different hosts

bull transport protocols run in end systems ndash send side breaks app messages

into segments passes to network layer

ndash rcv side reassembles segments into messages passes to app layer

bull more than one transport protocol available to appsndash Internet TCP and UDP

2

applicationtransportnetworkdata linkphysical

logical end-end transportapplicationtransportnetworkdata linkphysical

Transport vs network layer

bull network layer logical communication between hosts

bull transport layer logical communication between processesndash relies on enhances

network layer services

12 kids in Annrsquos house sending letters to 12 kids in Billrsquos house

bull hosts = housesbull processes = kidsbull app messages = letters in

envelopesbull transport protocol = Ann

and Bill who demux to in-house siblings

bull network-layer protocol = postal service

household analogy

3

Internet transport-layer protocols

bull reliable in-order delivery TCPndash congestion control ndash flow controlndash connection setup

bull unreliable unordered delivery UDPndash no-frills extension of ldquobest-

effortrdquo IP

bull services not available ndash delay guaranteesndash bandwidth guarantees

applicationtransportnetworkdata linkphysical

applicationtransportnetworkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical network

data linkphysical

logical end-end transport

4

UDP User Datagram Protocol [RFC 768]

bull ldquobare bonesrdquo Internet transport protocol

bull ldquobest effortrdquo service UDP segments may bendash lostndash delivered out-of-order to

appbull connectionless

ndash no handshaking between UDP sender receiver

ndash each UDP segment handled independently of others

5

bull UDP usendash streaming multimedia apps

(loss tolerant rate sensitive)ndash DNSndash SNMP

bull reliable transfer over UDP ndash add reliability at application

layerndash application-specific error

recovery

UDP segment header

6

bull no connection establishment (which can add delay)

bull simple no connection state at sender receiver

bull small header sizebull no congestion control UDP

can blast away as fast as desired

source port dest port

32 bits

applicationdata (payload)

UDP segment format

length checksum

length in bytes of UDP segment

including header

why is there a UDP

UDP checksum

senderbull treat segment contents

including header fields as sequence of 16-bit integers

bull checksum addition (1rsquos complement sum) of segment contents

bull sender puts checksum value into UDP checksum field

receiverbull compute checksum of

received segmentbull check if computed

checksum equals checksum field valuendash NO - error detectedndash YES - no error detected

But maybe errors nonetheless More later hellip

7

Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment

Internet checksum example

8

example add two 16-bit integers

1 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 01 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1

1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 0 01 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1

wraparound

sumchecksum

Note when adding numbers a carryout from the most significant bit needs to be added to the result

Principles of reliable data transfer

9

bull important in application transport link layersndash top-10 list of important networking topics

bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

Principles of reliable data transfer

10

bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

bull important in application transport link layersndash top-10 list of important networking topics

Principles of reliable data transfer

11

bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

bull important in application transport link layersndash top-10 list of important networking topics

Reliable data transfer getting started

12

sendside

receiveside

rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer

udt_send() called by rdtto transfer packet over unreliable channel to receiver

rdt_rcv() called when packet arrives on rcv-side of channel

deliver_data() called by rdt to deliver data to upper

Reliable data transfer getting started

13

Wersquollbull incrementally develop sender receiver sides of

reliable data transfer protocol (rdt)bull consider only unidirectional data transfer

ndash but control info will flow on both directions

bull use finite state machines (FSMs) to specify sender receiver

state 1 state 2

event causing state transitionactions taken on state transition

state when in this ldquostaterdquo next state

uniquely determined by next event

eventactions

rdt10 reliable transfer over a reliable channel

bull underlying channel perfectly reliablendash no bit errorsndash no loss of packets

bull separate FSMs for sender receiverndash sender sends data into underlying channelndash receiver reads data from underlying channel

14

Wait for call from above packet = make_pkt(data)

udt_send(packet)

rdt_send(data)extract (packetdata)deliver_data(data)

Wait for call from below

rdt_rcv(packet)

sender receiver

rdt20 channel with bit errors

bull underlying channel may flip bits in packetndash checksum to detect bit errors

bull the question how to recover from errorsndash acknowledgements (ACKs) receiver explicitly tells sender that

pkt received OKndash negative acknowledgements (NAKs) receiver explicitly tells

sender that pkt had errorsndash sender retransmits pkt on receipt of NAK

bull new mechanisms in rdt20 (beyond rdt10)ndash error detectionndash receiver feedback control msgs (ACKNAK) rcvr-gtsender

15

How do humans recover from ldquoerrorsrdquoduring conversation

rdt20 channel with bit errors

bull underlying channel may flip bits in packetndash checksum to detect bit errors

bull the question how to recover from errors

ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK

ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors

ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)

ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender

16

rdt20 FSM specification

17

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from belowsender

receiverrdt_send(data)

L

rdt20 operation with no errors

18

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 error scenario

19

Wait for call from above

snkpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 has a fatal flaw

what happens if ACKNAK corrupted

bull sender doesnrsquot know what happened at receiver

bull canrsquot just retransmit possible duplicate

handling duplicates bull sender retransmits current

pkt if ACKNAK corruptedbull sender adds sequence

number to each pktbull receiver discards (doesnrsquot

deliver up) duplicate pkt

20

stop and waitsender sends one packet then waits for receiver response

rdt21 sender handles garbled ACKNAKs

21

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0 udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

Wait forcall 1 from above

Wait for ACK or NAK 1

LL

Wait for 0 from below

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt21 receiver handles garbled ACKNAKs

22

rdt21 Example 1

23

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 1

24

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt21 Example 1

25

Wait for 0 from below

Wait for 1 from below

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 1

26

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 1

27

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

rdt21 Example 1

28

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

29

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

30

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 2

31

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 2: ChapterIII: Transport Layer

Transport services and protocols

bull provide logical communicationbetween app processes running on different hosts

bull transport protocols run in end systems ndash send side breaks app messages

into segments passes to network layer

ndash rcv side reassembles segments into messages passes to app layer

bull more than one transport protocol available to appsndash Internet TCP and UDP

2

applicationtransportnetworkdata linkphysical

logical end-end transportapplicationtransportnetworkdata linkphysical

Transport vs network layer

bull network layer logical communication between hosts

bull transport layer logical communication between processesndash relies on enhances

network layer services

12 kids in Annrsquos house sending letters to 12 kids in Billrsquos house

bull hosts = housesbull processes = kidsbull app messages = letters in

envelopesbull transport protocol = Ann

and Bill who demux to in-house siblings

bull network-layer protocol = postal service

household analogy

3

Internet transport-layer protocols

bull reliable in-order delivery TCPndash congestion control ndash flow controlndash connection setup

bull unreliable unordered delivery UDPndash no-frills extension of ldquobest-

effortrdquo IP

bull services not available ndash delay guaranteesndash bandwidth guarantees

applicationtransportnetworkdata linkphysical

applicationtransportnetworkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical network

data linkphysical

logical end-end transport

4

UDP User Datagram Protocol [RFC 768]

bull ldquobare bonesrdquo Internet transport protocol

bull ldquobest effortrdquo service UDP segments may bendash lostndash delivered out-of-order to

appbull connectionless

ndash no handshaking between UDP sender receiver

ndash each UDP segment handled independently of others

5

bull UDP usendash streaming multimedia apps

(loss tolerant rate sensitive)ndash DNSndash SNMP

bull reliable transfer over UDP ndash add reliability at application

layerndash application-specific error

recovery

UDP segment header

6

bull no connection establishment (which can add delay)

bull simple no connection state at sender receiver

bull small header sizebull no congestion control UDP

can blast away as fast as desired

source port dest port

32 bits

applicationdata (payload)

UDP segment format

length checksum

length in bytes of UDP segment

including header

why is there a UDP

UDP checksum

senderbull treat segment contents

including header fields as sequence of 16-bit integers

bull checksum addition (1rsquos complement sum) of segment contents

bull sender puts checksum value into UDP checksum field

receiverbull compute checksum of

received segmentbull check if computed

checksum equals checksum field valuendash NO - error detectedndash YES - no error detected

But maybe errors nonetheless More later hellip

7

Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment

Internet checksum example

8

example add two 16-bit integers

1 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 01 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1

1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 0 01 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1

wraparound

sumchecksum

Note when adding numbers a carryout from the most significant bit needs to be added to the result

Principles of reliable data transfer

9

bull important in application transport link layersndash top-10 list of important networking topics

bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

Principles of reliable data transfer

10

bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

bull important in application transport link layersndash top-10 list of important networking topics

Principles of reliable data transfer

11

bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

bull important in application transport link layersndash top-10 list of important networking topics

Reliable data transfer getting started

12

sendside

receiveside

rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer

udt_send() called by rdtto transfer packet over unreliable channel to receiver

rdt_rcv() called when packet arrives on rcv-side of channel

deliver_data() called by rdt to deliver data to upper

Reliable data transfer getting started

13

Wersquollbull incrementally develop sender receiver sides of

reliable data transfer protocol (rdt)bull consider only unidirectional data transfer

ndash but control info will flow on both directions

bull use finite state machines (FSMs) to specify sender receiver

state 1 state 2

event causing state transitionactions taken on state transition

state when in this ldquostaterdquo next state

uniquely determined by next event

eventactions

rdt10 reliable transfer over a reliable channel

bull underlying channel perfectly reliablendash no bit errorsndash no loss of packets

bull separate FSMs for sender receiverndash sender sends data into underlying channelndash receiver reads data from underlying channel

14

Wait for call from above packet = make_pkt(data)

udt_send(packet)

rdt_send(data)extract (packetdata)deliver_data(data)

Wait for call from below

rdt_rcv(packet)

sender receiver

rdt20 channel with bit errors

bull underlying channel may flip bits in packetndash checksum to detect bit errors

bull the question how to recover from errorsndash acknowledgements (ACKs) receiver explicitly tells sender that

pkt received OKndash negative acknowledgements (NAKs) receiver explicitly tells

sender that pkt had errorsndash sender retransmits pkt on receipt of NAK

bull new mechanisms in rdt20 (beyond rdt10)ndash error detectionndash receiver feedback control msgs (ACKNAK) rcvr-gtsender

15

How do humans recover from ldquoerrorsrdquoduring conversation

rdt20 channel with bit errors

bull underlying channel may flip bits in packetndash checksum to detect bit errors

bull the question how to recover from errors

ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK

ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors

ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)

ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender

16

rdt20 FSM specification

17

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from belowsender

receiverrdt_send(data)

L

rdt20 operation with no errors

18

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 error scenario

19

Wait for call from above

snkpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 has a fatal flaw

what happens if ACKNAK corrupted

bull sender doesnrsquot know what happened at receiver

bull canrsquot just retransmit possible duplicate

handling duplicates bull sender retransmits current

pkt if ACKNAK corruptedbull sender adds sequence

number to each pktbull receiver discards (doesnrsquot

deliver up) duplicate pkt

20

stop and waitsender sends one packet then waits for receiver response

rdt21 sender handles garbled ACKNAKs

21

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0 udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

Wait forcall 1 from above

Wait for ACK or NAK 1

LL

Wait for 0 from below

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt21 receiver handles garbled ACKNAKs

22

rdt21 Example 1

23

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 1

24

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt21 Example 1

25

Wait for 0 from below

Wait for 1 from below

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 1

26

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 1

27

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

rdt21 Example 1

28

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

29

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

30

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 2

31

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 3: ChapterIII: Transport Layer

Transport vs network layer

bull network layer logical communication between hosts

bull transport layer logical communication between processesndash relies on enhances

network layer services

12 kids in Annrsquos house sending letters to 12 kids in Billrsquos house

bull hosts = housesbull processes = kidsbull app messages = letters in

envelopesbull transport protocol = Ann

and Bill who demux to in-house siblings

bull network-layer protocol = postal service

household analogy

3

Internet transport-layer protocols

bull reliable in-order delivery TCPndash congestion control ndash flow controlndash connection setup

bull unreliable unordered delivery UDPndash no-frills extension of ldquobest-

effortrdquo IP

bull services not available ndash delay guaranteesndash bandwidth guarantees

applicationtransportnetworkdata linkphysical

applicationtransportnetworkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical network

data linkphysical

logical end-end transport

4

UDP User Datagram Protocol [RFC 768]

bull ldquobare bonesrdquo Internet transport protocol

bull ldquobest effortrdquo service UDP segments may bendash lostndash delivered out-of-order to

appbull connectionless

ndash no handshaking between UDP sender receiver

ndash each UDP segment handled independently of others

5

bull UDP usendash streaming multimedia apps

(loss tolerant rate sensitive)ndash DNSndash SNMP

bull reliable transfer over UDP ndash add reliability at application

layerndash application-specific error

recovery

UDP segment header

6

bull no connection establishment (which can add delay)

bull simple no connection state at sender receiver

bull small header sizebull no congestion control UDP

can blast away as fast as desired

source port dest port

32 bits

applicationdata (payload)

UDP segment format

length checksum

length in bytes of UDP segment

including header

why is there a UDP

UDP checksum

senderbull treat segment contents

including header fields as sequence of 16-bit integers

bull checksum addition (1rsquos complement sum) of segment contents

bull sender puts checksum value into UDP checksum field

receiverbull compute checksum of

received segmentbull check if computed

checksum equals checksum field valuendash NO - error detectedndash YES - no error detected

But maybe errors nonetheless More later hellip

7

Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment

Internet checksum example

8

example add two 16-bit integers

1 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 01 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1

1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 0 01 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1

wraparound

sumchecksum

Note when adding numbers a carryout from the most significant bit needs to be added to the result

Principles of reliable data transfer

9

bull important in application transport link layersndash top-10 list of important networking topics

bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

Principles of reliable data transfer

10

bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

bull important in application transport link layersndash top-10 list of important networking topics

Principles of reliable data transfer

11

bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

bull important in application transport link layersndash top-10 list of important networking topics

Reliable data transfer getting started

12

sendside

receiveside

rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer

udt_send() called by rdtto transfer packet over unreliable channel to receiver

rdt_rcv() called when packet arrives on rcv-side of channel

deliver_data() called by rdt to deliver data to upper

Reliable data transfer getting started

13

Wersquollbull incrementally develop sender receiver sides of

reliable data transfer protocol (rdt)bull consider only unidirectional data transfer

ndash but control info will flow on both directions

bull use finite state machines (FSMs) to specify sender receiver

state 1 state 2

event causing state transitionactions taken on state transition

state when in this ldquostaterdquo next state

uniquely determined by next event

eventactions

rdt10 reliable transfer over a reliable channel

bull underlying channel perfectly reliablendash no bit errorsndash no loss of packets

bull separate FSMs for sender receiverndash sender sends data into underlying channelndash receiver reads data from underlying channel

14

Wait for call from above packet = make_pkt(data)

udt_send(packet)

rdt_send(data)extract (packetdata)deliver_data(data)

Wait for call from below

rdt_rcv(packet)

sender receiver

rdt20 channel with bit errors

bull underlying channel may flip bits in packetndash checksum to detect bit errors

bull the question how to recover from errorsndash acknowledgements (ACKs) receiver explicitly tells sender that

pkt received OKndash negative acknowledgements (NAKs) receiver explicitly tells

sender that pkt had errorsndash sender retransmits pkt on receipt of NAK

bull new mechanisms in rdt20 (beyond rdt10)ndash error detectionndash receiver feedback control msgs (ACKNAK) rcvr-gtsender

15

How do humans recover from ldquoerrorsrdquoduring conversation

rdt20 channel with bit errors

bull underlying channel may flip bits in packetndash checksum to detect bit errors

bull the question how to recover from errors

ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK

ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors

ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)

ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender

16

rdt20 FSM specification

17

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from belowsender

receiverrdt_send(data)

L

rdt20 operation with no errors

18

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 error scenario

19

Wait for call from above

snkpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 has a fatal flaw

what happens if ACKNAK corrupted

bull sender doesnrsquot know what happened at receiver

bull canrsquot just retransmit possible duplicate

handling duplicates bull sender retransmits current

pkt if ACKNAK corruptedbull sender adds sequence

number to each pktbull receiver discards (doesnrsquot

deliver up) duplicate pkt

20

stop and waitsender sends one packet then waits for receiver response

rdt21 sender handles garbled ACKNAKs

21

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0 udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

Wait forcall 1 from above

Wait for ACK or NAK 1

LL

Wait for 0 from below

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt21 receiver handles garbled ACKNAKs

22

rdt21 Example 1

23

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 1

24

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt21 Example 1

25

Wait for 0 from below

Wait for 1 from below

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 1

26

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 1

27

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

rdt21 Example 1

28

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

29

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

30

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 2

31

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 4: ChapterIII: Transport Layer

Internet transport-layer protocols

bull reliable in-order delivery TCPndash congestion control ndash flow controlndash connection setup

bull unreliable unordered delivery UDPndash no-frills extension of ldquobest-

effortrdquo IP

bull services not available ndash delay guaranteesndash bandwidth guarantees

applicationtransportnetworkdata linkphysical

applicationtransportnetworkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical

networkdata linkphysical network

data linkphysical

logical end-end transport

4

UDP User Datagram Protocol [RFC 768]

bull ldquobare bonesrdquo Internet transport protocol

bull ldquobest effortrdquo service UDP segments may bendash lostndash delivered out-of-order to

appbull connectionless

ndash no handshaking between UDP sender receiver

ndash each UDP segment handled independently of others

5

bull UDP usendash streaming multimedia apps

(loss tolerant rate sensitive)ndash DNSndash SNMP

bull reliable transfer over UDP ndash add reliability at application

layerndash application-specific error

recovery

UDP segment header

6

bull no connection establishment (which can add delay)

bull simple no connection state at sender receiver

bull small header sizebull no congestion control UDP

can blast away as fast as desired

source port dest port

32 bits

applicationdata (payload)

UDP segment format

length checksum

length in bytes of UDP segment

including header

why is there a UDP

UDP checksum

senderbull treat segment contents

including header fields as sequence of 16-bit integers

bull checksum addition (1rsquos complement sum) of segment contents

bull sender puts checksum value into UDP checksum field

receiverbull compute checksum of

received segmentbull check if computed

checksum equals checksum field valuendash NO - error detectedndash YES - no error detected

But maybe errors nonetheless More later hellip

7

Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment

Internet checksum example

8

example add two 16-bit integers

1 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 01 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1

1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 0 01 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1

wraparound

sumchecksum

Note when adding numbers a carryout from the most significant bit needs to be added to the result

Principles of reliable data transfer

9

bull important in application transport link layersndash top-10 list of important networking topics

bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

Principles of reliable data transfer

10

bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

bull important in application transport link layersndash top-10 list of important networking topics

Principles of reliable data transfer

11

bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

bull important in application transport link layersndash top-10 list of important networking topics

Reliable data transfer getting started

12

sendside

receiveside

rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer

udt_send() called by rdtto transfer packet over unreliable channel to receiver

rdt_rcv() called when packet arrives on rcv-side of channel

deliver_data() called by rdt to deliver data to upper

Reliable data transfer getting started

13

Wersquollbull incrementally develop sender receiver sides of

reliable data transfer protocol (rdt)bull consider only unidirectional data transfer

ndash but control info will flow on both directions

bull use finite state machines (FSMs) to specify sender receiver

state 1 state 2

event causing state transitionactions taken on state transition

state when in this ldquostaterdquo next state

uniquely determined by next event

eventactions

rdt10 reliable transfer over a reliable channel

bull underlying channel perfectly reliablendash no bit errorsndash no loss of packets

bull separate FSMs for sender receiverndash sender sends data into underlying channelndash receiver reads data from underlying channel

14

Wait for call from above packet = make_pkt(data)

udt_send(packet)

rdt_send(data)extract (packetdata)deliver_data(data)

Wait for call from below

rdt_rcv(packet)

sender receiver

rdt20 channel with bit errors

bull underlying channel may flip bits in packetndash checksum to detect bit errors

bull the question how to recover from errorsndash acknowledgements (ACKs) receiver explicitly tells sender that

pkt received OKndash negative acknowledgements (NAKs) receiver explicitly tells

sender that pkt had errorsndash sender retransmits pkt on receipt of NAK

bull new mechanisms in rdt20 (beyond rdt10)ndash error detectionndash receiver feedback control msgs (ACKNAK) rcvr-gtsender

15

How do humans recover from ldquoerrorsrdquoduring conversation

rdt20 channel with bit errors

bull underlying channel may flip bits in packetndash checksum to detect bit errors

bull the question how to recover from errors

ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK

ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors

ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)

ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender

16

rdt20 FSM specification

17

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from belowsender

receiverrdt_send(data)

L

rdt20 operation with no errors

18

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 error scenario

19

Wait for call from above

snkpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 has a fatal flaw

what happens if ACKNAK corrupted

bull sender doesnrsquot know what happened at receiver

bull canrsquot just retransmit possible duplicate

handling duplicates bull sender retransmits current

pkt if ACKNAK corruptedbull sender adds sequence

number to each pktbull receiver discards (doesnrsquot

deliver up) duplicate pkt

20

stop and waitsender sends one packet then waits for receiver response

rdt21 sender handles garbled ACKNAKs

21

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0 udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

Wait forcall 1 from above

Wait for ACK or NAK 1

LL

Wait for 0 from below

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt21 receiver handles garbled ACKNAKs

22

rdt21 Example 1

23

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 1

24

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt21 Example 1

25

Wait for 0 from below

Wait for 1 from below

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 1

26

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 1

27

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

rdt21 Example 1

28

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

29

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

30

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 2

31

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 5: ChapterIII: Transport Layer

UDP User Datagram Protocol [RFC 768]

bull ldquobare bonesrdquo Internet transport protocol

bull ldquobest effortrdquo service UDP segments may bendash lostndash delivered out-of-order to

appbull connectionless

ndash no handshaking between UDP sender receiver

ndash each UDP segment handled independently of others

5

bull UDP usendash streaming multimedia apps

(loss tolerant rate sensitive)ndash DNSndash SNMP

bull reliable transfer over UDP ndash add reliability at application

layerndash application-specific error

recovery

UDP segment header

6

bull no connection establishment (which can add delay)

bull simple no connection state at sender receiver

bull small header sizebull no congestion control UDP

can blast away as fast as desired

source port dest port

32 bits

applicationdata (payload)

UDP segment format

length checksum

length in bytes of UDP segment

including header

why is there a UDP

UDP checksum

senderbull treat segment contents

including header fields as sequence of 16-bit integers

bull checksum addition (1rsquos complement sum) of segment contents

bull sender puts checksum value into UDP checksum field

receiverbull compute checksum of

received segmentbull check if computed

checksum equals checksum field valuendash NO - error detectedndash YES - no error detected

But maybe errors nonetheless More later hellip

7

Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment

Internet checksum example

8

example add two 16-bit integers

1 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 01 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1

1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 0 01 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1

wraparound

sumchecksum

Note when adding numbers a carryout from the most significant bit needs to be added to the result

Principles of reliable data transfer

9

bull important in application transport link layersndash top-10 list of important networking topics

bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

Principles of reliable data transfer

10

bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

bull important in application transport link layersndash top-10 list of important networking topics

Principles of reliable data transfer

11

bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

bull important in application transport link layersndash top-10 list of important networking topics

Reliable data transfer getting started

12

sendside

receiveside

rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer

udt_send() called by rdtto transfer packet over unreliable channel to receiver

rdt_rcv() called when packet arrives on rcv-side of channel

deliver_data() called by rdt to deliver data to upper

Reliable data transfer getting started

13

Wersquollbull incrementally develop sender receiver sides of

reliable data transfer protocol (rdt)bull consider only unidirectional data transfer

ndash but control info will flow on both directions

bull use finite state machines (FSMs) to specify sender receiver

state 1 state 2

event causing state transitionactions taken on state transition

state when in this ldquostaterdquo next state

uniquely determined by next event

eventactions

rdt10 reliable transfer over a reliable channel

bull underlying channel perfectly reliablendash no bit errorsndash no loss of packets

bull separate FSMs for sender receiverndash sender sends data into underlying channelndash receiver reads data from underlying channel

14

Wait for call from above packet = make_pkt(data)

udt_send(packet)

rdt_send(data)extract (packetdata)deliver_data(data)

Wait for call from below

rdt_rcv(packet)

sender receiver

rdt20 channel with bit errors

bull underlying channel may flip bits in packetndash checksum to detect bit errors

bull the question how to recover from errorsndash acknowledgements (ACKs) receiver explicitly tells sender that

pkt received OKndash negative acknowledgements (NAKs) receiver explicitly tells

sender that pkt had errorsndash sender retransmits pkt on receipt of NAK

bull new mechanisms in rdt20 (beyond rdt10)ndash error detectionndash receiver feedback control msgs (ACKNAK) rcvr-gtsender

15

How do humans recover from ldquoerrorsrdquoduring conversation

rdt20 channel with bit errors

bull underlying channel may flip bits in packetndash checksum to detect bit errors

bull the question how to recover from errors

ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK

ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors

ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)

ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender

16

rdt20 FSM specification

17

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from belowsender

receiverrdt_send(data)

L

rdt20 operation with no errors

18

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 error scenario

19

Wait for call from above

snkpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 has a fatal flaw

what happens if ACKNAK corrupted

bull sender doesnrsquot know what happened at receiver

bull canrsquot just retransmit possible duplicate

handling duplicates bull sender retransmits current

pkt if ACKNAK corruptedbull sender adds sequence

number to each pktbull receiver discards (doesnrsquot

deliver up) duplicate pkt

20

stop and waitsender sends one packet then waits for receiver response

rdt21 sender handles garbled ACKNAKs

21

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0 udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

Wait forcall 1 from above

Wait for ACK or NAK 1

LL

Wait for 0 from below

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt21 receiver handles garbled ACKNAKs

22

rdt21 Example 1

23

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 1

24

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt21 Example 1

25

Wait for 0 from below

Wait for 1 from below

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 1

26

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 1

27

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

rdt21 Example 1

28

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

29

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

30

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 2

31

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 6: ChapterIII: Transport Layer

UDP segment header

6

bull no connection establishment (which can add delay)

bull simple no connection state at sender receiver

bull small header sizebull no congestion control UDP

can blast away as fast as desired

source port dest port

32 bits

applicationdata (payload)

UDP segment format

length checksum

length in bytes of UDP segment

including header

why is there a UDP

UDP checksum

senderbull treat segment contents

including header fields as sequence of 16-bit integers

bull checksum addition (1rsquos complement sum) of segment contents

bull sender puts checksum value into UDP checksum field

receiverbull compute checksum of

received segmentbull check if computed

checksum equals checksum field valuendash NO - error detectedndash YES - no error detected

But maybe errors nonetheless More later hellip

7

Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment

Internet checksum example

8

example add two 16-bit integers

1 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 01 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1

1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 0 01 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1

wraparound

sumchecksum

Note when adding numbers a carryout from the most significant bit needs to be added to the result

Principles of reliable data transfer

9

bull important in application transport link layersndash top-10 list of important networking topics

bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

Principles of reliable data transfer

10

bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

bull important in application transport link layersndash top-10 list of important networking topics

Principles of reliable data transfer

11

bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

bull important in application transport link layersndash top-10 list of important networking topics

Reliable data transfer getting started

12

sendside

receiveside

rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer

udt_send() called by rdtto transfer packet over unreliable channel to receiver

rdt_rcv() called when packet arrives on rcv-side of channel

deliver_data() called by rdt to deliver data to upper

Reliable data transfer getting started

13

Wersquollbull incrementally develop sender receiver sides of

reliable data transfer protocol (rdt)bull consider only unidirectional data transfer

ndash but control info will flow on both directions

bull use finite state machines (FSMs) to specify sender receiver

state 1 state 2

event causing state transitionactions taken on state transition

state when in this ldquostaterdquo next state

uniquely determined by next event

eventactions

rdt10 reliable transfer over a reliable channel

bull underlying channel perfectly reliablendash no bit errorsndash no loss of packets

bull separate FSMs for sender receiverndash sender sends data into underlying channelndash receiver reads data from underlying channel

14

Wait for call from above packet = make_pkt(data)

udt_send(packet)

rdt_send(data)extract (packetdata)deliver_data(data)

Wait for call from below

rdt_rcv(packet)

sender receiver

rdt20 channel with bit errors

bull underlying channel may flip bits in packetndash checksum to detect bit errors

bull the question how to recover from errorsndash acknowledgements (ACKs) receiver explicitly tells sender that

pkt received OKndash negative acknowledgements (NAKs) receiver explicitly tells

sender that pkt had errorsndash sender retransmits pkt on receipt of NAK

bull new mechanisms in rdt20 (beyond rdt10)ndash error detectionndash receiver feedback control msgs (ACKNAK) rcvr-gtsender

15

How do humans recover from ldquoerrorsrdquoduring conversation

rdt20 channel with bit errors

bull underlying channel may flip bits in packetndash checksum to detect bit errors

bull the question how to recover from errors

ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK

ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors

ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)

ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender

16

rdt20 FSM specification

17

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from belowsender

receiverrdt_send(data)

L

rdt20 operation with no errors

18

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 error scenario

19

Wait for call from above

snkpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 has a fatal flaw

what happens if ACKNAK corrupted

bull sender doesnrsquot know what happened at receiver

bull canrsquot just retransmit possible duplicate

handling duplicates bull sender retransmits current

pkt if ACKNAK corruptedbull sender adds sequence

number to each pktbull receiver discards (doesnrsquot

deliver up) duplicate pkt

20

stop and waitsender sends one packet then waits for receiver response

rdt21 sender handles garbled ACKNAKs

21

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0 udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

Wait forcall 1 from above

Wait for ACK or NAK 1

LL

Wait for 0 from below

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt21 receiver handles garbled ACKNAKs

22

rdt21 Example 1

23

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 1

24

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt21 Example 1

25

Wait for 0 from below

Wait for 1 from below

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 1

26

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 1

27

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

rdt21 Example 1

28

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

29

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

30

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 2

31

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 7: ChapterIII: Transport Layer

UDP checksum

senderbull treat segment contents

including header fields as sequence of 16-bit integers

bull checksum addition (1rsquos complement sum) of segment contents

bull sender puts checksum value into UDP checksum field

receiverbull compute checksum of

received segmentbull check if computed

checksum equals checksum field valuendash NO - error detectedndash YES - no error detected

But maybe errors nonetheless More later hellip

7

Goal detect ldquoerrorsrdquo (eg flipped bits) in transmitted segment

Internet checksum example

8

example add two 16-bit integers

1 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 01 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1

1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 0 01 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1

wraparound

sumchecksum

Note when adding numbers a carryout from the most significant bit needs to be added to the result

Principles of reliable data transfer

9

bull important in application transport link layersndash top-10 list of important networking topics

bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

Principles of reliable data transfer

10

bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

bull important in application transport link layersndash top-10 list of important networking topics

Principles of reliable data transfer

11

bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

bull important in application transport link layersndash top-10 list of important networking topics

Reliable data transfer getting started

12

sendside

receiveside

rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer

udt_send() called by rdtto transfer packet over unreliable channel to receiver

rdt_rcv() called when packet arrives on rcv-side of channel

deliver_data() called by rdt to deliver data to upper

Reliable data transfer getting started

13

Wersquollbull incrementally develop sender receiver sides of

reliable data transfer protocol (rdt)bull consider only unidirectional data transfer

ndash but control info will flow on both directions

bull use finite state machines (FSMs) to specify sender receiver

state 1 state 2

event causing state transitionactions taken on state transition

state when in this ldquostaterdquo next state

uniquely determined by next event

eventactions

rdt10 reliable transfer over a reliable channel

bull underlying channel perfectly reliablendash no bit errorsndash no loss of packets

bull separate FSMs for sender receiverndash sender sends data into underlying channelndash receiver reads data from underlying channel

14

Wait for call from above packet = make_pkt(data)

udt_send(packet)

rdt_send(data)extract (packetdata)deliver_data(data)

Wait for call from below

rdt_rcv(packet)

sender receiver

rdt20 channel with bit errors

bull underlying channel may flip bits in packetndash checksum to detect bit errors

bull the question how to recover from errorsndash acknowledgements (ACKs) receiver explicitly tells sender that

pkt received OKndash negative acknowledgements (NAKs) receiver explicitly tells

sender that pkt had errorsndash sender retransmits pkt on receipt of NAK

bull new mechanisms in rdt20 (beyond rdt10)ndash error detectionndash receiver feedback control msgs (ACKNAK) rcvr-gtsender

15

How do humans recover from ldquoerrorsrdquoduring conversation

rdt20 channel with bit errors

bull underlying channel may flip bits in packetndash checksum to detect bit errors

bull the question how to recover from errors

ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK

ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors

ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)

ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender

16

rdt20 FSM specification

17

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from belowsender

receiverrdt_send(data)

L

rdt20 operation with no errors

18

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 error scenario

19

Wait for call from above

snkpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 has a fatal flaw

what happens if ACKNAK corrupted

bull sender doesnrsquot know what happened at receiver

bull canrsquot just retransmit possible duplicate

handling duplicates bull sender retransmits current

pkt if ACKNAK corruptedbull sender adds sequence

number to each pktbull receiver discards (doesnrsquot

deliver up) duplicate pkt

20

stop and waitsender sends one packet then waits for receiver response

rdt21 sender handles garbled ACKNAKs

21

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0 udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

Wait forcall 1 from above

Wait for ACK or NAK 1

LL

Wait for 0 from below

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt21 receiver handles garbled ACKNAKs

22

rdt21 Example 1

23

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 1

24

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt21 Example 1

25

Wait for 0 from below

Wait for 1 from below

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 1

26

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 1

27

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

rdt21 Example 1

28

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

29

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

30

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 2

31

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 8: ChapterIII: Transport Layer

Internet checksum example

8

example add two 16-bit integers

1 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 01 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1

1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 0 01 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1

wraparound

sumchecksum

Note when adding numbers a carryout from the most significant bit needs to be added to the result

Principles of reliable data transfer

9

bull important in application transport link layersndash top-10 list of important networking topics

bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

Principles of reliable data transfer

10

bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

bull important in application transport link layersndash top-10 list of important networking topics

Principles of reliable data transfer

11

bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

bull important in application transport link layersndash top-10 list of important networking topics

Reliable data transfer getting started

12

sendside

receiveside

rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer

udt_send() called by rdtto transfer packet over unreliable channel to receiver

rdt_rcv() called when packet arrives on rcv-side of channel

deliver_data() called by rdt to deliver data to upper

Reliable data transfer getting started

13

Wersquollbull incrementally develop sender receiver sides of

reliable data transfer protocol (rdt)bull consider only unidirectional data transfer

ndash but control info will flow on both directions

bull use finite state machines (FSMs) to specify sender receiver

state 1 state 2

event causing state transitionactions taken on state transition

state when in this ldquostaterdquo next state

uniquely determined by next event

eventactions

rdt10 reliable transfer over a reliable channel

bull underlying channel perfectly reliablendash no bit errorsndash no loss of packets

bull separate FSMs for sender receiverndash sender sends data into underlying channelndash receiver reads data from underlying channel

14

Wait for call from above packet = make_pkt(data)

udt_send(packet)

rdt_send(data)extract (packetdata)deliver_data(data)

Wait for call from below

rdt_rcv(packet)

sender receiver

rdt20 channel with bit errors

bull underlying channel may flip bits in packetndash checksum to detect bit errors

bull the question how to recover from errorsndash acknowledgements (ACKs) receiver explicitly tells sender that

pkt received OKndash negative acknowledgements (NAKs) receiver explicitly tells

sender that pkt had errorsndash sender retransmits pkt on receipt of NAK

bull new mechanisms in rdt20 (beyond rdt10)ndash error detectionndash receiver feedback control msgs (ACKNAK) rcvr-gtsender

15

How do humans recover from ldquoerrorsrdquoduring conversation

rdt20 channel with bit errors

bull underlying channel may flip bits in packetndash checksum to detect bit errors

bull the question how to recover from errors

ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK

ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors

ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)

ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender

16

rdt20 FSM specification

17

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from belowsender

receiverrdt_send(data)

L

rdt20 operation with no errors

18

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 error scenario

19

Wait for call from above

snkpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 has a fatal flaw

what happens if ACKNAK corrupted

bull sender doesnrsquot know what happened at receiver

bull canrsquot just retransmit possible duplicate

handling duplicates bull sender retransmits current

pkt if ACKNAK corruptedbull sender adds sequence

number to each pktbull receiver discards (doesnrsquot

deliver up) duplicate pkt

20

stop and waitsender sends one packet then waits for receiver response

rdt21 sender handles garbled ACKNAKs

21

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0 udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

Wait forcall 1 from above

Wait for ACK or NAK 1

LL

Wait for 0 from below

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt21 receiver handles garbled ACKNAKs

22

rdt21 Example 1

23

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 1

24

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt21 Example 1

25

Wait for 0 from below

Wait for 1 from below

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 1

26

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 1

27

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

rdt21 Example 1

28

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

29

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

30

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 2

31

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 9: ChapterIII: Transport Layer

Principles of reliable data transfer

9

bull important in application transport link layersndash top-10 list of important networking topics

bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

Principles of reliable data transfer

10

bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

bull important in application transport link layersndash top-10 list of important networking topics

Principles of reliable data transfer

11

bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

bull important in application transport link layersndash top-10 list of important networking topics

Reliable data transfer getting started

12

sendside

receiveside

rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer

udt_send() called by rdtto transfer packet over unreliable channel to receiver

rdt_rcv() called when packet arrives on rcv-side of channel

deliver_data() called by rdt to deliver data to upper

Reliable data transfer getting started

13

Wersquollbull incrementally develop sender receiver sides of

reliable data transfer protocol (rdt)bull consider only unidirectional data transfer

ndash but control info will flow on both directions

bull use finite state machines (FSMs) to specify sender receiver

state 1 state 2

event causing state transitionactions taken on state transition

state when in this ldquostaterdquo next state

uniquely determined by next event

eventactions

rdt10 reliable transfer over a reliable channel

bull underlying channel perfectly reliablendash no bit errorsndash no loss of packets

bull separate FSMs for sender receiverndash sender sends data into underlying channelndash receiver reads data from underlying channel

14

Wait for call from above packet = make_pkt(data)

udt_send(packet)

rdt_send(data)extract (packetdata)deliver_data(data)

Wait for call from below

rdt_rcv(packet)

sender receiver

rdt20 channel with bit errors

bull underlying channel may flip bits in packetndash checksum to detect bit errors

bull the question how to recover from errorsndash acknowledgements (ACKs) receiver explicitly tells sender that

pkt received OKndash negative acknowledgements (NAKs) receiver explicitly tells

sender that pkt had errorsndash sender retransmits pkt on receipt of NAK

bull new mechanisms in rdt20 (beyond rdt10)ndash error detectionndash receiver feedback control msgs (ACKNAK) rcvr-gtsender

15

How do humans recover from ldquoerrorsrdquoduring conversation

rdt20 channel with bit errors

bull underlying channel may flip bits in packetndash checksum to detect bit errors

bull the question how to recover from errors

ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK

ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors

ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)

ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender

16

rdt20 FSM specification

17

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from belowsender

receiverrdt_send(data)

L

rdt20 operation with no errors

18

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 error scenario

19

Wait for call from above

snkpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 has a fatal flaw

what happens if ACKNAK corrupted

bull sender doesnrsquot know what happened at receiver

bull canrsquot just retransmit possible duplicate

handling duplicates bull sender retransmits current

pkt if ACKNAK corruptedbull sender adds sequence

number to each pktbull receiver discards (doesnrsquot

deliver up) duplicate pkt

20

stop and waitsender sends one packet then waits for receiver response

rdt21 sender handles garbled ACKNAKs

21

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0 udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

Wait forcall 1 from above

Wait for ACK or NAK 1

LL

Wait for 0 from below

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt21 receiver handles garbled ACKNAKs

22

rdt21 Example 1

23

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 1

24

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt21 Example 1

25

Wait for 0 from below

Wait for 1 from below

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 1

26

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 1

27

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

rdt21 Example 1

28

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

29

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

30

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 2

31

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 10: ChapterIII: Transport Layer

Principles of reliable data transfer

10

bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

bull important in application transport link layersndash top-10 list of important networking topics

Principles of reliable data transfer

11

bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

bull important in application transport link layersndash top-10 list of important networking topics

Reliable data transfer getting started

12

sendside

receiveside

rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer

udt_send() called by rdtto transfer packet over unreliable channel to receiver

rdt_rcv() called when packet arrives on rcv-side of channel

deliver_data() called by rdt to deliver data to upper

Reliable data transfer getting started

13

Wersquollbull incrementally develop sender receiver sides of

reliable data transfer protocol (rdt)bull consider only unidirectional data transfer

ndash but control info will flow on both directions

bull use finite state machines (FSMs) to specify sender receiver

state 1 state 2

event causing state transitionactions taken on state transition

state when in this ldquostaterdquo next state

uniquely determined by next event

eventactions

rdt10 reliable transfer over a reliable channel

bull underlying channel perfectly reliablendash no bit errorsndash no loss of packets

bull separate FSMs for sender receiverndash sender sends data into underlying channelndash receiver reads data from underlying channel

14

Wait for call from above packet = make_pkt(data)

udt_send(packet)

rdt_send(data)extract (packetdata)deliver_data(data)

Wait for call from below

rdt_rcv(packet)

sender receiver

rdt20 channel with bit errors

bull underlying channel may flip bits in packetndash checksum to detect bit errors

bull the question how to recover from errorsndash acknowledgements (ACKs) receiver explicitly tells sender that

pkt received OKndash negative acknowledgements (NAKs) receiver explicitly tells

sender that pkt had errorsndash sender retransmits pkt on receipt of NAK

bull new mechanisms in rdt20 (beyond rdt10)ndash error detectionndash receiver feedback control msgs (ACKNAK) rcvr-gtsender

15

How do humans recover from ldquoerrorsrdquoduring conversation

rdt20 channel with bit errors

bull underlying channel may flip bits in packetndash checksum to detect bit errors

bull the question how to recover from errors

ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK

ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors

ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)

ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender

16

rdt20 FSM specification

17

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from belowsender

receiverrdt_send(data)

L

rdt20 operation with no errors

18

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 error scenario

19

Wait for call from above

snkpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 has a fatal flaw

what happens if ACKNAK corrupted

bull sender doesnrsquot know what happened at receiver

bull canrsquot just retransmit possible duplicate

handling duplicates bull sender retransmits current

pkt if ACKNAK corruptedbull sender adds sequence

number to each pktbull receiver discards (doesnrsquot

deliver up) duplicate pkt

20

stop and waitsender sends one packet then waits for receiver response

rdt21 sender handles garbled ACKNAKs

21

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0 udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

Wait forcall 1 from above

Wait for ACK or NAK 1

LL

Wait for 0 from below

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt21 receiver handles garbled ACKNAKs

22

rdt21 Example 1

23

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 1

24

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt21 Example 1

25

Wait for 0 from below

Wait for 1 from below

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 1

26

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 1

27

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

rdt21 Example 1

28

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

29

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

30

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 2

31

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 11: ChapterIII: Transport Layer

Principles of reliable data transfer

11

bull characteristics of unreliable channel will determine complexity of reliable data transfer protocol (rdt)

bull important in application transport link layersndash top-10 list of important networking topics

Reliable data transfer getting started

12

sendside

receiveside

rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer

udt_send() called by rdtto transfer packet over unreliable channel to receiver

rdt_rcv() called when packet arrives on rcv-side of channel

deliver_data() called by rdt to deliver data to upper

Reliable data transfer getting started

13

Wersquollbull incrementally develop sender receiver sides of

reliable data transfer protocol (rdt)bull consider only unidirectional data transfer

ndash but control info will flow on both directions

bull use finite state machines (FSMs) to specify sender receiver

state 1 state 2

event causing state transitionactions taken on state transition

state when in this ldquostaterdquo next state

uniquely determined by next event

eventactions

rdt10 reliable transfer over a reliable channel

bull underlying channel perfectly reliablendash no bit errorsndash no loss of packets

bull separate FSMs for sender receiverndash sender sends data into underlying channelndash receiver reads data from underlying channel

14

Wait for call from above packet = make_pkt(data)

udt_send(packet)

rdt_send(data)extract (packetdata)deliver_data(data)

Wait for call from below

rdt_rcv(packet)

sender receiver

rdt20 channel with bit errors

bull underlying channel may flip bits in packetndash checksum to detect bit errors

bull the question how to recover from errorsndash acknowledgements (ACKs) receiver explicitly tells sender that

pkt received OKndash negative acknowledgements (NAKs) receiver explicitly tells

sender that pkt had errorsndash sender retransmits pkt on receipt of NAK

bull new mechanisms in rdt20 (beyond rdt10)ndash error detectionndash receiver feedback control msgs (ACKNAK) rcvr-gtsender

15

How do humans recover from ldquoerrorsrdquoduring conversation

rdt20 channel with bit errors

bull underlying channel may flip bits in packetndash checksum to detect bit errors

bull the question how to recover from errors

ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK

ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors

ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)

ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender

16

rdt20 FSM specification

17

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from belowsender

receiverrdt_send(data)

L

rdt20 operation with no errors

18

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 error scenario

19

Wait for call from above

snkpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 has a fatal flaw

what happens if ACKNAK corrupted

bull sender doesnrsquot know what happened at receiver

bull canrsquot just retransmit possible duplicate

handling duplicates bull sender retransmits current

pkt if ACKNAK corruptedbull sender adds sequence

number to each pktbull receiver discards (doesnrsquot

deliver up) duplicate pkt

20

stop and waitsender sends one packet then waits for receiver response

rdt21 sender handles garbled ACKNAKs

21

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0 udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

Wait forcall 1 from above

Wait for ACK or NAK 1

LL

Wait for 0 from below

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt21 receiver handles garbled ACKNAKs

22

rdt21 Example 1

23

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 1

24

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt21 Example 1

25

Wait for 0 from below

Wait for 1 from below

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 1

26

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 1

27

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

rdt21 Example 1

28

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

29

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

30

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 2

31

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 12: ChapterIII: Transport Layer

Reliable data transfer getting started

12

sendside

receiveside

rdt_send() called from above (eg by app) Passed data to deliver to receiver upper layer

udt_send() called by rdtto transfer packet over unreliable channel to receiver

rdt_rcv() called when packet arrives on rcv-side of channel

deliver_data() called by rdt to deliver data to upper

Reliable data transfer getting started

13

Wersquollbull incrementally develop sender receiver sides of

reliable data transfer protocol (rdt)bull consider only unidirectional data transfer

ndash but control info will flow on both directions

bull use finite state machines (FSMs) to specify sender receiver

state 1 state 2

event causing state transitionactions taken on state transition

state when in this ldquostaterdquo next state

uniquely determined by next event

eventactions

rdt10 reliable transfer over a reliable channel

bull underlying channel perfectly reliablendash no bit errorsndash no loss of packets

bull separate FSMs for sender receiverndash sender sends data into underlying channelndash receiver reads data from underlying channel

14

Wait for call from above packet = make_pkt(data)

udt_send(packet)

rdt_send(data)extract (packetdata)deliver_data(data)

Wait for call from below

rdt_rcv(packet)

sender receiver

rdt20 channel with bit errors

bull underlying channel may flip bits in packetndash checksum to detect bit errors

bull the question how to recover from errorsndash acknowledgements (ACKs) receiver explicitly tells sender that

pkt received OKndash negative acknowledgements (NAKs) receiver explicitly tells

sender that pkt had errorsndash sender retransmits pkt on receipt of NAK

bull new mechanisms in rdt20 (beyond rdt10)ndash error detectionndash receiver feedback control msgs (ACKNAK) rcvr-gtsender

15

How do humans recover from ldquoerrorsrdquoduring conversation

rdt20 channel with bit errors

bull underlying channel may flip bits in packetndash checksum to detect bit errors

bull the question how to recover from errors

ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK

ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors

ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)

ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender

16

rdt20 FSM specification

17

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from belowsender

receiverrdt_send(data)

L

rdt20 operation with no errors

18

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 error scenario

19

Wait for call from above

snkpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 has a fatal flaw

what happens if ACKNAK corrupted

bull sender doesnrsquot know what happened at receiver

bull canrsquot just retransmit possible duplicate

handling duplicates bull sender retransmits current

pkt if ACKNAK corruptedbull sender adds sequence

number to each pktbull receiver discards (doesnrsquot

deliver up) duplicate pkt

20

stop and waitsender sends one packet then waits for receiver response

rdt21 sender handles garbled ACKNAKs

21

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0 udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

Wait forcall 1 from above

Wait for ACK or NAK 1

LL

Wait for 0 from below

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt21 receiver handles garbled ACKNAKs

22

rdt21 Example 1

23

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 1

24

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt21 Example 1

25

Wait for 0 from below

Wait for 1 from below

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 1

26

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 1

27

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

rdt21 Example 1

28

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

29

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

30

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 2

31

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 13: ChapterIII: Transport Layer

Reliable data transfer getting started

13

Wersquollbull incrementally develop sender receiver sides of

reliable data transfer protocol (rdt)bull consider only unidirectional data transfer

ndash but control info will flow on both directions

bull use finite state machines (FSMs) to specify sender receiver

state 1 state 2

event causing state transitionactions taken on state transition

state when in this ldquostaterdquo next state

uniquely determined by next event

eventactions

rdt10 reliable transfer over a reliable channel

bull underlying channel perfectly reliablendash no bit errorsndash no loss of packets

bull separate FSMs for sender receiverndash sender sends data into underlying channelndash receiver reads data from underlying channel

14

Wait for call from above packet = make_pkt(data)

udt_send(packet)

rdt_send(data)extract (packetdata)deliver_data(data)

Wait for call from below

rdt_rcv(packet)

sender receiver

rdt20 channel with bit errors

bull underlying channel may flip bits in packetndash checksum to detect bit errors

bull the question how to recover from errorsndash acknowledgements (ACKs) receiver explicitly tells sender that

pkt received OKndash negative acknowledgements (NAKs) receiver explicitly tells

sender that pkt had errorsndash sender retransmits pkt on receipt of NAK

bull new mechanisms in rdt20 (beyond rdt10)ndash error detectionndash receiver feedback control msgs (ACKNAK) rcvr-gtsender

15

How do humans recover from ldquoerrorsrdquoduring conversation

rdt20 channel with bit errors

bull underlying channel may flip bits in packetndash checksum to detect bit errors

bull the question how to recover from errors

ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK

ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors

ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)

ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender

16

rdt20 FSM specification

17

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from belowsender

receiverrdt_send(data)

L

rdt20 operation with no errors

18

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 error scenario

19

Wait for call from above

snkpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 has a fatal flaw

what happens if ACKNAK corrupted

bull sender doesnrsquot know what happened at receiver

bull canrsquot just retransmit possible duplicate

handling duplicates bull sender retransmits current

pkt if ACKNAK corruptedbull sender adds sequence

number to each pktbull receiver discards (doesnrsquot

deliver up) duplicate pkt

20

stop and waitsender sends one packet then waits for receiver response

rdt21 sender handles garbled ACKNAKs

21

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0 udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

Wait forcall 1 from above

Wait for ACK or NAK 1

LL

Wait for 0 from below

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt21 receiver handles garbled ACKNAKs

22

rdt21 Example 1

23

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 1

24

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt21 Example 1

25

Wait for 0 from below

Wait for 1 from below

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 1

26

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 1

27

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

rdt21 Example 1

28

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

29

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

30

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 2

31

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 14: ChapterIII: Transport Layer

rdt10 reliable transfer over a reliable channel

bull underlying channel perfectly reliablendash no bit errorsndash no loss of packets

bull separate FSMs for sender receiverndash sender sends data into underlying channelndash receiver reads data from underlying channel

14

Wait for call from above packet = make_pkt(data)

udt_send(packet)

rdt_send(data)extract (packetdata)deliver_data(data)

Wait for call from below

rdt_rcv(packet)

sender receiver

rdt20 channel with bit errors

bull underlying channel may flip bits in packetndash checksum to detect bit errors

bull the question how to recover from errorsndash acknowledgements (ACKs) receiver explicitly tells sender that

pkt received OKndash negative acknowledgements (NAKs) receiver explicitly tells

sender that pkt had errorsndash sender retransmits pkt on receipt of NAK

bull new mechanisms in rdt20 (beyond rdt10)ndash error detectionndash receiver feedback control msgs (ACKNAK) rcvr-gtsender

15

How do humans recover from ldquoerrorsrdquoduring conversation

rdt20 channel with bit errors

bull underlying channel may flip bits in packetndash checksum to detect bit errors

bull the question how to recover from errors

ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK

ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors

ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)

ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender

16

rdt20 FSM specification

17

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from belowsender

receiverrdt_send(data)

L

rdt20 operation with no errors

18

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 error scenario

19

Wait for call from above

snkpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 has a fatal flaw

what happens if ACKNAK corrupted

bull sender doesnrsquot know what happened at receiver

bull canrsquot just retransmit possible duplicate

handling duplicates bull sender retransmits current

pkt if ACKNAK corruptedbull sender adds sequence

number to each pktbull receiver discards (doesnrsquot

deliver up) duplicate pkt

20

stop and waitsender sends one packet then waits for receiver response

rdt21 sender handles garbled ACKNAKs

21

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0 udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

Wait forcall 1 from above

Wait for ACK or NAK 1

LL

Wait for 0 from below

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt21 receiver handles garbled ACKNAKs

22

rdt21 Example 1

23

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 1

24

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt21 Example 1

25

Wait for 0 from below

Wait for 1 from below

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 1

26

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 1

27

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

rdt21 Example 1

28

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

29

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

30

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 2

31

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 15: ChapterIII: Transport Layer

rdt20 channel with bit errors

bull underlying channel may flip bits in packetndash checksum to detect bit errors

bull the question how to recover from errorsndash acknowledgements (ACKs) receiver explicitly tells sender that

pkt received OKndash negative acknowledgements (NAKs) receiver explicitly tells

sender that pkt had errorsndash sender retransmits pkt on receipt of NAK

bull new mechanisms in rdt20 (beyond rdt10)ndash error detectionndash receiver feedback control msgs (ACKNAK) rcvr-gtsender

15

How do humans recover from ldquoerrorsrdquoduring conversation

rdt20 channel with bit errors

bull underlying channel may flip bits in packetndash checksum to detect bit errors

bull the question how to recover from errors

ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK

ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors

ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)

ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender

16

rdt20 FSM specification

17

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from belowsender

receiverrdt_send(data)

L

rdt20 operation with no errors

18

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 error scenario

19

Wait for call from above

snkpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 has a fatal flaw

what happens if ACKNAK corrupted

bull sender doesnrsquot know what happened at receiver

bull canrsquot just retransmit possible duplicate

handling duplicates bull sender retransmits current

pkt if ACKNAK corruptedbull sender adds sequence

number to each pktbull receiver discards (doesnrsquot

deliver up) duplicate pkt

20

stop and waitsender sends one packet then waits for receiver response

rdt21 sender handles garbled ACKNAKs

21

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0 udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

Wait forcall 1 from above

Wait for ACK or NAK 1

LL

Wait for 0 from below

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt21 receiver handles garbled ACKNAKs

22

rdt21 Example 1

23

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 1

24

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt21 Example 1

25

Wait for 0 from below

Wait for 1 from below

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 1

26

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 1

27

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

rdt21 Example 1

28

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

29

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

30

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 2

31

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 16: ChapterIII: Transport Layer

rdt20 channel with bit errors

bull underlying channel may flip bits in packetndash checksum to detect bit errors

bull the question how to recover from errors

ndash acknowledgements (ACKs) receiver explicitly tells sender that pkt received OK

ndash negative acknowledgements (NAKs) receiver explicitly tells sender that pkt had errors

ndash sender retransmits pkt on receipt of NAKbull new mechanisms in rdt20 (beyond rdt10)

ndash error detectionndash feedback control msgs (ACKNAK) from receiver to sender

16

rdt20 FSM specification

17

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from belowsender

receiverrdt_send(data)

L

rdt20 operation with no errors

18

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 error scenario

19

Wait for call from above

snkpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 has a fatal flaw

what happens if ACKNAK corrupted

bull sender doesnrsquot know what happened at receiver

bull canrsquot just retransmit possible duplicate

handling duplicates bull sender retransmits current

pkt if ACKNAK corruptedbull sender adds sequence

number to each pktbull receiver discards (doesnrsquot

deliver up) duplicate pkt

20

stop and waitsender sends one packet then waits for receiver response

rdt21 sender handles garbled ACKNAKs

21

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0 udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

Wait forcall 1 from above

Wait for ACK or NAK 1

LL

Wait for 0 from below

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt21 receiver handles garbled ACKNAKs

22

rdt21 Example 1

23

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 1

24

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt21 Example 1

25

Wait for 0 from below

Wait for 1 from below

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 1

26

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 1

27

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

rdt21 Example 1

28

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

29

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

30

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 2

31

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 17: ChapterIII: Transport Layer

rdt20 FSM specification

17

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from belowsender

receiverrdt_send(data)

L

rdt20 operation with no errors

18

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 error scenario

19

Wait for call from above

snkpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 has a fatal flaw

what happens if ACKNAK corrupted

bull sender doesnrsquot know what happened at receiver

bull canrsquot just retransmit possible duplicate

handling duplicates bull sender retransmits current

pkt if ACKNAK corruptedbull sender adds sequence

number to each pktbull receiver discards (doesnrsquot

deliver up) duplicate pkt

20

stop and waitsender sends one packet then waits for receiver response

rdt21 sender handles garbled ACKNAKs

21

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0 udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

Wait forcall 1 from above

Wait for ACK or NAK 1

LL

Wait for 0 from below

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt21 receiver handles garbled ACKNAKs

22

rdt21 Example 1

23

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 1

24

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt21 Example 1

25

Wait for 0 from below

Wait for 1 from below

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 1

26

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 1

27

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

rdt21 Example 1

28

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

29

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

30

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 2

31

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 18: ChapterIII: Transport Layer

rdt20 operation with no errors

18

Wait for call from above

sndpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 error scenario

19

Wait for call from above

snkpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 has a fatal flaw

what happens if ACKNAK corrupted

bull sender doesnrsquot know what happened at receiver

bull canrsquot just retransmit possible duplicate

handling duplicates bull sender retransmits current

pkt if ACKNAK corruptedbull sender adds sequence

number to each pktbull receiver discards (doesnrsquot

deliver up) duplicate pkt

20

stop and waitsender sends one packet then waits for receiver response

rdt21 sender handles garbled ACKNAKs

21

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0 udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

Wait forcall 1 from above

Wait for ACK or NAK 1

LL

Wait for 0 from below

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt21 receiver handles garbled ACKNAKs

22

rdt21 Example 1

23

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 1

24

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt21 Example 1

25

Wait for 0 from below

Wait for 1 from below

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 1

26

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 1

27

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

rdt21 Example 1

28

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

29

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

30

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 2

31

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 19: ChapterIII: Transport Layer

rdt20 error scenario

19

Wait for call from above

snkpkt = make_pkt(data checksum)udt_send(sndpkt)

extract(rcvpktdata)deliver_data(data)udt_send(ACK)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

rdt_rcv(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampampisNAK(rcvpkt)

udt_send(NAK)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

Wait for ACK or NAK

Wait for call from below

rdt_send(data)

L

sender

receiver

rdt20 has a fatal flaw

what happens if ACKNAK corrupted

bull sender doesnrsquot know what happened at receiver

bull canrsquot just retransmit possible duplicate

handling duplicates bull sender retransmits current

pkt if ACKNAK corruptedbull sender adds sequence

number to each pktbull receiver discards (doesnrsquot

deliver up) duplicate pkt

20

stop and waitsender sends one packet then waits for receiver response

rdt21 sender handles garbled ACKNAKs

21

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0 udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

Wait forcall 1 from above

Wait for ACK or NAK 1

LL

Wait for 0 from below

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt21 receiver handles garbled ACKNAKs

22

rdt21 Example 1

23

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 1

24

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt21 Example 1

25

Wait for 0 from below

Wait for 1 from below

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 1

26

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 1

27

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

rdt21 Example 1

28

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

29

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

30

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 2

31

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 20: ChapterIII: Transport Layer

rdt20 has a fatal flaw

what happens if ACKNAK corrupted

bull sender doesnrsquot know what happened at receiver

bull canrsquot just retransmit possible duplicate

handling duplicates bull sender retransmits current

pkt if ACKNAK corruptedbull sender adds sequence

number to each pktbull receiver discards (doesnrsquot

deliver up) duplicate pkt

20

stop and waitsender sends one packet then waits for receiver response

rdt21 sender handles garbled ACKNAKs

21

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0 udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

Wait forcall 1 from above

Wait for ACK or NAK 1

LL

Wait for 0 from below

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt21 receiver handles garbled ACKNAKs

22

rdt21 Example 1

23

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 1

24

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt21 Example 1

25

Wait for 0 from below

Wait for 1 from below

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 1

26

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 1

27

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

rdt21 Example 1

28

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

29

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

30

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 2

31

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 21: ChapterIII: Transport Layer

rdt21 sender handles garbled ACKNAKs

21

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0 udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

Wait forcall 1 from above

Wait for ACK or NAK 1

LL

Wait for 0 from below

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt21 receiver handles garbled ACKNAKs

22

rdt21 Example 1

23

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 1

24

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt21 Example 1

25

Wait for 0 from below

Wait for 1 from below

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 1

26

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 1

27

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

rdt21 Example 1

28

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

29

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

30

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 2

31

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 22: ChapterIII: Transport Layer

Wait for 0 from below

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq0(rcvpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamphas_seq1(rcvpkt)

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

rdt21 receiver handles garbled ACKNAKs

22

rdt21 Example 1

23

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 1

24

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt21 Example 1

25

Wait for 0 from below

Wait for 1 from below

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 1

26

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 1

27

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

rdt21 Example 1

28

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

29

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

30

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 2

31

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 23: ChapterIII: Transport Layer

rdt21 Example 1

23

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 1

24

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt21 Example 1

25

Wait for 0 from below

Wait for 1 from below

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 1

26

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 1

27

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

rdt21 Example 1

28

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

29

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

30

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 2

31

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 24: ChapterIII: Transport Layer

rdt21 Example 1

24

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

sndpkt = make_pkt(NAK chksum)udt_send(sndpkt)

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt21 Example 1

25

Wait for 0 from below

Wait for 1 from below

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 1

26

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 1

27

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

rdt21 Example 1

28

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

29

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

30

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 2

31

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 25: ChapterIII: Transport Layer

rdt21 Example 1

25

Wait for 0 from below

Wait for 1 from below

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 1

26

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 1

27

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

rdt21 Example 1

28

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

29

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

30

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 2

31

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 26: ChapterIII: Transport Layer

rdt21 Example 1

26

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 1

27

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

rdt21 Example 1

28

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

29

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

30

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 2

31

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 27: ChapterIII: Transport Layer

rdt21 Example 1

27

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

rdt21 Example 1

28

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

29

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

30

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 2

31

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 28: ChapterIII: Transport Layer

rdt21 Example 1

28

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

29

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

30

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 2

31

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 29: ChapterIII: Transport Layer

rdt21 Example 2

29

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq0(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

30

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 2

31

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 30: ChapterIII: Transport Layer

rdt21 Example 2

30

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isNAK(rcvpkt) )

rdt21 Example 2

31

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 31: ChapterIII: Transport Layer

rdt21 Example 2

31

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt_rcv(rcvpkt) ampamp not corrupt(rcvpkt) ampamphas_seq0(rcvpkt)

sndpkt = make_pkt(ACK chksum)udt_send(sndpkt)

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 32: ChapterIII: Transport Layer

rdt21 Example 2

32

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt)

L

Wait for 0 from below

Wait for 1 from below

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 33: ChapterIII: Transport Layer

rdt21 Example 2

33

Wait for call 0 from above

Wait for ACK or NAK 0

Wait forcall 1 from above

Wait for ACK or NAK 1

Wait for 0 from below

Wait for 1 from below

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 34: ChapterIII: Transport Layer

rdt21 discussion

senderbull seq added to pktbull two seq rsquos (01) will

suffice Whybull must check if received

ACKNAK corrupted bull twice as many states

ndash state must ldquorememberrdquowhether ldquoexpectedrdquo pkt should have seq of 0 or 1

receiverbull must check if received

packet is duplicatendash state indicates whether 0

or 1 is expected pkt seq

bull note receiver cannotknow if its last ACKNAK received OK at sender

34

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 35: ChapterIII: Transport Layer

rdt22 a NAK-free protocol

bull same functionality as rdt21 using ACKs onlybull instead of NAK receiver sends ACK for last pkt

received OKndash receiver must explicitly include seq of pkt being ACKed

bull duplicate ACK at sender results in same action as NAK retransmit current pkt

35

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 36: ChapterIII: Transport Layer

rdt22 sender receiver fragments

36

Wait for call 0 from above

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)

rdt_send(data)

udt_send(sndpkt)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

Wait for ACK 0

sender FSMfragment

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp has_seq1(rcvpkt)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(ACK1 chksum)udt_send(sndpkt)

Wait for 0 from below

rdt_rcv(rcvpkt) ampamp (corrupt(rcvpkt) ||has_seq1(rcvpkt))

udt_send(sndpkt)receiver FSMfragment

L

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 37: ChapterIII: Transport Layer

rdt30 channels with errors and loss

new assumptionunderlying channel can also lose packets (data ACKs)ndash checksum seq ACKs

retransmissions will be of help hellip but not enough

approach sender waits ldquoreasonablerdquo amount of time for ACK

bull retransmits if no ACK received in this time

bull if pkt (or ACK) just delayed (not lost)ndash retransmission will be

duplicate but seq rsquos already handles this

ndash receiver must specify seq of pkt being ACKed

bull requires countdown timer

37

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 38: ChapterIII: Transport Layer

rdt30 sender

38

sndpkt = make_pkt(0 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

Wait for ACK0

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt1) )

Wait for call 1 from above

sndpkt = make_pkt(1 data checksum)udt_send(sndpkt)start_timer

rdt_send(data)

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt0)

rdt_rcv(rcvpkt) ampamp ( corrupt(rcvpkt) ||isACK(rcvpkt0) )

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt) ampamp isACK(rcvpkt1)

stop_timerstop_timer

udt_send(sndpkt)start_timer

timeout

udt_send(sndpkt)start_timer

timeout

rdt_rcv(rcvpkt)

Wait for call 0 from above

Wait for ACK1

Lrdt_rcv(rcvpkt)

LL

L

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 39: ChapterIII: Transport Layer

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

pkt1

ack1

ack0

ack0

(a) no loss

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(b) packet loss

pkt1X

loss

pkt1timeout

resend pkt1

rdt30 in action

39

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 40: ChapterIII: Transport Layer

rdt30 in action

40

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

rcv pkt0

send ack0

send ack1

send ack0

rcv ack0

send pkt0

send pkt1

rcv ack1

send pkt0rcv pkt0

pkt0

pkt0

ack1

ack0

ack0

(c) ACK loss

ack1X

loss

pkt1timeout

resend pkt1

rcv pkt1send ack1

(detect duplicate)

pkt1

sender receiver

rcv pkt1

send ack0rcv ack0

send pkt1

send pkt0rcv pkt0

pkt0

ack0

(d) premature timeout delayed ACK

pkt1timeout

resend pkt1

ack1

ack1 rcv pkt0send ack0

send ack1

do nothingrcv ack1send pkt0rcv ack1 pkt0

rcv ack0

ack0

send pkt1pkt1

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 41: ChapterIII: Transport Layer

Performance of rdt30

bull rdt30 is correct but performance far from idealbull eg 1 Gbps link 15 ms prop delay 8000 bit packet

41

sect U sender utilization ndash fraction of time sender busy sending

U sender =

008 30008

= 000027 L R RTT + L R

=

sect if RTT=30 msec 1KB pkt every 30 msec 33kBsec throughput over 1 Gbps link

v network protocol limits use of physical resources

Dtrans = LR

8000 bits109 bitssec= = 8 microsecs

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 42: ChapterIII: Transport Layer

rdt30 stop-and-wait operation

42

first packet bit transmitted t = 0sender receiver

RTT

last packet bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

U sender =

008 30008

= 000027 L R RTT + L R

=

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 43: ChapterIII: Transport Layer

Pipelined protocols

pipelining sender allows multiple ldquoin-flightrdquo yet-to-be-acknowledged pktsndash range of sequence numbers must be increasedndash buffering at sender andor receiver

43

bull two generic forms of pipelined protocols Go-Back-N Selective Repeat

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 44: ChapterIII: Transport Layer

Pipelining increased utilization

44

first packet bit transmitted t = 0sender receiver

RTT

last bit transmitted t = L R

first packet bit arriveslast packet bit arrives send ACK

ACK arrives send next packet t = RTT + L R

last bit of 2nd packet arrives send ACKlast bit of 3rd packet arrives send ACK

3-packet pipelining increasesutilization by a factor of 3

U sender =

0024 30008

= 000081 3L R RTT + L R

=

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 45: ChapterIII: Transport Layer

Pipelined protocols overview

Go-back-Nbull sender can have up to

N unacked packets in pipeline

bull receiver only sends cumulative ackndash Doesnrsquot ack packet if

therersquos a gapbull sender has timer for

oldest unacked packetndash when timer expires

retransmit all unackedpackets

Selective Repeatbull sender can have up to

N unacked packets in pipeline

bull rcvr sends individual ackfor each packet

bull sender maintains timer for each unacked packetndash when timer expires

retransmit only that unacked packet

45

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 46: ChapterIII: Transport Layer

Go-Back-N sender

bull k-bit seq in pkt headerbull ldquowindowrdquo of up to N consecutive unacked pkts allowed

46

v ACK(n) ACKs all pkts up to including seq n - ldquocumulative ACKrdquosect may receive duplicate ACKs (see receiver)

v timer for oldest in-flight pktv timeout(n) retransmit packet n and all higher seq pkts in

window

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 47: ChapterIII: Transport Layer

GBN sender extended FSM

47

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN sender extended FSM

48

Wait start_timerudt_send(sndpkt[base])udt_send(sndpkt[base+1])hellipudt_send(sndpkt[nextseqnum-1])

timeout

rdt_send(data)

if (nextseqnum lt base+N) sndpkt[nextseqnum] = make_pkt(nextseqnumdatachksum)udt_send(sndpkt[nextseqnum])if (base == nextseqnum)

start_timernextseqnum++

else

refuse_data(data)

base = getacknum(rcvpkt)+1If (base == nextseqnum)

stop_timerelse

start_timer

rdt_rcv(rcvpkt) ampamp notcorrupt(rcvpkt)

base=1nextseqnum=1

rdt_rcv(rcvpkt) ampamp corrupt(rcvpkt)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

49

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN receiver extended FSM

ACK-only always send ACK for correctly-received pktwith highest in-order seq ndash may generate duplicate ACKsndash need only remember expectedseqnum

bull out-of-order pkt ndash discard (donrsquot buffer) no receiver bufferingndash re-ACK pkt with highest in-order seq

50

Wait

udt_send(sndpkt)default

rdt_rcv(rcvpkt)ampamp notcurrupt(rcvpkt)ampamp hasseqnum(rcvpktexpectedseqnum)

extract(rcvpktdata)deliver_data(data)sndpkt = make_pkt(expectedseqnumACKchksum)udt_send(sndpkt)expectedseqnum++

expectedseqnum=1sndpkt = make_pkt(0ACKchksum)

L

GBN in action

51

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

GBN in action

52

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 discard (re)send ack1rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2send pkt3send pkt4send pkt5

Xloss

receive pkt4 discard (re)send ack1

receive pkt5 discard (re)send ack1

rcv pkt2 deliver send ack2rcv pkt3 deliver send ack3rcv pkt4 deliver send ack4rcv pkt5 deliver send ack5

ignore duplicate ACK

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

Selective repeat

bull receiver individually acknowledges all correctly received packetsndash buffers packets as needed for eventual in-order delivery to

upper layer

bull sender only resends packets for which ACK not receivedndash sender timer for each unACKed packet

bull sender windowndash N consecutive seq rsquosndash limits seq s of sent unACKed packets

53

Selective repeat sender receiver windows

54

Selective repeat

data from abovebull if next available seq in

window send pkt

timeout(n)bull resend pkt n restart timer

ACK(n) in [sendbase sendbase+N-1]

bull mark pkt n as receivedbull if n smallest unACKed pkt

advance window base to next unACKed seq

55

senderpkt n in [rcvbase rcvbase+N-1]

v send ACK(n)v out-of-order bufferv in-order deliver (also

deliver buffered in-order pkts) advance window to next not-yet-received pkt

pkt n in [rcvbase-N rcvbase-1]

v ACK(n)otherwisev ignore

receiver

Selective repeat in action

56

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeat in action

57

send pkt0send pkt1send pkt2send pkt3

(wait)

sender receiver

receive pkt0 send ack0receive pkt1 send ack1

receive pkt3 buffer send ack3rcv ack0 send pkt4

rcv ack1 send pkt5

pkt 2 timeoutsend pkt2

Xloss

receive pkt4 buffer send ack4

receive pkt5 buffer send ack5

rcv pkt2 deliver pkt2pkt3 pkt4 pkt5 send ack2

record ack3 arrived

0 1 2 3 4 5 6 7 8

sender window (N=4)

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8

record ack4 arrivedrecord ack5 arrived

Q what happens when ack2 arrives

Selective repeatdilemma

example bull seq rsquos 0 1 2 3bull window size=3

receiver window(after receipt)

sender window(after receipt)

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2 pkt0

timeoutretransmit pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2XXX

will accept packetwith seq number 0(b) oops

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

pkt0pkt1pkt2

0 1 2 3 0 1 2pkt0

0 1 2 3 0 1 2

0 1 2 3 0 1 2

0 1 2 3 0 1 2

Xwill accept packetwith seq number 0

0 1 2 3 0 1 2 pkt3

(a) no problem

receiver canrsquot see sender sidereceiver behavior identical in both casessomethingrsquos (very) wrong

v receiver sees no difference in two scenarios

v duplicate data accepted as new in (b)

Q what relationship between seq size and window size to avoid problem in (b)

58

TCP Overview RFCs 79311221323 2018 2581

bull point-to-pointndash one sender one receiver

bull reliable in-order byte streamndash no ldquomessage boundariesrdquo

bull pipelinedndash TCP congestion and flow

control set window size

bull full duplex datandash bi-directional data flow in

same connectionndash MSS maximum segment

size

bull connection-orientedndash handshaking (exchange of

control msgs) inits sender receiver state before data exchange

bull flow controlledndash sender will not overwhelm

receiver

59

TCP segment structure

60

source port dest port

32 bits

applicationdata (variable length)

sequence numberacknowledgement number

receive windowUrg data pointerchecksum

FSRPAUheadlen

notused

options (variable length)

URG urgent data (generally not used)

ACK ACK valid

PSH push data now

RST SYN FINconnection estab(setup teardown

commands)

bytes rcvr willingto accept

countingby bytes of data(not segments)

Internetchecksum

(as in UDP)

TCP seq numbers ACKs

sequence numbersndashbyte stream ldquonumberrdquo of first byte in segmentrsquos data

acknowledgementsndashseq of next byte expected from other side

ndashcumulative ACKQ how receiver handles out-of-order segmentsndashA TCP spec doesnrsquot say ndashup to implementor

61

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

incoming segment to sender

A

sent ACKed

sent not-yet ACKed(ldquoin-flightrdquo)

usablebut not yet sent

not usable

window sizeN

sender sequence number space

source port dest port

sequence numberacknowledgement number

checksum

rwndurg pointer

outgoing segment from sender

Byte stream in TCP

62

Window N bytes

HTTP Get Message (K bytes)

100th byte

TCP header(seq no = 100)

M bytes

HTTP Get Message (K bytes)

Cannot be transmitted now

TCP seq numbers ACKs

63

UsertypeslsquoCrsquo

host ACKsreceipt

of echoedlsquoCrsquo

host ACKsreceipt oflsquoCrsquo echoesback lsquoCrsquo

simple telnet scenario

Host BHost A

Seq=42 ACK=79 data = lsquoCrsquo

Seq=79 ACK=43 data = lsquoCrsquo

Seq=43 ACK=80

TCP round trip time timeout

Q how to set TCP timeout value

bull longer than RTTndash but RTT varies

bull too short premature timeout unnecessary retransmissions

bull too long slow reaction to segment loss

Q how to estimate RTTbull SampleRTT measured

time from segment transmission until ACK receiptndash ignore retransmissions

bull SampleRTT will vary want estimated RTT ldquosmootherrdquondash average several recent

measurements not just current SampleRTT

64

RTT gaiacsumassedu to fantasiaeurecomfr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

iseco

nds)

SampleRTT Estimated RTT

EstimatedRTT = (1- a)EstimatedRTT + aSampleRTT

v exponential weighted moving averagev influence of past sample decreases exponentially fastv typical value a = 0125

TCP round trip time timeout

65

RTT

(milli

seco

nds)

RTT gaiacsumassedu to fantasiaeurecomfr

sampleRTTEstimatedRTT

time (seconds)

TCP round trip time timeout

bull timeout interval EstimatedRTT plus ldquosafety marginrdquondash large variation in EstimatedRTT egrave larger safety margin

bull estimate SampleRTT deviation from EstimatedRTT

66

DevRTT = (1-b)DevRTT +b|SampleRTT-EstimatedRTT|

(typically b = 025)

TimeoutInterval = EstimatedRTT + 4DevRTT

estimated RTT ldquosafety marginrdquo

TCP reliable data transfer

bull TCP creates rdt service on top of IPrsquos unreliable servicendash pipelined segmentsndash cumulative acksndash single retransmission timer

bull retransmissions triggered byndash timeout eventsndash duplicate acks

67

letrsquos initially consider simplified TCP senderndash ignore duplicate acksndash ignore flow control

congestion control

TCP sender events

data rcvd from appbull create segment with seq bull seq is byte-stream

number of first data byte in segment

bull start timer if not already running ndash think of timer as for oldest

unacked segmentndash expiration interval TimeOutInterval

timeoutbull retransmit segment that

caused timeoutbull restart timerack rcvdbull if ack acknowledges

previously unackedsegmentsndash update what is known to

be ACKedndash start timer if there are still

unacked segments

68

TCP sender (simplified)

69

waitfor event

NextSeqNum = InitialSeqNumSendBase = InitialSeqNum

L

create segment seq NextSeqNumpass segment to IP (ie ldquosendrdquo)NextSeqNum = NextSeqNum + length(data) if (timer currently not running)

start timer

data received from application above

retransmit not-yet-acked segment with smallest seq

start timer

timeout

if (y gt SendBase) SendBase = y SendBasendash1 last cumulatively ACKed byte if (there are currently not-yet-acked segments)

start timerelse stop timer

ACK received with ACK field value y

TCP retransmission scenarios

70

lost ACK scenario

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8 bytes of data

Xtimeo

ut

ACK=100

premature timeout

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=92 8bytes of data

timeo

ut

ACK=120

Seq=100 20 bytes of data

ACK=120

SendBase=100

SendBase=120

SendBase=120

SendBase=92

TCP retransmission scenarios

71

X

cumulative ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

Seq=120 15 bytes of data

timeo

ut

Seq=100 20 bytes of data

ACK=120

TCP ACK generation [RFC 5861]

72

event at receiver

arrival of in-order segment withexpected seq All data up toexpected seq already ACKed

arrival of in-order segment withexpected seq One other segment has ACK pending

arrival of out-of-order segmenthigher-than-expect seq Gap detected

arrival of segment that partially or completely fills gap

TCP receiver action

delayed ACK Wait up to 500msfor next segment If no next segmentsend ACK

immediately send single cumulative ACK ACKing both in-order segments

immediately send duplicate ACKindicating seq of next expected byte

immediate send ACK provided thatsegment starts at lower end of gap

TCP fast retransmit

bull time-out period often relatively longndash long delay before resending

lost packet

bull detect lost segments via duplicate ACKsndash sender often sends many

segments back-to-backndash if segment is lost there will

likely be many duplicate ACKs

73

if sender receives 3 ACKs for same data(ldquotriple duplicate ACKsrdquo)resend unackedsegment with smallest seq sect likely that unacked

segment lost so donrsquot wait for timeout

TCP fast retransmit

(ldquotriple duplicate ACKsrdquo)

X

fast retransmit after sender receipt of triple duplicate ACK

Host BHost A

Seq=92 8 bytes of data

ACK=100

timeo

ut ACK=100

ACK=100

ACK=100

TCP fast retransmit

74

Seq=100 20 bytes of data

Seq=100 20 bytes of data

3 DUP ACKs

TCP flow control

75

applicationprocess

TCP socketreceiver buffers

TCPcode

IPcode

applicationOS

receiver protocol stack

application may remove data from

TCP socket buffers hellip

hellip slower than TCP receiver is delivering(sender is sending)

from sender

receiver controls sender so sender wonrsquot overflow receiverrsquos buffer by transmitting too much too fast

flow control

TCP flow control

bull receiver ldquoadvertisesrdquo free buffer space by including rwnd value in TCP header of receiver-to-sender segmentsndash RcvBuffer size set via socket

options (typical default is 4096 bytes)ndash many operating systems autoadjustRcvBuffer

bull sender limits amount of unacked(ldquoin-flightrdquo) data to receiverrsquos rwnd value

bull guarantees receive buffer will not overflow

76

buffered data

free buffer spacerwnd

RcvBuffer

TCP segment payloads

to application process

receiver-side buffering

Connection Management

before exchanging data senderreceiver ldquohandshakerdquobull agree to establish connection (each knowing the other willing to

establish connection)bull agree on connection parameters

77

connection state ESTABconnection variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

connection state ESTABconnection Variables

seq client-to-serverserver-to-client

rcvBuffer sizeat serverclient

application

network

Socket clientSocket = newSocket(hostnameport number)

Socket connectionSocket = welcomeSocketaccept()

TCP 3-way handshake

80

SYNbit=1 Seq=x

choose init seq num xsend TCP SYN msg

ESTAB

SYNbit=1 Seq=yACKbit=1 ACKnum=x+1

choose init seq num ysend TCP SYNACKmsg acking SYN

ACKbit=1 ACKnum=y+1

received SYNACK(x) indicates server is livesend ACK for SYNACK

this segment may contain client-to-server data received ACK(y)

indicates client is live

SYNSENT

ESTAB

SYN RCVD

client stateCLOSED

server stateLISTEN

TCP 3-way handshake FSM

81

closed

L

listen

SYNrcvd

SYNsent

ESTAB

Socket clientSocket = newSocket(hostnameport number)

SYN(seq=x)

Socket connectionSocket = welcomeSocketaccept()

SYN(x)SYNACK(seq=yACKnum=x+1)create new socket for communication back to client

SYNACK(seq=yACKnum=x+1)ACK(ACKnum=y+1)ACK(ACKnum=y+1)

L

TCP closing a connection

bull client server each close their side of connectionndash send TCP segment with FIN bit = 1

bull respond to received FIN with ACKndash on receiving FIN ACK can be combined with own FIN

bull simultaneous FIN exchanges can be handled

82

FIN_WAIT_2

CLOSE_WAIT

FINbit=1 seq=y

ACKbit=1 ACKnum=y+1

ACKbit=1 ACKnum=x+1wait for server

close

can stillsend data

can no longersend data

LAST_ACK

CLOSED

TIMED_WAIT

timed wait for 2max

segment lifetime

CLOSED

TCP closing a connection

83

FIN_WAIT_1 FINbit=1 seq=xcan no longersend but canreceive data

clientSocketclose()

client state server stateESTABESTAB

The ldquoTwo Army Problemrdquo

84

Principles of congestion control

congestionbull informally ldquotoo many sources sending too much data

too fast for network to handlerdquobull different from flow controlbull manifestations

ndash lost packets (buffer overflow at routers)ndash long delays (queueing in router buffers)

bull a top-10 problem

85

Causescosts of congestion scenario 1

bull two senders two receivers

bull one router infinite buffers

bull output link capacity Rbull no retransmission

bull maximum per-connection throughput R2

86

unlimited shared output link buffers

Host A

original data lin

Host B

throughput lout

R2

R2

l out

lin R2

dela

ylin

v large delays as arrival rate lin approaches capacity

Causescosts of congestion scenario 2

bull one router finite buffers bull sender retransmission of timed-out packet

ndash application-layer input = application-layer output lin = lout

ndash transport-layer input includes retransmissions lrsquoin lin

87

finite shared output link buffers

Host A

lin original data

Host B

loutlin original data plusretransmitted data

Causescosts of congestion scenario 2

idealization perfect knowledgebull sender sends only when router

buffers available

88

finite shared output link buffers

lin original dataloutlin original data plus

retransmitted datacopy

free buffer space

R2

R2

l out

lin

Host B

A

lin original dataloutlin original data plus

retransmitted datacopy

no buffer space

Causescosts of congestion scenario 2

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

89

A

Host B

lin original dataloutlin original data plus

retransmitted data

free buffer space

Causescosts of congestion scenario 2

90

R2

R2lin

l out

when sending at R2 some packets are retransmissions but asymptotic goodput is still R2 (why)

A

Host B

Idealization known losspackets can be lost dropped at router due to full buffers

bull sender only resends if packet known to be lost

A

lin loutlincopy

free buffer space

timeout

R2

R2lin

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

Host B

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 2

91

R2

l out

when sending at R2 some packets are retransmissions including duplicated that are delivered

ldquocostsrdquo of congestionv more work (retrans) for given ldquogoodputrdquov unneeded retransmissions link carries multiple copies of pkt

sect decreasing goodput

R2lin

Causescosts of congestion scenario 2

92

Realistic duplicatesv packets can be lost dropped

at router due to full buffersv sender times out prematurely

sending two copies both of which are delivered

Causescosts of congestion scenario 3

bull four sendersbull multihop pathsbull timeoutretransmit

93

Q what happens as lin and linrsquo

increase

finite shared output link buffers

Host A lout Host B

Host CHost D

lin original datalin original data plus

retransmitted data

A as red linrsquo increases all arriving

blue pkts at upper queue are dropped blue throughput g 0

another ldquocostrdquo of congestionv when packet dropped any ldquoupstream

transmission capacity used for that packet was wasted

Causescosts of congestion scenario 3

94

R2

R2

l out

linrsquo

Bandwidth wastage for packets dropped at the 2nd router

Offered load by Host A

Thro

ughp

ut b

y bl

ue tr

affic

Approaches towards congestion control

95

two broad approaches towards congestion control

end-end congestion control

bull no explicit feedback from network

bull congestion inferred from end-system observed loss delay

bull approach taken by TCP

network-assisted congestion control

bull routers provide feedback to end systemsndashsingle bit indicating

congestion (SNA DECbit TCPIP ECN ATM)

ndashexplicit rate for sender to send at

TCP congestion controladditive increase multiplicative decrease (AIMD)

96

v approach sender increases transmission rate (window size) probing for usable bandwidth until loss occurssect additive increase increase cwnd by 1 MSS every

RTT until loss detectedsectmultiplicative decrease cut cwnd in half after loss

cwnd

TCP

send

er

cong

estio

n w

indo

w s

ize

AIMD saw toothbehavior probing

for bandwidth

additively increase window size helliphellip until loss occurs (then cut window in half)

time

TCP Congestion Control details

bull sender limits transmission

bull cwnd is dynamic function of perceived network congestion

TCP sending ratebull roughly send cwnd

bytes wait RTT for ACKs then send more bytes

97

last byteACKed sent not-

yet ACKed(ldquoin-flightrdquo)

last byte sent

cwnd

LastByteSent-LastByteAcked

lt cwnd

sender sequence number space

rate ~~cwndRTT

bytessec

TCP Slow Start

bull when connection begins increase rate exponentially until first loss eventndash initially cwnd = 1 MSSndash double cwnd every RTTndash done by incrementing cwnd for every ACK received

bull summary initial rate is slow but ramps up exponentially fast

98

Host A

one segment

Host B

RTT

time

two segments

four segments

TCP detecting reacting to loss

bull loss indicated by timeoutndash cwnd set to 1 MSS ndash window then grows exponentially (as in slow start) to

threshold then grows linearlybull loss indicated by 3 duplicate ACKs TCP RENO

ndash dup ACKs indicate network capable of delivering some segments

ndash cwnd is cut in half window then grows linearlybull TCP Tahoe always sets cwnd to 1 (timeout or 3

duplicate acks)

99

TCP switching from slow start to CA

100

Implementationbull variable ssthreshbull on loss event ssthresh is

set to 12 of cwnd just before loss event

Q when should the exponential increase switch to linear

A when cwnd gets to 12 of its value before timeout

Summary TCP Congestion Control

101

timeoutssthresh = cwnd2cwnd = 1 MSSdupACKcount = 0retransmit missing segment

Lcwnd gt ssthresh

congestionavoidance

cwnd = cwnd + MSS (MSScwnd)dupACKcount = 0transmit new segment(s) as allowed

new ACK

dupACKcount++duplicate ACK

fastrecovery

cwnd = cwnd + MSStransmit new segment(s) as allowed

duplicate ACK

ssthresh= cwnd2cwnd = ssthresh + 3

retransmit missing segment

dupACKcount == 3

timeoutssthresh = cwnd2cwnd = 1 dupACKcount = 0retransmit missing segment

ssthresh= cwnd2cwnd = ssthresh + 3retransmit missing segment

dupACKcount == 3cwnd = ssthreshdupACKcount = 0

New ACK

slow start

timeoutssthresh = cwnd2 cwnd = 1 MSSdupACKcount = 0retransmit missing segment

cwnd = cwnd+MSSdupACKcount = 0transmit new segment(s) as allowed

new ACKdupACKcount++duplicate ACK

Lcwnd = 1 MSSssthresh = 64 KBdupACKcount = 0

NewACK

NewACK

NewACK

TCP throughput

bull avg TCP throuput as function of window size RTTndash ignore slow start assume always data to send

bull W window size (measured in bytes) where loss occursndash avg window size ( in-flight bytes) is frac34 Wndash avg throuput is 34W per RTT

102

W

W2

avg TCP throuput = 34WRTT bytessec

34W

TCP Futures TCP over ldquolong fat pipesrdquo

bull example 1500 byte segments 100ms RTT want 10 Gbps throughput

bull requires W = 83333 in-flight segmentsbull throughput in terms of segment loss probability L

[Mathis 1997]

to achieve 10 Gbps throughput need a loss rate of L = 210-10 ndash a very small loss rate

bull new versions of TCP for high-speed

TCP throughput = 122 MSSRTT L

TCP Fairness

fairness goal if K TCP sessions share same bottleneck link of bandwidth R each should have average rate of RK

104

TCP connection 1

bottleneckroutercapacity RTCP connection 2

Why is TCP fair

two competing sessionsbull additive increase gives slope of 1 as throughout increasesbull multiplicative decrease decreases throughput proportionally

105

R

R

equal bandwidth share

Connection 1 throughput

Con

nect

ion

2 th

roug

hput

congestion avoidance additive increaseloss decrease window by factor of 2

congestion avoidance additive increaseloss decrease window by factor of 2

Full bandwidth utilization line

(X1 Y1) where X1+Y1 = R(X2 Y2) where X2 = Y2

Fairness (more)

Fairness and UDPbull multimedia apps often

do not use TCPndash do not want rate

throttled by congestion control

bull instead use UDPndash send audiovideo at

constant rate tolerate packet loss

Fairness parallel TCP connections

bull application can open multiple parallel connections between two hosts

bull web browsers do this bull eg link of rate R with

9 existing connectionsndash new app asks for 1 TCP gets

rate R10ndash new app asks for 11 TCPs

gets R2

106

network-assisted congestion controlsect two bits in IP header (ToS field) marked by network router to indicate

congestionsect congestion indication carried to receiving hostsect receiver (seeing congestion indication in IP datagram) ) sets ECE bit

on receiver-to-sender ACK segment to notify sender of congestion

Explicit Congestion Notification (ECN)

sourceapplicationtransportnetworklinkphysical

destinationapplicationtransportnetworklinkphysical

ECN=00 ECN=11

ECE=1

IP datagram

TCP ACK segment

Page 48: ChapterIII: Transport Layer
Page 49: ChapterIII: Transport Layer
Page 50: ChapterIII: Transport Layer
Page 51: ChapterIII: Transport Layer
Page 52: ChapterIII: Transport Layer
Page 53: ChapterIII: Transport Layer
Page 54: ChapterIII: Transport Layer
Page 55: ChapterIII: Transport Layer
Page 56: ChapterIII: Transport Layer
Page 57: ChapterIII: Transport Layer
Page 58: ChapterIII: Transport Layer
Page 59: ChapterIII: Transport Layer
Page 60: ChapterIII: Transport Layer
Page 61: ChapterIII: Transport Layer
Page 62: ChapterIII: Transport Layer
Page 63: ChapterIII: Transport Layer
Page 64: ChapterIII: Transport Layer
Page 65: ChapterIII: Transport Layer
Page 66: ChapterIII: Transport Layer
Page 67: ChapterIII: Transport Layer
Page 68: ChapterIII: Transport Layer
Page 69: ChapterIII: Transport Layer
Page 70: ChapterIII: Transport Layer
Page 71: ChapterIII: Transport Layer
Page 72: ChapterIII: Transport Layer
Page 73: ChapterIII: Transport Layer
Page 74: ChapterIII: Transport Layer
Page 75: ChapterIII: Transport Layer
Page 76: ChapterIII: Transport Layer
Page 77: ChapterIII: Transport Layer
Page 78: ChapterIII: Transport Layer
Page 79: ChapterIII: Transport Layer
Page 80: ChapterIII: Transport Layer
Page 81: ChapterIII: Transport Layer
Page 82: ChapterIII: Transport Layer
Page 83: ChapterIII: Transport Layer
Page 84: ChapterIII: Transport Layer
Page 85: ChapterIII: Transport Layer
Page 86: ChapterIII: Transport Layer
Page 87: ChapterIII: Transport Layer
Page 88: ChapterIII: Transport Layer
Page 89: ChapterIII: Transport Layer
Page 90: ChapterIII: Transport Layer
Page 91: ChapterIII: Transport Layer
Page 92: ChapterIII: Transport Layer
Page 93: ChapterIII: Transport Layer
Page 94: ChapterIII: Transport Layer
Page 95: ChapterIII: Transport Layer
Page 96: ChapterIII: Transport Layer
Page 97: ChapterIII: Transport Layer
Page 98: ChapterIII: Transport Layer
Page 99: ChapterIII: Transport Layer
Page 100: ChapterIII: Transport Layer
Page 101: ChapterIII: Transport Layer
Page 102: ChapterIII: Transport Layer
Page 103: ChapterIII: Transport Layer
Page 104: ChapterIII: Transport Layer
Page 105: ChapterIII: Transport Layer